Variational Inference for Gaussian Process Models with Linear Complexity
Abstract
Largescale Gaussian process inference has long faced practical challenges due to time and space complexity that is superlinear in dataset size. While sparse variational Gaussian process models are capable of learning from largescale data, standard strategies for sparsifying the model can prevent the approximation of complex functions. In this work, we propose a novel variational Gaussian process model that decouples the representation of mean and covariance functions in reproducing kernel Hilbert space. We show that this new parametrization generalizes previous models. Furthermore, it yields a variational inference problem that can be solved by stochastic gradient ascent with time and space complexity that is only linear in the number of mean function parameters, regardless of the choice of kernels, likelihoods, and inducing points. This strategy makes the adoption of largescale expressive Gaussian process models possible. We run several experiments on regression tasks and show that this decoupled approach greatly outperforms previous sparse variational Gaussian process inference procedures.
Variational Inference for Gaussian Process Models with Linear Complexity
ChingAn Cheng Institute for Robotics and Intelligent Machines Georgia Institute of Technology Atlanta, GA 30332 cacheng@gatech.edu Byron Boots Institute for Robotics and Intelligent Machines Georgia Institute of Technology Atlanta, GA 30332 bboots@cc.gatech.edu
noticebox[b]\end@float
1 Introduction
Gaussian process (GP) inference is a popular nonparametric framework for reasoning about functions under uncertainty. However, the expressiveness of GPs comes at a price: solving (approximate) inference for a GP with data instances has time and space complexities in and , respectively. Therefore, GPs have traditionally been viewed as a tool for problems with small or mediumsized datasets
Recently, the concept of inducing points has been used to scale GPs to larger datasets. The idea is to summarize a full GP model with statistics on a sparse set of fictitious observations [18, 24]. By representing a GP with these inducing points, the time and the space complexities are reduced to and , respectively. To further process datasets that are too large to fit into memory, stochastic approximations have been proposed for regression [10] and classification [11]. These methods have similar complexity bounds, but with replaced by the size of a minibatch .
Despite the success of sparse models, the scalability issues of GP inference are far from resolved. The major obstruction is that the cubic complexity in in the aforementioned upperbound is also a lowerbound, which results from the inversion of an by covariance matrix defined on the inducing points. As a consequence, these models can only afford to use a small set of basis functions, limiting the expressiveness of GPs for prediction.
In this work, we show that superlinear complexity is not completely necessary. Inspired by the reproducing kernel Hilbert space (RKHS) representation of GPs [2], we propose a generalized variational GP model, called DGPs (Decoupled Gaussian Processes), which decouples the bases for the mean and the covariance functions. Specifically, let and be the numbers of basis functions used to model the mean and the covariance functions, respectively. Assume . We show, when DGPs are used as a variational posterior [24], the associated variational inference problem can be solved by stochastic gradient ascent with space complexity and time complexity , where is the input dimension. We name this algorithm svdgp. As a result, we can choose , which allows us to keep the time and space complexity similar to previous methods (by choosing ) while greatly increasing accuracy. To the best of our knowledge, this is the first variational GP algorithm that admits linear complexity in , without any assumption on the choice of kernel and likelihood.
While we design svdgp for general likelihoods, in this paper we study its effectiveness in Gaussian process regression (GPR) tasks. We consider this is without loss of generality, as most of the sparse variational GPR algorithms in the literature can be modified to handle general likelihoods by introducing additional approximations (e.g. in Hensman et al. [11] and Sheth et al. [22]). Our experimental results show that svdgp significantly outperforms the existing techniques, achieving higher variational lower bounds and lower prediction errors when evaluated on heldout test sets.
1.1 Related Work
,  ,  Time  Space  

svdgp  sga  sga  sga  false  true  
svi  snga  sga  sga  true  true  
ivsgpr  sma  sma  sga  true  true  
vsgpr  cg  cg  cg  true  true  
gpr  cg  cg  cg  true  false 
Our framework is based on the variational inference problem proposed by Titsias [24], which treats the inducing points as variational parameters to allow direct approximation of the true posterior. This is in contrast to Seeger et al. [21], Snelson and Ghahramani [23], QuiñoneroCandela and Rasmussen [18], and LázaroGredilla et al. [15], which all use inducing points as hyperparameters of a degenerate prior. While both approaches have the same time and space complexity, the latter additionally introduces a large set of unregularized hyperparameters and, therefore, is more likely to suffer from overfitting [1].
In Table 1, we compare svdgp with recent GPR algorithms in terms of the assumptions made and the time and space complexity. Each algorithm can be viewed as a special way to solve the maximization of the variational lower bound (5), presented in Section 3.2. Our algorithm svdgp generalizes the previous approaches to allow the basis functions for the mean and the covariance to be decoupled, so an approximate solution can be found by stochastic gradient ascent in linear complexity.
To illustrate the idea, we consider a toy GPR example in Figure 1. The dataset contains 500 noisy observations of a function. Given the same training data, we conduct experiments with three different GP models. Figure 1 (a)(c) show the results of the traditional coupled basis, which can be solved by any of the variational algorithms listed in Table 1, and Figure 1 (b) shows the result using the decoupled approach svdgp. The sizes of basis and observations are selected to emulate a large dataset scenario. We can observe svdgp achieves a nice tradeoff between prediction performance and complexity: it achieves almost the same accuracy in prediction as the fullscale model in Figure 1(c) and preserves the overall shape of the predictive variance.
In addition to the sparse algorithms above, some recent attempts aim to revive the nonparametric property of GPs by structured covariance functions. For example, Wilson and Nickisch [27] proposes to space the inducing points on a multidimensional lattice, so the time and space complexities of using a product kernel becomes and , respectively. However, because , where is the number of grid points per dimension, the overall complexity is exponential in and infeasible for highdimensional data. Another interesting approach by Hensman et al. [12] combines variational inference [24] and a sparse spectral approximation [15]. By equally spacing inducing points on the spectrum, they show the covariance matrix on the inducing points have diagonal plus lowrank structure. With MCMC, the algorithm can achieve complexity . However, the proposed structure in [12] does not help to reduce the complexity when an approximate Gaussian posterior is favored or when the kernel hyperparameters need to be updated.
Other kernel methods with linear complexity have been proposed using functional gradient descent [14, 5]. However, because these methods use a model strictly the same size as the entire dataset, they fail to estimate the predictive covariance, which requires space complexity. Moreover, they cannot learn hyperparameters online. The latter drawback also applies to greedy algorithms based on rankone updates, e.g. the algorithm of Csató and Opper [4].
In contrast to these previous methods, our algorithm applies to all choices of inducing points, likelihoods, and kernels, and we allow both variational parameters and hyperparameters to adapt online as more data are encountered.
2 Preliminaries
In this section, we briefly review the inference for GPs and the variational framework proposed by Titsias [24]. For now, we will focus on GPR for simplicity of exposition. We will discuss the case of general likelihoods in the next section when we introduce our framework, DGPs.
2.1 Inference for GPs
Let be a latent function defined on a compact domain . Here we assume a priori that is distributed according to a Gaussian process . That is, , and . In short, we write .
A GP probabilistic model is composed of a likelihood and a GP prior ; in GPR, the likelihood is assumed to be Gaussian i.e. with variance . Usually, the likelihood and the GP prior are parameterized by some hyperparameters, which we summarize as . This includes, for example, the variance and the parameters implicitly involved in defining . For notational convenience, and without loss of generality, we assume in the prior distribution and omit explicitly writing the dependence of distributions on .
Assume we are given a dataset , in which and . Let^{2}^{2}2In notation, we use boldface to distinguish finitedimensional vectors (lowercase) and matrices (uppercase) that are used in computation from scalar and abstract mathematical objects. and . Inference for GPs involves solving for the posterior for any new input , where . For example in GPR, because the likelihood is Gaussian, the predictive posterior is also Gaussian with mean and covariance
(1) 
and the hyperparameter can be found by nonlinear conjugate gradient ascent [19]
(2) 
where , and denote the covariances between the sets in the subscript.^{3}^{3}3If the two sets are the same, only one is listed. One can show that these two functions, and , define a valid GP. Therefore, given observations , we say .
Although theoretically GPs are nonparametric and can model any function as , in practice this is difficult. As the inference has time complexity and space complexity , applying vanilla GPs to large datasets is infeasible.
2.2 Variational Inference with Sparse GPs
To scale GPs to large datasets, Titsias [24] introduced a scheme to compactly approximate the true posterior with a sparse GP, , defined by the statistics on function values: where is a bounded linear operator^{4}^{4}4Here we use the notation loosely for the compactness of writing. Rigorously, is a bounded linear operator acting on and , not necessarily on all sample paths . and . is called an inducing function and an inducing point. Common choices of include the identity map (as used originally by Titsias [24]) and integrals to achieve better approximation or to consider multidomain information [26, 7, 3]. Intuitively, we can think of as a set of potentially indirect observations that capture salient information about the unknown function .
Titsias [24] solves for by variational inference. Let and let and be the (inducing) function values defined on and , respectively. Let be the prior given by and define to be its variational posterior, where and are the mean and the covariance of the approximate posterior of . Titsias [24] proposes to use as the variational posterior to approximate and to solve for together with the hyperparameter through
(3) 
where is a variational lower bound of , is the conditional probability given in , and .
At first glance, the specific choice of variational posterior seems heuristic. However, although parameterized finitely, it resembles a fullfledged GP :
(4) 
This result is further studied in Matthews et al. [16] and Cheng and Boots [2], where it is shown that (3) is indeed minimizing a proper KLdivergence between Gaussian processes/measures.
By comparing (2) and (3), one can show that the time and the space complexities now reduce to and , respectively, due to the lowrank structure of [24]. To further reduce complexity, stochastic optimization, such as stochastic natural ascent [10] or stochastic mirror descent [2] can be applied. In this case, in the above asymptotic bounds would be replaced by the size of a minibatch . The above results can be modified to consider general likelihoods as in [22, 11].
3 Variational Inference with Decoupled Gaussian Processes
Despite the success of sparse GPs, the scalability issues of GPs persist. Although parameterizing a GP with inducing points/functions enables learning from large datasets, it also restricts the expressiveness of the model. As the time and the space complexities still scale in and , we cannot learn or use a complex model with large .
In this work, we show that these two complexity bounds, which have long accompanied GP models, are not strictly necessary, but are due to the tangled representation canonically used in the GP literature. To elucidate this, we adopt the dual representation of Cheng and Boots [2], which treats GPs as linear operators in RKHS. But, unlike Cheng and Boots [2], we show how to decouple the basis representation of mean and covariance functions of a GP and derive a new variational problem, which can be viewed as a generalization of (3). We show that this problem—with arbitrary likelihoods and kernels—can be solved by stochastic gradient ascent with linear complexity in , the number of parameters used to specify the mean function for prediction.
In the following, we first review the results in [2]. We next introduce the decoupled representation, DGPs, and its variational inference problem. Finally, we present svdgp and discuss the case with general likelihoods.
3.1 Gaussian Processes as Gaussian Measures
Let an RKHS be a Hilbert space of functions with the reproducing property: , such that , .^{5}^{5}5To simplify the notation, we write for , and for , where and , even if is infinitedimensional. A Gaussian process is equivalent to a Gaussian measure on Banach space which possesses an RKHS [2]:^{6}^{6}6Such w.l.o.g. can be identified as the natural RKHS of the covariance function of a zeromean prior GP. there is a mean functional and a bounded positive semidefinite linear operator , such that for any , , we can write and . The triple is known as an abstract Wiener space [9, 6], in which is also called the CameronMartin space. Here the restriction that , are RKHS objects is necessary, so the variational inference problem in the next section can be welldefined.
We call this the dual representation of a GP in RKHS (the mean function and the covariance function are realized as linear operators and defined in ). With abuse of notation, we write in short. This notation does not mean a GP has a Gaussian distribution in , nor does it imply that the sample paths from are necessarily in . Precisely, contains the sample paths of and is dense in . In most applications of GP models, is the Banach space of continuous function and is the span of the covariance function. As a special case, if is finitedimensional, and coincide and becomes equivalent to a Gaussian distribution in a Euclidean space.
In relation to our previous notation in Section 2.1: suppose and is a feature map to some Hilbert space . Then we have assumed a priori that is a normal Gaussian measure; that is samples functions in the form , where are independent. Note if , with probability one is not in , but fortunately is large enough for us to approximate the sampled functions. In particular, it can be shown that the posterior in GPR has a dual RKHS representation in the same RKHS as the prior GP [2].
3.2 Variational Inference in Gaussian Measures
Cheng and Boots [2] proposes a dual formulation of (3) in terms of Gaussian measures^{7}^{7}7 We assume is absolutely continuous wrt , which is true as is nondegenerate. The integral denotes the expectation of over , and denotes the RadonNikodym derivative.:
(5) 
where is a variational Gaussian measure and is a normal prior. Its connection to the inducing points/functions in (3) can be summarized as follows [2, 3]: Define a linear operator as , where is defined such that . Then (3) and (5) are equivalent, if has a subspace parametrization,
(6) 
with and satisfying and . In other words, the variational inference algorithms in the literature are all using a variational Gaussian measure in which and are parametrized by the same basis .
Compared with (3), the formulation in (5) is neater: it follows the definition of the very basic variational inference problem. This is not surprising, since GPs can be viewed as Bayesian linear models in an infinitedimensional space. Moreover, in (5) all hyperparameters are isolated in the likelihood , because the prior is fixed as a normal Gaussian measure.
3.3 Disentangling the GP Representation with DGPs
While Cheng and Boots [2] treat (5) as an equivalent form of (3), here we show that it is a generalization. By further inspecting (5), it is apparent that sharing the basis between and in (6) is not strictly necessary, since (5) seeks to optimize two linear operators, and . With this in mind, we propose a new parametrization that decouples the bases for and :
(7) 
where and denote linear operators defined similarly to and . Compared with (6), here we parametrize through its inversion with so the condition that can be easily realized as . This form agrees with the posterior covariance in GPR [2] and will give a posterior that is strictly less uncertain than the prior.
The decoupled subspace parametrization (7) corresponds to a DGP, , with mean and covariance functions as ^{8}^{8}8In practice, we can parametrize with Cholesky factor so the problem is unconstrained. The required terms in (8) and later in (9) can be stably computed as and , where .
(8) 
While the structure of (8) looks similar to (4), directly replacing the basis in (4) with and is not trivial. Because the equations in (4) are derived from the traditional viewpoint of GPs as statistics on function values, the original optimization problem (3) is not defined if and therefore, it is not clear how to learn a decoupled representation traditionally. Conversely, by using the dual RKHS representation, the objective function to learn (8) follows naturally from (5), as we will show next.
3.4 svdgp: Algorithm and Analysis
Substituting the decoupled subspace parametrization (7) into the variational inference problem in (5) results in a numerical optimization problem: with
(9)  
(10) 
where each expectation is over a scalar Gaussian given by (8) as functions of and . Our objective function contains [11] as a special case, which assumes . In addition, we note that Hensman et al. [11] indirectly parametrize the posterior by and , whereas we parametrize directly by (6) with for scalability and for better stability (which always reduces the uncertainty in the posterior compared with the prior).
We notice that and are completely decoupled in (9) and potentially combined again in (10). In particular, if is Gaussian as in GPR, we have an additional decoupling, i.e. for some and . Intuitively, the optimization over aims to minimize the fittingerror, and the optimization over aims to memorize the samples encountered so far; the mean and the covariance functions only interact indirectly through the optimization of the hyperparameter .
One salient feature of svdgp is that it tends to overestimate, rather than underestimate, the variance, when we select . This is inherited from the nondegeneracy property of the variational framework [24] and can be seen in the toy example in Figure 1. In the extreme case when , we can see the covariance in (8) becomes the same as the prior; moreover, the objective function of svdgp becomes similar to kernel methods (exactly the same as kernel ridge regression, when the likelihood is Gaussian). The additional inclusion of expected loglikelihoods here allows svdgp to learn the hyperparameters in a unified framework, as its objective function can be viewed as minimizing a generalization upperbound in PACBayes learning [8].
svdgp solves the above optimization problem by stochastic gradient ascent. Here we purposefully ignore specific details of to emphasize that svdgp can be applied to general likelihoods as it only requires unbiased firstorder information, which e.g. can be found in [22]. In addition to having a more adaptive representation, the main benefit of svdgp is that the computation of an unbiased gradient requires only linear complexity in , as shown below (see Appendix Afor details).
KLDivergence
Assume and . By (9), One can show and Therefore, the time complexity to compute can be reduced to if we sample over the columns of with a minibatch of size . By contrast, the time complexity to compute will always be and cannot be further reduced, regardless of the parametrization of .^{9}^{9}9Due to , the complexity would remain as even if is constrained to be diagonal. The gradient with respect to and can be derived similarly and have time complexity and , respectively.
Expected LogLikelihood
Let and be the vectors of the mean and covariance of scalar Gaussian for . As (10) is a sum over terms, by sampling with a minibatch of size , an unbiased gradient of (10) with respect to can be computed in . To compute the full gradient with respect to , we compute the derivative of and with respect to and then apply chain rule. These steps take and for and , respectively.
The above analysis shows that the curse of dimensionality in GPs originates in the covariance function. For space complexity, the decoupled parametrization (7) requires memory in ; for time complexity, an unbiased gradient with respect to can be computed in , but that with respect to has time complexity . This motivates choosing and in or , which maintains the same complexity as previous variational techniques but greatly improves the prediction performance.
4 Experimental Results
We compare our new algorithm, svdgp, with the stateoftheart incremental algorithms for sparse variational GPR, svi [10] and ivsgpr [2], as well as the classical gpr and the batch algorithm vsgpr [24]. As discussed in Section 1.1, these methods can be viewed as different ways to optimize (5). Therefore, in addition to the normalized mean square error (nMSE) [19] in prediction, we report the performance in the variational lower bound (VLB) (5), which also captures the quality of the predictive variance and hyperparameter learning.^{10}^{10}10The exact marginal likelihood is computationally infeasible to evaluate for our large model. These two metrics are evaluated on heldout test sets in all of our experimental domains.
Algorithm 1 summarizes the online learning procedure used by all stochastic algorithms,^{11}^{11}11The algorithms differs only in whether the bases are shared and how the model is updated (see Table 1). where each learner has to optimize all the parameters onthefly using i.i.d. data. The hyperparameters are first initialized heuristically by median trick using the first minibatch. We incrementally build up the variational posterior by including observations in each minibatch as the initialization of new variational basis functions. Then all the hyperparameters and the variational parameters are updated online. These steps are repeated for iterations.
For all the algorithms, we assume the prior covariance is defined by the seard kernel [19] and we use the generalized seard kernel [2] as the inducing functions in the variational posterior (see Appendix B for details). We note that all algorithms in comparison use the same kernel and optimize both the variational parameters (including inducing points) and the hyperparameters.
In particular, we implement sga by adam [13] (with default parameters and ). The stepsize for each stochastic algorithms is scheduled according to , where is selected manually for each algorithm to maximize the improvement in objective function after the first 100 iterations. We test each stochastic algorithm for iterations with minibatches of size and the increment size . Finally, the model sizes used in the experiments are listed as follows: and for svdgp; for svi; for ivsgpr; , for vsgpr; for gp. These settings share similar order of time complexity in our current Matlab implementation.
4.1 Datasets
Inverse Dynamics of KUKA Robotic Arm
This dataset records the inverse dynamics of a KUKA arm performing rhythmic motions at various speeds [17]. The original dataset consists of two parts: kuka and kuka, each of which have 17,560 offline data and 180,360 online data with 28 attributes and 7 outputs. In the experiment, we mix the online and the offline data and then split 90% as training data (178,128 instances) and 10% testing data (19,792 instances) to satisfy the i.i.d. assumption.
Walking MuJoCo
MuJoCo (MultiJoint dynamics with Contact) is a physics engine for research in robotics, graphics, and animation, created by [25]. In this experiment, we gather 1,000 walking trajectories by running trpo [20]. In each time frame, the MuJoCo transition dynamics have a 23dimensional input and a 17dimensional output. We consider two regression problems to predict 9 of the 17 outputs from the input^{12}^{12}12Because of the structure of MuJoCo dynamics, the rest 8 outputs can be trivially known from the input.: mujoco which maps the input of the current frame (23 dimensions) to the output, and mujoco which maps the inputs of the current and the previous frames (46 dimensions) to the output. In each problem, we randomly select 90% of the data as training data (842,745 instances) and 10% as test data (93,608 instances).




4.2 Results
We summarize part of the experimental results in Table 2 in terms of nMSE in prediction and VLB. While each output is treated independently during learning, Table 2 present the mean and the standard deviation over all the outputs as the selected metrics are normalized. For the complete experimental results, please refer to Appendix C.
We observe that svdgp consistently outperforms the other approaches with much higher VLBs and much lower prediction errors; svdgp also has smaller standard deviation. These results validate our initial hypothesis that adopting a large set of basis functions for the mean can help when modeling complicated functions. ivsgpr has the next best result after svdgp, despite using a basis size of 256, much smaller than that of 1,024 in svi, vsgpr, and gpr. Similar to svdgp, ivsgpr also generalizes better than the batch algorithms vsgpr and gpr, which only have access to a smaller set of training data and are more prone to overfitting. By contrast, the performance of svi is surprisingly worse than vsgpr. We conjecture this might be due to the fact that the hyperparameters and the inducing points/functions are only crudely initialized in online learning. We additionally find that the stability of svi is more sensitive to the choice of step size than other methods. This might explain why in [10, 2] batch data was used to initialize the hyperparameters and the learning rate to update the hyperparameters was selected to be much smaller than that for stochastic natural gradient ascent.
To further investigate the properties of different stochastic approximations, we show the change of VLB and the prediction error over iterations and time in Figure 2. Overall, whereas ivsgpr and svi share similar convergence rate, the behavior of svdgp is different. We see that ivsgpr converges the fastest, both in time and sample complexity. Afterwards, svdgp starts to descend faster and surpass the other two methods. From Figure 2, we can also observe that although svi has similar convergence to ivsgpr, it slows down earlier and therefore achieves a worse result. These phenomenon are observed in multiple experiments.
5 Conclusion
We propose a novel, fullydifferentiable framework, Decoupled Gaussian Processes DGPs, for largescale GP problems. By decoupling the representation, we derive a variational inference problem that can be solved with stochastic gradients with linear time and space complexity. Compared with existing algorithms, svdgp can adopt a much larger set of basis functions to predict more accurately. Empirically, svdgp significantly outperforms stateofthearts variational sparse GPR algorithms in multiple regression tasks. These encouraging experimental results motivate further application of svdgp to endtoend learning with neural networks in largescale, complex real world problems.
Acknowledgments
This work was supported in part by NSF NRI award 1637758.
References
 Bauer et al. [2016] Matthias Bauer, Mark van der Wilk, and Carl Edward Rasmussen. Understanding probabilistic sparse Gaussian process approximations. In Advances in Neural Information Processing Systems, pages 1525–1533, 2016.
 Cheng and Boots [2016] ChingAn Cheng and Byron Boots. Incremental variational sparse Gaussian process regression. In Advances in Neural Information Processing Systems, pages 4403–4411, 2016.
 Cheng and Huang [2016] ChingAn Cheng and HanPang Huang. Learn the Lagrangian: A vectorvalued RKHS approach to identifying Lagrangian systems. IEEE Transactions on Cybernetics, 46(12):3247–3258, 2016.
 Csató and Opper [2001] Lehel Csató and Manfred Opper. Sparse representation for Gaussian process models. Advances in Neural Information Processing Systems, pages 444–450, 2001.
 Dai et al. [2014] Bo Dai, Bo Xie, Niao He, Yingyu Liang, Anant Raj, MariaFlorina F Balcan, and Le Song. Scalable kernel methods via doubly stochastic gradients. In Advances in Neural Information Processing Systems, pages 3041–3049, 2014.
 Eldredge [2016] Nathaniel Eldredge. Analysis and probability on infinitedimensional spaces. arXiv preprint arXiv:1607.03591, 2016.
 FigueirasVidal and Lázarogredilla [2009] Anibal FigueirasVidal and Miguel Lázarogredilla. Interdomain Gaussian processes for sparse inference using inducing features. In Advances in Neural Information Processing Systems, pages 1087–1095, 2009.
 Germain et al. [2016] Pascal Germain, Francis Bach, Alexandre Lacoste, and Simon LacosteJulien. Pacbayesian theory meets bayesian inference. In Advances in Neural Information Processing Systems, pages 1884–1892, 2016.
 Gross [1967] Leonard Gross. Abstract wiener spaces. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 2: Contributions to Probability Theory, Part 1, pages 31–42. University of California Press, 1967.
 Hensman et al. [2013] James Hensman, Nicolo Fusi, and Neil D. Lawrence. Gaussian processes for big data. arXiv preprint arXiv:1309.6835, 2013.
 Hensman et al. [2015] James Hensman, Alexander G. de G. Matthews, and Zoubin Ghahramani. Scalable variational Gaussian process classification. In International Conference on Artificial Intelligence and Statistics, 2015.
 Hensman et al. [2016] James Hensman, Nicolas Durrande, and Arno Solin. Variational Fourier features for Gaussian processes. arXiv preprint arXiv:1611.06740, 2016.
 Kingma and Ba [2014] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kivinen et al. [2004] Jyrki Kivinen, Alexander J Smola, and Robert C Williamson. Online learning with kernels. IEEE transactions on signal processing, 52(8):2165–2176, 2004.
 LázaroGredilla et al. [2010] Miguel LázaroGredilla, Joaquin QuiñoneroCandela, Carl Edward Rasmussen, and Aníbal R. FigueirasVidal. Sparse spectrum Gaussian process regression. Journal of Machine Learning Research, 11(Jun):1865–1881, 2010.
 Matthews et al. [2016] Alexander G. de G. Matthews, James Hensman, Richard E. Turner, and Zoubin Ghahramani. On sparse variational methods and the KullbackLeibler divergence between stochastic processes. In Proceedings of the Nineteenth International Conference on Artificial Intelligence and Statistics, 2016.
 Meier et al. [2014] Franziska Meier, Philipp Hennig, and Stefan Schaal. Incremental local Gaussian regression. In Advances in Neural Information Processing Systems, pages 972–980, 2014.
 QuiñoneroCandela and Rasmussen [2005] Joaquin QuiñoneroCandela and Carl Edward Rasmussen. A unifying view of sparse approximate Gaussian process regression. The Journal of Machine Learning Research, 6:1939–1959, 2005.
 Rasmussen and Williams [2006] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian processes for machine learning. 2006.
 Schulman et al. [2015] John Schulman, Sergey Levine, Pieter Abbeel, Michael I. Jordan, and Philipp Moritz. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning, pages 1889–1897, 2015.
 Seeger et al. [2003] Matthias Seeger, Christopher Williams, and Neil Lawrence. Fast forward selection to speed up sparse Gaussian process regression. In Artificial Intelligence and Statistics 9, number EPFLCONF161318, 2003.
 Sheth et al. [2015] Rishit Sheth, Yuyang Wang, and Roni Khardon. Sparse variational inference for generalized GP models. In Proceedings of the 32nd International Conference on Machine Learning, pages 1302–1311, 2015.
 Snelson and Ghahramani [2005] Edward Snelson and Zoubin Ghahramani. Sparse Gaussian processes using pseudoinputs. In Advances in Neural Information Processing Systems, pages 1257–1264, 2005.
 Titsias [2009] Michalis K. Titsias. Variational learning of inducing variables in sparse Gaussian processes. In International Conference on Artificial Intelligence and Statistics, pages 567–574, 2009.
 Todorov et al. [2012] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for modelbased control. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE, 2012.
 Walder et al. [2008] Christian Walder, Kwang In Kim, and Bernhard Schölkopf. Sparse multiscale Gaussian process regression. In Proceedings of the 25th international conference on Machine learning, pages 1112–1119. ACM, 2008.
 Wilson and Nickisch [2015] Andrew Wilson and Hannes Nickisch. Kernel interpolation for scalable structured Gaussian processes (KISSGP). In Proceedings of the 32nd International Conference on Machine Learning, pages 1775–1784, 2015.
Appendix
Appendix A Variational Inference with Decoupled Gaussian Processes
Here we provide the details of the variational inference problem used to learn DGPs:
(11) 
a.1 KL Divergence
a.1.1 Evaluation
First, we show how to evaluate the KLdivergence. We do so by extending the KLdivergence between two finitedimensional subspaceparametrized Gaussian measures to infinite dimensional space and show that it is welldefined.
Recall for two dimensional Gaussian distributions and , the KLdivergence is given as
Proposition 1.
Now consider and are subspace parametrized as
(12) 
By Proposition 1, we derive the representation of KLdivergence which is applicable even when is infinite. Recall in the infinite dimensional case, , , , and are objects in the RKHS (CameronMartin space).
Theorem 1.
Assume and are two subspace parametrized Gaussian measures given as (12). Regardless of the dimension of , the following holds
(13) 
where
In particular, if is normal (i.e. ), then
Proof.
To prove, we derive each term in (13) as follows.
First, we derive . Define . Then we can write
(14) 
Using (14), we can derive
and therefore
Note this term does not depend on the ambient dimension.
Second, we derive : Since
it holds that
Finally, we derive the quadratic term:
∎
Notes to Practitioners

The above expression is well defined even when , because . Particularly, we can parametrize with Cholesky factor in practice so the problem is unconstrained. The required terms can be stably computed: and , where .

In implementations, a Jacobi preconditioner can be optionally used. That is, we can parametrize and through and
for improving stability of using firstorder methods, here denotes the diagonal part.
a.1.2 Gradients
Here we derive the equations of the gradient of the variational inference problem of svdgp. The purpose here is to show the complexity of calculating the gradients. These equations are useful in implementing svdgp using basic linear algebra routines, while computationalgraph libraries based on automatic differentiation are also applicable and easier to apply.
To derive the gradients, we first introduce some shorthand