Orthogonally Decoupled Variational Gaussian Processes
Abstract
Gaussian processes (GPs) provide a powerful nonparametric framework for reasoning over functions. Despite appealing theory, its superlinear computational and memory complexities have presented a longstanding challenge. Stateoftheart sparse variational inference methods trade modeling accuracy against complexity. However, the complexities of these methods still scale superlinearly in the number of basis functions, implying that that sparse GP methods are able to learn from large datasets only when a small model is used. Recently, a decoupled approach was proposed that removes the unnecessary coupling between the complexities of modeling the mean and the covariance functions of a GP. It achieves a linear complexity in the number of mean parameters, so an expressive posterior mean function can be modeled. While promising, this approach suffers from optimization difficulties due to illconditioning and nonconvexity. In this work, we propose an alternative decoupled parametrization. It adopts an orthogonal basis in the mean function to model the residues that cannot be learned by the standard coupled approach. Therefore, our method extends, rather than replaces, the coupled approach to achieve strictly better performance. This construction admits a straightforward natural gradient update rule, so the structure of the information manifold that is lost during decoupling can be leveraged to speed up learning. Empirically, our algorithm demonstrates significantly faster convergence in multiple experiments.
[table]capposition=top \NewEnvironproofatend+0=
\pat@proofof \pat@label.
\BODY∎
lemma \AtEndEnvironmenttheorem \AtEndEnvironmentproposition \AtEndEnvironmentcorollary
1 Introduction
Gaussian processes (GPs) are flexible Bayesian nonparametric models that have achieved stateoftheart performance in a range of applications (Diggle and Ribeiro, 2007; Snoek et al., 2012). A key advantage of GP models is that they have large representational capacity, while being robust to overfitting (Rasmussen and Williams, 2006). This property is especially important for robotic applications, where there may be an abundance of data in some parts of the space but a scarcity in others (Deisenroth and Rasmussen, 2011). Unfortunately, exact inference in GPs scales cubically in computation and quadratically in memory with the size of the training set, and is only available in closed form for Gaussian likelihoods.
To learn from large datasets, variational inference provides a principled way to find tractable approximations to the true posterior. A common approach to approximate GP inference is to form a sparse variational posterior, which is designed by conditioning the prior process at a small set of inducing points (Titsias, 2009). The sparse variational framework trades accuracy against computation, but its complexities still scale superlinearly in the number of inducing points. Consequently, the representation power of the approximate distribution is greatly limited.
Various attempts have been made to reduce the complexities in order to scale up GP models for better approximation. Most of them, however, rely on certain assumptions on the kernel structure and input dimension. In the extreme, Hartikainen and Särkkä (2010) show that, for 1Dinput problems, exact GP inference can be solved in linear time for kernels with finitely many nonzero derivatives. For lowdimensional inputs and stationary kernels, variational inference with structured kernel approximation (Wilson and Nickisch, 2015) or Fourier features (Hensman et al., 2018) has been proposed. Both approaches, nevertheless, scale exponentially with input dimension, except for the special case of sumandproduct kernels (Gardner et al., 2018). Approximate kernels have also been proposed as GP priors with lowrank structure (Snelson and Ghahramani, 2006; QuiñoneroCandela et al., 2005) or a sparse spectrum (LÃ¡zaroGredilla et al., 2010). Another family of methods partitions the input space into subsets and performs prediction aggregation (Tresp, 2000; Rasmussen and Ghahramani, 2002; Nguyen and Bonilla, 2014; Deisenroth and Ng, 2015), and Bayesian aggregation of local experts with attractive theoretical properties is recently proposed by Rullière et al. (2018).
A recent decoupled framework (Cheng and Boots, 2017) takes a different direction to address the complexity issue of GP inference. In contrast to the above approaches, this decoupled framework is agnostic to problem setups (e.g. likelihoods, kernels, and input dimensions) and extends the original sparse variational formulation (Titsias, 2009). The key idea is to represent the variational distribution in the reproducing kernel Hilbert space (RKHS) induced by the covariance function of the GP. The sparse variational posterior by Titsias (2009) turns out to be equivalent to a particular parameterization in the RKHS, where the mean and covariance both share the same basis. Cheng and Boots (2017) suggest to relax the requirement of basis sharing. Since the computation only scales linearly in the mean parameters, many more basis functions can be used for modeling the mean function to achieve higher accuracy in prediction.
However, the original decoupled basis (Cheng and Boots, 2017) turns out to have optimization difficulties (Havasi et al., 2018). In particular, the nonconvexity of the optimization problem means that a suboptimal solution may be found, leading to performance that is potentially worse than the standard coupled case. While Havasi et al. (2018) suggest to use a preconditioner to amortize the problem, their algorithm incurs an additional cubic computational cost; therefore, its applicability is limited to small simple models.
Inspired by the success of natural gradients in variational inference (Hoffman et al., 2013; Salimbeni et al., 2018), we propose a novel RKHS parameterization of decoupled GPs that admits efficient natural gradient computation. We decompose the mean parametrization into a part that shares the basis with the covariance, and an orthogonal part that models the residues that the standard coupled approach fails to capture. We show that, with this particular choice, the natural gradient update rules further decouple into the natural gradient descent of the coupled part and the functional gradient descent of the residual part. Based on these insights, we propose an efficient optimization algorithm that preserves the desired properties of decoupled GPs and converges faster than the original formulation (Cheng and Boots, 2017).
We demonstrate that our basis is more effective than the original decoupled formulation on a range of classification and regression tasks. We show that the natural gradient updates improve convergence considerably and can lead to much better performance in practice. Crucially, we show also that our basis is more effective than the standard coupled basis for a fixed computational budget.
2 Background
We consider the inference problem of GP models. Given a dataset and a GP prior on a latent function , the goal is to infer the (approximate) posterior of for any query input . In this work, we adopt the recent decoupled RKHS reformulation of variational inference (Cheng and Boots, 2017), and, without loss of generality, we will assume is a scalar function. For notation, we use boldface to distinguish finitedimensional vectors and matrices that are used in computation from scalar and abstract mathematical objects.
2.1 Gaussian Processes and their RKHS Representation
We first review the primal and dual representations of GPs, which form the foundation of the RKHS reformulation. Let be the domain of the latent function. A GP is a distribution of functions, which is described by a mean function and a covariance function . We say a latent function is distributed according to , if for any , , , and for any finite subset is Gaussian distributed.
We call the above definition, in terms of the function values and , the primal representation of GPs. Alternatively, one can adopt a dual representation of GPs, by treating functions and as RKHS objects (Cheng and Boots, 2016). This is based on observing that the covariance function satisfies the definition of positive semidefinite functions, so can also be viewed as a reproducing kernel (Aronszajn, 1950). Specifically, given , without loss of generality, we can find an RKHS such that
(1) 
for some , bounded positive semidefinite selfadjoint operator , and feature map . Here we use to denote the inner product in , even when is infinite.
For notational clarity, we use symbols and (or ) to denote the mean and covariance functions, and symbols and to denote the RKHS objects; we use to distinguish the (approximate) posterior covariance function from the prior covariance function .
If satisfying (1), we also write .
To concretely illustrate the primaldual connection, we consider the GP regression problem. Suppose in prior and , where . Let and , where the notation denotes stacking the elements. Then, with observed, it can be shown that where
(2) 
where , and denote the covariances between the subscripted sets,
(3) 
where .
2.2 Variational Inference Problem
Inference in GP models is challenging because the closedform expressions in (2) have computational complexity that is cubic in the size of the training dataset, and are only applicable for Gaussian likelihoods. For nonGaussian likelihoods (e.g. classification) or for large datasets (i.e. more than 10,000 data points), we must adopt approximate inference.
Variational inference provides a principled approach to search for an approximate but tractable posterior. It seeks a variational posterior that is close to the true posterior in terms of KL divergence, i.e. it solves . For GP models, the variational posterior must be defined over the entire function, so a natural choice is to use another GP. This choice is also motivated by the fact that the exact posterior is a GP in the case of a Gaussian likelihood as shown in (2). Using the results from Section 2.1, we can represent this posterior process via a mean and a covariance function or, equivalently, through their associated RKHS objects.
We denote these RKHS objects as and , which uniquely determine the GP posterior . In the following, without loss of generality, we shall assume that the prior GP is zeromean and the RKHS is selected such that a priori.
The variational inference problem in GP models leads to the optimization problem
(4) 
where and is the KL divergence between the approximate posterior GP and the prior GP . It can be shown that up to an additive constant (Cheng and Boots, 2017).
2.3 Decoupled Gaussian Processes
Directly optimizing the possibly infinitedimensional RKHS objects and is computationally intractable except for the special case of a Gaussian likelihood and a small training set size . Therefore, in practice, we need to impose a certain sparse structure on and . Inspired by the functional form of the exact solution in (3), Cheng and Boots (2017) propose to model the approximate posterior GP in the decoupled subspace parametrization (which we will refer to as decoupled basis for short) with
(5) 
where and are the sets of inducing points to specify the bases and in the RKHS, and and are the coefficients such that . With only finite perturbations from the prior, the construction in (5) ensures the KL divergence is finite (Matthews et al., 2016; Cheng and Boots, 2017) (see Appendix A). Importantly, this parameterization decouples the variational parameters for the mean and the variational parameters for the covariance . As a result, the computation complexities related to the two parts become independent, and a large set of parameters can adopted for the mean to model complicated functions, as discussed below.
Coupled Basis
The form in (5) covers the sparse variational posterior (Titsias, 2009). Let be some fictitious inputs and let be the vector of function values. Based on the primal viewpoint of GPs, Titsias (2009) constructs the variational posterior as the posterior GP conditioned on with marginal , where and . The elements in along with and are the variational parameters to optimize. The mean and covariance functions of this process are
(6) 
which is reminiscent of the exact result in (2). Equivalently, it has the dual representation
(7) 
which conforms with the form in (5). The computational complexity of using the coupled basis reduces from to . Therefore, when is selected, the GP can be applied to learning from large datasets (Titsias, 2009).
Inversely Parametrized Decoupled Basis
Directly parameterizing the dual representation in (5) admits more flexibility than the primal functionvalued perspective. To ensure that the covariance of the posterior strictly decreases compared with the prior, Cheng and Boots (2017) propose a decoupled basis with an inversely parametrized covariance operator
(8) 
where and is further parametrized by its Cholesky factors in implementation. It can be shown that the choice in (8) is equivalent to setting in (5). In this parameterization, because the bases for the mean and the covariance are decoupled, the computational complexity of solving (4) with the decoupled basis in (8) becomes , as opposed to of (7). Therefore, while it is usually assumed that is in the order of , with a decoupled basis, we can freely choose for modeling complex mean functions accurately.
3 Orthogonally Decoupled Variational Gaussian Processes
While the particular decoupled basis in (8) is more expressive, its optimization problem is illconditioned and nonconvex, and empirically slow convergence has been observed (Havasi et al., 2018). To improve the speed of learning decoupled models, we consider the use of natural gradient descent (Amari, 1998). In particular, we are interested in the update rule for natural parameters, which has empirically demonstrated impressive convergence performance over other choices of parametrizations (Salimbeni et al., 2018)
However, it is unclear what the natural parameters (5) for the general decoupled basis in (5) are and whether finitedimensional natural parameters even exist for such a model. In this paper, we show that when a decoupled basis is appropriately structured, then natural parameters do exist. Moreover, they admit a very efficient (approximate) natural gradient update rule as detailed in Section 3.4. As a result, largescale decoupled models can be quickly learned, joining the fast convergence property from the coupled approach (Hensman et al., 2013) and the flexibility of the decoupled approach (Cheng and Boots, 2017).
3.1 Alternative Decoupled Bases
To motivate the proposed approach, let us first introduce some alternative decoupled bases for improving optimization properties (8) and discuss their limitations. The inversely parameterized decoupled basis (8) is likely to have different optimization properties from the standard coupled basis (7), due to the inversion in its covariance parameterization. To avoid these potential difficulties, we reparameterize the covariance of (8) as the one in (7) and consider instead the basis
(9) 
The basis (9) can be viewed as a decoupled version of (7): it can be readily identified that setting and recovers (7). Note that we do not want to define a basis in terms of as that incurs the cubic complexity that we intend to avoid.
The alternate choice (9) addresses the difficulty in optimizing the covariance operator, but it still suffers from one serious drawback: while using more inducing points, (9) is not necessarily more expressive than the standard basis (7), for example, when is selected badly. To eliminate the worstcase setup, we can explicitly consider to be part of and use
(10) 
where . This is exactly the hybrid basis suggested in the appendix of Cheng and Boots (2017), which is strictly more expressive than (7) and yet has the complexity as (8). Also the explicit inclusion of inside is pivotal to defining proper finitedimensional natural parameters, which we will later discuss.
3.2 Orthogonally Decoupled Representation
But is (10) the best possible decoupled basis? Upon closer inspection, we find that there is redundancy in the parameterization of this basis: as is not orthogonal to in general, optimizing and jointly would create coupling and make the optimization landscape more illconditioned.
To address this issue, under the partition that , we propose a new decoupled basis as
(11) 
where , and is parametrized by its Cholesky factor. We call the model parameters and refer to (11) as the orthogonally decoupled basis, because is orthogonal to (i.e. ). By substituting and , we can compare (11) to (7): (11) has an additional part parameterized by to model the mean function residues that cannot be captured by using the inducing points alone. In prediction, our basis has time complexity in because can be precomputed.
Compared with the original decoupled basis in (8), our choice in (11) has attractive properties:

The explicit inclusion of as a subset of leads to the existence of natural parameters.
Our setup in (11) introduces a projection operator before in the basis (10) and therefore it can be viewed as the unique hybrid parametrization, which confines the function modeled by to be orthogonal to the span the basis. Consequently, there is no correlation between optimizing and , making the problem more wellconditioned.
3.3 Natural Parameters and Expectation Parameters
To identify the natural parameter of GPs structured as (11), we revisit the definition of natural parameters in exponential families. A distribution belongs to an exponential family if we can write , where is the sufficient statistics, is the natural parameter, is the logpartition function, and is the carrier measure.
Based on this definition, we can see that the choice of natural parameters is not unique. Suppose for some constant matrix and vector . Then is also an admissible natural parameter, because we can write , where , , and . In other words, the natural parameter is only unique up to affine transformations. If the natural parameter is transformed, the corresponding expectation parameter also transforms accordingly to . It can be shown that the Legendre primaldual relationship between and is also preserved: is also convex, and it satisfies and , where denotes the Legendre dual function (see Appendix C).
We use this trick to identify the natural and expectation parameters of (11).
Natural Parameters
Recall that for Gaussian distributions the natural parameters are conventionally defined as . Therefore, to find the natural parameters of (11), it suffices to show that of (11) can be written as an affine transformation of some finitedimensional parameters. The matrix inversion lemma and the orthogonality of and yield
(12) 
Expectation Parameters
Once the new natural parameters are selected, we can also derive the corresponding expectation parameters. Recall for the natural parameters , the associated expectation parameters are . Using the relationship between transformed natural and expectation parameters, we find the expected parameters of (11) using the adjoint operators: and , where we have
(13) 
Note the equations for in (3.3) and (13) have exactly the same relationship between the natural and expectation parameters in the standard Gaussian case, i.e. .
3.4 Natural Gradient Descent
Natural gradient descent updates parameters according to the information geometry induced by the KL divergence (Amari, 1998). It is invariant to reparametrization and can normalize the problem to be well conditioned (Martens, 2014). Let be the Fisher information matrix, where denotes a distribution with natural parameter . Natural gradient descent for natural parameters performs the update where is the step size. Because directly computing the inverse is computationally expensive, we use the duality between natural and expectation parameters in exponential families and adopt the equivalent update Hoffman et al. (2013); Salimbeni et al. (2018).
Exact Update Rules
For our basis in (11), the natural gradient descent step can be written as
(14) 
where we recall is the negative variational lower bound in (4), in (3.3), and in (13). As is defined in terms of , to compute these derivatives we use chain rule (provided by the relationship in Figure 1) and obtain
(15a)  
(15b)  
(15c) 
Due to the orthogonal choice of natural parameter definition, the update for the and the parts are independent. Furthermore, one can show that the update for and is exactly the same as the natural gradient descent rule for the standard coupled basis (Hensman et al., 2013), and that the update for the residue part is equivalent to functional gradient descent (Kivinen et al., 2004) in the subspace orthogonal to the span of .
Approximate Update Rule
We described the natural gradient descent update for the orthogonally decoupled GPs in (11). However, in the regime where , computing (15a) becomes infeasible. Here we propose an approximation of (15a) by approximating with a diagonalpluslowrank structure. Because the inducing points are selected to globally approximate the function landscape, one sensible choice is to approximate with a Nyström approximation based on and a diagonal correction term: , where , , and denotes extracting the diagonal part of a matrix. FITC (Snelson and Ghahramani, 2006) uses a similar idea to approximate the prior distribution (QuiñoneroCandela et al., 2005), whereas here it is used to derive an approximate update rule without changing the problem. This leads to a simple update rule
(16) 
where a jitter is added to ensure stability. This update rule uses a diagonal scaling , which is independent of and . Therefore, while one could directly use the update (16), in implementation, we propose to replace (16) with an adaptive coordinatewise gradient descent algorithm (e.g. ADAM Kingma and Ba (2015)) to update the part. Due to the orthogonal structure, the overall computational complexity is in . While this is more than the complexity of the original decoupled approach (Cheng and Boots, 2017); the experimental results suggest the additional computation is worth the large performance improvement.
4 Results
We empirically assess the performance of our algorithm in multiple regression and classification tasks. We show that {enumerate*}[label=)]
given fixed wallclock time, the proposed orthogonally decoupled basis outperforms existing approaches;
given the same number of inducing points for the covariance, our method almost always improves on the coupled approach (which is in contrast to the previous decoupled bases);
using natural gradients can dramatically improve performance, especially in regression.
We compare updating our orthogonally decoupled basis with adaptive gradient descent using the Adam optimizer (Kingma and Ba, 2015) (Orth), and using the approximate natural gradient descent rule described in Section 3.4 (OrthNat).
As baselines, we consider the original decoupled approach of Cheng and Boots (2017) (Decoupled) and the hybrid approach suggested in their Appendix (Hybrid). We compare also to the standard coupled basis with and without natural gradients (CoupledNat and Coupled, respectively). We make generic choices for hyperparameters, inducing point initializations, and data processing, which are detailed in Appendix E.
Our code
Illustration
Figures 1(a) and 1(b) show a simplified setting to illustrate the difference between the methods. In this example, we fixed the inducing inputs and hyperparameters, and optimized the rest of the variational parameters. All the decoupled methods then have the same global optimum, so we can easily assess their convergence property. With a Gaussian likelihood we also computed the optimal solution analytically as an optimal baseline, although this requires inverting an sized matrix, and therefore is not useful as a practical method. We include the expressions of the optimal solution in Appendix D. We set for all bases and conducted experiments on 3droad dataset (, ) for regression with a Gaussian likelihood and ringnorm data (, ) for classification with a Bernoulli likelihood. Overall, the natural gradient methods are much faster to converge than their ordinary gradient counterparts. Decoupled fails to converge to the optimum after 20K iterations, even in the Gaussian case. We emphasize that, unlike our proposed approaches, Decoupled leads to a nonconvex optimization problem.
Wallclock comparison
To investigate largescale performance, we used 3droad with a large number of inducing points. We used a computer with a Tesla K40 GPU and found that, in wallclock time, the orthogonally decoupled basis with was equivalent to a coupled model with (about seconds per iteration) in our tensorflow (Abadi et al., 2016) implementation. Under this setting, we show the ELBO in Figure 1(c) and the test loglikelihood and accuracy in Figure 3 of the Appendix F. OrthNat performs the best, both in terms of loglikelihood and accuracy. The the highest test loglikelihood of OrthNat is , followed by Orth (). Coupled () and CoupledNat () both outperform Hybrid () and Decoupled ().
Regression benchmarks
We applied our models on 12 regression datasets ranging from 15K to 2M points. To enable feasible computation on multiple datasets, we downscale (but keep the same ratio) the number of inducing points to for the decoupled models, and for the coupled mode. We compare also to the coupled model with to establish whether extra computation always improves the performance of the decoupled basis. The test mean absolute error (MAE) results are summarized in Table 1, and the full results for both test loglikelihood and MAE are given in Appendix F. OrthNat overall is the most competitive basis. And, by all measures, the orthogonal bases outperform their coupled counterparts with the same , except for Hybrid and Decoupled.
Coupled  CoupledNat  Coupled  CoupledNat  Orth  OrthNat  Hybrid  Decoupled  

Mean  0.298  0.295  0.291  0.290  0.284  0.282  0.298  0.361 
Median  0.221  0.219  0.215  0.213  0.211  0.210  0.225  0.299 
Avg Rank  6.083(0.19)  5.00(0.33)  3.750(0.26)  2.417(0.31)  2.500(0.47)  1.833(0.35)  6.417(0.23)  8(0.00) 
Classification benchmarks
We compare our method with stateoftheart fully connected neural networks with Selu activations (Klambauer et al., 2017). We adopted the experimental setup from (Klambauer et al., 2017), using the largest 19 datasets (4898 to 130000 data points). For the binary datasets we used the Bernoulli likelihood, and for the multiclass datasets we used the robustmax likelihood (HernándezLobato et al., 2011). The same basis settings as for the regression benchmarks were used here. Orth performs the best in terms of median, and OrthNat is best ranked. The neural network wins in terms of mean, because it substantially outperforms all the GP models in one particular dataset (chesskrvk), which skews the mean performance over the 19 datasets. We see that our orthogonal bases on average improve the coupled bases with equivalent wallclock time, although for some datasets the coupled bases are superior. Unlike in the regression case, it is not always true that using natural gradients improve performance, although on average they do. This holds for both the coupled and decoupled bases.
Selu  Coupled  CoupledNat  Orth  OrthNat  Hybrid  Decoupled  

Mean  91.6  90.4  90.2  90.6  90.3  89.9  89.0 
Median  93.1  94.8  93.6  95.6  93.6  93.4  92.0 
Average rank  4.16(0.67)  3.89(0.42)  3.53(0.45)  3.68(0.35)  3.42(0.31)  3.89(0.38)  5.42(0.51) 
Overall, the empirical results demonstrate that the orthogonally decoupled basis is superior to the coupled basis with the same wallclock time, averaged over datasets. It is important to note that for the same , adding extra increases performance for the orthogonally decoupled basis in almost all cases, but not for Hybrid of Decoupled. While this does add additional computation, the ratio between the extra computation for additional and that for additional decreases to zero as increases. That is, eventually the cubic scaling in will dominate the linear scaling in .
5 Conclusion
We present a novel orthogonally decoupled basis for variational inference in GP models. Our basis is constructed by extending the standard coupled basis with an additional component to model the mean residues. Therefore, it extends the standard coupled basis (Titsias, 2009; Hensman et al., 2013) and achieves better performance. We show how the natural parameters of our decoupled basis can be identified and propose an approximate natural gradient update rule, which significantly improves the optimization performance over original decoupled approach (Cheng and Boots, 2017). Empirically, our method demonstrates strong performance in multiple regression and classification tasks.
Appendix
Appendix A Variational Inference Problem
In this section, we provides details of implementing the variational inference problem
(4) 
when the variational posterior is parameterized using a decoupled basis
(5) 
Without loss of generality, we assume . That is, we focus on the following form of parametrization with ,
(17) 
a.1 KL Divergence
We first show how the KL divergence can be computed using finitedimensional variables. The proof is similar to strategy in (Cheng and Boots, 2017, Appendix)
Proposition A.1.
For and with
Then It satisfies
For the orthogonally decoupled basis (11) in particular, we can write
a.2 Expected LogLikelihoods
The expected loglikelihood can be computed by first computing the predictive Gaussian distribution for each data point . For example, for the orthogonally decoupled basis (11), this is given as
Given and , then the expected loglikelihood can be computed exactly (for Gaussian case) or using quadrature approximation.
a.3 Gradient Computation
Using the above formulas, a differentiable computational graph can be constructed and then the gradient can to can be computed using automatic differentiation. When in the KLdivergence is further approximated by column sampling , an unbiased gradient can be computed in time complexity .
Appendix B Convexity of the Variational Inference Problem
Here we show the objective function in (4) is strictly convex in if the likelihood is logstrictlyconvex.
b.1 KL Divergence
We first study the KL divergence term. It is easy to see that it is strongly convex in . When , where is lower triangle and with positive diagonal terms, the KL divergence is strongly convex in as well. To see this, we notice that
is strictly convex and
is strongly convex, because .
b.2 Expected Loglikelihood
Here we show the negative expected loglikelihood part is strictly convex. For the negative expected loglikelihood, let and we can write
in which and .
Then we give a lemma below.
Lemma B.1.
Suppose is strictly convex. Then is also strictly convex.
Proof.
Let and . Let .
Because is strictly convex when likelihood is logstrictlyconcave and is linearly parametrized, the desired strict convexity follows.
Appendix C Uniqueness of Parametrization and Natural Parameters
Here we provide some additional details regarding natural parameters and natural gradient descent.
c.1 Necessity of Including as Subset of
We show that the partition condition in Section 3.2 is necessary to derive proper natural parameters. Suppose the contrary case where is a general set of inducing points. Using a similar derivation as Section 3.3, we show that
where and . Therefore, we might consider choosing as a candidate for natural parameters. However the above choice of parametrization is actually coupled due to the condition that and have to satisfy, i.e.
Thus, they cannot satisfy the requirement of being natural parameters. This is mainly because is given in only basis, whereas is given in both and bases.
c.2 Alternate Choices of Natural Parameters
As discussed previously in Section 3.3, the choice of natural parameters is only unique up to affine transformation. While in this paper we propose to use the unique orthogonal version, other choices of parametrization are possible. For instance, here we consider the hybrid parametrization in (Cheng and Boots, 2017, appendix) and give an overview on finding its natural parameters.
The hybrid parametrization use the following decoupled basis:
To facilitate a clear comparison, here we remove the in the original form suggested by Cheng and Boots (2017), which uses . Note in the experiments, their original form was used.
As the covariance part above is the same form as our orthogonally decoupled basis in (11), here we only consider the mean part. Following a similar derivation, we can write
That is, we can choose the natural parameters as
(18) 
This set of natural parameters, unlike the one in the previous section, is proper, because included as a subset of .
c.3 Invariance of Natural Gradient Descent
As discussed above, the choice of natural parameters for the mean part is not unique, but here we show they all lead to the same natural gradient descent update. Therefore, our orthogonal choice (11), among all possible equivalent parameterizations, has the cleanest update rule.
This equivalence between different parameterizations can be easily seen from that the KL divergence between Gaussians are quadratic in . Therefore, the natural gradient of has the form as the proximal update below
for some function , vector and positivedefinite matrix .
To see the invariance of invertible linear transformations, suppose we reparametrize above as and , for some invertible and . Then the update becomes
which represents the same update step in because
c.4 Transformation of Natural Parameters and Expectation Parameters
Here we provide a more rigorous proof of identifying natural and expectation parameters of decoupled bases, as the density function , which is used to illustrate the idea in Section 3.3, is not defined for GPs. Here we show the transformation of natural parameters and expectation parameters based on KL divergence. We start from dimensional exponential families and then show that the formulation extends to arbitrary .
Consider a dimensional exponential family. Its KL divergence of an exponential family can be written as
(19) 
where is the logpartition function, is its Legendre dual of , is the expectation parameter, and is the natural parameter. It holds the duality property that and .
As (19) is expressed in terms of inner product, it holds for arbitrary and it is defined finitely for GPs with decoupled basis (Cheng and Boots, 2017). Therefore, here we show that when we parametrize problem by , is also a candidate natural parameter satisfying (19) for some transformed expectation parameter . It can be shown as below
where we define
It can be verified that is indeed the Legendre dual of .