Orthogonally Decoupled VariationalGaussian Processes

# Orthogonally Decoupled Variational Gaussian Processes

## Abstract

Gaussian processes (GPs) provide a powerful non-parametric framework for reasoning over functions. Despite appealing theory, its superlinear computational and memory complexities have presented a long-standing challenge. State-of-the-art sparse variational inference methods trade modeling accuracy against complexity. However, the complexities of these methods still scale superlinearly in the number of basis functions, implying that that sparse GP methods are able to learn from large datasets only when a small model is used. Recently, a decoupled approach was proposed that removes the unnecessary coupling between the complexities of modeling the mean and the covariance functions of a GP. It achieves a linear complexity in the number of mean parameters, so an expressive posterior mean function can be modeled. While promising, this approach suffers from optimization difficulties due to ill-conditioning and non-convexity. In this work, we propose an alternative decoupled parametrization. It adopts an orthogonal basis in the mean function to model the residues that cannot be learned by the standard coupled approach. Therefore, our method extends, rather than replaces, the coupled approach to achieve strictly better performance. This construction admits a straightforward natural gradient update rule, so the structure of the information manifold that is lost during decoupling can be leveraged to speed up learning. Empirically, our algorithm demonstrates significantly faster convergence in multiple experiments.

\floatsetup

[table]capposition=top \NewEnvironproofatend+0=

###### \pat@proofof\pat@label.
\BODY

\AtEndEnvironment

lemma \AtEndEnvironmenttheorem \AtEndEnvironmentproposition \AtEndEnvironmentcorollary

## 1 Introduction

Gaussian processes (GPs) are flexible Bayesian non-parametric models that have achieved state-of-the-art performance in a range of applications (Diggle and Ribeiro, 2007; Snoek et al., 2012). A key advantage of GP models is that they have large representational capacity, while being robust to overfitting (Rasmussen and Williams, 2006). This property is especially important for robotic applications, where there may be an abundance of data in some parts of the space but a scarcity in others (Deisenroth and Rasmussen, 2011). Unfortunately, exact inference in GPs scales cubically in computation and quadratically in memory with the size of the training set, and is only available in closed form for Gaussian likelihoods.

To learn from large datasets, variational inference provides a principled way to find tractable approximations to the true posterior. A common approach to approximate GP inference is to form a sparse variational posterior, which is designed by conditioning the prior process at a small set of inducing points (Titsias, 2009). The sparse variational framework trades accuracy against computation, but its complexities still scale superlinearly in the number of inducing points. Consequently, the representation power of the approximate distribution is greatly limited.

Various attempts have been made to reduce the complexities in order to scale up GP models for better approximation. Most of them, however, rely on certain assumptions on the kernel structure and input dimension. In the extreme, Hartikainen and Särkkä (2010) show that, for 1D-input problems, exact GP inference can be solved in linear time for kernels with finitely many non-zero derivatives. For low-dimensional inputs and stationary kernels, variational inference with structured kernel approximation (Wilson and Nickisch, 2015) or Fourier features (Hensman et al., 2018) has been proposed. Both approaches, nevertheless, scale exponentially with input dimension, except for the special case of sum-and-product kernels (Gardner et al., 2018). Approximate kernels have also been proposed as GP priors with low-rank structure (Snelson and Ghahramani, 2006; Quiñonero-Candela et al., 2005) or a sparse spectrum (LÃ¡zaro-Gredilla et al., 2010). Another family of methods partitions the input space into subsets and performs prediction aggregation (Tresp, 2000; Rasmussen and Ghahramani, 2002; Nguyen and Bonilla, 2014; Deisenroth and Ng, 2015), and Bayesian aggregation of local experts with attractive theoretical properties is recently proposed by Rullière et al. (2018).

A recent decoupled framework (Cheng and Boots, 2017) takes a different direction to address the complexity issue of GP inference. In contrast to the above approaches, this decoupled framework is agnostic to problem setups (e.g. likelihoods, kernels, and input dimensions) and extends the original sparse variational formulation (Titsias, 2009). The key idea is to represent the variational distribution in the reproducing kernel Hilbert space (RKHS) induced by the covariance function of the GP. The sparse variational posterior by Titsias (2009) turns out to be equivalent to a particular parameterization in the RKHS, where the mean and covariance both share the same basis. Cheng and Boots (2017) suggest to relax the requirement of basis sharing. Since the computation only scales linearly in the mean parameters, many more basis functions can be used for modeling the mean function to achieve higher accuracy in prediction.

However, the original decoupled basis (Cheng and Boots, 2017) turns out to have optimization difficulties (Havasi et al., 2018). In particular, the non-convexity of the optimization problem means that a suboptimal solution may be found, leading to performance that is potentially worse than the standard coupled case. While Havasi et al. (2018) suggest to use a pre-conditioner to amortize the problem, their algorithm incurs an additional cubic computational cost; therefore, its applicability is limited to small simple models.

Inspired by the success of natural gradients in variational inference (Hoffman et al., 2013; Salimbeni et al., 2018), we propose a novel RKHS parameterization of decoupled GPs that admits efficient natural gradient computation. We decompose the mean parametrization into a part that shares the basis with the covariance, and an orthogonal part that models the residues that the standard coupled approach fails to capture. We show that, with this particular choice, the natural gradient update rules further decouple into the natural gradient descent of the coupled part and the functional gradient descent of the residual part. Based on these insights, we propose an efficient optimization algorithm that preserves the desired properties of decoupled GPs and converges faster than the original formulation (Cheng and Boots, 2017).

We demonstrate that our basis is more effective than the original decoupled formulation on a range of classification and regression tasks. We show that the natural gradient updates improve convergence considerably and can lead to much better performance in practice. Crucially, we show also that our basis is more effective than the standard coupled basis for a fixed computational budget.

## 2 Background

We consider the inference problem of GP models. Given a dataset and a GP prior on a latent function , the goal is to infer the (approximate) posterior of for any query input . In this work, we adopt the recent decoupled RKHS reformulation of variational inference (Cheng and Boots, 2017), and, without loss of generality, we will assume is a scalar function. For notation, we use boldface to distinguish finite-dimensional vectors and matrices that are used in computation from scalar and abstract mathematical objects.

### 2.1 Gaussian Processes and their RKHS Representation

We first review the primal and dual representations of GPs, which form the foundation of the RKHS reformulation. Let be the domain of the latent function. A GP is a distribution of functions, which is described by a mean function and a covariance function . We say a latent function is distributed according to , if for any , , , and for any finite subset is Gaussian distributed.

We call the above definition, in terms of the function values and , the primal representation of GPs. Alternatively, one can adopt a dual representation of GPs, by treating functions and as RKHS objects (Cheng and Boots, 2016). This is based on observing that the covariance function satisfies the definition of positive semi-definite functions, so can also be viewed as a reproducing kernel (Aronszajn, 1950). Specifically, given , without loss of generality, we can find an RKHS such that

 m(x)=ϕ(x)⊤μ,k(x,x′)=ϕ(x)⊤Σϕ(x′) (1)

for some , bounded positive semidefinite self-adjoint operator , and feature map . Here we use to denote the inner product in , even when is infinite. For notational clarity, we use symbols and (or ) to denote the mean and covariance functions, and symbols and to denote the RKHS objects; we use to distinguish the (approximate) posterior covariance function from the prior covariance function . If satisfying (1), we also write .3

To concretely illustrate the primal-dual connection, we consider the GP regression problem. Suppose in prior and , where . Let and , where the notation denotes stacking the elements. Then, with observed, it can be shown that where

 m(x)=kx,X(KX+σ2I)−1y,s(x,x′)=kx,x′−kx,X(KX+σ2I)−1kX,x′, (2)

where , and denote the covariances between the subscripted sets,4 We can also equivalently write the posterior GP in (2) in its dual RKHS representation: suppose the feature map is selected such that , then a priori and a posteriori ,

 μ=ΦX(KX+σ2I)−1y,Σ=I−ΦX(KX+σ2I)−1Φ⊤X, (3)

where .

### 2.2 Variational Inference Problem

Inference in GP models is challenging because the closed-form expressions in (2) have computational complexity that is cubic in the size of the training dataset, and are only applicable for Gaussian likelihoods. For non-Gaussian likelihoods (e.g. classification) or for large datasets (i.e. more than 10,000 data points), we must adopt approximate inference.

Variational inference provides a principled approach to search for an approximate but tractable posterior. It seeks a variational posterior that is close to the true posterior in terms of KL divergence, i.e. it solves . For GP models, the variational posterior must be defined over the entire function, so a natural choice is to use another GP. This choice is also motivated by the fact that the exact posterior is a GP in the case of a Gaussian likelihood as shown in (2). Using the results from Section 2.1, we can represent this posterior process via a mean and a covariance function or, equivalently, through their associated RKHS objects.

We denote these RKHS objects as and , which uniquely determine the GP posterior . In the following, without loss of generality, we shall assume that the prior GP is zero-mean and the RKHS is selected such that a priori.

The variational inference problem in GP models leads to the optimization problem

 (4)

where and is the KL divergence between the approximate posterior GP and the prior GP . It can be shown that up to an additive constant (Cheng and Boots, 2017).

### 2.3 Decoupled Gaussian Processes

Directly optimizing the possibly infinite-dimensional RKHS objects and is computationally intractable except for the special case of a Gaussian likelihood and a small training set size . Therefore, in practice, we need to impose a certain sparse structure on and . Inspired by the functional form of the exact solution in (3), Cheng and Boots (2017) propose to model the approximate posterior GP in the decoupled subspace parametrization (which we will refer to as decoupled basis for short) with

 μ=Ψαa,Σ=I+ΨβAΨ⊤β (5)

where and are the sets of inducing points to specify the bases and in the RKHS, and and are the coefficients such that . With only finite perturbations from the prior, the construction in (5) ensures the KL divergence is finite (Matthews et al., 2016; Cheng and Boots, 2017) (see Appendix A). Importantly, this parameterization decouples the variational parameters for the mean and the variational parameters for the covariance . As a result, the computation complexities related to the two parts become independent, and a large set of parameters can adopted for the mean to model complicated functions, as discussed below.

#### Coupled Basis

The form in (5) covers the sparse variational posterior (Titsias, 2009). Let be some fictitious inputs and let be the vector of function values. Based on the primal viewpoint of GPs, Titsias (2009) constructs the variational posterior as the posterior GP conditioned on with marginal , where and . The elements in along with and are the variational parameters to optimize. The mean and covariance functions of this process are

 m(x)=kx,ZK−1Zm,s(x,x′)=kx,x′+kx,ZK−1Z(S−KZ)K−1ZkZ,x, (6)

which is reminiscent of the exact result in (2). Equivalently, it has the dual representation

 μ=ΨZK−1Zm,Σ=I+ΨZK−1Z(S−KZ)K−1ZΨ⊤Z, (7)

which conforms with the form in (5). The computational complexity of using the coupled basis reduces from to . Therefore, when is selected, the GP can be applied to learning from large datasets (Titsias, 2009).

#### Inversely Parametrized Decoupled Basis

Directly parameterizing the dual representation in (5) admits more flexibility than the primal function-valued perspective. To ensure that the covariance of the posterior strictly decreases compared with the prior, Cheng and Boots (2017) propose a decoupled basis with an inversely parametrized covariance operator

 μ=Ψαa,Σ=(I+ΨβBΨ⊤β)−1, (8)

where and is further parametrized by its Cholesky factors in implementation. It can be shown that the choice in (8) is equivalent to setting in (5). In this parameterization, because the bases for the mean and the covariance are decoupled, the computational complexity of solving (4) with the decoupled basis in (8) becomes , as opposed to of (7). Therefore, while it is usually assumed that is in the order of , with a decoupled basis, we can freely choose for modeling complex mean functions accurately.

## 3 Orthogonally Decoupled Variational Gaussian Processes

While the particular decoupled basis in (8) is more expressive, its optimization problem is ill-conditioned and non-convex, and empirically slow convergence has been observed (Havasi et al., 2018). To improve the speed of learning decoupled models, we consider the use of natural gradient descent (Amari, 1998). In particular, we are interested in the update rule for natural parameters, which has empirically demonstrated impressive convergence performance over other choices of parametrizations (Salimbeni et al., 2018)

However, it is unclear what the natural parameters (5) for the general decoupled basis in (5) are and whether finite-dimensional natural parameters even exist for such a model. In this paper, we show that when a decoupled basis is appropriately structured, then natural parameters do exist. Moreover, they admit a very efficient (approximate) natural gradient update rule as detailed in Section 3.4. As a result, large-scale decoupled models can be quickly learned, joining the fast convergence property from the coupled approach (Hensman et al., 2013) and the flexibility of the decoupled approach (Cheng and Boots, 2017).

### 3.1 Alternative Decoupled Bases

To motivate the proposed approach, let us first introduce some alternative decoupled bases for improving optimization properties (8) and discuss their limitations. The inversely parameterized decoupled basis (8) is likely to have different optimization properties from the standard coupled basis (7), due to the inversion in its covariance parameterization. To avoid these potential difficulties, we reparameterize the covariance of (8) as the one in (7) and consider instead the basis

 μ=Ψαa,Σ=(I−ΨβK−1βΨ⊤β)+ΨβK−1βSK−1βΨ⊤β. (9)

The basis (9) can be viewed as a decoupled version of (7): it can be readily identified that setting and recovers (7). Note that we do not want to define a basis in terms of as that incurs the cubic complexity that we intend to avoid.

The alternate choice (9) addresses the difficulty in optimizing the covariance operator, but it still suffers from one serious drawback: while using more inducing points, (9) is not necessarily more expressive than the standard basis (7), for example, when is selected badly. To eliminate the worst-case setup, we can explicitly consider to be part of and use

 μ=Ψγaγ+ΨβK−1βmβ,Σ=(I−ΨβK−1βΨ⊤β)+ΨβK−1βSK−1βΨ⊤β. (10)

where . This is exactly the hybrid basis suggested in the appendix of Cheng and Boots (2017), which is strictly more expressive than (7) and yet has the complexity as (8). Also the explicit inclusion of inside is pivotal to defining proper finite-dimensional natural parameters, which we will later discuss.

### 3.2 Orthogonally Decoupled Representation

But is (10) the best possible decoupled basis? Upon closer inspection, we find that there is redundancy in the parameterization of this basis: as is not orthogonal to in general, optimizing and jointly would create coupling and make the optimization landscape more ill-conditioned.

To address this issue, under the partition that , we propose a new decoupled basis as

 μ=(I−ΨβK−1βΨ⊤β)Ψγaγ+Ψβaβ,Σ=(I−ΨβK−1βΨ⊤β)+ΨβK−1βSK−1βΨ⊤β, (11)

where , and is parametrized by its Cholesky factor. We call the model parameters and refer to (11) as the orthogonally decoupled basis, because is orthogonal to (i.e. ). By substituting and , we can compare (11) to (7): (11) has an additional part parameterized by to model the mean function residues that cannot be captured by using the inducing points alone. In prediction, our basis has time complexity in because can be precomputed.

Compared with the original decoupled basis in (8), our choice in (11) has attractive properties:

1. The explicit inclusion of as a subset of leads to the existence of natural parameters.

2. If the likelihood is strictly log-concave (e.g. Gaussian and Bernoulli likelihoods), then the variational inference problem in (4) is strictly convex in (see Appendix B).

Our setup in (11) introduces a projection operator before in the basis (10) and therefore it can be viewed as the unique hybrid parametrization, which confines the function modeled by to be orthogonal to the span the basis. Consequently, there is no correlation between optimizing and , making the problem more well-conditioned.

### 3.3 Natural Parameters and Expectation Parameters

To identify the natural parameter of GPs structured as (11), we revisit the definition of natural parameters in exponential families. A distribution belongs to an exponential family if we can write , where is the sufficient statistics, is the natural parameter, is the log-partition function, and is the carrier measure.

Based on this definition, we can see that the choice of natural parameters is not unique. Suppose for some constant matrix and vector . Then is also an admissible natural parameter, because we can write , where , , and . In other words, the natural parameter is only unique up to affine transformations. If the natural parameter is transformed, the corresponding expectation parameter also transforms accordingly to . It can be shown that the Legendre primal-dual relationship between and is also preserved: is also convex, and it satisfies and , where denotes the Legendre dual function (see Appendix C).

We use this trick to identify the natural and expectation parameters of (11).5 The relationships between natural, expectation, and model parameters are summarized in Figure 1.

#### Natural Parameters

Recall that for Gaussian distributions the natural parameters are conventionally defined as . Therefore, to find the natural parameters of (11), it suffices to show that of (11) can be written as an affine transformation of some finite-dimensional parameters. The matrix inversion lemma and the orthogonality of and yield

 Σ−1μ=(I−ΨβK−1βΨ⊤β)Ψγjγ+Ψβjβ,12Σ−1=12(I−ΨβK−1βΨ⊤β)+ΨβΘΨ⊤β, wherejγ=aγ,jβ=S−1Kβaβ,Θ=12S−1. (12)

Therefore, we call the natural parameters of (11). This choice is unique in the sense that is orthogonally parametrized.6 The explicit inclusion of as part of is important; otherwise there will be a constraint on and because can only be parametrized by the -basis (see Appendix C).

#### Expectation Parameters

Once the new natural parameters are selected, we can also derive the corresponding expectation parameters. Recall for the natural parameters , the associated expectation parameters are . Using the relationship between transformed natural and expectation parameters, we find the expected parameters of (11) using the adjoint operators: and , where we have

 mγ⊥β=(Kγ−Kγ,βK−1βKβ,γ)jγ,mβ=Sjβ,Λ=−S−mβm⊤β. (13)

Note the equations for in (3.3) and (13) have exactly the same relationship between the natural and expectation parameters in the standard Gaussian case, i.e. .

Natural gradient descent updates parameters according to the information geometry induced by the KL divergence (Amari, 1998). It is invariant to reparametrization and can normalize the problem to be well conditioned (Martens, 2014). Let be the Fisher information matrix, where denotes a distribution with natural parameter . Natural gradient descent for natural parameters performs the update where is the step size. Because directly computing the inverse is computationally expensive, we use the duality between natural and expectation parameters in exponential families and adopt the equivalent update  Hoffman et al. (2013); Salimbeni et al. (2018).

#### Exact Update Rules

For our basis in (11), the natural gradient descent step can be written as

 j←j−τ∇mL,Θ←Θ−τ∇ΛL, (14)

where we recall is the negative variational lower bound in (4), in (3.3), and in (13). As is defined in terms of , to compute these derivatives we use chain rule (provided by the relationship in Figure 1) and obtain

 jγ ←jγ−τ(Kγ−Kγ,βK−1βKγ,β)−1∇aγL, (15a) jβ ←jβ−τ(K−1β∇aβL−2∇SLmβ), (15b) Θ ←Θ+τ∇SL. (15c)

Due to the orthogonal choice of natural parameter definition, the update for the and the parts are independent. Furthermore, one can show that the update for and is exactly the same as the natural gradient descent rule for the standard coupled basis (Hensman et al., 2013), and that the update for the residue part is equivalent to functional gradient descent (Kivinen et al., 2004) in the subspace orthogonal to the span of .

#### Approximate Update Rule

We described the natural gradient descent update for the orthogonally decoupled GPs in (11). However, in the regime where , computing (15a) becomes infeasible. Here we propose an approximation of (15a) by approximating with a diagonal-plus-low-rank structure. Because the inducing points are selected to globally approximate the function landscape, one sensible choice is to approximate with a Nyström approximation based on and a diagonal correction term: , where , , and denotes extracting the diagonal part of a matrix. FITC (Snelson and Ghahramani, 2006) uses a similar idea to approximate the prior distribution (Quiñonero-Candela et al., 2005), whereas here it is used to derive an approximate update rule without changing the problem. This leads to a simple update rule

 jγ←jγ−τ(Dγ|β+ϵI)−1∇aγL, (16)

where a jitter is added to ensure stability. This update rule uses a diagonal scaling , which is independent of and . Therefore, while one could directly use the update (16), in implementation, we propose to replace (16) with an adaptive coordinate-wise gradient descent algorithm (e.g. ADAM Kingma and Ba (2015)) to update the -part. Due to the orthogonal structure, the overall computational complexity is in . While this is more than the complexity of the original decoupled approach (Cheng and Boots, 2017); the experimental results suggest the additional computation is worth the large performance improvement.

## 4 Results

We empirically assess the performance of our algorithm in multiple regression and classification tasks. We show that {enumerate*}[label=)]

given fixed wall-clock time, the proposed orthogonally decoupled basis outperforms existing approaches;

given the same number of inducing points for the covariance, our method almost always improves on the coupled approach (which is in contrast to the previous decoupled bases);

using natural gradients can dramatically improve performance, especially in regression.

We compare updating our orthogonally decoupled basis with adaptive gradient descent using the Adam optimizer (Kingma and Ba, 2015) (Orth), and using the approximate natural gradient descent rule described in Section 3.4 (OrthNat). As baselines, we consider the original decoupled approach of Cheng and Boots (2017) (Decoupled) and the hybrid approach suggested in their Appendix (Hybrid). We compare also to the standard coupled basis with and without natural gradients (CoupledNat and Coupled, respectively). We make generic choices for hyperparameters, inducing point initializations, and data processing, which are detailed in Appendix E. Our code 7 and datasets 8 are publicly available.

#### Illustration

Figures 1(a) and 1(b) show a simplified setting to illustrate the difference between the methods. In this example, we fixed the inducing inputs and hyperparameters, and optimized the rest of the variational parameters. All the decoupled methods then have the same global optimum, so we can easily assess their convergence property. With a Gaussian likelihood we also computed the optimal solution analytically as an optimal baseline, although this requires inverting an -sized matrix, and therefore is not useful as a practical method. We include the expressions of the optimal solution in Appendix D. We set for all bases and conducted experiments on 3droad dataset (, ) for regression with a Gaussian likelihood and ringnorm data (, ) for classification with a Bernoulli likelihood. Overall, the natural gradient methods are much faster to converge than their ordinary gradient counterparts. Decoupled fails to converge to the optimum after 20K iterations, even in the Gaussian case. We emphasize that, unlike our proposed approaches, Decoupled leads to a non-convex optimization problem.

#### Wall-clock comparison

To investigate large-scale performance, we used 3droad with a large number of inducing points. We used a computer with a Tesla K40 GPU and found that, in wall-clock time, the orthogonally decoupled basis with was equivalent to a coupled model with (about seconds per iteration) in our tensorflow (Abadi et al., 2016) implementation. Under this setting, we show the ELBO in Figure 1(c) and the test log-likelihood and accuracy in Figure 3 of the Appendix F. OrthNat performs the best, both in terms of log-likelihood and accuracy. The the highest test log-likelihood of OrthNat is , followed by Orth (). Coupled () and CoupledNat () both outperform Hybrid () and Decoupled ().

#### Regression benchmarks

We applied our models on 12 regression datasets ranging from 15K to 2M points. To enable feasible computation on multiple datasets, we downscale (but keep the same ratio) the number of inducing points to for the decoupled models, and for the coupled mode. We compare also to the coupled model with to establish whether extra computation always improves the performance of the decoupled basis. The test mean absolute error (MAE) results are summarized in Table 1, and the full results for both test log-likelihood and MAE are given in Appendix F. OrthNat overall is the most competitive basis. And, by all measures, the orthogonal bases outperform their coupled counterparts with the same , except for Hybrid and Decoupled.

#### Classification benchmarks

We compare our method with state-of-the-art fully connected neural networks with Selu activations (Klambauer et al., 2017). We adopted the experimental setup from (Klambauer et al., 2017), using the largest 19 datasets (4898 to 130000 data points). For the binary datasets we used the Bernoulli likelihood, and for the multiclass datasets we used the robust-max likelihood (Hernández-Lobato et al., 2011). The same basis settings as for the regression benchmarks were used here. Orth performs the best in terms of median, and OrthNat is best ranked. The neural network wins in terms of mean, because it substantially outperforms all the GP models in one particular dataset (chess-krvk), which skews the mean performance over the 19 datasets. We see that our orthogonal bases on average improve the coupled bases with equivalent wall-clock time, although for some datasets the coupled bases are superior. Unlike in the regression case, it is not always true that using natural gradients improve performance, although on average they do. This holds for both the coupled and decoupled bases.

Overall, the empirical results demonstrate that the orthogonally decoupled basis is superior to the coupled basis with the same wall-clock time, averaged over datasets. It is important to note that for the same , adding extra increases performance for the orthogonally decoupled basis in almost all cases, but not for Hybrid of Decoupled. While this does add additional computation, the ratio between the extra computation for additional and that for additional decreases to zero as increases. That is, eventually the cubic scaling in will dominate the linear scaling in .

## 5 Conclusion

We present a novel orthogonally decoupled basis for variational inference in GP models. Our basis is constructed by extending the standard coupled basis with an additional component to model the mean residues. Therefore, it extends the standard coupled basis (Titsias, 2009; Hensman et al., 2013) and achieves better performance. We show how the natural parameters of our decoupled basis can be identified and propose an approximate natural gradient update rule, which significantly improves the optimization performance over original decoupled approach (Cheng and Boots, 2017). Empirically, our method demonstrates strong performance in multiple regression and classification tasks.

## Appendix A Variational Inference Problem

In this section, we provides details of implementing the variational inference problem

 L(q)=−N∑n=1Eq(f(xn))[logp(yn|f(xn))]+KL(q||p) (4)

when the variational posterior is parameterized using a decoupled basis

 μ=Ψαa,Σ=I+ΨβAΨ⊤β. (5)

Without loss of generality, we assume . That is, we focus on the following form of parametrization with ,

 μ=Ψαa,Σ=I+ΨβAΨ⊤β\coloneqq(I−ΨβK−1βΨ⊤β)+ΨβK−1βSK−1βΨ⊤β. (17)

### a.1 KL Divergence

We first show how the KL divergence can be computed using finite-dimensional variables. The proof is similar to strategy in (Cheng and Boots, 2017, Appendix)

###### Proposition A.1.

For and with

 μ=Ψαa,Σ=(I−ΨβK−1βΨ⊤β)+ΨβK−1βSK−1βΨ⊤β

Then It satisfies

 KL(q||p) =12(a⊤Kαa+tr(SK−1β)−log|S|+log|Kβ|−|β|)

For the orthogonally decoupled basis (11) in particular, we can write

 KL(q||p) =12(a⊤γ(Kγ−Kγ,βK−1βKβ,γ)aγ+a⊤βKβaβ+tr(SK−1β)−log|S|+log|Kβ|−|β|).

### a.2 Expected Log-Likelihoods

The expected log-likelihood can be computed by first computing the predictive Gaussian distribution for each data point . For example, for the orthogonally decoupled basis (11), this is given as

 m(x) =(kx,γ−kx,βK−1βKβ,γ)aγ+kx,βaβ s(x) =(kx−kx,βK−1βkβ,x)+kx,βK−1βSK−1βkβ,x.

Given and , then the expected log-likelihood can be computed exactly (for Gaussian case) or using quadrature approximation.

Using the above formulas, a differentiable computational graph can be constructed and then the gradient can to can be computed using automatic differentiation. When in the KL-divergence is further approximated by column sampling , an unbiased gradient can be computed in time complexity .

## Appendix B Convexity of the Variational Inference Problem

Here we show the objective function in (4) is strictly convex in if the likelihood is log-strictly-convex.

### b.1 KL Divergence

We first study the KL divergence term. It is easy to see that it is strongly convex in . When , where is lower triangle and with positive diagonal terms, the KL divergence is strongly convex in as well. To see this, we notice that

 −log|S|=−log|LL⊤|=−2|β|∑i=1log|Lii|

is strictly convex and

is strongly convex, because .

### b.2 Expected Log-likelihood

Here we show the negative expected log-likelihood part is strictly convex. For the negative expected log-likelihood, let and we can write

 En =Eq(f(xn))[−logp(yn|f(xn))] =Eζ,ξ[F(m(xn)+ζ+kxn,βK−1βLξ)]

in which and .

Then we give a lemma below.

###### Lemma B.1.

Suppose is -strictly convex. Then is also -strictly convex.

###### Proof.

Let and . Let .

 f(v)−f(u) ≥⟨∇f(u),v−u⟩+θ2(⟨∇f(u),v−u⟩)2 =⟨∇f(u),A(y−x)⟩+θ2(⟨∇f(u),A(y−x)⟩)2 =⟨A⊤∇f(u),y−x⟩+θ2(⟨A⊤∇f(u),y−x⟩)2

Because is strictly convex when likelihood is log-strictly-concave and is linearly parametrized, the desired strict convexity follows.

## Appendix C Uniqueness of Parametrization and Natural Parameters

Here we provide some additional details regarding natural parameters and natural gradient descent.

### c.1 Necessity of Including β as Subset of α

We show that the partition condition in Section 3.2 is necessary to derive proper natural parameters. Suppose the contrary case where is a general set of inducing points. Using a similar derivation as Section 3.3, we show that

 12Σ−1 =12(I−ΨβK−1βΨ⊤β)+12ΨβS−1Ψ⊤β Σ−1μ =((I−ΨβK−1βΨ⊤β)+ΨβS−1Ψ⊤β)Ψαa =(I−ΨβK−1βΨ⊤β)Ψα~jα+Ψβ~jβ

where and . Therefore, we might consider choosing as a candidate for natural parameters. However the above choice of parametrization is actually coupled due to the condition that and have to satisfy, i.e.

 ~jβ=S−1Kβ~jα

Thus, they cannot satisfy the requirement of being natural parameters. This is mainly because is given in only basis, whereas is given in both and bases.

### c.2 Alternate Choices of Natural Parameters

As discussed previously in Section 3.3, the choice of natural parameters is only unique up to affine transformation. While in this paper we propose to use the unique orthogonal version, other choices of parametrization are possible. For instance, here we consider the hybrid parametrization in (Cheng and Boots, 2017, appendix) and give an overview on finding its natural parameters.

The hybrid parametrization use the following decoupled basis:

 μ=Ψγaγ+ΨβaβΣ=(I−ΨβK−1βΨ⊤β)+ΨβK−1βSK−1βΨ⊤β

To facilitate a clear comparison, here we remove the in the original form suggested by Cheng and Boots (2017), which uses . Note in the experiments, their original form was used.

As the covariance part above is the same form as our orthogonally decoupled basis in (11), here we only consider the mean part. Following a similar derivation, we can write

 Σ−1μ =((I−ΨβK−1βΨ⊤β)+ΨβS−1Ψ⊤β)(Ψγaγ+Ψβaβ) =Ψγaγ+Ψβ(S−1−K−1β)Kβ,γaγ+ΨβS−1Kβaβ =(Ψγ−ΨβK−1βKβ,γ)aγ+ΨβS−1(Kβaβ+Kβ,γaγ) =(I−ΨβK−1βΨ⊤β)Ψγjγ+Ψβjβ

That is, we can choose the natural parameters as

 jγ=aγ,jβ=S−1(Kβaβ+Kβ,γaγ),Θ=12S−1 (18)

This set of natural parameters, unlike the one in the previous section, is proper, because included as a subset of .

Comparing (18) with (3.3), we can see that there is a coupling between and in (18). This would lead to a more complicated update rule in computing the natural gradient. This coupling phenomenon also applies to other choice of parametrizations, excerpt for our orthogonally decoupled basis.

### c.3 Invariance of Natural Gradient Descent

As discussed above, the choice of natural parameters for the mean part is not unique, but here we show they all lead to the same natural gradient descent update. Therefore, our orthogonal choice (11), among all possible equivalent parameterizations, has the cleanest update rule.

This equivalence between different parameterizations can be easily seen from that the KL divergence between Gaussians are quadratic in . Therefore, the natural gradient of has the form as the proximal update below

for some function , vector and positive-definite matrix .

To see the invariance of invertible linear transformations, suppose we reparametrize above as and , for some invertible and . Then the update becomes

 argminu⟨A⊤∇xf,u⟩+12(u−v)⊤A⊤QA(u−v) =v−A−1Q−1∇xf

which represents the same update step in because

 A(v−A−1Q−1∇xf)+b=y−Q−1∇xf.

### c.4 Transformation of Natural Parameters and Expectation Parameters

Here we provide a more rigorous proof of identifying natural and expectation parameters of decoupled bases, as the density function , which is used to illustrate the idea in Section 3.3, is not defined for GPs. Here we show the transformation of natural parameters and expectation parameters based on KL divergence. We start from -dimensional exponential families and then show that the formulation extends to arbitrary .

Consider a -dimensional exponential family. Its KL divergence of an exponential family can be written as

 KL(q||p)=A(ηp)+A∗(θq)−⟨ηp,θq⟩ (19)

where is the log-partition function, is its Legendre dual of , is the expectation parameter, and is the natural parameter. It holds the duality property that and .

As (19) is expressed in terms of inner product, it holds for arbitrary and it is defined finitely for GPs with decoupled basis (Cheng and Boots, 2017). Therefore, here we show that when we parametrize problem by , is also a candidate natural parameter satisfying  (19) for some transformed expectation parameter . It can be shown as below

 KL(q||p) =A(ηp)+A∗(θq)−⟨ηp,θq⟩ =A(H~ηp+b)+A∗(θq)−⟨H~ηp+b,θq⟩ =A(H~ηp+b)+A∗(θq)−⟨~ηp,H⊤θq⟩−⟨b,θq⟩ =A(H~ηp+b)+(A∗(H−⊤~θq)−⟨b,H−⊤~θq⟩)−⟨~ηp,~θq⟩ \eqqcolon~A(~ηp)+~A∗(~θq)−⟨~ηp,~θh⟩

where we define

 ~θq =H⊤θq ~A(~ηp) =A(H~ηp+b) ~A∗(~θq) =A∗(H−⊤~θq)−⟨b,H−⊤~θq⟩

It can be verified that is indeed the Legendre dual of .

 maxx⟨w,x⟩−~A(x) =maxx⟨w,x⟩−A(Hx+b) =maxz⟨w,H−1(z−b)⟩−A(z) =−⟨H−⊤w,b⟩+maxz⟨H−⊤w,z⟩−A(z) =−⟨H−⊤w,b⟩+A∗(H−⊤w)=