# A Tutorial on Sparse Gaussian Processes and Variational Inference

## Abstract

Gaussian processes (GPs) provide a mathematically elegant framework for Bayesian inference and they can offer principled uncertainty estimates for a large range of problems. For example, if we consider certain regression problems with Gaussian likelihoods, a GP model enjoys a posterior in closed form. However, identifying the posterior GP scales cubically with the number of training examples and furthermore requires to store all training examples in memory. In order to overcome these practical obstacles, sparse GPs have been proposed that approximate the true posterior GP with a set of pseudo-training examples (a.k.a. inducing inputs or inducing points). Importantly, the number of pseudo-training examples is user-defined and enables control over computational and memory complexity. In the general case, sparse GPs do not enjoy closed-form solutions and one has to resort to approximate inference. In this context, a convenient choice for approximate inference is variational inference (VI), where the problem of Bayesian inference is cast as an optimization problem—namely, to maximize a lower bound of the logarithm of the marginal likelihood. This paves the way for a powerful and versatile framework, where pseudo-training examples are treated as optimization arguments of the approximate posterior that are jointly identified together with hyperparameters of the generative model (i.e. prior and likelihood) in the course of training. The framework can naturally handle a wide scope of supervised learning problems, ranging from regression with heteroscedastic and non-Gaussian likelihoods to classification problems with discrete labels, but also multilabel problems (where the regression or classification targets are multidimensional). The purpose of this tutorial is to provide access to the basic matter for readers without prior knowledge in both GPs and VI. It turns out that a proper exposition to the subject enables also convenient access to more recent advances in the field of GPs (like importance-weighted VI as well as inderdomain, multioutput and deep GPs) that can serve as an inspiration for exploring new research ideas.

Variational Inference, Importance-Weighted Variational Inference, Latent-Variable Variational Inference, Bayesian Layers, Bayesian Deep Learning, Sparse Gaussian Process, Sparse Variational Gaussian Process, Interdomain Gaussian Process, Multioutput Gaussian Process, Deep Gaussian Process

###### Contents

## 1 Introduction

Gaussian Processes (GPs) [Rasmussen and Williams, 2006] are a natural way to generalize the concept of a multivariate normal distribution. While a multivariate normal distribution describes random variables that are vectors, a GP describes random variables that are real-valued functions defined over some input domain. Imagine the input domain are the real numbers, then the random variable described by a GP can be thought of as a “vector” of uncountably infinite extent and with infinite resolution that is “indexed” by a real number rather than a discrete index. GPs allow however for a wider range of input domains such as Euclidean vector spaces, but also non-continuous input domains like sets containing graph-theoretical objects or character sequences, and many more.

GPs are a popular tool for regression where the goal is to identify an unknown real-valued function given noisy function observations at some input locations. More precisely, given input/output tuples where is a scalar and an input from some input domain, the modelling assumption is that the data has been generated by where is a real-valued function with input that is sampled from a GP, and where are scalar i.i.d. random variables corresponding to observation noise. In this context, the prior knowledge on the data generation process can be encapsulated in the distribution of , and the likelihood (i.e. the observation model) determines how likely a noisy function observation is given the corresponding noise-free function observation. If the observation noise is Gaussian, the posterior process is also a GP that enjoys a closed-form expression. In order to compute the posterior GP, one has to resort to Bayes’ rule to infer the posterior distribution over functions given the input/output tuples and the likelihood. It turns out that the denominator in Bayes’ rule, known as the marginal likelihood that depends on both the prior GP and the likelihood, provides a natural way to identify point estimates for hyperparameters of the generative model that are not subject to inference [Bishop, 2006].

In the case of general regression or classification problems, the exact posterior process is usually no longer a GP. In logistic regression for example, the noisy function observations are binary values (which is why logistic regression is actually a classification problem). The likelihood is a Bernoulli distribution whose mean is obtained by squashing the output of a real-valued function evaluated at the input (represented by a GP) through a sigmoid function . This yields a probability value for each input location indicating the probability of the observed noisy function value being . But even regression problems that have Gaussian likelihoods are not unproblematic. It turns out that computing the exact posterior GP requires to store and invert an -matrix that is quadratic in the amount of training data . This means quadratic memory consumption and cubic computational complexity—both of which are infeasible for large data sets.

All of these problems can be addressed with recent advances in the field of GP research: sparse GPs [Titsias, 2009]. Sparse GPs limit the amount of pseudo data that is used to represent a (possibly non-Gaussian) posterior process, where the limit is user-defined and determines memory and computational complexity. Intuitively, in a regression problem that has a closed-form solution, the optimal sparse GP should be “as close as possible” to the true intractable posterior GP. In this regard, “as close as possible” can be for example defined as low Kullback-Leibler (KL) divergence between the sparse GP and the true posterior GP, and the goal is to identify the sparse GP’s pseudo data in such a way that this KL becomes minimal. In general, it is not possible to identify such optimal sparse GPs in closed-form solution. One way to approximate optimal sparse GPs is to resort to optimization and gradient-based methods, one particular example of which is variational inference (VI) that is equivalent to minimizing said KL divergence. Other examples comprise Markov-chain-Monte-Carlo methods and expectation propagation [Hensman et al., 2015a, Bui et al., 2017]. However, in this tutorial, we put emphasis on VI due to its popularity and convenience (and to limit the scope).

VI is a particular type of approximate inference technique that translates the problem of Bayesian inference into an optimization objective to be optimized w.r.t. the parameters of the approximate posterior. Interestingly, this objective is a lower bound to the logarithm of the aforementioned marginal likelihood, which enables hence convenient joint optimization over hyperparameters (that are not treated as random variables) in addition to the approximate posterior’s parameters. VI does not only provide a principled way to identify approximate posterior processes via sparse GPs when the likelihood is Gaussian, but it also provides a solution to scenarios with arbitrary likelihoods and where the true posterior process is typically not a GP (such as in logistic regression). It turns out that the framework can be readily extended to multimodal likelihood problems, as well as multiregression and -classification problems where unknown noisy vector-valued functions need to be identified.

The remainder of this manuscript is organized as follows. In Section 2, we provide an overview over sparse GPs and recent extensions that enable further computational gains. Importantly, Section 2 only explains sparse GPs but outside the scope of approximate inference. This is subject of Section 3 giving a general background on VI where we refer to weight space models (such as deep neural networks) to ease the exposition. In Section 4, we combine the previous two sections and elucidate how to do VI with sparse GPs (that are function space models), but we also provide some tricks for practitioners. In the end, we cover some further reaching topics around sparse GPs and VI (Section 5) before concluding with a summary in Section 6.

## 2 Sparse Gaussian Processes

Informally, GPs can be imagined as a generalization of multivariate Gaussians that are indexed by a (possibly continuous) input domain rather than an index set. Exact and approximate inference techniques with GPs leverage conditioning operations that are conceptually equivalent to those in multivariate Gaussians. In Section 2.1, we therefore provide an overview of the most important conditioning operations in multivariate Gaussians and present their GP counterparts in Section 2.2. It turns out that these conditioning operations provide a natural way to express sparse GPs and generalize readily to interdomain GPs (Section 2.3), GPs with multiple outputs (Section 2.4) and deep GPs consisting of multioutput GPs stacked on top of one another (Section 2.5).

### 2.1 Multivariate Gaussian Identities for Conditioning

The identities presented in this section might evoke the impression of being a bit out of context at first sight but will turn out to be essential for understanding sparse GPs as presented in Section 2.2 and subsequent sections. We start by noting that conditionals of multivariate Gaussians are Gaussian as well. For that purpose, imagine a multivariate Gaussian whose random variables are partitioned into two vectors and respectively. The joint distribution then assumes the following form:

(1) |

where and refer to the marginal means of and respectively, and , , and to (cross-)covariance matrices. The conditional distribution of given can then be expressed as:

(2) |

Let’s imagine that other than the marginal distribution with mean and covariance from Equation (1), there is another Gaussian distribution over with mean and covariance :

(3) |

Denoting the distribution from Equation (2) as , and the distribution from Equation (3) as , one obtains a marginal distribution over as that is again Gaussian:

(4) |

A quick sanity check reveals that if we had integrated with the marginal distribution from Equation (1) instead of from Equation (3), we would have recovered the original marginal distribution with mean and covariance from Equation (1) as expected.

Importantly, Equations (1) to (4) remain valid if we define as via a linear transformation of the random variable . In this case, since and are given, the only remaining quantities to be identified are the mean and the (cross)-covariance matrices , and , yielding:

(5) | |||||

(6) | |||||

(7) |

Note that the joint covariance matrix of and is degenerate (i.e. singular) because is the result of a linear transformation of and hence completely determined by .

### 2.2 Gaussian Processes and Conditioning

A GP represents a distribution denoted as over real-valued functions defined over an input domain that we assume to be continuous throughout this tutorial (although this does not need to be the case). While multivariate Gaussians represent a distribution over finite-dimensional vectors, GPs represent a distribution over uncountably infinite-dimensional functions—vector indexes in multivariate Gaussian random variables conceptually correspond to specific evaluation points in GP random functions. Formally, a GP is defined through two real-valued functions: a mean function and a positive-definite covariance function a.k.a. kernel [Rasmussen and Williams, 2006]:

(8) |

where and in accordance with multivariate Gaussian notation. Importantly, if we evaluate the GP at any finite subset of with cardinality , we would obtain an -dimensional multivariate Gaussian random variable :

(9) |

where and are the mean and covariance matrix obtained by evaluating the mean and covariance function respectively at , i.e. and . Here, square brackets conveniently refer to numpy indexing notation for vectors and matrices: refers to the entry at index of the vector , and refers to the entry at row index and column index of the matrix . While the notation with subscript puts emphasis on the random variable and is standard in the literature, it unfortunately hides away the explicit “dependence” on the evaluation points which might confuse readers new to the subject.

Similarly to Equation (1) from the previous section on multivariate Gaussians, we can partition the uncountably infinite set of random variables represented by a GP into two sets: one that contains a finite subset denoted as evaluated at a finite set of evaluation points , such that , with mean and covariance matrix ; and one that contains the remaining uncountably infinite set of random variables denoted as evaluated at all locations in except for the finitely many points :

(10) |

where and denote vector-valued functions that express the cross-covariance between the finite-dimensional random variable and the uncountably infinite-dimensional random variable , i.e. and . Note that is row-vector-valued whereas is column-vector-valued, and that holds.

One might ask why we have chosen the same notation in Equation (10) as in Equation (8) although both random variables have different evaluation domains ( without inducing points versus all of ). The answer is a more careful notation could have been adopted [Matthews et al., 2016] but, technically, Equation (10) is correct and poses a degenerate distribution over and , because is completely determined by (since is the GP evaluated at the inducing points ).

The conditional GP of conditioned on corresponding to the joint from Equation (10) is obtained in a similar fashion as the conditional multivariate Gaussian from Equation (2) is obtained from the joint multivariate Gaussian in Equation (1), resulting in:

(11) |

Similarly to the multivariate Gaussian case, one can assume another marginal distribution over as in Equation (3) (other than which is the marginal distribution over with mean and covariance in accordance with Equation (10)). Denoting the conditional GP from Equation (11) as and integrating out with yields , which is a GP:

(12) |

and conceptually equivalent to its multivariate Gaussian counterpart from Equation (4). With Equation (12), we have arrived at the definition of a sparse GP as used in contemporary literature [Titsias, 2009]. In this context, the evaluation points are usually called “inducing points” or “inducing inputs” that refer to pseudo-training examples, and is called “inducing variable” conceptually referring to noise-free pseudo-outputs observed at the inducing inputs.

The number of inducing points governs the expressiveness of the sparse GP—more inducing points mean more pseudo-training examples and hence a more accurate approximate representation of a function posterior. However, since determines the dimension of , more inducing points also mean higher memory requirements (as needs to be stored) and higher computational complexity (as needs to be inverted which is a cubic operation ). Practical limitations therefore incentivise low explaining why the formulation is referred to as a “sparse” GP in the first place. Note that this line of reasoning assumes a naive approach for dealing with covariance matrices—there are are recent advances that enable approximate computations based on conjugate-gradient-type algorithms [Gardner et al., 2018], but this is outside the scope of this tutorial.

At this point, we need to highlight that the notation related to the approximate posterior process is mathematically sloppy since it colloquially refers to a distribution over functions for which no probability density exists. We will nevertheless continue with this notation occasionally—or with the notation to refer to the prior distribution over —where we feel it makes the subject more easily digestible.

In practice, is used to approximate an intractable posterior process through variational inference—we will learn more about what this means in Section 4. In short and on a high level, variational inference phrases an approximate inference problem as an optimization problem where inducing points , as well as the mean and the covariance of the inducing variable distribution , are treated as optimization arguments that are automatically identified in the course of training. The importance of Equation (12) is substantiated by the fact that it remains valid for interdomain GPs (Section 2.3) and multioutput GPs (Section 2.4), and forms the central building block of modern deep GPs (Section 2.5), all of which can be practically trained with variational inference.

### 2.3 Interdomain Gaussian Processes

In the preceding section, when introducing sparse GPs, we have defined as a random variable obtained when evaluating the GP on a set of inducing points . In line with the second part of Section 2.1, we could have alternatively defined differently as through a linear functional on with help of a set of “inducing features” that are real-valued functions [Lazaro-Gredilla and Figueiras-Vidal, 2009]. It turns out that in this case, Equation (12) remains valid and the only quantities to be identified are the mean for inducing variables as well as (cross-)covariances and , which are computed as follows:

(13) | |||||

(14) | |||||

(15) | |||||

We ask the reader at this stage to pause and carefully compare the definition of in this section and Equations (13) to (15), with the definition of at the end of Section 2.1 on multivariate Gaussians and Equations (5) to (7). Equations (5) to (7) can be rewritten in numpy indexing notation as:

(16) | |||||

(17) | |||||

(18) |

This provides a good intuition of how interdomain GPs relate to linear transformations of random variables in the mutlivariate Gaussian case: mean functions and covariance functions correspond to mean vectors and covariance matrices, inducing features to feature vectors, and integrals over a continuous input domain to sums over discrete indices. Mathematically, is a real-valued linear integral operator (which maps functions to real numbers) that generalizes the concept of a real-valued linear transformation operating on finite-dimensional vector spaces.

After obtaining a mathematical intuition for interdomain GPs and how they relate to linear transformations in multivariate Gaussians, it can be insightful to get a conceptual intuition with concrete examples. An important characteristic of the interdomain formulation is that inducing points can live in a domain different from the one in which the GP operates (which is ), hence the naming. Of practical importance is also the question how to choose the features such that the covariance formulations from Equations (14) and (15) have closed-form expressions (the mean formulation from Equation (13) evaluates trivially to zero for a zero-mean function that assigns zero to every input location , which is a common practical choice).

In the following, we present four examples: Dirac, Fourier, kernel-eigen and derivative features. Dirac features recover ordinary inducing points as a special case of the interdomain formulation. Fourier features enable inducing points to live in a frequency domain (different from that is considered as time/space domain in this context). Kernel-eigen features are conceptually equivalent to principal components of a finite-dimensional covariance matrix but for uncountably-infinite dimensional kernel functions. Derivative features enable inducing points to evaluate the derivative of w.r.t. specific dimensions of , as opposed to the the ordinary inducing point formulation that enables inducing points to evaluate .

#### Dirac Features.

Dirac features are trivially defined as through the Dirac delta function that puts all probability mass on . This makes inducing points live in as expected, and , and recover the expressions from Section 2.2 for ordinary sparse GPs. Since Dirac features recover the ordinary inducing point formulation through a linear integral operator, it becomes clear that we can choose the same notation for both random variables in Equations (10) and (8), without worrying too much about one input domain being excluding finitely many inducing points and the other one all of .

#### Fourier Features.

Fourier features are defined as where refers to an inducing frequency vector and to the complex unit. Note that, practically, boundary conditions need to be defined for Fourier features. Otherwise, would have infinite-valued entries on its diagonal for any stationary kernel (stationary kernels are a specific type of kernel where the covariance between two input locations and only depends on the distance between and ). Also note that for specific stationary kernels, and have real-valued closed-form expressions—see Hensman et al. [2018] for details. Fourier features underpin the term “interdomain” because the integral operator enables inducing points to live in a frequency domain different from the time/space domain . While Hensman et al. [2018] chose fixed inducing frequencies arranged in a grid-wise fashion, an interesting future research direction is to treat as optimization arguments in the context of approximate inference (e.g. variational inference as explained later in Section 3). We need to stress that the original Fourier feature formulation from Hensman et al. [2018] does not use the inner product between and to define inducing variables as we did at the beginning of this section, but introducing alternative ways how to define the inner product between two functions is outside the scope of this tutorial.

#### Kernel-Eigen Features.

Kernel-eigen features can be given as where refers to the -th eigenfunction of the kernel with eigenvalue . Eigenfunctions of kernels are defined as , similarly to eigenvectors of square matrices when replacing integrals over with sums over indexes. Equation (14) then trivially evaluates to:

(19) |

and is conceptually equivalent to whitening finite-dimensional vectors via principal component analysis. The covariance matrix from Equation (15) is computed as follows:

(20) | |||||

which is a diagonal matrix [Burt et al., 2019]. The step from the second to the third line merely utilizes the eigenfunction definition. The step from the third to the fourth line leverages that eigenfunctions are orthonormal, which means equals one if equals and is zero otherwise. This is of great practical importance since a diagonal is more memory-efficient and invertible in linear rather than cubic time. Identifying eigenfunctions and eigenvalues for arbitrary kernels in closed-form is non-trivial, but solutions do exist for some [Rasmussen and Williams, 2006, Borovitskiy et al., 2020, Burt et al., 2020, Dutordoir et al., 2020a, Riutort-Mayol et al., 2020].

#### Derivative Features.

In addition to the linear integral operator formulation from above in Equations (13) to (15) that contains Dirac, Fourier and kernel-eigen features as special cases, there is another possibility to define interdomain variables via the derivatives of a GP’s function values [Adam et al., 2020, van der Wilk et al., 2020]. In this case, the inducing variable is expressed as where refers to an inducing point and to the input dimension of over which the partial derivative is performed (as determined by the interdomain variable with index ). Every interdomain variable needs to specify a particular input dimension of the derivative of is taken with respect to. Equations (13) to (15) then become:

(21) | |||||

(22) | |||||

(23) |

presupposing differentiable mean and kernel functions. This can be useful, e.g. when is a time domain and represents a space domain, but one seeks to express interdomain variables in a velocity domain [O’Hagan, 1992, Rasmussen and Williams, 2006]. Mathematically, the results above are not surprising since differentiation is a linear operator over function spaces.

### 2.4 Multioutput Gaussian Processes

A multioutput GP extends a GP to a distribution over vector-valued functions , where refers to the number of outputs [Alvarez et al., 2012]. While an ordinary GP outputs a scalar Gaussian random variable when evaluated at a specific input location , a multioutput GP outputs a -dimensional multivariate Gaussian random variable when evaluated at a specific input location . Formally, a multioutput GP is defined similarly to an ordinary singleoutput GP as in Equation (8), but with a vector-valued mean function and a matrix-valued covariance function [Micchelli and Pontil, 2005]. Note that needs to output a proper covariance matrix for every -pair and must hence satisfy the following symmetry relation , i.e. in numpy indexing notation. We would like to remind the reader at this stage to not confuse multioutput notation with singleoutput notation from earlier for random vectors , mean vectors and covariance matrices that result from evaluating a singleoutput GP in multiple locations .

One might ask how to evaluate a multioutput GP since this would naively lead to a random variable of extend where refers to the number of evaluation points. The answer is that we can flatten the random variable into a vector of size . This results in a multivariate Gaussian random variable as described in Equation (9) with a mean vector of size and a covariance matrix of size . While the choice of flattening is up to the user, one could e.g. concatenate all -dimensional random variables for each evaluation point. This yields a partitioned mean vector with partitions of size and a block-covariance matrix with blocks each holding a matrix. The block covariance matrix is required to form a proper covariance matrix, i.e. it needs to be symmetric (which is guaranteed by the imposed symmetry relation from the previous paragraph).

The “flattening trick” already hints at the fact that a multioutput GP can be indeed defined differently as a distribution over real-valued functions with a real-valued mean function and a real-valued covariance function. This is made possible through the “output-as-input” view where a multioutput GP’s input domain is extended through an index set to index output dimensions [van der Wilk et al., 2020]. More formally, this yields distributed according to the mean function and the covariance function , where the dot notation now refers to a pair (the first element of the pair being the input and the second the output index ). In this notation, the kernel only needs to satisfy the same symmetry relation as an ordinary singleoutput kernel and Equation (8) for singleouput GPs remains applicable (under the extended input domain).

The advantage of defining a multioutput GP this way is that it makes the evaluation at arbitrary input-output-index pairs more convenient, i.e. it is not required to evaluate the multioutput GP at all outputs for a specific input . The latter comes in handy e.g. for (approximate) Bayesian inference in multilabel problems (that have multidimensional targets) when training examples can have missing labels (i.e. the set of labels for some can be incomplete).

The output-as-input view not only ensures that Equation (8) remains valid for multioutput GPs but also ensures that Equation (12) remains valid for sparse multioutput GPs under an output-index-extended input domain. Naively, one could specify inducing points (or inducing features in the interdomain formulation) and inducing variables — for each output head respectively. In Equation (12), this would lead and to be -dimensional vectors, and to be matrices, and and to be -dimensional vector-valued functions.

At this point, we have acquired an understanding of sparse multioutput GPs, but it can be insightful to continue with a discussion on computational efficiency and how to design computationally efficient multioutput GPs in practice. The biggest computational chunk in Equation (12) is the inversion of which poses a cubic operation in —i.e. —under the naive specification from earlier. This can be addressed two-fold: by a separate independent multioutput GP that sacrifices the ability to model output correlations for the sake of computational efficiency [van der Wilk et al., 2020], or by squashing a latent separate independent multioutput GP through a linear transformation to couple outputs—the latter can be achieved with a linear model of coregionalization [Journel and Huijbregts, 1978] or convolutional GPs [Alvarez et al., 2010, van der Wilk and Rasmussen, 2017].

#### Separate Independent Multioutput GP.

A separate independent multioutput GP specifies separate singleoutput GPs—one for each output dimension—and specifies different output dimensions to have zero-covariance, i.e. is a diagonal matrix-valued function under the “output-as-output” view from earlier. Under the -flatting outlined previously, a separate multiouput GP makes be a block matrix with blocks each of which contains a diagonal matrix of size . However, we could have alternatively chosen a -flattening view rather than an -flattening view, in which case would be a block-diagonal matrix with blocks, where each diagonal block is a full matrix but each off-diagonal block contains only zeros. The latter rearrangement enables to invert by inverting the diagonal blocks (each of size ) separately, yielding an improved computational complexity of .

#### Linear Model of Coregionalization.

A linear model of coregionalization [Journel and Huijbregts, 1978] provides a simple approach to construct a multioutput GP that is both computationally efficient and ensures correlated output heads. Resorting back to the output-as-output view, the idea is to squash a latent separate independent multioutput GP with a diagonal matrix-valued kernel through a linear transformation [Dutordoir et al., 2018]. The linear transformation W is defined to be a matrix where denotes the number of latent outputs. Since is user-defined, it enables control over computational complexity because it determines the computational burden of inverting latent covariance matrices. This allows for efficient fully-correlated multioutput GPs with high-dimensional outputs where . Denoting the latent GP in the output-as-output view as:

(24) |

where refers to the latent vector-valued random function and to the latent vector-valued mean function, a proper multioutput GP can be obtained via , resulting in:

(25) |

Rather than defining the latent multiouput GP with Equation (24), we could have alternatively used the corresponding output-as-output view of a sparse multioutput GP according to Equation (12) in combination with the kernel function , in which case Equation (25) would have represented a fully correlated but sparse multioutput GP with efficient matrix inversion in latent space. A more detailed discussion on the topic can be found once more in van der Wilk et al. [2020].

#### Convolutional GP.

In similar vein to a linear model of coregionalization, we can construct coupled output heads from a latent separate independent multioutput GP as given in Equation (24) via a convolutional GP. The idea is to define where is a matrix-valued function that outputs a matrix for every input . The corresponding process over is a proper multioutput GP because of the linearity of the convolution operator, and is given by:

(26) |

where the matrix-valued function is usually chosen such as to yield tractable integrals. The presentation of convolutional GPs here follows the descriptions in Alvarez et al. [2010] and van der Wilk et al. [2020], but similar models have been proposed in earlier work [Higdon, 2002, Boyle and Frean, 2004, Alvarez and Lawrence, 2008, Alvarez et al., 2009]. Note that Equation (26) builds upon a latent GP according to Equation (24) for conceptual convenience, but we could have used a sparse GP according to Equation (12) instead.

#### Image-Convolutional GP.

Despite the similarity in naming, an image-convolutional GP is different from a convolutional GP. Let’s imagine a domain of images and that a single image is subdivided into a set of (possibly overlapping) patches, all of equal size and indexed by . For notational convenience, we define the -th patch of as . We can then define an ordinary singleoutput GP operating in a latent patch space with random functions denoted as and with mean function and kernel , where the notation refers to the -th patch of an input image and where the input image is indicated by the usual dot notation . It turns out that this latent singleouput GP defined in patch space induces a multiouptut GP over vector-valued functions that operates in image space (where the number of outputs equals the number of patches). The vector-valued random function then relates to the latent real-valued function as . Similarly, the multiouput mean function can be described as and the multioutput kernel as . This design was inspired by convolutional neural networks and a first description can be found in van der Wilk and Rasmussen [2017] which was later extended to deep architectures by Blomqvist et al. [2019] and Dutordoir et al. [2020b]. The difference between a convolutional and an image-convolutional GP is hence that the former performs a convolution operation in the input domain, whereas the latter performs an operation that resembles a discrete two-dimensional convolution on a single element from an image input domain.

#### Derivative GP.

We conclude this section with derivative GPs because they provide a natural example of multioutput GPs. Earlier, we have seen how to define interdomain variables that are partial derivatives of a singleoutput GP’s random function (evaluated at specific locations). It turns out that for any singleoutput GP with mean function and kernel , the derivative of w.r.t. gives rise to a proper multioutput GP where the number of output dimensions equals the dimension of —presupposing that and are differentiable. More precisely, in the output-as-output view, we obtain with the corresponding multioutput mean function and the multioutput kernel function . Analysing the derivative of a multioutput GP, the resulting random function and its mean function are matrix-valued—in Jacobian notation and —, and the kernel function is a hypercubical tensor of dimension four (we refrain from the mathematical notation for preserving a clear view). However, by applying the flattening trick from earlier, we can again obtain a multioutput GP in the output-as-output view—e.g. by flattening the matrix-valued random function and its matrix-valued mean function into vector-valued functions, which induces a flattening of the four-dimensional hypercubical tensor-valued kernel function into a matrix-valued kernel function.

### 2.5 Deep Gaussian Processes

A deep GP is obtained by stacking multioutput GPs on top of each other [Damianou and Lawrence, 2013]. The output of one GP determines where the next GP is evaluated, i.e. the output dimension (= number of outputs) of the GP below needs to conform with the input dimension of the GP on top. More formally, imagine multioutput GPs with random vector-valued functions denoted in the output-as-output notation as . A single input is propagated through the deep GP as follows. The first GP with index is evaluated at yielding a vector-valued random variable . Drawing a single sample from determines where to evaluate the second GP with index . This yields another vector-valued random variable that determines where to evaluate the third GP, and so forth. This process is repeated until we arrive at the last GP with index —the random variable is considered as output of the deep GP for input . The graphical model illustrating this process is depicted in Figure 1.

Note that for deep GPs, the notation f refers to a vector-valued random variable as a result of evaluating all output heads of a multioutput GP at a single input location . This clashes with the same notation used for shallow singleoutput GPs where f refers to a vector-valued random variable as a result of evaluating a singleoutput GP in multiple locations . We apologize for the confusion but there are only so many ways to represent vector-valued quantities. Also note that if we propagate samples through a deep GP (instead of just one as depicted above), we would need to evaluate the first GP at all inputs yielding a random variable of size in the output-as-output view (where refers to the number of outputs of the first GP). Sampling precisely once from this random variable results in vector-valued samples of size each, that would be used to evaluate the second GP, and so on. The final output would be a random variable of size , where refers to the number of outputs of the last GP.

The behavior of a deep GP is illustrated in Figure 2 for singleoutput GP building blocks that have one-dimensional input domains (for reasons of interpretability). With increasing depth, function values become more narrow-ranged and the function’s behavior changes from smoothly to more abruptly [Duvenaud et al., 2014]. The latter is not surprising since once samples at intermediate layers are mapped to similar function values, they won’t assume very different function values in subsequent layers. While this enables to potentially model challenging functions that are less smooth (which may be difficult with an ordinary shallow GP), the marginal distributions over function values in every layer (except for the first one of course) are no longer Gaussian and hence impede analytical uncertainty estimates, which is usually considered a hallmark of GP models.

We conclude this section by noting that the definition of a deep GP is independent of the type of shallow GP used as a building block. Modern deep GPs stack sparse GPs, as presented in Equation (12), on top of each other to be computationally efficient [Salimbeni and Deisenroth, 2017, Salimbeni et al., 2019]. Up to this point, we haven’t yet addressed how to do inference in (deep) sparse GPs. The reason is that exact inference is not possible and one has to resort to approximate inference techniques. Variational inference (VI) is a convenient tool used by contemporary literature in this context. We therefore dedicate the next section (Section 3) to explain general VI, before coming back to VI in shallow and deep sparse GPs later on in Section 4.

## 3 Variational Inference

VI is a specific type of approximate Bayesian inference. As the name indicates, approximate Bayesian inference deals with approximating posterior distributions that are computationally intractable. The goal of this section is to provide an overview of VI, where we resort mostly to the parameter space view and where we assume a supervised learning scenario. How to combine sparse GPs (that are function space models) with VI is subject of Section 4 later on. We begin with Section 3.1, where we derive vanilla VI. In Section 3.2, we demonstrate an alternative way to derive VI with importance weighting that provides a more accurate solution at the expense of increased computational complexity. After that, in Section 3.3, we introduce latent-variable VI to enable more flexible models. In Section 3.4, we combine latent-variable VI with importance weighting to trade computational cost for accuracy. Finally, in Section 3.5, we present how to do VI in a hierarchical and compositional fashion giving rise to a generic framework for Bayesian deep learning via the concept of Bayesian layers [Tran et al., 2019].

### 3.1 Vanilla Variational Inference

Let’s start be revisiting Bayesian inference for supervised learning. Imagine some input , some observed variable and the probability of observing given via a parametric distribution with hyperparameters and unknown parameters . Our goal is to infer and we have some prior belief over through the distribution , where we assume for notational convenience that both and are hyperparameterized by . Inference over is then obtained via the posterior distribution over after observing and :

(27) |

where is referred to as likelihood, as prior and as marginal likelihood (or evidence). The corresponding graphical model for this inference problem is depicted in Figure 3 A) on the left-hand side (denoted as “general formulation”). The graphical model represents the joint distribution between and given , which is also referred to as “generative model”. The challenge in computing the posterior is that the marginal likelihood usually does not have a closed form solution except for special cases, e.g. when the prior is conjugate to the likelihood (which we won’t consider in this tutorial). When the marginal likelihood does have a closed form solution, it is usually maximized w.r.t. to the hyperparemeters of the generative model before the exact posterior is computed [Bishop, 2006, Rasmussen and Williams, 2006]. Note that the hyperparameters are also referred to as “generative parameters”.

The idea in VI is to introduce an approximation , parameterized via , to the intractable posterior , and to optimize for such that the approximate posterior becomes close to the true posterior. In this regard, the approximate posterior is also referred to as “variational model” and as “variational parameters”. The question is which optimization objective to choose to identify optimal variational parameters . We are going to respond to this question shortly but for now, we commence with the negative Kullback-Leibler divergence (KL) between the approximate and the true posterior, which can be written as (by applying Bayes’ rule to the true posterior):

(28) |

Rearranging by bringing the log marginal likelihood term to the left yields:

(29) |

The term on the right-hand side is referred to as the evidence lower bound [Rasmussen and Williams, 2006] since it poses a lower bound to the log marginal likelihood (a.k.a. log evidence)—“log evidence lower bound” might hence be a more suitable description but omitting “log” is established convention. The ELBO is a lower bound because the KL between the posterior approximation and the true posterior is non-negative. Since the log marginal likelihood does not depend on the variational parameters , the ELBO assumes its maximum when the approximate posterior equals the true one, i.e. , in which case the KL term on the left is zero and the ELBO recovers the log marginal likelihood exactly.

Note how the formulation for the ELBO does not require to know the true posterior in its functional form a priori in order to identify an optimal approximation, because Equation (29) was obtained via decomposing the intractable posterior via Bayes’ rule. Also note that the log marginal likelihood is usually the preferred objective to maximize for the generative hyperparameters , as mentioned earlier. Contemporary VI methods therefore maximize the evidence lower bound jointly w.r.t. both generative parameters and variational parameters . Some current methods with deep function approximators choose a slight modification of Equation (29) by multiplying the KL term between the approximate posterior and the prior with a positive -parameter. This is called “-VI” and recovers a maximum expected log likelihood objective as a special case when . It has been proposed by Higgins et al. [2017], and Wenzel et al. [2020] provide a recent discussion.

Assuming an optimal approximate posterior has been identified after optimizing the ELBO w.r.t. both variational parameters and generative parameters , the next question is how to use it, namely how to predict for a new data point that is not part of the training data. The answer is:

(30) |

by forming the joint between the likelihood and the approximate posterior , and integrating out . If the integration has no closed form, one has to resort to Monte Carlo methods—i.e. replace the integral over with an empirical average via samples obtained from .

So far, we haven’t made any assumptions on how the generative model looks like precisely. In supervised learning, it is however common to assume an i.i.d. dataset in the sense that the training set comprises i.i.d. training examples in the form of ()-pairs. The corresponding graphical model is depicted in Figure 3 A) on the right denoted as “i.i.d. dataset”. In this case, the likelihood is given by and the ELBO becomes:

(31) |

An interesting fact to note is that when is large, Equation (31) can be approximated by Monte Carlo using minibatches, which can be used for parameter updates without exceeding potential memory limits [Hensman et al., 2013]. The corresponding predictions for new data points in the i.i.d. setting are:

(32) |

Note that we have deliberately not made any assumptions on the dimensions of , , , and to keep the notation light (which doesn’t mean that these quantities need to be scalars). We also chose the weight space view by using instead of the function space view, although both are conceptually equivalent. The function space view can be obtained by replacing with in all formulations and equations above. Practically, one would need to be careful with expectations and KL divergences between infinite-dimensional random variables. We are going to address this issue later in Section 4 when talking about VI in sparse GPs (that naturally assume the function space view).

Since the probability distributions have been held generic so far, it can be insightful to provide some examples. To this end, imagine an i.i.d. regression problem with one-dimensional labels. Assume the prior is a mean field multivariate Gaussian over the vectorized weights in a neural network, with mean vector and variance vector . Let the likelihood be a homoscedastic Gaussian with variance , whose mean depends on the neural net’s output—i.e. where denotes the output of the neural net for the input . In this context, the superscript marks the likelihood variance as a generative parameter. The variational approximation could then be a mean field multivariate Gaussian as well, with mean vector and variance vector , where the superscript marks variational parameters. We have just arrived at a vanilla Bayesian neural network as in Blundell et al. [2015].

We can increase the expressiveness of the likelihood by making it heteroscedastic, i.e. by letting the neural net output a two-dimensional vector instead of a scalar to encode both the mean and the variance. This is achieved by defining and where and index both network outputs and is a strictly positive function (because the neural net’s output is unbounded in general). In the latter case, there wouldn’t be any generative parameter anymore because the likelihood variance has become a function of the neural net’s output.

In practice, during optimization with gradient methods, the reparameterization trick [Kingma and Welling, 2014, Rezende et al., 2014] is applied to the random variable in order to establish a differentiable relationship between and the parameters of the distribution from which is sampled, i.e. the mean and the variance parameters and . This is known to produce parameter updates with lower variability leading to better optimization. If we replace the prior and the approximate posterior with (multioutput) GPs and replace the notation and with and accordingly in the likelihood, we would obtain the GP equivalents of the homoscedastic and heteroscedastic Bayesian neural networks respectively. However, in GPs, one usually treats certain kernel hyperparameters as generative parameters , which means the prior is subject to optimization as opposed to the Bayesian neural network case.

We conclude by mentioning a related approximate inference scheme called expectation propagation (EP) [Bishop, 2006, Bui et al., 2017] that also encourages an approximate posterior to be close to the true posterior via a KL objective, similar to VI. In fact, EP chooses a similar objective as in Equation (28) but with swapped arguments in the KL. The practical difference is that VI tends to provide mode-centered solutions whereas EP tends to provide support-covering solutions at the price of potentially significant mode mismatch [Bishop, 2006] (if the approximation is unimodal but the true posterior multimodal). There are at least two more advantages of VI over vanilla EP. First, in VI, the expectation is w.r.t. to the approximate posterior and hence amenable to sampling and stochastic optimization with gradient methods, whereas the expectation in EP is w.r.t. to the unknown optimal posterior. And second, VI is principled in that it lower-bounds the log marginal likelihood therefore encouraging convenient optimization not only over variational but also generative parameters.

### 3.2 Importance-Weighted Variational Inference

Importance-weighted VI provides a way of lower-bounding the log marginal likelihood more tightly and with less estimation variance at the expense of increased computational complexity. We start by showing that there is an alternative way to derive the ELBO, in addition to the derivation from the previous section, according to the following formulation:

(33) | |||||

(34) |

where the inequality comes from applying Jensen’s inequality that swaps the logarithm with the expectation over . While this derivation is straightforward, it has the disadvantage of only demonstrating that the ELBO lower-bounds the log marginal likelihood but not by how much, namely the KL between the approximate and the true posterior, as shown in Equation (29).

In order to obtain an importance-weighted formulation that bounds the log marginal likelihood more tightly [Burda et al., 2016, Domke and Sheldon, 2018], we need to proceed from Equation (33) before applying Jensen:

(35) | |||||

(36) | |||||

(37) | |||||

(38) |

In Equation (36), the expectation in Equation (35) is replicated times by introducing i.i.d. variables