Conditionally Independent Multiresolution Gaussian Processes

# Conditionally Independent Multiresolution Gaussian Processes

Jalil Taghia
Systems and Control Division
Department of Information Technology, Uppsala University
Uppsala, Sweden
jalil.taghia@it.uu.se
&Thomas B. Schön
Systems and Control Division
Department of Information Technology, Uppsala University
Uppsala, Sweden
thomas.schon@it.uu.se
###### Abstract

We propose a multiresolution Gaussian process (GP) model which assumes conditional independence among GPs across resolutions. The model is built on the hierarchical application of predictive processes using a particular representation of the GP via the Karhunen-Loève expansion with a Bingham prior model, where each basis vector of the expansion consists of an axis and a scale factor, referred to as the basis axis and the basis-axis scale. The basis axes have unique characteristics: They are zero-mean by construction and live on the unit sphere. These properties allow us to further assume that the axes are shared across all resolutions while their scales remain resolution specific. The properties of the Bingham distribution makes it the natural choice when it comes to modeling the axes. We drive a fully Bayesian inference for the model using a structured variational inference with a partially factorized mean-field approximation which learns a joint Gaussian-Bingham posterior distribution over the basis-axis scales and the basis axes. Relaxing the full independence assumption enables the construction of models which are robust to overfitting in the sense of sensitivity to the chosen resolution and predictions that are smooth at the boundaries. Our new model and inference algorithm are compared against current state of the art on 2 synthetic and 9 real-world datasets.

## 1 Introduction

There is rich literature on methods designed to avoid the computational bottleneck incurred by the vanilla Gaussian process (GP) solution, including sub-sampling [33], low rank approximations [9], covariance tapering [14], inducing variables [32; 35], predictive processes [3], and multiresolution models [34; 31], to name just a few. Here, we focus mainly on the low rank approximations.

A GP typically assume certain smoothness properties that can undesirably soften abrupt local changes of interest. Although some less smooth kernel choices can be helpful at times, they assume stationary processes that do not adapt well to varying levels of smoothness. The undesirable smoothness characteristic of traditional GPs could further get pronounced in approximate GP methods in general and rank-reduced approximations in particular [39]. A way to overcome the limitations of low rank approximations is to recognize that the long-range dependencies tend to be of lower rank than short-range dependencies. This idea has been explored in the context of hierarchical matrices [16; 4; 2] and in multiresolution models [34; 31; 22].

Multiresolution GPs, seen as hierarchical models, connect collections of smooth GPs each of which is defined over an element of a random nested partition [15; 12; 11]. The long-range dependencies are captured by the GP at the top of hierarchy while the bottom-level GPs capture the local changes. We can also view the multiresolution GPs as a hierarchical application of predictive processes—approximations of the true process arising from conditioning the initial process on parts of the data [3; 32]. Application of such models has recently been exploited in spatial statistics [34; 31; 22] for modeling large spatial datasets. Refer to [12] and [22] for an overview of each category.

Current multiresolution models based on the predictive processes, although effective in terms of computational complexity, assume full independence among resolutions. The independence assumption however results in models which are inherently susceptible to the chosen resolution and approximations which are non-smooth at the boundaries. The latter one stems from the fact that the multiresolution framework, e.g., [22], recursively splits each region at each resolution into a set of subregions. As discussed by Katzfuss and Gong [23], since the remainder process is assumed to be independent between these subregions, this can lead to discontinuities at the region boundaries. A heuristic solution based on tapering functions is proposed in [23] which uses Kanter’s function as the modulating function to address this limitation. The former one, sensitivity to the chosen resolution, is partly due to the nature of the remainder process and the unconstrained representative flexibility of the GPs which manifests itself most noticeably at higher resolutions. As the size of the region under consideration decreases as the resolution increases, the remainder process may inevitably include certain aspects of data which might not be the patterns of interest. When all GPs are forced to be independent, there is no natural mechanism to constrain the representative flexibility of the GPs.

These limitations can be naturally addressed by allowing uncertainty to propagate between resolutions, specifically, by conditioning the GPs on each other. Thus, here, we propose a new model which unlike the previous models that impose full independence among resolutions, instead assumes conditional independence. Relaxing the full independence assumption is shown to result in models robust to overfitting in the sense of reduced sensitivity to the chosen resolution—that is regardless of the extra computational complexity, arbitrary increasing the resolution has small effect on the optimal model performance. Furthermore, it results in predictions which are smooth at the boundaries. This is facilitated by constructing a low-rank representation of the GP via a Karhunen-Loève expansion with the Bingham prior model that consists of basis axes and basis-axis scales. Our multiresolution model ties all GPs, across all resolutions, to the same set of basis axes. These axes are learnt successively in a Bayesian recursive fashion. We consider a fully Bayesian treatment for the proposed model and derive a structured variational inference based on the partially factorized mean-field approximation.

The idea of using conditional independence in the context of multiresolution GPs has previously been studied by Fox and Dunson [12]. The two models differ in their underlying generative models and in their inference. While the computational complexity of the proposed model scales linearly with respect to the number of samples, Fox & Dunson’s model scales cubically and relies on inference which may further limit its application to large data analysis.

Our main contribution is to develop the conditionally independent multiresolution GP model and to derive a variational inference algorithm to learn this model from data. The Bingham distribution [6] is an important distribution in directional and axial statistics commonly used for shape analysis where the inference is typically based on [25], [30], and [27]. Hence, our use of the Bingham prior model and the corresponding variational inference solution for this prior model could also appeal to the researchers in directional statistics [29].

Finally, this work mainly describes the theoretical derivations in general settings independent of the choice of the basis functions and the spectral densities of the covariance functions used in the series expansion. Details of derivations and experiments are presented in the supplementary materials111An implementation of the model will be made available via GitHub.. The resulting model in its standard form can be useful for analysis of large data. It can also be directly used as a building block in more complex models, for example, nonlinear state-space models [40; 13] and deep Gaussian processes [10]. We illustrate key advantages of the model on both synthetic data and real datasets.

## 2 Karhunen-Loève representation of the Gaussian process

Consider a minimalistic model of GP regression, , where denotes a zero-mean GP prior, denotes the overall bias, denotes Gaussian noise with zero mean and variance , denotes input variables, and denotes the measurements, . The standard solution involves inversion of a Gram matrix which is an operation in general. In the following, we consider low rank representations of the GP enabled via Karhunen-Loève expansion theorem.

##### Gaussian model.

For a -dimensional input variable on the interval , the GP can be represented using the Karhunen-Loève expansion according to [37], , where denotes the basis vectors of the series expansion, denotes the basis intervals such that , denotes the orthogonal eigenfunctions (basis functions) with corresponding eigenvalues , and denotes the spectral density of the covariance function. Note that, unlike the minimalistic representation used by Solin and Särkkä [37], we have explicitly included the basis interval in the representation, which are treated as random variables and are optimized using maximum likelihood estimation.

To ensure that the representation satisfies the dual orthogonality requirement of the Karhunen-Loève expansion, all the basis vectors  must be zero-mean. Normally, we would assign a zero-mean Gaussian distribution over , or alternatively one could assign a zero-mean matrix-normal distribution over as was done by Svensson and Schön [40]. The choice of zero-mean Gaussian priors over the basis vectors would lead to Gaussian posteriors with non-zero means. In our multiresolution model, as we shall see later in Sec. 3, the basis vector posterior needs to be learnt in a recursive fashion such that the posterior from the current resolution is used as the prior for the resolution in the next level of the hierarchy. Now, as the expansion requires the prior to be zero-mean, we would then need a posterior over basis vectors which is zero-mean by construction. If we were going to use Gaussian priors, we would then have had a multiresolution model where all GPs must be fully independent.

To address this issue, we now break the basis vectors into two parts: basis axes and basis-axis scales. The basis axis vectors are defined to be antipodally symmetric and thus zero-mean by construction. They primarily carry information about the direction and we can for that reason without loss of generality assume them to be on the unit sphere. The axes will be shared across resolutions such that given the axes, all GPs are independent. Although GPs are tied to the same set of axes, they will be scaled by resolution-specific variables, namely the basis-axis scales. The axial distributions from directional statistics [29] make for a perfect fit in modeling these axes. In the following we consider a particular choice of prior model, the Bingham prior model, which conveniently allows for the design of a conditionally independent multiresolution model.

##### Bingham model.

Let denote the unit sphere. Furthermore, let such that and denote the basis axes and the basis-axis scales, respectively. Without loss of generality, we can now express the noisy measurements as

 \vcyt=p∑i=1ai\vcuiϕi(\vcxt,\vcτ)+\vcb+\vcet,∀t∈T. (1)

The basis axes are modeled as Bingham distributions [6] according to , where denotes the Bingham distribution parameterized with a real-symmetric matrix . It is straightforward to show that satisfies the Karhunen-Loève expansion requirements. Importantly, the Bingham distribution is antipodally symmetric, meaning that , which in turn implies that by construction [29, Ch. 9.4]. We can then assign zero-mean Gaussian distributions as priors over the basis-axis scale variables . Assuming , and using , this choice of prior over and is conveniently conjugate to the data likelihood.

The main limitation of our choice of using the Bingham prior model is the implicit requirement of , as the Bingham density is defined on . Other prior models should be considered for the special case of . One possible choice is constituted by the one-parameter variant of the Bingham model [24] for modeling axes concentrated asymmetrically near a small circle222Note that, for , if we simply assume , the model reduces to a multiresolution architecture with fully independent GPs.. Here, we restrict ourselves to the Bingham prior model and cases where .

## 3 Model

##### Notation.

Consider a recursive partitioning of the index set across resolutions. At each resolution , is partitioned into a number of non-overlapping regions. The partitioning of can be structured or random. Without loss of generality, consider a uniform subdivision of the index set across resolutions by a factor of , such that is first partitioned into regions, each of which is then partitioned into subregions. The partitioning continues until resolution  where the index sets at various resolution are denoted by

 T(0)\colonequalsT,T(1)={T(0)1,…,T(0)q},⋯,T(m) ={T(m−1)1,…,T(m−1)q},

where , , and . An example of such partitioning by a factor of is shown in Fig. 1-(a). As a convention, we will use the notation to indicate the -th element of the set , which corresponds to the index set related to region at resolution . We also define and , where .

##### Generative model.

As before, let be the stochastic process of interest. Once the process is observed at , it gives rise to the noisy observations . By making use of a Gaussian process as the prior over , the observations at resolution are modeled according to (1). In a multiresolution setting based on the hierarchical application of predictive processes, we approximate according to , where is the approximate predictive process at resolution , and is the so-called remainder process. Let indicate the noisy instantiations of the latent process at . We will treat as a latent variable, and model it using a conditionally independent GP prior, , , where the basis axes  are shared among all the processes while the basis-axis scales  are region specific. At the higher resolution, , the latent process is in turn approximated by . In general, for resolution we have , where is the remainder process at resolution whose noisy instantiations on are modeled according to: .

Throughout, has been written without indexing w.r.t.  and . This is to emphasize that these are shared across all resolutions and regions such that in transition from one resolution to another, the axes of the basis vectors remain the same but they may be scaled differently via a region-specific and resolution-specific variable . We have expressed noise with indexing on both and , but one could assume the noise to be only a resolution-specific variable. In a multiresolution model, bias may not be simply removed as a part of the preprocessing step, as the bias at each resolution carries uncertainties from the previous resolutions. These parameters are expressed using indexing on both and . We have indicated the basis functions with indexing on , as generally one might consider a different choice of basis functions at different resolutions. The basis interval variables are learnt from data and hence expressed with indexing on both and .

The recursive procedure continues until resolution is reached. By assuming that the latent remainder process at approaches zero, we can approximate  as the sum of predictive processes from all resolutions, , where captures global patterns and finer details are captured at higher resolutions.

## 4 Bayesian inference

##### Notation.

Let where denote the set of noisy observations, and denote the set of latent variables for , where . We denote the latent function instantiations at by . Similarly, let . Furthermore, to keep the notation uncluttered, let:

 \vczT–––––––={\vczT(j)}mj=1,\vcxT–––––––={\vcxT(j)}mj=0,fT–––={fT(j)}mj=0,˜f––(j)={fT(j′)}j−1j′=0,∀j≥1, \vca–––––={{\vca(j)l}|T(j)|l=1}mj=0, \vca(0)≡\vca,γ––={{γ(j)l}|T(j)|l=1}mj=0, γ(0)≡γ, \vcb–––––={{\vcb(j)l}|T(j)|l=1}mj=0, \vcb(0)≡\vcb,\vcθ–––––={{\vcθ(j)l}|T(j)|l=1}mj=0, \vcθ(j)l={\vca(j)l,\vcU,\vcb(j)l,γ(j)l}.
##### Joint distribution.

The complete joint distribution of observations and all variables is expressed as

 (2)

where the pair of and are hierarchical parameters which shall be discussed shortly. The corresponding graphical model is shown in Fig. 1-(b).

The prior model parameter in (2) is factorized as

 p(\vcθ(j)l,\vcΓ,\vcr)=p(\vcb(j)l∣γ(j)l)p(γ(j)l)p(\vcU∣\vcΓ)p(\vca(j)l∣\vcr)p(\vcr∣\vcΓ)p(\vcΓ). (3)

To facilitate expression of the conditional distributions, let , indicate a binary switch parameter such that when and when . The conditional distribution of the observations is expressed by

 p(\vcyT∣fT(1),\vcxT,\vcθ(0))=∏k∈{0,1}[|T(1)|∏l=1∏t∈T(1)lp0(ft)]1−Z(j)k[∏t∈T(0)N(\vcyt;\vcb+p∑i=1ai\vcuiϕ(0)i(\vcxt,\vcτ(0)),γ−1)]Z(j)k,

and the conditional distribution of the latent variables , , is expressed by

 p(\vczT(j)l∣fT(j+1),˜f––(j)l,\vcxT(j)l,\vcθ(j)l)=∏k∈{j,j+1}[|T(j+1)|∏l=1∏t∈T(j+1)lp0(ft)]1−Z(j)k[∏t∈T(j)lN(\vcz(j)t,l;¯\vcz(j)t,l,γ(j)l−1)]Z(j)k,

where , , is defined as: , and approaches the Dirac point density , while the exact form of this density is not critical.

As mentioned earlier, in the expression of the joint distribution (2) we have introduced hierarchical parameters and , which are not explicit in the generative model, Fig. 1-(b). The parameters represent the precision of the basis-axis scale parameter and are shared across resolutions and regions. These parameters will enable automatic determination of the effective number of basis axes, as the posterior will approach zero for axes that are effectively not used. Thus at each resolution and at each region, only a subset of the basis axes will be used and others will have little to no influence. Furthermore, our recursive framework requires the indexing of axes of to be the same across resolutions. More precisely, we shall learn the posterior distribution over in a Bayesian recursive fashion such that the posterior from the previous resolution is used as the prior for the current resolution. A complication is that , might get totally arbitrary indexing at each resolution. To formally handle the axis-index ambiguity across resolutions, we have introduced a latent sparse matrix of binary indicator variables to account for the possible index permutation between prior and posterior of basis axes in transition from resolution to . A matrix element indicates that the axis identified by index in the posterior model of resolution is identical to the axis denoted by index of the current resolution . In defining the prior distributions, Eq. (A.1) and Eq. (A.2), we have conditioned both and on to ensure accumulation of “aligned prior beliefs” of these parameters across resolutions (see (3) and Fig. 1-b).

The explicit form of the prior distributions over all variables in (3) is discussed in details in App. A.

##### Variational inference.

Using variational inference [21; 7], the goal is to find a tractable approximation to the true posterior distribution. Here, we consider a variational posterior in the form of:

 q(\vczT–––––––,fT–––,fT(m+1),\vcU,\vca–––––,\vcΓ,\vcr)=[m∏j=0|T(j)|∏l=1q(\vcθ(j)l,\vcΓ,\vcr)][m∏j=1|T(j)|∏l=1q(\vczT(j)l)p(fT(j)l∣\vczT(j)l)]p(fT(m+1)),

where further using a partially factorized mean-field approximation:

 q(\vcθ(j)l,\vcΓ,\vcr)=q(\vcb(j)l∣γ(j)l)q(γ(j)l)q(\vca(j)l∣\vcU)q(\vcU)q(\vcr)q(\vcΓ). (4)

We then take and to match the ones in the prior model of the joint expression (2) allowing a tractable solution. A similar approach was used by Frigola et al. [13] and Damianou and Lawrence [10]. Furthermore, notice the difference in factorization of the prior (3) and the posterior (4). In particular, we have considered a joint posterior over basis axes and their scales, . The joint posterior allows us to conveniently use the posterior as the prior in the factorized prior for the sequential (recursive) learning procedure.

Given the joint distribution and our choice of the variational posterior distribution, the variational lower bound is expressed by

 L=L\vcyT+m∑j=1L\vczT(j), (5)

where can be written as the sum of the likelihood and the negative Kullback-Leibler divergence (KLD) between the posterior and the prior,

 L\vcyT=⟨logp(\vcyT∣\vcxT,fT(1),\vcθ(0))⟩q(\vcθ(0))p(fT(1))−⟨logq(\vcθ(0),\vcΓ,\vcr)p(\vcθ(0),\vcΓ,\vcr)⟩q(\vcθ(0),\vcΓ,\vcr).

The notation is used to denote the expectation of its variable with respect to its variational posterior distribution. Similarly can be expressed as the sum of the likelihood and the negative KLD between the posterior and the prior plus the posterior entropy of the remainder term,

 L\vczT(j)⟨logp(\vczT(j)∣\vcxT(j),fT(j+1),˜f––(j),\vcθ(j))⟩q(\vczT(j))q(\vcθ(j))p(˜f––(j))p(fT(j+1))−⟨logq(\vcθ(j),\vcΓ,\vcr)p(\vcθ(j),\vcΓ,\vcr)⟩q(\vcθ(j),\vcΓ,\vcr)−⟨logq(\vczT(j))⟩q(\vczT(j)),

where using (3) and (4), the second term factorizes as:

 ⟨logq(\vcθ(j),\vcΓ,\vcr)p(\vcθ(j),\vcΓ,\vcr)⟩q(\vcθ(j),\vcΓ,\vcr)=⟨logq(\vcb(j),γ(j))p(\vcb(j),γ(j))⟩q(\vcb(j),γ(j))+⟨logq(\vcU)p(\vcU∣\vcΓ)⟩q(\vcU)q(\vcΓ)++⟨logq(\vca(j)∣\vcU)p(\vca(j)∣\vcr)⟩q(\vca(j)∣\vcU)q(\vcr)+⟨logq(\vcΓ)p(\vcΓ)⟩q(\vcΓ)+⟨logq(\vcr)p(\vcr∣\vcΓ)⟩q(\vcr)q(\vcΓ).

Taking into account the convenient form of (5), the optimal posterior distribution can now be obtained by maximizing the lower bound using standard variational inference.

The explicit forms of the optimized variational posterior distributions are derived in App. B. Descriptive statistics of the posterior distributions are summarized in App. C. The predictive process is discussed in App. D. The optimization of the basis interval parameters is discussed in App. E.

## 5 Experiments

Throughout of this section, we consider spectral densities of the Matérn class of covariance functions (order and length scale ), [33, ch. 4], and we consider eigenfunctions of the Laplace operator as the basis functions across all resolutions. Thus, for -dimensional input variable , we choose the basis functions, , with corresponding eigenvalues . The number of basis functions is set to .

In all experiments, we compare the performance of two different multiresolution model architectures, conditionally independent and fully independent models, namely and . Note that here is obtained from by forcing all GPs across resolutions to be independent. For simplicity, we consider uniform subdivision of the index set by a factor of . Finally, for instance, the notation is used to refer to of resolution .

##### Conditional independence versus full independence.

We begin with an illustrative experiment which demonstrates limitations of the full independence assumption, non-smooth boundaries and overfitting in the sense of sensitivity to the chosen resolution. For this demonstration, we compare performance of and at various resolutions on synthetic data and real data. Figure 2-(a) presents a regression task of identifying (-dimensional) latent functions from noisy measurements on dataset, App. F.2.1. The dotted lines show the ground-truth and the solid lines indicate the predictions at test locations within the input range. At resolution , both models and perform comparatively. However, with increasing resolution, these two perform very differently. In particular, notice the non-smooth boundaries in the case of full independence at the highest resolution, , which are almost non-existing in . Given that the training set includes data samples, at practically every single data point is a region, . Also notice that is closely following these data points, exhibiting signs of overfitting. The overfitting issue associated with is partly due to the unconstrained flexibility of GPs which manifest itself at the higher resolutions where the size of the regions under consideration becomes increasingly smaller. In our experiments on real data, however, the overfitting even happened at the lower resolutions. An example on the dataset, a subset of data recorded from a magnetic field, App. F.2.2, is shown in Fig. 2-(b). The -dimensional noisy measurements are shown by dotted lines and the predicted strength of the magnetic fields at three different heights estimated by each method are shown with solid lines. At , both methods and perform equally well but with the increase of resolution to , begins to fail which worsens as the resolution increases, while family of methods remain intact and comparative at all resolutions.

##### Regression on multiple datasets.

We now compare performance of various MRGP models on a number of datasets in a more structured fashion. As baselines, we include other scalable GP methods in this comparison. Key features of the datasets and methods are summarized in Table 3, and are described in more details in App. F. The performance is evaluated in terms of the root-mean-square error (RMSE) and the mean log-likelihood (MLL) on test sets, shown respectively in Table 3 and Table 3. The method