Non-linear process convolutions for multi-output Gaussian processes

# Non-linear process convolutions for multi-output Gaussian processes

## Abstract

The paper introduces a non-linear version of the process convolution formalism for building covariance functions for multi-output Gaussian processes. The non-linearity is introduced via Volterra series, one series per each output. We provide closed-form expressions for the mean function and the covariance function of the approximated Gaussian process at the output of the Volterra series. The mean function and covariance function for the joint Gaussian process are derived using formulae for the product moments of Gaussian variables. We compare the performance of the non-linear model against the classical process convolution approach in one synthetic dataset and two real datasets.

## 1 Introduction

A multi-output Gaussian process (MOGP) is a Gaussian process (GP) with a covariance function that accounts for dependencies between multiple and related outputs (Bonilla et al., 2008). Having models that exploit such dependencies is particularly important when some of the outputs are expensive to measure and the other more inexpensive outputs can be used as surrogates of the expensive output to improve its prediction. A typical example comes from geostatistics, where the accuracy of predicting the concentration of toxic heavy metals like lead or copper, which can be expensive to measure, can be improved by including measurements of pH as secondary variables, something that is significantly less expensive to measure (Goovaerts, 1997).

One of the challenges in multi-output GPs is defining a cross-covariance function between outputs that leads to a valid covariance function for the joint GP. There is extensive literature looking at ways to build such types of cross-covariance functions (Álvarez et al., 2012). One such approach is known as process convolution, for which each output is the convolution integral between a smoothing kernel and a latent random process. The approach was introduced by Barry and Ver Hoef (1996) to build covariance functions for single-output GPs, and later for multi-outputs in Ver Hoef and Barry (1998) and Higdon (2002). The convolution integral linearly transforms the underlying latent process, which is usually assumed to be a Gaussian process. The output process is then a GP with a covariance equal to the convolution operators acting to modify the covariance function of the latent GP.

The main contribution in this paper is the introduction of a non-linear version of the process convolution construction suitable both for single-output and multi-output GPs. The non-linear model is constructed using a Volterra series where the input function is a latent random process. The Volterra series has been widely studied in the literature of non-linear dynamical systems (Haber and Keviczky, 1999). They generalise the Taylor expansion for the case of non-instantaneous input functions. We treat the latent process as a Gaussian process and, using formulae for the product moments of Gaussian variables, we provide closed-form expressions for the mean function and covariance function of the output process. We approximate the output as a Gaussian process using these mean and covariance functions.

Most attempts to generate non-linear models that involve Gaussian processes come from an alternative representation of the convolution integral based on state space approaches (Hartikainen and Särkkä, 2012; Särkkä et al., 2013). Exceptions include the works by Lawrence et al. (2007) and Titsias et al. (2009) where the non-linearity is a static transformation of the underlying latent GP. We review these and other approaches in the Section 4.

We compare the performance of the non-linear model against the classical process convolution approach in one synthetic dataset and two real datasets and show that the non-linear version provides better performance than the traditional approach.

## 2 Background

In this section, we briefly review the MOGP with process convolutions. We will refer to this particular construction as the convolved MOGP (CMOGP). We also briefly review the Volterra series and the formulae for the product moments of Gaussian variables that we use to construct our non-linear model.

### 2.1 Process convolutions for multi-output Gaussian processes

We want to jointly model the set of processes using a joint Gaussian process. In the process convolution approach to building such a Gaussian process, dependencies between different ouputs are introduced by assuming that each output is the convolution integral between a smoothing kernel and some latent function ,

 fd(t)=∫τGd(t−τ)u(τ)dτ, (1)

where the smoothing kernel should have finite energy. Assuming that the latent process is a GP with zero mean function and covariance function , the set of processes are jointly Gaussian with zero mean function and cross-covariance function between and given by

 kd,d′(t,t′)=cov[fd(t),fd′(t′)]=∬Gd(t−τi)Gd′(t′−τi)k(τi,τj)dτi% dτj. (2)

In Álvarez et al. (2012), the authors have shown that the covariance function above generalises the well-known linear model of coregionalization, a form of covariance function for multiple outputs commonly used in machine learning and geostatistics.

The expression for in the form of the convolution integral in (1) is also the representation of a linear dynamical system with impulse response . In the context of latent force models, such convolution expressions have been used to compute covariance functions informed by physical systems where the smoothing kernel is related to the so-called Green’s function representation of an ordinary differential operator (Álvarez et al., 2013).

### 2.2 Representing non-linear dynamics with Volterra series

We borrow ideas from the representation of non-linear systems to extend the CMOGP to the non-linear case. One such representation is the Taylor series, which is the expansion of a time-invariant non-linear system as a polynomial about a fixed working point:

 f(t)=∞∑c=0giuc(t)=g0+g1u(t)+g2u2(t)+…

While the Taylor series is widely used in the approximation of non-linear systems, it can only approximate systems for which input is instantaneous (Haber and Keviczky, 1999).

By the Stone-Weierstraß theorem, a given continuous non-linear system with finite-dimensional vector input can be uniformly appoximated by a finite polynomial series (Gallman and Narendra, 1976). The Volterra series is such a polynomial expansion, describing a series of nested convolution operators:

 f(t)=∞∑c=0∫…∫G(c)(t−τ1,…,t−τu)c∏j=1u(τj)dτj=G(0)+∫G(1)(t−τ1)u(τ1)dτ1+∬G(2)(t−τ1,t−τ2)u(τ1)u(τ2)dτ1dτ2+…

The leading term is a constant term, which in practise is assumed to be zero-valued and the series is incremented from . Because of the convolutions involved, the series is no longer modelling an instantaneous input at , giving the series a so-called memory effect (Haber and Keviczky, 1999). As with the Taylor series, the approximant needs a cut-off for the infinite sum, denoted ; a Volterra series with sum terms is called -order.

The representation of a degree kernel, i.e. , can be expressed in different forms, such as in symmetric or triangular form (Haber and Keviczky, 1999). A common assumption is that the kernels are homogeneous and seperable, such that is the product of first degree kernels. The assumption of seperability requires a stronger assumption but reduces the number of unique parameters, which can be very large for a full Volterra series (Schetzen, 1980). It should also be noted that a truncated Volterra series with seperable homogeneous kernels is equivalent to a Wiener model (Cheng et al., 2017).

### 2.3 Product moments for multivariate Gaussian random variables

Several of the results that we will present in the following section involve the computation of the expected value of the product of several multivariate Gaussian random variables. For this, we will use results derived in Song and Lee (2015). We are interested in those results for which the Gaussian random variables have zero mean. In particular, let be multivariate Gaussian random variables with zero mean values and covariance between and given as . According to Corollary 1 in Song and Lee (2015), the expression for , follows as

 E[c∏i=1Xaii] =∑L∈Ta∏ck=1ak!2tr[L]∏ci=1∏cj=1lij!c∏i=1c∏j=i(ϕij)lij, (3)

where is a vector consisting of the random variable exponents and is the set of symmetric matrices1 that meet the condition for , as defined by

 Ta={[lpq]c×c∣∣∣ak−lkk−c∑j=1ljkLa,k=0,∀k∈{1,…,c}}. (4)

If the sum of exponents, , is an odd number, then for the zero mean value case, as described in Corollary 2 in Song and Lee (2015).

An additional result that we will use later is that if , then (3), by Remark 5 in Song and Lee (2015), reduces to

 E[c∏i=1Xi]=∑L∈Tac∏i=1c∏j=iϕlijij. (5)

## 3 A Non-Linear CMOGP Based on Volterra Series

We represent a vector-valued non-linear dynamic system with a system of Volterra series of order . For a given output dimension, , we approximate the function with a truncated Volterra series as follows

 f(C)d(t)=C∑c=1∫⋯∫G(c)d(t−τ1,…,t−τc)c∏j=1u(τj)dτj, (6)

where are degree Volterra kernels.

Our approach is to assume that , the latent driving function, follows a GP prior. For , we recover the the expression for the process convolution construction of a multi-output GP as defined in (1). In contrast to the linear case, the output is no longer a GP. However, we approximate with a GP with a mean and covariance function computed from the moments of the output process :

 ~f(C)d(t)∼GP(μ(C)d(t),k(C)d,d′(t,t′)), (7)

where and . Approximating a non-Gaussian distribution with a Gaussian, particularly for non-linear systems, is common in state space modelling, for example in the unscented Kalman filter (Särkkä, 2013); or as a choice of variational distribution in variational inference (Blei et al., 2017). We refer to the joint process as the non-linear convolved multi-output GP (NCMOGP).

Furthermore, we will assume that the degree Volterra kernels are separable, such that

 G(c)d(t−τ1,…,t−τc)=c∏i=1G(c,i)d(t−τi), (8)

where are first degree Volterra kernels.

Using this separable form, we express the output as

 C∑c=1c∏i=1∫τiG(c,i)d(t−τi)u(τi)dτi=C∑c=1c∏i=1f(c,i)d(t),

where we define

 f(c,i)d(t) =∫τiG(c,i)d(t−τi)u(τi)dτi.

Assuming has a GP prior with zero mean and covariance , and due to the linearity of the expression above, we can compute the corresponding mean and covariance functions for the joint Gaussian process . We compute the cross-covariance function between and using

 k(c,i),(c′,j)d,d′(t,t′)=cov[f(c,i)d(t),f(c,j)d′(t′)]=∬G(c,i)d(t−τi)G(c′,j)d′(t′−τi)k(τi,τj)dτidτj. (9)

This is a similar expression to the one in (2) for the CMOGP. For some particular forms of the Volterra kernels and covariance of the latent process , the covariance can be computed analytically.

### 3.1 NCMOGP with separable Volterra kernels

In this section, we derive expressions for and with the assumption of separability of the Volterra kernels in (8).

#### Mean function

Let us first compute the mean function, . Using the definition for the expected value, we get

 E[f(C)d(t)] =C∑c=1E[c∏i=1f(c,i)d(t)]. (10)

The expected value of the product of the Gaussian processes, , can be computed using results obtained for the expected value of the product of multivariate Gaussian random variables as introduced in Section 2.3.

Applying the result in (5) to the expression of the expected value in (10), we get

 (11)

where

 k(c,i),(c,j)d,d(t,t)=cov[f(c,i)d(t),f(c,j)d(t)].

Note that only the terms for which is even are non-zero.

Example 1 To see an example of the kind of expressions that the expected value takes, let us assume that . We then have

 E[4∏i=1f(4,i)d(t)] =∑L∈Ta4∏i=14∏j=i(k(4,i),(4,j)d,d(t,t))lij,

where

 k(4,i),(4,j)d,d(t,t)=cov[f(4,i)d(t),f(4,j)d(t)].

We now need to find the set containing all symmetric matrices , the elements of which meet the conditon described in (4), where . This leads to the following system of equations

 2l11+l12+l13+l14 =1 l12+2l22+l23+l24 =1 l13+l23+2l33+l34 =1 l14+l24+l34+2l44 =1,

where we have used the symmetry of , so . It can be seen from the above system that the set contains three unique symmetric matrices:

 Ta=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩⎡⎢ ⎢ ⎢⎣0100100000010010⎤⎥ ⎥ ⎥⎦,⎡⎢ ⎢ ⎢⎣0010000110000100⎤⎥ ⎥ ⎥⎦,⎡⎢ ⎢ ⎢⎣0001001001001000⎤⎥ ⎥ ⎥⎦⎫⎪ ⎪ ⎪⎬⎪ ⎪ ⎪⎭.

We now have an expression for the expected value, by (11):

 E[4∏i=1f(4,i)d(t)] =k(4,1),(4,2)d,d(t,t)k(4,3),(4,4)d,d(t,t) +k(4,1),(4,3)d,d(t,t)k(4,2),(4,4)d,d(t,t) +k(4,1),(4,4)d,d(t,t)k(4,2),(4,3)d,d(t,t).

#### Cross-covariance function

For computing the covariance function, , we first need to compute the second moment between and . The second moment is given as

 E[f(C)d(t)f(C)d′(t′)]=E[C∑c=1c∏i=1f(c,i)d(t)C∑c′=1c′∏j=1f(c′,j)d′(t′)]=C∑c=1C∑c′=1E[c∏i=1c′∏j=1f(c,i)d(t)f(c′,j)d′(t′)]=C∑c=1C∑c′=1E[c+c′∏i=1¯f(i)d,d′(t)], (12)

where is the output of a vector-valued function consisting of all functions in the product

 ¯fd,d′(t)=[f(c,1)d(t)…f(c,c)d(t)f(c′,1)d′(t′)…f(c′,c′)d′(t′)]⊤.

We have assumed that both and share the same value of , although a more general expression can be obtained where each output can have its own value. We can apply the expressions in Song and Lee (2015) to the above moment of the product of Gaussian random variables as we did for computing the mean function in Section 3.1.1. Using the expression for the expected value of the product of Gaussian variables, we get

 E[c+c′∏i=1¯f(i)d,d′(t)]=∑L∈Tac+c′∏i=1c+c′∏j=i(cov[¯f(i)d,d′(t),¯f(j)d,d′(t)])lij,

where the covariance element is defined in (9) as the cross-covariance of two latent functions.

Example 2 For illustration purposes, let us assume that and . In this case, we would have

 E[ 3∏i=1f(3,i)d(t)f(1,1)d′(t′)]=∑L∈Ta4∏i=14∏j=i(cov[¯f(i)d,d′(t),¯f(j)d,d′(t)])lij,

where we have defined as

 ¯fd,d′=[f(3,1)d(t)f(3,2)d(t)f(3,3)d(t)f(1,1)d′(t′)]⊤.

The set contains the same matrices as we found in Example 1 in Section 3.1.1 leading to

 E[3∏i=1f(c,i)d(t)f(1,1)d′(t′)] =k(3,1),(3,2)d,d(t,t)k(3,3),(1,1)d,d′(t,t′) +k(3,1),(3,3)d,d(t,t)k(3,2),(1,1)d,d′(t,t′) +k(3,1),(1,1)d,d′(t,t′)k(3,2),(3,3)d,d(t,t).

The cross-covariance function between and is then computed using

 cov[f(C)d(t),f(C)d′(t′)]=E[f(C)d(t)f(C)d′(t′)]−E[f(C)d(t)]E[f(C)d′(t′)].

We have expressions for both and by (10) and (12) respectively.

### 3.2 NCMOGP with separable and homogeneous Volterra kernels

In the section above, we introduced a model that allows for different first-order Volterra kernels , when building the general Volterra kernel of order . A further assumption we can make is that all first-order Volterra kernels are the same, i.e.,

 G(c,i)d(t−τi)=Gd(t−τ),∀c,∀i,

meaning that , for all and for all . We will refer to this as the separable and homogeneous version of the NCMOGP.

We then get

 f(C)d(t) =C∑c=1c∏i=1f(c,i)d(t)=C∑c=1c∏i=1fd(t)=C∑c=1fcd(t),

where

 fd(t) =∫τGd(t−τ)u(τ)dτ,

and . The cross-covariance function between and is again as in (2).

As we did in Section 3.1, we can compute the mean function for and cross-covariance functions between and .

### 3.3 Mean function

Using expression (3), the mean function follows as

 E[f(C)d(t)] =C∑c=1E[fcd(t)] =C∑c=1c!2l11l11!(kd,d(t,t))l11,

where is such that satisfies . This means that the above expression only has solutions for even-valued , and . Then,

 E[f(C)d(t)] =C∑c=1c!2c/2(c2)!(kd,d(t,t))c/2,

for even and .

### 3.4 Cross-covariance function

We can compute the second moment using

 E[f(C)d(t)f(C)d′(t′)] =E[C∑c=1fcd(t)C∑c′=1fc′d′(t′)] =C∑c=1C∑c′=1E[fcd(t)fc′d′(t′)].

Once again we use expression (3) to compute , leading to

 E[fcd(t)fc′d′(t′)]=∑L∈TaAc,c′,L(kd,d(t,t))l11(kd,d′(t,t′))l12(kd′,d′(t′,t′))l22,

where is defined as

 Ac,c′,L=c!c′!2l11+l22l11!l12!l22!,

and is defined in (2). To avoid computer overflow due to the factorial operators when computing , we compute instead.

As stated previously, the expected value will be 0 if is not even.

Example 3 Let as assume that and . The expected value follows as

 ∑L∈TaA3,3,L(kd,d(t,t))l11(kd,d′(t,t′))l12(kd′,d′(t′,t′))l22,

where . To find the elements in , we need to solve similar equations to the ones in Example 1. We would have

 2l11+l12 =c=3 l12+2l22 =c′=3.

We can see that there are two valid solutions for :

 Ta={[0330],[1111]}.

The expression for is thus

 6k3d,d′(t,t′)+9kd,d(t,t)kd,d′(t,t′)kd′,d′(t′,t′).

Again, we compute the covariance now that we have expressions to compute the mean for and the expected value for the product between and .

## 4 Related Work

In the work by Lawrence et al. (2007), non-linear dynamics are introduced with a GP prior within a non-linear function, which are inferred using the Laplace approximation, with the convolution operator itself approximated as a discrete sum. Similarly, Titsias et al. (2009) approximate the posterior to the non-linear system over a GP using an MCMC sampling approach. Approaches to non-linear likelihoods with GP priors, not limited to MOGPs, include the warped GP model (Lázaro-Gredilla, 2012) and chained GPs (Saul et al., 2016) which make use of variational approximations. Techniques from state space modelling, including Taylor series linearisation and sigma points used to approximate Gaussians in the extended and unscented Kalman filters respectively have been applied to non-linear Gaussian processes, both for single output and multitask learning (Steinberg and Bonilla, 2014; Bonilla et al., 2016).

An alternative perspective to the linear convolution process, in particular latent force models, is to construct it as a continuous-discrete state space model (SSM) driven by white noise (Hartikainen and Särkkä, 2012; Särkkä et al., 2013). Hartikainen et al. (2012) use this approach for the general case non-linear Wiener system, approximating the posterior with an unscented versions of the Kalman filter and Rauch-Tung-Streibel smoother. The SSM approach benefits from inference being performed in linear time, but relies on certain constraints on the nature of the underlying covariance functions. In particular, a kernel must have a rational power spectrum to be used in exact form, which precludes the use of, for example, the exponentiated quadratic kernel for exact Gaussian process regression without introducing additional approximation error (Särkkä et al., 2013). Wills et al. (2013) also use a state space representation to approximate Hammerstein-Wiener models, albeit with sequential Monte Carlo and a maximum-likelihood approach.

## 5 Implementation

Multi-output regression with NCMOGP In this paper, we are interested in the multi-output regression case. Therefore, we restrict the likelihood models for each output to be Gaussian. In particular, we assume that each observed output follows

 yd(t) =~f(C)d(t)+wd(t),

where is a white Gaussian noise process with covariance function . Other types of likelihoods are possible as for example in Moreno-Muñoz et al. (2018).

Kernel functions In all the experiments, we use an exponentiated quadratic (EQ) form for the smoothing kernels and an EQ form for the kernel of the latent functions . With these forms, the kernel also follows an EQ form. We use the expressions for obtained by Álvarez and Lawrence (2011).

High-dimensional inputs The resulting mean function and covariance function assume that the input space is one-dimensional. We can extend the approach to high-dimensional inputs, by assuming that both the mean function and covariance function factorise across the input dimension, and using the same expressions for the kernels for each factorised dimension.

Hyperparameter learning We optimise the log-marginal likelihood for finding point estimates of the hyperparameters of the NCMOGP. Hyperparameters include the parameters for the smoothing kernels , the kernel function and the variances for the white noise processes , . For simplicity in the notation, we assume that all the outputs are evaluated at the same set of inputs . Let , with . The log-marginal likelihood is then given as

 −ND2 log(2π)−12(y−μ(C))⊤(K(C)f,f+Σ)−1×(y−μ(C))−12log∣∣K(C)f,f+Σ∣∣,

where has entries given by , has entries computed using and is a diagonal matrix containing the variances of the noise processes per output. We use a gradient-based optimization procedure to estimate the hyperparameters that maximize the log-marginal likelihood.

Predictive distribution The predictive distribution is the same one used for the single-output case. Let be the input test set. The predictive distribution follows as , with and , where . In these expressions, has entries given by ; has entries given by ; and has entries given by .

## 6 Experimental Results

Experimental results are provided for the NCMOGP with homogeneous and separable kernels. In all the experiments that follow, hyperparameter estimation is performed through maximization of the log-marginal likelihood as explained in section 5. We use the normalised mean squared-error (NMSE) and the negative log-predictive density (NLPD) to assess the performance.

### 6.1 Toy example

We set a problem of outputs where the smoothing kernels are given as , with parameters , and and . The latent function follows as . We then numerically solve the convolution integral for datapoints in the input range . We compute and the observed data is obtained by adding Gaussian noise with a variance of . We randomly split the dataset into a train set of per output and the rest of the datapoints are used for assessing the performance.