# Functional Gaussian processes for regression with linear PDE models

## Abstract

In this paper, we present a new statistical approach to the problem of incorporating experimental observations into a mathematical model described by linear partial differential equations (PDEs) to improve the prediction of the state of a physical system. We augment the linear PDE with a functional that accounts for the uncertainty in the mathematical model and is modeled as a *Gaussian process*. This gives rise to a stochastic PDE which is characterized by the Gaussian functional. We develop a *functional Gaussian process regression* method to determine the posterior mean and covariance of the Gaussian functional, thereby solving the stochastic PDE to obtain the posterior distribution for our prediction of the physical state. Our method has the following features which distinguish itself from other regression methods. First, it incorporates both the mathematical model and the observations into the regression procedure. Second, it can handle the observations given in the form of linear functionals of the field variable. Third, the method is non-parametric in the sense that it provides a systematic way to optimally determine the prior covariance operator of the Gaussian functional based on the observations. Fourth, it provides the posterior distribution quantifying the magnitude of uncertainty in our prediction of the physical state. We present numerical results to illustrate these features of the method and compare its performance to that of the standard Gaussian process regression.

## 1Introduction

Partial differential equations (PDEs) are used to mathematically model a wide variety of physical phenomena such as heat transfer, fluid flows, electromagnetism, and structural deformations. A PDE model of a physical system is typically described by conservation laws, constitutive laws, material properties, boundary conditions, boundary data, and geometry. In practical applications, the mathematical model described by the PDEs is only an approximation to the real physical system due to (*i*) the deliberate simplification of the mathematical model to keep it tractable (by ignoring certain physics or certain boundary conditions that pose computational difficulties), and (*ii*) the uncertainty of the available data (by using geometry, material property and boundary data that are not exactly the same as those of the physical system). We refer to the PDE model (available to us) as the *best knowledge PDE model* [23] and to its solution as the *best knowledge state*. To assess the accuracy of the best knowledge model in predicting the physical system, the best knowledge state needs to be compared against experimental data, which typically will have some level of noise.

In cases where the discrepancy between the PDE model and the experimental data is beyond an acceptable level of accuracy, we need to improve the current PDE model. There are several approaches to defining a new improved model. *Parameter estimation* [19] involves calibrating some parameters in the model to match the data. An alternative approach to obtain an improved model is *data assimilation* [7]. Broadly speaking, data assimilation is a numerical procedure by which we incorporate observations into a mathematical model to reflect the errors inherent in our mathematical modeling of the physical system. Although data assimilation shares the same objective as parameter estimation, it differs from the latter in methodology. More specifically, data assimilation does not assume any parameters to be calibrated; instead, data assimilation defines a new model that matches the observations as well as possible, while being as close as possible to the best knowledge model. Another approach is *data interpolation* [2] which involves computing a collection of solutions (snapshots) of a parametrized or time-varying mathematical model and reconstructing the physical state by fitting the experimental data to the snapshots.

A widely used technique for obtaining an improved model in parameter estimation and data assimilation is *least squares regression* [3]. Least squares is a deterministic regression approach that provides an estimate for the physical state which is optimal in least squares sense. However, it does not provide a means to quantify the prediction uncertainty. A recent work [23] poses the least-square regression as a regularized saddle point Galerkin formulation which admits interpretation from a variational framework and permits its extension to *Petrov-Galerkin formulation*. While the Petrov-Galerkin formulation provides more flexibility than the Galerkin formulation, it does not quantify the uncertainty in the prediction either. A popular statistical approach in parameter estimation and data assimilation is *Bayesian inference* [4]. In Bayesian inference an estimate of the physical state is described by random variables and the posterior probability distribution of the estimate is determined by the data according to Bayes’ rule [6]. Therefore, Bayesian inference provides a powerful framework to quantify the prediction uncertainties.

In this paper, we introduce a new statistical data assimilation approach to the problem of incorporating observations into the best knowledge model to predict the state of physical system. Our approach has its root in Gaussian process (GP) regression [13]. We augment the linear PDE with a functional that accounts for the uncertainty in the mathematical model and is modeled as a *Gaussian process*.^{3}*functions of vectors* to *functionals of functions*, we develop a *functional Gaussian process regression* method to determine the posterior distribution of the Gaussian functional, thereby solving the stochastic PDE for our prediction of the physical state. Our method is devised as follows. We first derive a *functional regression problem* by making use of the adjoint states and the observations. We next solve the functional regression problem by an application of the principle of Gaussian processes to obtain the posterior mean and covariance of the Gaussian functional. Finally, we compute the posterior distribution for our estimate of the physical state. A crucial ingredient in our method is the *covariance operator* representing the *prior* of the Gaussian functional. The bilinear covariance operators considered incorporate a number of free parameters (the so-called *hyperparameters*) that can be optimally determined from the measured data by maximizing a marginal likelihood.

Our functional Gaussian process regression method can be viewed as a generalization of the standard GP regression from a finite dimensional vector (input) space to an infinite dimensional function (input) space. GP regression is a well-established technique to construct maps between inputs and outputs based on a set of sample, or training, input and output pairs, but does not offer a direct method to incorporate prior knowledge, albeit approximate, from an existing model relating the inputs and outputs. By combining the best knowledge model with the data, our method can greatly improve the prediction of the physical system.

Furthermore, we introduce a *nonparametric Bayesian inference* method for linear functional regression with Gaussian noise. It turns out that nonparametric Bayesian inference and functional GP regression represent two different views of the same procedure. Specifically, we can think of functional GP regression as defining a distribution over functionals and doing inference in the space of functionals — the *functional-space view*. We can think of nonparametric Bayesian inference as defining a distribution over weights and doing inference in the space of weights — the *weight-space view*. Theoretically, functional GP regression can be interpreted as an application of the *kernel trick* [15] to nonparametric Bayesian inference, thereby avoiding an explicit construction of the feature map.

The paper is organized as follows. In Section 2, we present a description of the problem considered. In Section 3, we give an overview of Gaussian processes for regression problems. In Section 4, we introduce functional Gaussian processes for functional regression problems via linear PDE models. In Section 5, we present numerical results to demonstrate our method and compare its performance to that of function Gaussian process regression. In Section 6, we provides some concluding remarks on future research. Finally, in the Appendix, we describe our nonparametric Bayesian inference method.

## 2Motivation and Problem Statement

Let denote a bounded open domain with Lipschitz boundary. Let be an appropriate real-valued function space in which the true state resides. A weak formulation of the true PDE model can be stated as: Find and such that

where is a bilinear form, is a linear functional, and are observation functionals. We assume that the true PDE model ( ?) is well defined and accurately describes the physical system of interest.

In actual practice, we do not have access to , and . Hence, we can not compute and . However, we assume that we have access to the “best knowledge” of , and , which shall be denoted by , and , respectively. We then define the best knowledge PDE model: Find and such that

In the remainder of this paper, we shall drop the superscript “bk” for the quantities associated with the best knowledge model to simplify the notation. (In practice, we replace the continuous function space with a discrete approximation space, which is assumed to be large enough that the discrete solution is indistinguishable from the continuous one.)

We now assume that we are given the observed data , which are the measurements of the true output vector . We further assume that the measurements differ from the true outputs by additive Gaussian noise , namely,

where are independent, identically distributed Gaussian distributions with zero mean and variance . If is sufficiently small within the acceptable accuracy then we can use the observed data to validate the best knowledge model ( ?). If the best knowledge outputs are close enough to within the noise level then we may trust the best knowledge model to predict the behavior of the true model. In many cases, the best knowledge outputs do not match the observed data due to various sources of uncertainty from physical modeling, constitutive laws, boundary conditions, boundary data, material properties, and geometry.

We are interested in improving the best knowledge model when it does not produce a good estimate of the true state. In particular, we propose a method to compute a better estimate for the true state by combining the best knowledge model with the observed data. Our method has its root in Gaussian process regression. Before proceeding to describe the proposed method we review the ideas behind Gaussian processes.

## 3Gaussian Process Regression

We begin by assuming that we are given a training set of observations

where denotes an input vector of dimension and denotes a scalar real-valued output. The training input vectors are aggregated in the real-valued matrix , and the outputs are collected in the real-valued vector , so we can write . We assume that

where is the true but unknown function which we want to infer. The unknown function is modeled as a Gaussian process^{4}^{5}

From a Bayesian perspective, we encode our belief that instances of are drawn from a Gaussian process with zero mean and covariance function *prior to taking into account observations*.^{6}

Let be the matrix that contains test input vectors as its columns. Since , the joint distribution of the observed outputs and the function values at the test input vectors is

where , , , , and have entries

respectively. We next apply the conditional distribution formula (see [15]) to the joint distribution (Equation 4) to obtain the *predictive distribution* for as

where

and and are given by

Note that the predictive mean is a linear combination of kernel functions, each one centered on a training input vector. Note also that the predictive covariance does not explicitly depend on the observed data , but only on the training input vectors and the covariance function . This is a property of the Gaussian distribution. However, as discussed below, the covariance function can be determined by using the observed data. As a result, the predictive covariance implicitly depends on the observed data.

Typically, the covariance function has some free parameters , so that the matrix depends on . These free parameters are called *hyperparameters*. The hyperparameters have a significant impact on the predictive mean and covariance. They are determined by maximizing the log marginal likelihood (see [15]):

Once we choose a specific form for and determine its hyperparameters, we can compute and for any given . Gaussian processes also provide us a mean to choose an appropriate family among many possible families of covariance functions. Choosing a covariance function for a particular application involves both determining hyperparameters within a family and comparing across different families. This step is termed as *model selection* [15].

We see that the standard GP regression provides us not only a posterior mean, but also a posterior covariance which characterizes uncertainty in our prediction of the true function. Moreover, it allows us to determine the optimal covariance function and thus the optimal reproducing Kernel Hilbert space in which the true function is believed to reside. These features differentiate GP regression from parametric regression methods such as least-squares regression, which typically provides the maximum likelihood estimate only. However, GP regression tends to require larger sample sizes than parametric regression methods because the data must supply enough information to yield a good covariance function by using model selection.

There are a number of obstacles that prevent us from applying the standard GP regression to our problem of interest described in the previous section. First, our outputs are in general not the evaluations of the state at spatial coordinates. Instead, they are linear functionals of the state. Second, the standard GP regression described here does not allow us to make use of the best knowledge model. The best knowledge model plays an important role because it carries crucial prior information about the true model. By taking advantage of the best knowledge model, we may be able to use far less observations to obtain a good prediction and thus address the main disadvantage of the standard GP regression. We propose a new approach to overcome these obstacles.

## 4Functional Gaussian Process Regression

### 4.1A stochastic PDE model

Let be a linear functional. We introduce a new mathematical model: Find and such that

Notice that the new model ( ?) differs from the best knowledge model ( ?) by the functional . We can determine and only if is known. The functional thus characterizes the solution and the output vector of the model ( ?). We note that if then we . Unfortunately, this particular choice of requires the true state which we do not know and thus want to infer.

In order to capture various sources of uncertainty in the best knowledge model, we represent as a *functional Gaussian process*^{7}

Notice that there are three main differences between the functional Gaussian process (Equation 6) and the Gaussian process (Equation 3). First, is a functional, whereas is a function. Second, is a function, whereas is a vector. And third, is generally a differential and integral operator of and , whereas is a function of and . We will require that the covariance operator is symmetric positive-definite. That is,

As the covariance operator characterizes the space of all possible functionals prior to taking into account the observations, it plays an important role in our method. The selection of a covariance operator will be discussed later.

Since is a functional Gaussian process, the model ( ?) becomes a stochastic PDE. In order to solve the stochastic PDE ( ?), we need to compute the *posterior mean* and *posterior covariance* of after accounting for the observed data . To this end, we formulate a functional regression problem and describe a procedure for solving it as follows.

### 4.2Functional regression problem

We first introduce the adjoint problems: for we find such that

We note that the adjoint states depend only on the output functionals and the bilinear form . It follows from ( ?), ( ?), and (Equation 7) that

for . Moreover, we would like our stochastic PDE to produce the outputs that are consistent with the observed data in such a way that

This equation is analogous to (Equation 1) which relates the observed data to the true outputs . We substitute into (Equation 8) to obtain

Notice that this expression characterizes the relationship between and in the same way (Equation 2) characterizes the relationship between and .

We now introduce a training set of observations

and use this training set to learn about . More specifically, we wish to determine for any given based on the training set . This problem is similar to the regression problem described in the previous section and is named the *functional regression problem* to emphasize that the object of interest is a functional. We next describe the solution of the functional regression problem.

### 4.3Regression procedure

Let be a collection of adjoint states as determined by (Equation 7). Let be a collection of test functions. The joint distribution of the observed outputs and the functional values for the test functions according to the prior (Equation 6) is given by

where , , , , and have entries

respectively. It thus follows that the *predictive distribution* for is

where

and and are given by

Notice that we have correspondence with function Gaussian process regression described in the previous section, when identifying with , with , and with .

While our approach share similarities with function Gaussian process regression, it differs from the latter in many important ways. We summarize in Table 1 the differences between function Gaussian process regression and functional Gaussian process regression.

Quantities | Function Gaussian process | Functional Gaussian process |
---|---|---|

[-2ex] Input | vector | function |

[1ex] | ||

[-2ex] Output | function | functional |

[1ex] | ||

[-2ex] Prior | ||

[1ex] | ||

[-2ex] Kernel | function | operator |

[1ex] | ||

[-2ex] Training inputs | adjoint states | |

[1ex] | ||

[-2ex] Observations | ||

[1ex] | ||

[-2ex] Coefficients | ||

[1ex] | ||

[-2ex] Test inputs | ||

[1ex] | ||

[-2ex] Mean | ||

[1ex] | ||

[-2ex] Covariance | ||

[1ex] |

In the Appendix A, we introduce a nonparametric Bayesian framework for linear functional regression with Gaussian noise. It turns out that this nonparametric Bayesian framework is equivalent to the functional GP regression described here. In fact, functional GP regression can be viewed as an application of the *kernel trick* to nonparametric Bayesian inference for linear functional regression, thereby avoiding the computation of the eigenfunctions of the covariance operator . We next introduce a family of bilinear covariance operators and then describe a method for determining the hyperparameters.

### 4.4Covariance operators

The covariance operator is a crucial ingredient in our approach. Here, we consider a class of bilinear covariance operators parametrized by of the form:

More general forms of the covariance operator are possible provided that they are symmetric and positive definite.

In order for a covariance operator to be used in our method, we need to specify its hyperparameters . Fortunately, Gaussian processes allow us to determine the hyperparameters by using the observed data. In order to do this, we first calculate the probability of the observed data given the hyperparameters, or *marginal likelihood* and choose so that this likelihood is maximized. We note from (Equation 10) that

where the matrix as defined in (Equation 11) depends on and thus on as well. Rather than maximizing (Equation 12), it is more convenient to maximize the log marginal likelihood which is given by,

Thus, we find by solving the maximization problem

Hence, the hyperparameters are chosen as the maximizer of the log marginal likelihood.

Once we determine the covariance operator, we can compute for any given set of test functions as described in Subsection 4.3. Therefore, functional GP regression is *non-parametric* in the sense that both the hyperparameters are chosen in light of the observed data. In order words, the data are used to define both the prior covariance and the posterior covariance. In contrast, parametric regression methods use a number of parameters to define the prior and combine this prior with the data to determine the posterior prediction. It remains to describe how to compute the posterior mean and covariance of the solution of the stochastic PDE.

### 4.5Computation of the mean state and covariance

We recall that our stochastic PDE model consists of finding such that

Let be a “suitable” basis set of the function space , where is the dimension of . Since the functional is Gaussian and the best knowledge model is linear, we can express the solution of the stochastic PDE (Equation 14) as

In order to determine and , we choose in (Equation 14) to arrive at the stochastic linear system:

where , for , and with

for . It thus follows from (Equation 16) that

as is Gaussian and is invertible.

Now let be spatial points at which we would like to evaluate the predictive mean and covariance of . Let be a matrix with entries . It then follows from (Equation 15) that

where , . It follows from (Equation 18) and (Equation 19) that

where

with . We examine the posterior distribution as given by (Equation 20). Note first that the posterior mean is the difference between two terms: the first term is simply the best knowledge state evaluated at ; the second term is a correction term to the best knowledge state and is obtained by using our functional Gaussian process regression. Note also that the posterior covariance is a quadratic form of with the posterior covariance matrix , showing that the predictive uncertainty grows with the magnitude of . Hence, the predictive uncertainty depends on the inverse matrix . The implementation of our method for computing the posterior distribution (Equation 20) is shown in Figure 1.

### 4.6Relationship with least-squares regression

Here, we show an alternative approach to computing the posterior mean in (Equation 20) by solving a deterministic least-squares problem. We note that the posterior mean satisfies

Here, for any , is the posterior mean of and given by

where the adjoint states satisfy (Equation 7) and the coefficient vector is the solution of (Equation 11) . It thus follows that the mean state satisfies

where is the weighted sum of the adjoint states. We then evaluate the mean outputs of the stochastic PDE model as

The following lemma sheds light on the relationship between the mean outputs and the observed data.

We first note from the adjoint equation (Equation 7) and (Equation 23) that

We next recall that satisfies

where is the Kronecker delta. Moreover, we obtain from the best knowledge model ( ?) and the adjoint equation (Equation 7) that

The desired result immediately follows from the above three equations. This completes the proof.

This lemma shows that the mean outputs differs from the observed data by the product of the noise level and the coefficient vector . When the observed data is noise-free (namely, ) we have that the mean output vector is exactly equal to the observed data. Henceforth, whenever is sufficiently large and is relatively small, we expect that our method will yield a much better estimate of the true state than the best knowledge model. The following theorem shows the optimality of the mean state.

We introduce the Lagrangian

where and are the Lagrange multipliers of the constraints. The optimal solution satisfies

which yields

Note that when taking the partial derivatives we have used the assumption that is bilinear. The first two equations of ( ?) yield that

where are the adjoint states. And the third equation ( ?c) gives

Therefore, if we can show that then ( ?d) and (Equation 24) imply that . To this end, we note from ( ?d) and the adjoint equation (Equation 7) that

Moreover, we obtain from the best knowledge model ( ?) and the adjoint equation (Equation 7) that

Finally, it follows from ( ?e), (Equation 24), (Equation 25), (Equation 26), and (Equation 27) that

which implies that . This completes the proof.

This theorem establishes a connection between functional GP regression and traditional least-squares regression when the covariance operator is bilinear. In particular, the posterior mean state is the optimal solution of a least-squares minimization. This is hardly a surprise as the posterior mean state is also the maximum a posteriori (MAP) estimate of linear functional regression model in the Bayesian framework discussed in the Appendix A. It is well known that the MAP estimate coincides with the least-squares solution. The main advantage of our approach over least-squares regression is that we can compute not only the posterior mean state but also the posterior covariance. Another advantage of our approach is that it allows us to choose a covariance operator based on the observed data by exploiting the marginal likelihood function, whereas least-squares regression does not provide a mechanism to optimally set the prior covariance operator.

## 5A Simple Heat Conduction Example

### 5.1Problem description

For the true PDE model we consider a one-dimensional heat equation:

with Dirichlet boundary conditions . The function space is then given by

The true state satisfies

where

We prescribe a *synthetic source term* as . It is easy to see that

The true state is unknown to us and will serve to assess the performance of our method.

We next assume that we know almost everything about the true model except for the source term and the boundary data. We introduce a function space with as

where the boundary data and will be determined from the observed data. We then define our best knowledge model: find such that

where

In practice, we replace the continuous space with a discrete counterpart, for this problem, a 2000-element linear finite element space.

### 5.2Model specifications

We now specify the observation functionals , where is the Dirac delta function and the are the extended Chebyshev nodes [refs] in the interval :

These functionals correspond to pointwise observations taken at the points . Note that the set of measurement points varies with . Note also that the finite element mesh is designed to include in its grid points.

We shall assume that the observations are noise-free, that is we have and . Since the observations at and are used to define the function space in (Equation 29) (that is we set and ), we can only use the remaining observations to construct the training set as

where satisfies

and are the outputs of the best knowledge model. Hence, the training set has only samples. Furthermore, we use a bilinear covariance operator of the form

The parameters are determined by maximizing the log marginal likelihood (Equation 13).

We will compare our method to the standard Gaussian process regression described in Section 3 which ignores the best knowledge model and utilizes only the data. To this end, we employ a squared-exponential covariance function of the form

where represents the signal variance, while represents the length scale. These parameters are set by maximizing the log marginal likelihood (Equation 5).

### 5.3Results and discussions

We consider for the best knowledge model (Equation 30)-(Equation 31). This yields the best knowledge state . Figure 1 shows the true state and the best knowledge state . We observe that is considerably different from . Therefore, the best knowledge model does not produce a good prediction of the heat equation (Equation 28). We now apply functional GP regression to this example and present numerical results to demonstrate the performance of our method relative to the standard GP regression.

We present in Table ? the optimal hyperparameters, the norm of the prediction error, and the norm of the posterior standard deviation (the square root of the posterior variance) for our method and the standard GP regression. Here the norm of a function is defined as . We observe that while is always zero, decreases as increases, indicating that the prediction uncertainty is reduced as the number of observations increases. We also note that the length scale of the squared-exponential covariance function is relatively small for , indicating that the training set may be inadequate for the standard GP regression to produce a good prediction.

We see from Table ? that the prediction error in our method converges significantly faster than that in the standard GP regression as increases. Therefore, our method requires fewer observations to achieve the same accuracy. In particular, our method with observations has slightly smaller error than the standard GP regression with observations. This is made possible because our method uses both the best knowledge model and the observations to do regression on the space of functionals, whereas the standard GP progression uses the observations only to do regression on the space of functions. We also observe that the posterior standard deviation (measured in norm) shrinks with increasing albeit at a slower rate than the prediction error, indicating that our posterior variance of the prediction error is rigorous. In contrast, the standard GP regression has the posterior standard deviation even smaller than the prediction error for small values of , indicating that the posterior variance of the standard GP regression may not be rigorous when the training set is inadequate. This can be attributed to the fact the standard GP regression requires a large enough set of observations to provide accurate prediction and rigorous error estimation.

Finally, we show in Figure 2 the true state, the mean prediction, and the 95% confidence region (shaded area) for our method (left panels) and the standard GP regression (right panels). Here the 95% confidence region is an area bounded by the mean prediction plus and minus two times the standard deviation function. Note that the prediction error is zero at the measurement points, which is consistent with the theoretical result stated in Lemma 1. Moreover, the standard deviation function is also zero at the measurement points — a consequence of the fact that the prediction error is zero at those points. We see that our method does remarkably well even with just 4 observations when it is compared to the standard GP regression. For our method the true state resides in the 95% confidence region which shrinks rapidly with as increases, whereas for the standard GP regression does not always reside in the 95% confidence region. Indeed, as seen in Figure 2(d), the standard GP regression gives poor prediction and erroneous 95% confidence region for . Although the standard GP regression provides more accurate prediction and rigorous 95% confidence region for , it is still not as good as our method. In summary, the numerical results obtained for this simple example show that our method outperforms the standard GP regression.

## 6Conclusions

In this paper, we have presented a new statistical approach to the problem of combining the observed data with a mathematical model to improve our prediction of a physical system. A new *functional Gaussian process* regression approach is presented, which has its root in Gaussian processes for functions. Our approach has the following unique properties. First, it allows for the incorporation of the best knowledge model into the regression procedure. Second, it can handle observations given in the form of linear functionals of the field variable. Third, the method is non-parametric in the sense that it provides a systematic way to optimally determine the prior covariance based on the data. Fourth, our method can compute not only the mean prediction but also the posterior covariance, characterizing the uncertainty in our prediction of the physical state. These features distinguish our method from other regression methods. Numerical results were presented to highlight these features and demonstrate the superior performance of the proposed method relative to the standard GP regression.

We conclude the paper by pointing out several possible extensions and directions for further research. We would like to extend the proposed approach to nonlinear PDE models, which will broaden the application domain of our method. Nonlinear PDEs represent some significant challenges because the adjoint problems will depend on the true state which we do not know and because our stochastic PDE does not preserve the Gaussian property of the functional due to nonlinearity. We would also like to extend the method to goal-oriented statistical estimation in which we would like to infer new outputs rather the state of the physical system.

## Acknowledgments

We would like to thank Professor Robert M. Freund of Sloan School of Management at MIT for fruitful discussions. This work was supported by AFOSR Grant No. FA9550-11-1-0141, AFOSR Grant No. FA9550-12-0357, and the Singapore-MIT Alliance.

## ABayesian linear functional regression

In this section, we develop a nonparametric Bayesian framework for linear functional regression with Gaussian noise. We then show that this framework is equivalent to our functional Gaussian process regression described in Section 3.

We begin by introducing an orthornormal basis set such that