Physics-Informed CoKriging: A Gaussian-Process-Regression-Based Multifidelity Method for Data-Model Convergence

# Physics-Informed CoKriging: A Gaussian-Process-Regression-Based Multifidelity Method for Data-Model Convergence

Xiu Yang***xiu.yang@pnnl.gov Advanced Computing, Mathematics and Data Division, Pacific Northwest National Laboratory, Richland, WA 99352 David Barajas-Solanodavid.barajas-solano@pnnl.gov Advanced Computing, Mathematics and Data Division, Pacific Northwest National Laboratory, Richland, WA 99352 Guzel Tartakovskyguzel.tartakovsky@pnnl.gov Hydrology Group, Pacific Northwest National Laboratory, Richland, WA 99352 Alexandre M. Tartakovsky§§§alexandre.tartakovsky@pnnl.gov Advanced Computing, Mathematics and Data Division, Pacific Northwest National Laboratory, Richland, WA 99352
###### Abstract

In this work, we propose a new Gaussian process regression (GPR)-based multifidelity method: physics-informed CoKriging (CoPhIK). In CoKriging-based multifidelity methods, the quantities of interest are modeled as linear combinations of multiple parameterized stationary Gaussian processes (GPs), and the hyperparameters of these GPs are estimated from data via optimization. In CoPhIK, we construct a GP representing low-fidelity data using physics-informed Kriging (PhIK), and model the discrepancy between low- and high-fidelity data using a parameterized GP with hyperparameters identified via optimization. Our approach reduces the cost of optimization for inferring hyperparameters by incorporating partial physical knowledge. We prove that the physical constraints in the form of deterministic linear operators are satisfied up to an error bound. Furthermore, we combine CoPhIK with a greedy active learning algorithm for guiding the selection of additional observation locations. The efficiency and accuracy of CoPhIK are demonstrated for reconstructing the partially observed modified Branin function, reconstructing the sparsely observed state of a steady state heat transport problem, and learning a conservative tracer distribution from sparse tracer concentration measurements.

Keywords: physics-informed, Gaussian process regression, CoKriging, multifidelity, active learning, error bound.

## 1 Introduction

Gaussian processes (GPs) are a widely used tool in applied mathematics, statistics, and machine learning for regression, classification, and optimization [14, 40, 44]. GP regression (GPR), also known as Kriging in geostatistics, constructs a statistical model of a partially observed process by assuming that its observations are a realization of a GP. A GP is uniquely described by its mean and covariance function (also known as kernel). In standard (referred to here as data-driven) GPR, usually parameterized forms of mean and covariance functions are assumed, and the hyperparameters of these functions (e.g., variance and correlation length) are estimated from data by maximizing the log marginal likelihood of the data. GPR is also closely related to kernel machines in machine learning, but it provides a richer characterization in the result, as it provides uncertainty estimates [48]. GP is also connected to infinite neural networks, that is, networks with an infinite number of hidden units [29].

There are several variants of GPR, including simple, ordinary, and universal Kriging [21]. Ordinary Kriging is the most widely used GPR method. It assumes stationarity of the random field, including constant mean and variance, and a prescribed stationary covariance function. The stationarity assumption reduces the number of hyperparameters and the model complexity. For example, in universal Kriging, the mean is modeled as a linear combination of basis functions [1], which increases the number of unknown parameters and may lead to non-convex optimization problems. Although the assumption of stationarity may not be suitable for some application problems, it is often necessary as there are usually not enough data to compute accurate estimates of non-stationary mean and covariance functions. Progress have been made at incorporating physical knowledge into kernels, e.g., [42, 18, 6, 7, 38, 39] by computing kernels for systems governed by linear and weakly nonlinear (allowing accurate linearization) ordinary and partial differential equations. Such kernels are computed by substituting a GPR approximation of the system’s state variables into the governing equation and obtaining a system of equations for the kernel hyperparameters. For complex linear systems, computing the kernel in such a way can become prohibitively expensive, and for strongly nonlinear systems, it may not be possible at all.

In our previous work [52], we proposed the physics-informed Kriging method (PhIK) that incorporates (partial) physical knowledge into GPRs. In modeling complex systems, it is common to treat unknown parameters and fields as random parametrs and fields, and the resulting realizations of the state of the system are employed to study the uncertainty of the model or the system. The goal of PhIK is to exploit the information of the system provided by these realizations to assimilate observations. In PhIK, such random realizations are used to compute the prior mean and covariance. A similar idea is used in the ensemble Kalman filter (EnKF) [13] and the formula of the “filtering step” is equivalent to the PhIK prediction Eq. (2.13). Whereas EnKF introduces uncertainty mainly from the observation noise and evolves an ensemble of state variables drawn from the posterior distribution of the previous time step, PhIK utilizes the stochasticity in models and directly uses simulation outputs for prediction without redrawing the ensemble in each time step. Not only does PhIK provide prediction or reconstruction in the form of posterior mean, it also performs uncertainty reduction (UR) in the form of posterior variance. More importantly, PhIK posterior mean satisfies linear physical constraints with a bounded error [52], which is critical for guaranteeing the predictive value of the method. The main drawback of PhIK is that it is highly dependent on the physical model, because the prior mean and covariance are determined entirely by the model and are not informed by data. Therefore, convergence of PhIK to the true solution with the increasing number of available observations is slower than in the data-driven GPR if the physical model is incorrect.

In this work, we propose a physics-informed CoKriging (CoPhIK) method, an extension of the CoKriging-based multifidelity framework [20, 15] to physics-informed Kriging. In this context, the direct observations of a physical system are considered as high-fidelity data and the stochastic physical model outputs are treated as low-fidelity data. CoPhIK uses PhIK to construct a GP that regresses low-fidelity data, and uses another parameterized GP to model the discrepancy between low- and high-fidelity data by assuming a specific kernel; then it infers hyperparameters of the GP model for via optimization. Subsequently, CoPhIK uses a linear combination of and to represent high-fidelity data. The mean and covariance in CoPhIK integrate physical model outputs and observation data; therefore, CoPhIK is expected to have better accuracy than PhIK in some applications (e.g., the first two numerical examples in Section 3). On the other hand, due to the introduction of the GP , CoPhIK may lose some capacity for satisfying physical constraints with respect to PhIK, as will be shown in the error estimate provided by Theorem 2.2.

This work is organized as follows: Section 2 summarizes the GPR framework and physics-informed Kriging (Sections 2.1 to 2.3), and introduces the CoPhIK method (Section 2.4). Section 3 provides three numerical examples to demonstrate the efficiency of the proposed method. Conclusions are presented in Section 4.

## 2 Methodology

We begin this section by reviewing the general GPR framework [48], the ordinary Kriging method based on the assumption of stationary GP [14], and the PhIK method [52]. Then, we introduce the modified PhIK and CoPhIK methods.

### 2.1 GPR framework

We consider the spatial dependence of a scalar state of a physical system. Let , be the spatial domain, denote the state of interest, and let , denote observations of collected at the observation locations , where . The observations are arranged into the observation vector . We aim to predict at any new location . The GPR method assumes that the observation vector is a realization of the -dimensional random vector with multivariate Gaussian distribution

 Y=(Y(x(1),ω),Y(x(2),ω),…,Y(x(N),ω))⊤,

where is a GP defined on the probability space . Of note, the observation coordinates can be considered as parameters for the GP such that is a Gaussian random variable for any . For brevity, we denote by . The GP is usually represented using GP notation as

 Y(x)∼GP(μ(x),k(x,x′)), (2.1)

where and are the mean and covariance functions

 μ(x) =E{Y(x)}, (2.2) k(x,x′) =Cov{Y(x),Y(x′)}=E{[Y(x)−μ(x)][Y(x′)−μ(x′)]}. (2.3)

The variance of is , and its standard deviation is . The covariance matrix of the random vector is then given by

 C=⎛⎜ ⎜ ⎜⎝k(x(1),x(1))⋯k(x(1),x(N))⋮⋱⋮k(x(N),x(1))⋯k(x(N),x(N))⎞⎟ ⎟ ⎟⎠. (2.4)

When the functions and are parameterized, their hyperparameters are identified by maximizing the log marginal likelihood of the observations (see examples in Section 2.2[48]

 lnL=−12(y−μ)⊤C−1(y−μ)−12ln|C|−N2ln2π, (2.5)

where .

The GPR prediction at consists of the posterior distribution , with posterior mean and variance given by

 ^y(x∗) =μ(x∗)+c(x∗)⊤C−1(y−μ), (2.6) ^s2(x∗) =σ2(x∗)−c(x∗)⊤C−1c(x∗), (2.7)

and is the vector of covariances

 c(x∗)=(k(x(1),x∗),k(x(2),x∗),⋯,k(x(N),x∗))⊤. (2.8)

In practice, it is common to employ the posterior mean as the prediction. The variance is often called the mean squared error (MSE) of the prediction because  [14]. Consequently, is called the root mean squared error (RMSE).

To account for observation noise, one can model the noises as independent and identically distributed (iid) Gaussian random variables with zero mean and variance , and replace in Eqs. 2.5 to 2.8 with . In this study, we assume that observations of are noiseless. If is not invertible or its condition number is very large, one can add a small regularization term , where is a small positive real number, to such that it becomes full rank. Adding the regularization term is equivalent to assuming there is iid observation noise with variance .

### 2.2 Stationary GPR

In the widely used ordinary Kriging method, a stationary GP is assumed. Specifically, is set as a constant , and , where . Consequently, is a constant. Popular forms of kernels include polynomial, exponential, Gaussian (squared-exponential), and Matérn functions. For example, the Gaussian kernel can be written as , where the weighted norm is defined as , The constant and the correlation lengths along each direction, , are the hyperparameters of the Gaussian kernel.

For the stationary kernel, the covariance matrix of observations, , can be written as , where for the Gaussian kernel. In the maximum likelihood (MLE) framework, the estimators of and , denoted as and , are

 ^μ=1⊤Ψ−1y1⊤Ψ−11,^σ2=(y−1μ)⊤Ψ−1(y−1μ)N, (2.9)

where is a vector of s. The hyperparameters are estimated by maximizing the log marginal likelihood, Eq. (2.5). The prediction of at location is

 ^y(x∗)=^μ+ψ⊤Ψ−1(y−1^μ), (2.10)

where is a vector of correlations between the observed data and the prediction, given by

 ψ=ψ(x∗)=1σ2(k(x(1)−x∗),⋯,k(x(N)−x∗))⊤,

and the MSE of the prediction is

 ^s2(x∗)=^σ2(1−ψ⊤Ψ−1ψ). (2.11)

A more general approach to GPR is to employ parameterized nonstationary covariance kernels. Nonstationary kernels can be obtained by modifying stationary covariance kernels, e.g., [41, 27, 32, 37, 4, 28], or from neural networks with specific activation functions, e.g., [29, 34], among other approaches. Many of these approaches assume a specific functional form for the correlation function, chosen according to expert knowledge. The key computational challenge in these data-driven GPR is the optimization step of maximizing the (log marginal) likelihood. In many practical cases, this is a non-convex optimization problem, and the condition number of or can be quite large. Another fundamental challenge is that parameterized models for mean and covariance usually don’t account for physical constraints, and therefore require a large amount of data to accurately model physics.

### 2.3 PhIK

The recently proposed PhIK method [52] takes advantage of existing expert knowledge in the form of stochastic physics-based models. These stochastic models for physical systems include random parameters or random fields to reflect the lack of understanding (of physical laws) or knowledge (of the coefficients, parameters, etc.) of the real system. Monte Carlo (MC) simulations of the stochastic physical model can be conducted to generate an ensemble of state variables, from which the mean and covariance are estimated. This mean and covariance estimates are then employed to construct a GP model of the state variables. As such, there is no need to assume a specific parameterized covariance kernel or solve an optimization problem for the hyperparameters of the kernel.

Given realizations of a stochastic model , , , denoted as , we build the following GP model:

 Y(x)∼GP(μMC(x),kMC(x,x′)), (2.12)

where and are the ensemble mean and covariance functions

 μ(x) ≈μMC(x)=1MM∑m=1Ym(x), (2.13) k(x,x′)

The covariance matrix of observations can be estimated as

 C≈CMC=1M−1M∑m=1(Ym−μMC)(Ym−μMC)⊤, (2.14)

where and . The prediction and MSE at location are

 ^y(x∗) (2.15) ^s2(x∗) =^σ2MC(x∗)−cMC(x∗)⊤C−1MCcMC(x∗), (2.16)

where is the variance of the set , and
.

It was demonstrated in [52] that PhIK predictions satisfy linear physical constraints up to an error bound that depends on the numerical error, the discrepancy between the physical model and real system, and the smallest eigenvalue of matrix . Linear physical constraints include periodic, Dirichlet or Neumann boundary condition, and linear equation , where is a linear differential or integral operator. For example, let be a stochastic model of the velocity potential for a incompressible flow, i.e., ; then PhIK guarantees that is a divergence-free field.

In PhIK, MC simulation of the stochastic physical model for computing and can be replaced by more efficient approaches, such as quasi-Monte Carlo [30], multi-level Monte Carlo (MLMC) [17], probabilistic collocation [49], Analysis Of Variance (ANOVA) [50], compressive sensing [51], the moment equation and PDF methods [45, 3], and the bi-fidelity method [53]. Linear physical constraints are preserved if and are computed using a linear combination of the realizations . As an example, we present the MLMC-based PhIK [52] in Appendix A.

Further, the matrix and vector are fixed in Eq. (2.5) for a given ensemble . Thus, the log marginal likelihood is fixed. We can modify PhIK by adding a correction term to to increase the likelihood. Specifically, we replace by , where is a constant. Then, taking the derivative of with respect to and setting it to be zero yields

 Δμ=1⊤Ψ−1(y−1μ)1⊤Ψ−11. (2.17)

This modification has a potential to increase the accuracy of the prediction, but it may also violate some physical constraints, e.g., the Dirichlet boundary condition. We name this method modified PhIK, and provide the following theorem on how well it preserves linear physical constraints.

###### Theorem 2.1.

Assume that a stochastic model defined on () satisfies for any , where is a deterministic bounded linear operator, is a well-defined function on , and is a well-defined function norm. are a finite number of realizations of , i.e., . Then, the prediction from modified PhIK satisfies

 ∥∥L^y(x)−¯¯¯¯¯¯¯¯¯¯g(x)∥∥ ≤ϵ+[2ϵ√MM−1+σ(g(x;ωm))]⋅ (2.18) ∥∥C−1MC∥∥2∥∥y−μMC−Δμ1∥∥2N∑i=1σ(Ym(x(i)))+∥∥LΔμ∥∥,

where is the standard deviation of the data set for fixed , , and .

###### Proof.

The modified PhIK prediction can be written as

 ^y(x)=μMC(x)+Δμ+N∑i=1~aikMC(x,x(i)), (2.19)

where is the -th entry of . According to Theorem 2.1 and Corollary 2.2 in [52],

 ∥∥L^y(x)−¯¯¯¯¯¯¯¯¯¯g(x)∥∥≤ ∥∥ ∥∥L(μMC(x)+N∑i=1~aikMC(x,x(i)))−¯¯¯¯¯¯¯¯¯¯g(x)∥∥ ∥∥+∥LΔμ∥ ≤ ϵ+[2ϵ√MM−1+σ(g(x;ωm))]∥∥C−1MC∥∥2∥∥y−μMC−Δμ1∥∥2N∑i=1σ(Ym(x(i))) +∥∥LΔμ∥∥.

For , the bound (2.18) reverts to the bound in [52] and the modified PhIK method reverts to PhIK. In some cases, the term , e.g, when is a differential operator such as the Neumann boundary condition operator.

### 2.4 CoPhIK

CoKriging was originally formulated to compute predictions of sparsely observed states of physical systems by leveraging observations of other states or parameters of the system [43, 22]. Recently, it has been employed for constructing multi-fidelity models [20, 25, 35], and has been applied in various areas, e.g., [24, 5, 33]. In this work, we propose a novel multi-fidelity method, CoPhIK, that integrates PhIK and CoKriging by combining numerical simulations and high-fidelity observations. Our multi-fidelity method is based on Kennedy and O’Hagan’s CoKriging framework presented in [20, 14].

We briefly review the formulation of CoKriging for two-level multi-fidelity modeling in [15]. Suppose that we have high-fidelity data (e.g., accurate measurements of states) at locations , and low-fidelity data (e.g., simulation results) at locations , where and . By concatenating the observation locations and data respectively, i.e., and , we can construct a multivariate GP via Kriging as detailed in [2]. Kennedy and O’Hagan proposed an alternative formulation of CoKriging based on the auto-regressive model for

 YH(x)=ρYL(x)+Yd(x), (2.20)

where is a regression parameter and is a GP that models the difference between and . This model assumes that

 Cov{YH(x),YL(x′)∣YL(x)}=0,for all x′≠x, x,x′∈D. (2.21)

It was shown in [31] that the assumption of Eq. (2.21) implies the auto-regressive model of Eq. (2.20). The covariance of observations, , is then given by

 \mathclapC\mathclap˜\mathclapC\mathclapCCC=(CL(XL,XL)ρCL(XL,XH)ρCL(XH,XL)ρ2CL(XH,XH)+Cd(XH,XH)) (2.22)

where is the covariance matrix based on GP ’s kernel , and is the covariance matrix based on GP ’s kernel . One can assume parameterized forms for these kernels (e.g., Gaussian kernel) and then simultaneously identify their hyperparameters along with by maximizing the following log marginal likelihood:

 ln\mathclapL\mathclap˜\mathclapI\mathclapLIL=−12(\mathclapy\mathclap˜\mathclapy\mathclapyyy−\mathclapμ\mathclap˜\mathclapy\mathclapμyμ)⊤\mathclapC\mathclap˜\mathclapC\mathclapCCC−1(\mathclapy\mathclap˜\mathclapy\mathclapyyy−\mathclapμ\mathclap˜\mathclapy\mathclapμyμ)−12ln∣∣\mathclapC\mathclap˜\mathclapC\mathclapCCC∣∣−NH+NL2ln2π. (2.23)

Alternatively, one can employ the following two-step approach [15, 14]:

1. Use Kriging to construct using .

2. Denote , where are the values of at locations common to those of , then construct using via Kriging.

The posterior mean and variance of at are given by

 ^y(x∗) =μH(x∗)+\mathclapc\mathclap˜\mathclapy\mathclapcyc(x∗)⊤\mathclapC\mathclap˜\mathclapC\mathclapCCC−1(\mathclapy\mathclap˜\mathclapy\mathclapyyy−\mathclapμ\mathclap˜\mathclapy\mathclapμyμ), (2.24) ^s2(x∗) =ρ2σ2L(x∗)+σ2d(x∗)−\mathclapc\mathclap˜\mathclapy\mathclapcyc(x∗)⊤\mathclapC\mathclap˜\mathclapC\mathclapCCC−1\mathclapc\mathclap˜\mathclapy\mathclapcyc(x∗), (2.25)

where , is the mean of , is the mean of , , , and

 \mathclapμ\mathclap˜\mathclapy\mathclapμy =(μLμH)=⎛⎝(μL(x(1)L)⋯,μL(x(NL)L))⊤(μH(x(1)H)⋯,μH(x(NH)H))⊤⎞⎠, (2.26) \mathclapc\mathclap˜\mathclapy\mathclapcyc(x∗) =(ρcL(x∗)cH(x∗))=⎛⎝(ρkL(x∗,x(1)L),⋯,ρkL(x∗,x(NL)L))⊤(kH(x∗,x(1)H),⋯,kH(x∗,x(NH)H))⊤⎞⎠, (2.27)

where . Here, we have neglected a small contribution to (see [14]).

Now we describe the CoPhIK method. We set to simplify the formula and computing, and denote . We employ PhIK to construct GP using realizations of a stochastic model on . Specifically, we set and , where and are given by Eq. (2.13). The GP model is constructed using the same approach as in the second step of the Kennedy and O’Hagan CoKriging framework. In other words, CoPhIK replaces the first step of their framework with PhIK, and follows the same procedure for the second step.

We proceed to describe the construction of in more detail. First, we set . The reason for this choice is that is the most probable observation of the GP . Next, we need to assume a specific form of the kernel function. Without loss of generality, in the following theoretical analysis and computational examples, we use the stationary Gaussian kernel model and constant . Once is computed, and the form of and are decided, can the constructed as in ordinary Kriging. Now that all components in are specified except for the in . We set as the realization from the ensemble that maximizes . The algorithm is summarized in Algorithm 1.

Next, we analyze the form of . Recalling the choice and introducing the notation and in Eq. (2.22), we can write the inverse of as

 (2.28)

Thus,

 \mathclapC\mathclap˜\mathclapC\mathclapCCC−1(\mathclapy\mathclap˜\mathclapy\mathclapyyy−\mathclapμ\mathclap˜\mathclapy\mathclapμyμ) =(C−11+ρ2C−12−ρC−12−ρC−12C−12)(yL−μLyH−μH) (2.29) =((C−11+ρ2C−12)(yL−μL)−ρC−12(yH−μH)−ρC−12(yL−μL)+C−12(yH−μH)) =(C−11(yL−μL)−ρC−12((yH−μH)−ρ(yL−μL))C−12((yH−μH)−ρ(yL−μL))) =(C−11(yL−μL)−ρC−12(yH−ρyL−1μd)C−12(yH−ρyL−1μd)),

where . Therefore, the posterior mean at , given by Eq. (2.24), can be rewritten as

 ^y(x∗) =μH(x∗)+(ρcL(x∗)⊤,cH(x∗)⊤)(C−11(yL−μL)−ρC−12(yH−ρyL−1μd)C−12(yH−ρyL−1μd)) (2.30) =μH(x∗)+ρcL(x∗)⊤C−11(yL−μL)+(cH(x∗)−ρ2cL(x∗))⊤C