Gaussian processes with linear operator inequality constraints

Gaussian processes with linear operator inequality constraints

\nameChristian Agrell \emailchrisagr@math.uio.no
\addrDepartment of Mathematics
University of Oslo
P.O. Box 1053 Blindern, Oslo N-0316, Norway
 
\addrGroup Technology and Research
DNV GL
P.O. Box 300, 1322 Høvik, Norway
Abstract

This paper presents an approach for constrained Gaussian Process (GP) regression where we assume that a set of linear transformations of the process are bounded. It is motivated by machine learning applications for high-consequence engineering systems, where this kind of information is often made available from phenomenological knowledge, and the resulting constraints may be essential to achieve the level of confidence needed. We consider a GP over functions on taking values in , where the process is still Gaussian when is a linear operator. Our goal is to model under the constraint that realizations of are confined to a convex set of functions. In particular we require that given two functions and where pointwise. This formulation provides a consistent way of encoding multiple linear constraints, such as shape-constraints based on e.g. boundedness, monotonicity or convexity as a relevant example. We adopt the approach of using a sufficiently dense set of virtual observation locations where the constraint is required to hold, and derive the exact posterior for a conjugate likelihood. The results needed for stable numerical implementation are derived, together with an efficient sampling scheme for estimating the posterior process which is exact in the limit. A few numerical examples focusing on noiseless observations are given. This is relevant for computer code emulation and is also more computationally demanding than the alternative scenario with i.i.d. Gaussian noise.

Gaussian processes with linear operator inequality constraints Christian Agrell chrisagr@math.uio.no
Department of Mathematics
University of Oslo
P.O. Box 1053 Blindern, Oslo N-0316, Norway
 
Group Technology and Research
DNV GL
P.O. Box 300, 1322 Høvik, Norway

Pre-print

Keywords: Gaussian processes, uncertainty quantification, linear constraints

1 Introduction

Gaussian Processes (GPs) are a flexible tool for Bayesian nonparametric function estimation, and widely used for applications that require inference on functions such as regression and classification. A useful property of GPs is that they automatically produce estimates on prediction uncertainty, and it is often possible to encode prior knowledge in a principled manner in the modelling of prior covariance. Some early well-known applications of GPs are within spatial statistics, e.g. meteorology (Thompson, 1956), and in geostatistics (Matheron, 1973) where it is known as kriging. More recently GPs have become a popular choice within probabilistic machine learning (Rasmussen and Williams, 2005; Ghahramani, 2015). Since the GPs can act as interpolators when observations are noiseless, GPs have also become the main approach for uncertainty quantification and analysis involving computer experiments (Sacks et al., 1989; Kennedy and O’Hagan, 2001)

Often, the modeler performing function estimation has prior knowledge, or at least hypotheses, on some properties of the function to be estimated. This is typically related to the function shape with respect to some of the input parameters, such as boundedness, monotonicity or convexity. Various methods have been proposed for imposing these types of constraints on GPs (see Section 4 for a short review). For engineering and physics based applications constraints based on integral operators and partial differential equations are also relevant (Jidling et al., 2017; Särkkä, 2011). What the above constraints have in common is that they are linear operators, and so any combination of such constraints can be written as a single linear operator. For instance, the constraints , and for some function (or distribution over functions) , can be written as for , and being the linear operator .

The motivation for including these constraints on GPs is usually to improve predictions and to obtain a reduced and more realistic estimate on the uncertainty, the latter having significant impact for risk-based applications. For many real-world systems information related to constraints in this form is often available from phenomenological knowledge. For engineering systems this is typically knowledge related to some underlying physical phenomenon. Being able to make use of these constraint in probabilistic modelling is particularly relevant for high-consequence applications, where obtaining realistic uncertainty estimates in subsets of the domain where data is scarce is a challenge. Furthermore, information on whether these types of constraints are likely to hold given a set of observations is also useful for explainability and model falsification. For a broader discussion see (Agrell et al., 2018) and (Eldevik et al., 2018).

In this paper we present a model for estimating a function by a constrained GP (CGP) , where is a set of observations of , possibly including additive white noise, and is a GP with mean and covariance function that are chosen such that existence of is ensured. Due to the linearity of , both and remain Gaussian, and our approach is based on modelling under the constraint . To model the constraint that for all inputs x, we take the approach of using a finite set of input locations where the constraint is required to hold. That is, we require that for a finite set of inputs called the set of virtual observation locations. With this approach the CGP is not guaranteed to satisfy the constraint on the entire domain, but a finite set of points can be found so that the constraint holds globally with sufficiently high probability.

The model presented in this paper is inspired by the research on shape constrained GPs (Wang and Berger, 2016; Golchi et al., 2015; Riihimäki and Vehtari, 2010; Lenk and Choi, 2017; Da Veiga and Marrel, 2012, 2015; Maatouk and Bay, 2017; López-Lopera et al., 2018; Abrahamsen and Benth, 2001; Lin and Dunson, 2014), and the need for a simple framework for imposing multiple such constraints. In the case where our approach is most similar to that in (Wang and Berger, 2016), where the authors make use of a similar sampling scheme for noiseless GP regression applied to computer code emulation. With the exception of (Maatouk and Bay, 2017; López-Lopera et al., 2018), most of the approaches to constrained GPs (including ours) rely on the constraint to be satisfied at a specified set of virtual locations. The use of virtual constraint observations may seem ad hoc at first, as the set of virtual observation locations has to be dense enough to ensure that the constraint holds globally with sufficiently high probability. Inversion of the covariance matrix of the joint GP may therefore be of concern, both because this scales with the number of observations cubed and because there is typically high serial correlation if there are many virtual observation close together. The general solution is then to restrict the virtual observation set to regions where the probability of occurrence of the constraint is low (Riihimäki and Vehtari, 2010; Wang and Berger, 2016). According to Wang and Berger (2016), when they followed this approach in their experiments they found that only a modest number of virtual observations were typically needed and that these points were usually rather disperse and the resulting serial correlation was not severe. There is also one benefit with the virtual observation approach, which is that implementation of constraints that only hold on subsets of the domain is straightforward.
For practical use of the model presented in this paper we also pay special attention to numerical implementation. The computations involving only real observations or only virtual observations are separated, which is convenient when only changes to the constraints are made such as in algorithms for finding a sparse set of virtual observation locations or for testing/validation of constraints. We also provide the algorithms based on Choleksy factorization for stable numerical implementation, and an efficient sampling scheme for estimating the posterior process and for finding an optimal set of virtual observation locations. These algorithms are based on derivation of the exact posterior of the constrained Gaussian process using a general linear operator, and constitutes the main contribution of this paper.
The paper is structured as follows: In Section 2 we state the results needed on GP regression and GPs under linear transformations. Our main results are given in Section 3 where we introduce the constrained GP (CGP) and present the model for GP regression under linear inequality constraints. In particular, given some training data we derive the posterior distribution of the CGP evaluated at a finite set of inputs, which is a compound Gaussian with a truncated Gaussian mean (Section 3.1). Section 3.2 presents an algorithm for sampling from the posterior, and parameter estimation is adressed in Section 3.3. Section 3.4 and Section 3.5 are dedicated to optimization of the set of virtual observation locations needed to ensure that the constraint holds with sufficiently high probability. Finally, numerical examples considering monotonicity and boundedness constraints are presented in Section 4, and some concluding remarks are given in Section 5.

2 Gaussian processes and linear operators

We are interested in GP regression on functions under the additional inequality constraint for some specified functions and , and the class of linear operators . Here and are positive integers, and the subscripts are just used to indicate the relevant underlying space over . We will make use of the properties of GPs under linear transformations given below.

2.1 Gaussian process regression

We consider a Gaussian process given as a prior over functions , which is specified by its mean and covariance function

(1)

Let x denote a vector in and the matrix of such input vectors. The distribution over the vector f of latent values corresponding to is then multivariate Gaussian with

where denotes the Gram matrix for two matrices of input vectors and . Given a set of observations , and under the assumption that the relationship between the latent function values and observed output is Gaussian, , the predictive distribution for new observations is still Gaussian with mean and covariance

(2)

Here is the predictive distribution of and is the predictive posterior given the data . For further details see e.g. Rasmussen and Williams (2005).

2.2 Linear operations on Gaussian processes

Let be a linear operator on realizations of . As GPs are closed under linear operators (Rasmussen and Williams, 2005; Papoulis and Pillai, 2002), is still a GP 111We assume here that exists. For instance, if involves differentiation then the process must be differentiable. See e.g. (Adler, 1981) for details on proving existence.. We will assume that the operator produces functions with range in , but where the input domain is unchanged. That is, the operator produces functions from to . This type of operators on GPs has also been considered by Särkkä (2011) with applications to stochastic partial differential equations. The mean and covariance of are given by applying to the mean and covariance of the argument

and the cross-covariance is given as

The notation for here means that the operator acts on as a function of the -th parameter keeping the other one fixed, i.e. for and for . The transpose operator is defined as . In the following sections we make use of the predictive distribution (2) where observations correspond to the transformed GP under .

3 Gaussian processes with linear inequality constraints

Following Section 2.1 and Section 2.2, we let be a GP over real valued functions on , and a linear operator producing functions from to . The matrix and the vector will represent noise perturbed observations: with i.i.d. for .

We would like to model the posterior GP conditioned on the observations , and on the event that for two functions where for all and . To achieve this approximately we start by assuming that the constraint only holds at a finite set of inputs which we refer to as virtual observation locations. Later we will consider how to specify the set of virtual observation locations such that the initial constraint defined for all x holds with sufficiently high probability. Furthermore, we will also assume that virtual observations of the transformed process, , comes with additive white noise with variance . We can write this as where is the matrix containing the virtual observation locations and is a multivariate Gaussian with diagonal covariance of elements .

We will make use of the following notation: Let be the matrix with rows for i.i.d. , and let denote the event . thus represents the event that the constraint is satisfied for all points in , and it is defined through the latent variable .

In summary, the process we will consider is stated as

where is a Gaussian process, is the training data and are the locations where the transformed process is bounded. The additive noise and are multivariate Gaussian with diagonal covariance matrices of elements and respectively.

Here we assume that observations of all parts of comes with i.i.d. white noise with variance . The reason for this is mainly for numerical stability where we in computations will choose a tiny variance to approximate noiseless observations. Similarly, may be chosen as a fixed small number for interpolation in the standard GP regression setting. In the following derivations, the results for exact noiseless observations can be obtained by setting the relevant variance to zero. We also assume that any sub-operator of is constrained at the same set of virtual locations . This is mainly for notational convenience, and this assumption is easily relaxed (see Section 3.5).

3.1 Posterior predictive distribution

Our goal is to obtain the posterior predictive distribution . That is: the distribution of for some new inputs , conditioned on the observed data and the constraint .

To simplify the notation we write , excluding the dependency on inputs and (as well as any hyperparameter of the mean and covariance function). The posterior predictive distribution is given by marginalizing over the latent variable :

where the limits correspond to the hyper-rectangle in given by the functions and evaluated at each . The predictive distribution and the probability are given in Lemma 1. is of interest, as it is the probability that the constraint holds at given the data .

In the remainder of the paper we will use the shortened notation , , and . For vectors with elements in , such as , we interpret this elementwise;

Lemma 1

The predictive distribution is a compound Gaussian with truncated Gaussian mean:

(3)
(4)

where is the Gaussian conditioned on the hyper-rectangle , and

Moreover, the constraint probability is the probability that the unconstrained version of C falls within the constraint region, i.e.

(5)

and the unconstrained predictive distribution is

The derivation in Lemma 1 is based on conditioning the multivariate Gaussian , and the proof is given in Appendix A. For practical implementation the matrix inversions involved in Lemma 1 may be prone to numerical instability. A numerically stable alternative is given in Lemma 2.

In the following lemma Chol is the lower triangular Cholesky factor of a matrix . We also let denote the solution to the linear system for matrices and , which may be efficiently computed when is triangular using forward or backward substitution.

Lemma 2

Let , and .

Then the matrices in Lemma 1 can be computed as

Moreover, is symmetric and positive definite. By letting and we also have

The numerical complexity of the procedures in Lemma 2 is for Cholesky factorization of matrices and for solving triangular systems where the unknown matrix is . In the derivation of Lemma 1 and Lemma 2 the order of operations was chosen such that the first cholesky factor only depends on . This is convenient in the case where the posterior is calculated multiple times for different constraints or virtual observations , but where the data remain unchanged. For such computations using where contains observations the numerical complexity is .

3.2 Sampling from the posterior distribution

In order to sample from the posterior we can first sample from the constraint distribution (4), and then use these samples in the mean of (3) to create the final samples of .

To generate samples of the posterior at new input locations, , we use the following procedure

Algorithm 3

Sampling from the posterior distribution

  1. Find a matrix s.t. , e.g. by Cholesky or a spectral decomposition.

  2. Generate , a matrix where each column is a sample of from the distribution in (4).

  3. Generate , a matrix with samples from the standard normal ).

  4. The matrix where each column in a sample from is then obtained by

    where means that the vector on the left hand side is added to each column of the matrix on the right hand side.

This procedure is based on the well-known method for sampling from multivariate Gaussian distributions, where we have used the property that in the distribution of , only the mean depends on samples from the constraint distribution.

The challenging part of this procedure is the second step where samples have to be drawn from a truncated multivariate Gaussian. The simplest approach is by rejection sampling, i.e. generating samples from the normal distribution and rejection those that fall outside the bounds. In order to generate samples with rejection sampling, the expected number of samples needed is , where the acceptance rate is the constraint probability given in (5). If the acceptance rate is low then rejection sampling becomes inefficient, and alternative approaches such as Gibbs sampling (Kotecha and Djuric, 1999) is typically used. In our numerical experiments (presented in Section 4) we made use of a new method based on simulation via minimax tilting developed by Botev (2017). Note also that when sampling from the posterior for a new set of input locations , when the data and virtual observation locations are unchanged, the samples generated in step 2 may be reused.

3.3 Parameter estimation

To estimate the parameters of the CGP we make use of the marginal maximum likelihood approach (MLE). We define the marginal likelihood function of the CGP as

(6)

i.e. as the probability of the data and constraint combined, given the set of parameters represented by . We assume that both the mean and covariance function of the GP prior (1) and may depend on .

The log-likelihood, , is thus given as the sum of the unconstrained log-likelihood, , which is optimized in unconstrained MLE, and which is the probability that the constraint holds at given in (5). When the constraint probability is computed using Lemma 2, the unconstrained part of the likelihood is also given by

(7)

for (see e.g. Rasmussen and Williams, 2005). It is often reasonable to assume the maximum marginal likelihood estimate of , , should be close to the unconstrained estimate given by maximizing (7), in which case using the unconstrained estimate as an initial guess in numerical optimization should be an efficient strategy. For numerical optimization of (6) it might also be beneficial to make use of the gradient of the marginal log-likelihood. This involves differentiation of which could be done by making use of the first two moments of a truncated multivariate Gaussian (Lee and Scott, 2012), and the derivatives of the covariance function w.r.t. .

In (Bachoc et al., 2018) the authors study the asymptotic distribution of the MLE for shape-constrained GPs, and show that for large sample sizes the effect of including the constraint in the MLE is negligible. But for small or moderate sample sizes the constrained MLE is generally more accurate so taking the constraint into account is beneficial. In the numerical experiments presented in this paper we make use of the unconstrained MLE as an initial guess, and to find the set of virtual observation locations needed to impose the constraint with high probability (see Section 3.4). When the set of virtual locations has been determined the MLE can be updated where the constraint probability is included.

3.4 Finding the virtual observations

For the constraint to be satisfied locally at any input location in some bounded set with sufficiently high probability, the set of virtual observation locations has to be sufficiently dense. We will specify a target probability and find a set such that when the constraint is satisfied at all virtual locations in , the probability that the constraint is satisfied for any x in is at least . The number of virtual observations needed depends on the smoothness properties of the kernel, and for a given kernel it is of interest to find a set that is effective in terms of numerical computation. As we need to sample from a truncated Gaussian involving cross-covariances between all elements in , this means that we would like the set to be small as well as avoiding points in close together that could lead to high serial correlation.

Seeking an optimal set of virtual observations has also been discussed in (Wang and Berger, 2016; Golchi et al., 2015; Riihimäki and Vehtari, 2010; Da Veiga and Marrel, 2012, 2015) and the intuitive idea is to iteratively place virtual observations where the probability that the constraint holds is low. The general approach presented in this section is most similar to that in (Wang and Berger, 2016). In Section 3.5 we extend this to derive a more efficient algorithm for multiple constraints.

In order to estimate the probability that the constraint holds at some new location we first derive the posterior distribution of the constraint process.

Lemma 4

The predictive distribution of the constraint for some new input , condition on the data is given by

(8)

and when is conditioned on both the data and virtual constraint observations, and , the posterior becomes

(9)

Here , , , and are defined as in Lemma 2 , C is the distribution in (4) and

The proof is given in Appendix C. Using the posterior distribution of in Lemma 4 we define the constraint probability as

(10)

where for and otherwise. The quantity is a non-negative fixed number that is included to ensure that it will be possible to increase using observations with additive noise. When we use virtual observations that come with noise , we can use where is the normal cumulative distribution function. Note that , and in this case , will be small numbers included mainly for numerical stability. In the numerical examples presented in this paper this noise variance was set to .

In the case where , computation of (10) is straightforward as is Gaussian. Otherwise we will rely on the following estimate of

(11)

where are samples of C given in (4).

We outline an algorithm for finding a set of virtual observations , such that the probability that the constraint holds locally at any is at least for some specified set and . That is, . The algorithm can be used starting with no initial virtual observation locations, , or using some pre-defined set . The latter may be useful e.g. if the data is updated, in which case only a few additions to the previous set may be needed.

Algorithm 5

Finding locations of virtual observations s.t. for all .

  1. Compute .

  2. Until convergence do:

    1. If compute and as defined in Lemma 2, and generate samples of C given in (4).

    2. If compute . Otherwise compute with defined as in (11) using the samples generated in step (a).

    3. Terminate if , otherwise update .

The rate of convergence of Algorithm 5 relies on the probability that the constraint holds initially, , and for practical application one may monitor as a function of the number of virtual observation locations, , to find an appropriate stopping criterion.

3.5 Separating virtual observation locations for sub-operators

Let be the linear operator where each is a linear operator leaving both the domain and range of its argument unchanged, i.e. produces functions from to , subjected to an interval constraint . Until now we have assumed that the constrain holds at a set of virtual observation locations , which means that for all .

However, it might not be necessary to constrain each of the sub-operators at the same points . Intuitively constraints with respect to need only be imposed at locations where is not negligible small. To accommodate this we might let be the concatenation of the matrices and define . This is equivalent to removing of rows in and all of the results in this paper still apply.

In particular, the matrices needed to make use of Lemma 1 and Lemma 2 are , , , and . Using that , these are given by

where also is given by the above equation for . Finally, is the block matrix with blocks

In this setting we can also improve the algorithm in Section 3.4 for finding the set of virtual observation locations by considering each sub-operator individually. To do this we will make use of the partial constraint probabilities defined as

(12)

where is the univariate Normal distribution given by the -th row of and are samples of C given in (4) as before. Similarly, can be defined as in (10) by considering only the i-th sub-operator. In the case where this means that .

For the individual sub-operators , the set of virtual observations needed to ensure that can then be found using the following algorithm.

Algorithm 6

Finding locations of virtual observations s.t. for all and all sub-operators .

  1. Compute .

  2. Until convergence do:

    1. If compute and as defined in Lemma 2, and generate samples of C given in (4).

    2. If compute . Otherwise compute , for all with defined as in (12) using the samples generated in step (a).

    3. Let correspond to the smallest probability: .

    4. Terminate if , otherwise update .

Using Algorithm 6 instead of Algorithm 5 will provide a more optimal set of virtual observation locations. However, if one were to impose conflicting constraints there is no guarantee of convergence in the sense that adding virtual locations for one constraint may reduce the probability for another. For practical applications it might be wise to monitor as a function of the points added, as this will reveal conflicting constraints or constraints that does not agree with the training data. Moreover, if convergence is slow (i.e. many virtual locations are needed) this might suggest that the prior GP is not suitable.

3.6 Prediction using the posterior distribution

For the unconstrained GP in this paper where the likelihood is given by Gaussian white noise, the posterior mean and covariance is sufficient to describe predictions as the posterior remains Gaussian. It is also known that in this case there is a correspondence between the posterior mean of the GP and the optimal estimator in the Reproducing Kernel Hilbert Space (RKHS) associated with the GP (Kimeldorf and Wahba, 1970). Interestingly, a similar correspondence holds for the constrained case. Maatouk et al. (2016) show that for constrained interpolation, the Maximum A Posteriori (MAP) or mode of the posterior is the optimal constrained interpolation function in the RKHS. This holds when the GP is constrained to a convex set of functions, which is the case in this paper where we condition on linear transformations of a function restricted to a convex set.

4 Gaussian process modelling with boundedness and monotonicity constraints

In this section we present some examples related to function estimation where we assume that the function and some of its partial derivatives are bounded. This is also the scenario considered in the literature on shape constrained GPs: One approach by Golchi et al. (2015), based on a similar idea as Riihimäki and Vehtari (2010), make use of a probit link to represent interval observations of function derivatives. Here Golchi et al. (2015) estimate the exact posterior from sampling, whereas Riihimäki and Vehtari (2010) make use of expectation propagation for approximate inference. A different approach is taken by Da Veiga and Marrel (2012, 2015), where only truncated moments of the constrained process are approximated. Maatouk and Bay (2017) and López-Lopera et al. (2018) on the other hand set out to model a conditional process where the constraints hold in the entire domain, as opposed to at a finite set of virtual locations. They achieve this by make use of finite-dimensional approximations of the GP that converge uniformly pathwise. Some other alternatives can also be found in (Lenk and Choi, 2017; Abrahamsen and Benth, 2001; Lin and Dunson, 2014). In the setting where we condition on the derivative process our approach is most similar to that in (Wang and Berger, 2016), where the authors make use of a similar sampling scheme for noiseless GP regression applied to computer code emulation.

In this section we will make us of the following constraints:

for all x in some bounded subset of , and . Without loss of generality we assume that the constrains on partial derivatives are with respect to the first components of x, i.e. for some .

As the prior GP we will assume a constant mean and the squared exponential covariance function, also called radial basis function (RBF) kernel, given by