Gaussian Processes with Linear Operator Inequality Constraints
This paper presents an approach for constrained Gaussian Process (GP) regression where we assume that a set of linear transformations of the process are bounded. It is motivated by machine learning applications for high-consequence engineering systems, where this kind of information is often made available from phenomenological knowledge. We consider a GP over functions on taking values in , where the process is still Gaussian when is a linear operator. Our goal is to model under the constraint that realizations of are confined to a convex set of functions. In particular, we require that , given two functions and where pointwise. This formulation provides a consistent way of encoding multiple linear constraints, such as shape-constraints based on e.g. boundedness, monotonicity or convexity. We adopt the approach of using a sufficiently dense set of virtual observation locations where the constraint is required to hold, and derive the exact posterior for a conjugate likelihood. The results needed for stable numerical implementation are derived, together with an efficient sampling scheme for estimating the posterior process.
1 \jmlrheading2020191-LABEL:LastPage1/19; Revised 7/198/1919-065Christian Agrell \ShortHeadingsGPs with Linear Inequality ConstraintsAgrell
Gaussian processes, Linear constraints, Virtual observations, Uncertainty Quantification, Computer code emulation
Gaussian Processes (GPs) are a flexible tool for Bayesian nonparametric function estimation, and widely used for applications that require inference on functions such as regression and classification. A useful property of GPs is that they automatically produce estimates on prediction uncertainty, and it is often possible to encode prior knowledge in a principled manner in the modelling of prior covariance. Some early well-known applications of GPs are within spatial statistics, e.g. meteorology (Thompson, 1956), and in geostatistics (Matheron, 1973) where it is known as kriging. More recently, GPs have become a popular choice within probabilistic machine learning (Rasmussen and Williams, 2005; Ghahramani, 2015). Since the GPs can act as interpolators when observations are noiseless, GPs have also become the main approach for uncertainty quantification and analysis involving computer experiments (Sacks et al., 1989; Kennedy and O’Hagan, 2001).
Often, the modeler performing function estimation has prior knowledge, or at least hypotheses, on some properties of the function to be estimated. This is typically related to the function shape with respect to some of the input parameters, such as boundedness, monotonicity or convexity. Various methods have been proposed for imposing these types of constraints on GPs (see Section 4.1 for a short review). For engineering and physics based applications, constraints based on integral operators and partial differential equations are also relevant (Jidling et al., 2017; Särkkä, 2011). What the above constraints have in common is that they are linear operators, and so any combination of such constraints can be written as a single linear operator. For instance, the constraints , and for some function (or distribution over functions) , can be written as for , and being the linear operator .
The motivation for including constraints is usually to improve predictions and to obtain a reduced and more realistic estimate on the uncertainty, the latter having significant impact for risk-based applications. For many real-world systems, information related to constraints in this form is often available from phenomenological knowledge. For engineering systems, this is typically knowledge related to some underlying physical phenomenon. Being able to make use of these constraint in probabilistic modelling is particularly relevant for high-consequence applications, where obtaining realistic uncertainty estimates in subsets of the domain where data is scarce is a challenge. Furthermore, information on whether these types of constraints are likely to hold given a set of observations is also useful for explainability and model falsification. For a broader discussion see (Agrell et al., 2018; Eldevik et al., 2018).
In this paper, we present a model for estimating a function by a constrained GP (CGP) . Here is a set of observations of , possibly including additive white noise, and is a GP with mean and covariance function that are chosen such that existence of is ensured. Due to the linearity of , both and remain Gaussian, and our approach is based on modelling under the constraint . To model the constraint that for all inputs x, we take the approach of using a finite set of input locations where the constraint is required to hold. That is, we require that for a finite set of inputs called the set of virtual observation locations. With this approach the CGP is not guaranteed to satisfy the constraint on the entire domain, but a finite set of points can be found so that the constraint holds globally with sufficiently high probability.
The model presented in this paper is inspired by the research on shape-constrained GPs, in particular (Wang and Berger, 2016; Da Veiga and Marrel, 2012, 2015; Riihimäki and Vehtari, 2010; Golchi et al., 2015; Maatouk and Bay, 2017; López-Lopera et al., 2018). We refer to Section 4 for further discussion on these alternatives. In the case where , our approach is most similar to that of Wang and Berger (2016), where the authors make use of a similar sampling scheme for noiseless GP regression applied to computer code emulation. Many of the approaches to constrained GPs, including ours, rely on the constraint to be satisfied at a specified set of virtual locations. The use of virtual constraint observations may seem ad hoc at first, as the set of virtual observation locations has to be dense enough to ensure that the constraint holds globally with sufficiently high probability. Inversion of the covariance matrix of the joint GP may therefore be of concern, both because this scales with the number of observations cubed and because there is typically high serial correlation if there are many virtual observations close together. The general solution is then to restrict the virtual observation set to regions where the probability of occurrence of the constraint is low (Riihimäki and Vehtari, 2010; Wang and Berger, 2016). According to Wang and Berger (2016), when they followed this approach in their experiments, they found that only a modest number of virtual observations were typically needed, that these points were usually rather disperse, and the resulting serial correlation was not severe. We draw the same conclusion in our experiments. There is also one benefit with the virtual observation approach, which is that implementation of constraints that only hold on subsets of the domain is straightforward.
For practical use of the model presented in this paper, we also pay special attention to numerical implementation. The computations involving only real observations or only virtual observations are separated, which is convenient when only changes to the constraints are made such as in algorithms for finding a sparse set of virtual observation locations or for testing/validation of constraints. We also provide the algorithms based on Cholesky factorization for stable numerical implementation, and an efficient sampling scheme for estimating the posterior process. These algorithms are based on derivation of the exact posterior of the constrained Gaussian process using a general linear operator, and constitutes the main contribution of this paper.
The paper is structured as follows: In Section 2 we state the results needed on GP regression and GPs under linear transformations. Our main results are given in Section 3, where we introduce the constrained GP (CGP) and present the model for GP regression under linear inequality constraints. In particular, given some training data, we derive the posterior predictive distribution of the CGP evaluated at a finite set of inputs, which is a compound Gaussian with a truncated Gaussian mean (Section 3.1). Section 3.2 presents an algorithm for sampling from the posterior, and parameter estimation is addressed in Section 3.3. Section 3.4 and Section 3.5 are dedicated to optimization of the set of virtual observation locations needed to ensure that the constraint holds with sufficiently high probability. Some relevant alternative approaches from the literature on GP’s under linear ¨ constraints are discussed in Section 4, followed up by numerical examples considering monotonicity and boundedness constraints. A Python implementation is available at https://github.com/cagrell/gp_constr, together with the code used for the examples. We end with some concluding remarks in Section 5.
2 Gaussian Processes and Linear Operators
We are interested in GP regression on functions under the additional inequality constraint for some specified functions and , and the class of linear operators . Here and are positive integers, and the subscripts are just used to indicate the relevant underlying space over . We will make use of the properties of GPs under linear transformations given below.
2.1 Gaussian Process Regression
We consider a Gaussian process given as a prior over functions , which is specified by its mean and covariance function
Let x denote a vector in and the matrix of such input vectors. The distribution over the vector f of latent values corresponding to is then multivariate Gaussian with
where denotes the Gram matrix for two matrices of input vectors and . Given a set of observations , and under the assumption that the relationship between the latent function values and observed output is Gaussian, , the predictive distribution for new observations is still Gaussian with mean and covariance
Here is the predictive distribution of and is the predictive posterior given the data . For further details see e.g. Rasmussen and Williams (2005).
2.2 Linear Operations on Gaussian Processes
Let be a linear operator on realizations of . As GPs are closed under linear operators (Rasmussen and Williams, 2005; Papoulis and Pillai, 2002), is still a GP 111We assume here that exists. For instance, if involves differentiation then the process must be differentiable. See e.g. (Adler, 1981) for details on proving existence.. We will assume that the operator produces functions with range in , but where the input domain is unchanged. That is, the operator produces functions from to . This type of operators on GPs has also been considered by Särkkä (2011) with applications to stochastic partial differential equations. The mean and covariance of are given by applying to the mean and covariance of the argument:
and the cross-covariance is given as
The notation and is used to indicate when the operator acts on as a function of x and respectively. That is, and . With the transpose operator the latter becomes . In the following sections we make use of the predictive distribution (2), where observations correspond to the transformed GP under .
3 Gaussian Processes with Linear Inequality Constraints
Following Section 2.1 and Section 2.2, we let be a GP over real valued functions on , and a linear operator producing functions from to . The matrix and the vector will represent noise perturbed observations: with i.i.d. for .
We would like to model the posterior GP conditioned on the observations , and on the event that for two functions , where for all and . To achieve this approximately, we start by assuming that the constraint only holds at a finite set of inputs that we refer to as virtual observation locations. Later, we will consider how to specify the set of virtual observation locations such that the constraint holds for any x with sufficiently high probability. Furthermore, we will also assume that virtual observations of the transformed process, , comes with additive white noise with variance . We can write this as , where is the matrix containing the virtual observation locations and is a multivariate Gaussian with diagonal covariance of elements .
We will make use of the following notation: Let be the matrix with rows for i.i.d. , and let denote the event . thus represents the event that the constraint is satisfied for all points in , and it is defined through the latent variable .
In summary, the process we will consider is stated as
where is a Gaussian process, is the training data and are the locations where the transformed process is bounded. The additive noise and are multivariate Gaussian with diagonal covariance matrices of elements and respectively.
Here we assume that observations of all parts of comes with i.i.d. white noise with variance . The reason for this is mainly for numerical stability, where we in computations will choose a tiny variance to approximate noiseless observations. Similarly, may be chosen as a fixed small number for interpolation in the standard GP regression setting. In the following derivations, the results for exact noiseless observations can be obtained by setting the relevant variance to zero.
We also assume that any sub-operator of is constrained at the same set of virtual locations . This is mainly for notational convenience, and this assumption will be relaxed in Section 3.5. In the following, we let denote the total number of virtual observation locations. Here for now, whereas we will later consider where the i-th sub-operator is associated with virtual observation locations.
3.1 Posterior Predictive Distribution
Our goal is to obtain the posterior predictive distribution . That is: the distribution of for some new inputs , conditioned on the observed data and the constraint .
To simplify the notation we write , excluding the dependency on inputs and (as well as any hyperparameter of the mean and covariance function). The posterior predictive distribution is given by marginalizing over the latent variable :
where the limits correspond to the hyper-rectangle in given by
the functions and evaluated at each . The predictive distribution and the probability are given
in Lemma 3.1. is of interest, as it is
the probability that the constraint holds at given the data .
In the remainder of the paper we will use the shortened notation , , and . For vectors with elements in , such as , we interpret this elementwise. E.g. is given by the column vector .
We start by deriving the posterior predictive distribution at some new locations . The predictive distribution is represented by a Gaussian, , for some fixed covariance matrix and a mean that depends on the random variable . The variable remains Gaussian after conditioning on the observations , i.e. with some expectation and covariance matrix that can be computed using (3, 4). Applying the constraints represented by the event on the random variable just means restricting to lie in the hyper-rectangle defined by the bounds and . This means that is a truncated multivariate Gaussian, . The full derivation of the distribution parameters of C and are given in Lemma 3.1 below, whereas Lemma 3.1 provides an alternative algorithmic representation suitable for numerical implementation.
The predictive distribution is a compound Gaussian with truncated Gaussian mean:
where is the Gaussian conditioned on the hyper-rectangle , and
Moreover, the probability that the unconstrained version of C falls within the constraint region, , is given by
and the unconstrained predictive distribution is
The derivation in Lemma 3.1 is based on conditioning the multivariate Gaussian , and the proof is given in Appendix A. For practical implementation the matrix inversions involved in Lemma 3.1 may be prone to numerical instability. A numerically stable alternative is given in Lemma 3.1 below.
In the following lemma, Chol is the lower triangular Cholesky factor of a matrix . We also let denote the solution to the linear system for matrices and , which may be efficiently computed when is triangular using forward or backward substitution.
Then the matrices in Lemma 3.1 can be computed as
Moreover, is symmetric and positive definite. By letting and we also have
The proof is given in Appendix B. The numerical complexity of the procedures in Lemma 3.1 is for Cholesky factorization of matrices and for solving triangular systems where the unknown matrix is . In the derivation of Lemma 3.1 and Lemma 3.1, the order of operations was chosen such that the first Cholesky factor only depends on . This is convenient in the case where the posterior is calculated multiple times for different constraints or virtual observations , but where the data remain unchanged.
3.2 Sampling from the Posterior Distribution
To generate samples of the posterior at new input locations, , we use the following procedure
Sampling from the posterior distribution
Find a matrix s.t. , e.g. by Cholesky or a spectral decomposition.
Generate , a matrix where each column is a sample of from the distribution in (6).
Generate , a matrix with samples from the standard normal ).
The matrix where each column in a sample from is then obtained by
where means that the vector on the left hand side is added to each column of the matrix on the right hand side.
This procedure is based on the well-known method for sampling from multivariate Gaussian distributions, where we have used the property that in the distribution of , only the mean depends on samples from the constraint distribution.
The challenging part of this procedure is the second step where samples have to be drawn from a truncated multivariate Gaussian. The simplest approach is by rejection sampling, i.e. generating samples from the normal distribution and rejection those that fall outside the bounds. In order to generate samples with rejection sampling, the expected number of samples needed is , where the acceptance rate is the probability given in (7). If the acceptance rate is low, then rejection sampling becomes inefficient, and an alternative approach such as Gibbs sampling (Kotecha and Djuric, 1999) is typically used. In our numerical experiments (presented in Section 4.2) we made use of a new method based on simulation via minimax tilting by Botev (2017), developed for high-dimensional exact sampling. Botev (2017) prove strong efficiency properties and demonstrate accurate simulation in dimensions with small acceptance probabilities (), that take about the same time as one cycle of Gibbs sampling. For higher dimensions in the thousands, the method is used to accelerate existing Gibbs samplers by sampling jointly hundreds of highly correlated variables. In our experiments, we experienced that this method worked well in cases where Gibbs sampling was challenging. A detailed comparison with other sampling alternatives for an application similar to ours is also given in (López-Lopera et al., 2018). An important observation in Algorithm 3.2 is that for inference at a new set of input locations , when the data and virtual observation locations are unchanged, the samples generated in step 2 can be reused.
3.3 Parameter Estimation
To estimate the parameters of the CGP we make use of the marginal maximum likelihood approach (MLE). We define the marginal likelihood function of the CGP as
i.e. as the probability of the data and constraint combined, given the set of parameters represented by . We assume that both the mean and covariance function of the GP prior (1) and may depend on . The log-likelihood, , is thus given as the sum of the unconstrained log-likelihood, , which is optimized in unconstrained MLE, and , which is the probability that the constraint holds at given in (7).
In (Bachoc et al., 2018) the authors study the asymptotic distribution of the MLE for shape-constrained GPs, and show that for large sample sizes the effect of including the constraint in the MLE is negligible. But for small or moderate sample sizes the constrained MLE is generally more accurate, so taking the constraint into account is beneficial. However, due to the added numerical complexity in optimizing a function that includes the term , it might not be worthwhile. Efficient parameter estimation using the full likelihood (8) is a topic of future research. In the numerical experiments presented in this paper, we therefore make use of the unconstrained MLE. This also makes it possible to compare models with and without constraints in a more straightforward manner.
3.4 Finding the Virtual Observation Locations
For the constraint to be satisfied locally at any input location in some bounded set with sufficiently high probability, the set of virtual observation locations has to be sufficiently dense. We will specify a target probability and find a set , such that when the constraint is satisfied at all virtual locations in , the probability that the constraint is satisfied for any x in is at least . The number of virtual observation locations needed depends on the smoothness properties of the kernel, and for a given kernel it is of interest to find a set that is effective in terms of numerical computation. As we need to sample from a truncated Gaussian involving cross-covariances between all elements in , we would like the set to be small, and also to avoid points in close together that could lead to high serial correlation.
Seeking an optimal set of virtual observation locations has also been discussed in (Wang and Berger, 2016; Golchi et al., 2015; Riihimäki and Vehtari, 2010; Da Veiga and Marrel, 2012, 2015), and the intuitive idea is to iteratively place virtual observation locations where the probability that the constraint holds is low. The general approach presented in this section is most similar to that of Wang and Berger (2016). In Section 3.5 we extend this to derive a more efficient method for multiple constraints.
In order to estimate the probability that the constraint holds at some new location , we first derive the posterior distribution of the constraint process.
The predictive distribution of the constraint for some new input , condition on the data is given by
and when is conditioned on both the data and virtual constraint observations, and , the posterior becomes
The proof is given in Appendix D. The predictive distribution in Lemma 3.4 was defined for a single input , and we will make use of the result in this context. But we could just as well consider an input matrix with rows , where the only change in Lemma 3.4 is to replace with . In this case we also note that the variances, , is more efficiently computed as where we recall that for .
Using the posterior distribution of in Lemma 3.4 we define the constraint probability as
where for and otherwise. The quantity is a non-negative fixed number that is included to ensure that it will be possible to increase using observations with additive noise. When we use virtual observations that come with noise , we can use where is the normal cumulative distribution function. Note that , and in this case , will be small numbers included mainly for numerical stability. In the numerical examples presented in this paper this noise variance was set to .
In the case where , computation of (11) is straightforward as is Gaussian. Otherwise, we will rely on the following estimate of :
where are samples of C given in (6).
We outline an algorithm for finding a set of virtual observation locations , such that the probability that the constraint holds locally at any is at least for some specified set and . That is, . The algorithm can be used starting with no initial virtual observation locations, , or using some pre-defined set . The latter may be useful e.g. if the data is updated, in which case only a few additions to the previous set might be needed.
Finding locations of virtual observations s.t. for all .
The rate of convergence of Algorithm 3.4 relies on the probability that the constraint holds initially, , and for practical application one may monitor as a function of the number of virtual observation locations, , to find an appropriate stopping criterion.
With the exception of low dimensional input x, the optimization step is in general a hard non-convex optimization problem. But with respect to how and are used in the algorithm, some simplifications can be justified. First, we note that when computing with (12) for multiple , the samples are reused. It is also not necessary to find the the absolute minimum, as long as a small enough value is found in each iteration. Within the global optimization one might therefore decide to stop after the first occurrence of less than some threshold value. With this idea one could also search over finite candidate sets , using a fixed number of random points in . This approach might produce a larger set , but where the selection of is faster in each iteration. Some of the alternative strategies for locating in Algorithm 3.4 are studied further in our numerical experiments in Section 4.2.
With the above algorithm we aim to impose constraints on some bounded set . Here has to be chosen with respect to both training and test data. For a single boundedness constraint, it might be sufficient that the constraint only holds at the points that will be used for prediction. But if we consider constraints related to monotonicity (see Example 1, Section 4.2), dependency with respect to the latent function’s properties at the training locations is lost with this strategy. In the examples we give in this paper we consider a convex set , in particular , and assume that training data, test data and any input relevant for prediction lies within .
3.5 Separating Virtual Observation Locations for Sub-operators
Let be a linear operator defined by the column vector , where each is a linear operator leaving both the domain and range of its argument unchanged, i.e. produces functions from to , subjected to an interval constraint . Until now we have assumed that the constrain holds at a set of virtual observation locations , which means that for all .
However, it might not be necessary to constrain each of the sub-operators at the same points . Intuitively, constraints with respect to need only be imposed at locations where is large. To accommodate this we let be the concatenation of the matrices and define . This is equivalent to removing some of the rows in , and all of the results in this paper still apply. In this setting we can improve the algorithm in Section 3.4 for finding the set of virtual observation locations by considering each sub-operator individually. This is achieved using the estimated partial constraint probabilities, , that we defined as in (11) by considering only the i-th sub-operator. We may then use the estimate
where is the univariate Normal distribution given by the -th row of , and are samples of C given in (6) as before. Algorithm 3.4 can then be improved by minimizing (13) with respect to both x and . The details are presented in Appendix C, Algorithm C.
3.6 Prediction using the Posterior Distribution
For the unconstrained GP in this paper where the likelihood is given by Gaussian white noise, the posterior mean and covariance is sufficient to describe predictions as the posterior remains Gaussian. It is also known that in this case there is a correspondence between the posterior mean of the GP and the optimal estimator in the Reproducing Kernel Hilbert Space (RKHS) associated with the GP (Kimeldorf and Wahba, 1970). This is a Hilbert space of functions defined by the positive semidefinite kernel of the GP. Interestingly, a similar correspondence holds for the constrained case. Maatouk et al. (2016) show that for constrained interpolation, the Maximum A Posteriori (MAP) or mode of the posterior is the optimal constrained interpolation function in the RKHS, and also illustrate in simulations that the unconstrained mean and constrained MAP coincide only when the unconstrained mean satisfies the constraint. This holds when the GP is constrained to a convex set of functions, which is the case in this paper where we condition on linear transformations of a function restricted to a convex set.
3.7 An Alternative Approach based on Conditional Expectations
Da Veiga and Marrel (2012, 2015) propose an approach for approximating the first two moments of the constrained posterior, , using conditional expectations of the truncated multivariate Gaussian. This means, in the context of this paper, that the first two moments of are computed using the first two moments of the latent variable C. To apply this idea using the formulation of this paper, we can make use of the following result.
Let the matrices , , and the truncated Gaussian random variable C be as defined in Lemma 3.1, and let be the expectation and covariance of C. Then the expectation and covariance of the predictive distribution are given as
Moreover, if , and are the matrices defined in Lemma 3.4, then the expectation and variance of the predictive distribution of the constraint are given as
Da Veiga and Marrel (2012, 2015) make use of a Genz approximation (Genz, 1992, 1997) to compute for inference using (14). They also introduce a crude but faster correlation-free approximation that can be used in the search for virtual observation locations. With this approach, (15) is used where are computed under the assumption that is diagonal. We can state this approximation as follows:
where is the i-th component of , , ,