Sparse Pseudo-input Local Kriging for Large Spatial Datasets with Exogenous Variables

Sparse Pseudo-input Local Kriging for Large Spatial Datasets with Exogenous Variables

Babak Farmanesh and Arash Pourhabib School of Industrial Engineering and Management, Oklahoma State University, Stillwater, Oklahoma, {babak.farmanesh,arash.pourhabib}@okstate.edu
Abstract

We study large-scale spatial systems that contain exogenous variables, e.g. environmental factors that are significant predictors in spatial processes. Building predictive models for such processes is challenging because the large numbers of observations present makes it inefficient to apply full Kriging. In order to reduce computational complexity, this paper proposes Sparse Pseudo-input Local Kriging (SPLK), which utilizes hyperplanes to partition a domain into smaller subdomains and then applies a sparse approximation of the full Kriging to each subdomain. We also develop an optimization procedure to find the desired hyperplanes. To alleviate the problem of discontinuity in the global predictor, we impose continuity constraints on the boundaries of the neighboring subdomains. Furthermore, partitioning the domain into smaller subdomains makes it possible to use different parameter values for the covariance function in each region and, therefore, the heterogeneity in the data structure can be effectively captured. Numerical experiments demonstrate that SPLK outperforms, or is comparable to, the algorithms commonly applied to spatial datasets.

Keywords: Gaussian process regression, Local Kriging, Sparse approximation, Spatial datasets

1 Introduction

Advances in data collection technologies for geostatistics have created unprecedented opportunities to build more effective data-driven models. Of paramount importance in many engineering applications is to build predictive models for spatial processes that include environmental factors such as temperature or irrigation as significant predictors (Gao et al., 2014; Zhang et al., 2017). We call these environmental factors exogenous variables to distinguish them from simple spatial information such as latitude, longitude, and altitude.

Kriging (Cressie, 1990), also known as Gaussian process regression (GPR) (Rasmussen and Williams, 2006), is a powerful tool for modeling spatial processes. Theoretically, GPR can benefit from very large datasets since it is a non-parametric model whose flexibility and performance generally increase by having more data points (Friedman et al., 2009). However, the computational complexity of GPR is dominated by the inversion of covariance matrices which is of , where is the number of data points.

To reduce GPR computation time, various approaches have been developed to approximate the covariance matrix of GP, resulting in less costly matrix operations. One class of such methods approximates the covariance matrix with sparse matrices, i.e., matrices with many zero elements (Furrer et al., 2006; Zhang and Du, 2008; Kaufman et al., 2008), and another class of methods finds low-rank approximations of the covariance matrix (Williams and Seeger, 2001; Smola and Schölkopf, 2000; Snelson, 2007; Snelson and Ghahramani, 2007; Quiñonero-Candela and Rasmussen, 2005; Pourhabib et al., 2014). However, these methods do not directly take the heterogeneous structure for the data into account: if the behavior of the response variable strongly depends on the underlying geology (Kim et al., 2005) or the exogenous variables, it is reasonable to assume different values for parameters of a given covariance function (hence, heterogeneity). Although there is a rich body of literature in spatial statistics that proposes different methods to capture inhomogeneous covariance structures (see Sampson and Guttorp (1992); Schmidt and O’Hagan (2003); Damian et al. (2003); Paciorek and Schervish (2006); Lindgren et al. (2011); Fuglstad et al. (2015)), the application of these methods is generally limited to small datasets with up to two-dimensional input spaces. As such, for large spatial datasets, and especially data with exogenous variables, it is beneficial to allow for different covariance parameters in each region, while addressing the computational challenge of handling the large number of observations.

In order to reduce the computational complexity of GPR, while at the same time improve its ability to tackle inhomogeneous covariance structures for large spatial datasets, one idea is to use a class of local Kriging that assumes distinct covariance functions for each region of the data domain. Local Kriging uses a partitioning policy that decomposes the domain into smaller subdomains and applies local GPR in each subdomain (Haas, 1990; Park et al., 2011; Gramacy and Lee, 2008). Therefore, local Kriging reduces the total computational complexity to , where is the number of local data points, and . This idea, however, presents two related challenges: discontinuity in prediction on the boundaries of the subdomains and devising an efficient partitioning policy.

To address discontinuity on the boundaries, one category of local Kriging methods uses various averaging techniques to smooth the prediction surface close to the boundaries. Examples in this category include Bayesian committee machine, BCM (Tresp, 2000), mixtures of Gaussian processes, MGP (Rasmussen and Ghahramani, 2002), treed Gaussian process models, TGP (Urtasun and Darrell, 2008; Gramacy and Lee, 2008), bagged Gaussian process, BGP (Chen and Ren, 2009), and local probabilistic regression, LPR (Urtasun and Darrell, 2008). Such averaging techniques, however, come at the cost of higher computational complexity at prediction time.

Another category of methods to alleviate the discontinuity problem enforces continuity constraints on the boundaries of subdomains. This class of approaches includes domain decomposition method, DDM (Park et al., 2011), patching local Gaussian processes, PGP (Park and Huang, 2016), and patchwork Kriging, PWK (Park and Apley, 2018). Experimental studies suggest that directly imposing continuity constraints generally outperforms the averaging techniques (Park et al., 2011; Park and Huang, 2016; Park and Apley, 2018). However, due to the complexity of handling boundary conditions, only PWK can be applied to higher-dimensional domains; DDM and PGP are limited in practice to only two-dimensional domains (Park and Apley, 2018).

Furthermore, none of the local Kriging approaches above take the data structure, which is manifested in the covariance function, into account when partitioning the data domain: DDM and PGP use uniform mesh that partitions the domain of the input data into rectangles. TGP and PWK, on the other hand, use simple tree based partitioning to iteratively bisect the input domain. Moreover, in order to obtain time efficient algorithms, the number of data points in each subdomain must be kept to a small value, e.g., up to 600 data points in each subdomain (Park et al., 2011; Park and Apley, 2018). However, there is a trade-off between the number of subdomains and the accuracy of prediction: as the number of subdomains, and thus boundaries, increases, the prediction accuracy on the boundaries of the subdomains decreases, regardless of the method used to handle the boundary conditions.

To address the limitations of existing local Kriging methods, this paper proposes a new method, Sparse Pseudo-input Local Kriging (SPLK), which utilizes covariance information to partition the data domain into subdomains. The data is partitioned using parallel hyperplanes, and continuity constraints are enforced on the boundaries of subdomains. This partitioning approach minimizes the number of boundaries and simplifies boundary conditions, allowing application to datasets with moderate dimensional spaces. We develop an optimization algorithm to find the desired hyperplanes that result in lower errors for the covariance approximations in each region, and provide theoretical justification for the use of such parallel hyperplanes to create the subdomains based on analysis of the covariance structure. Therefore, SPLK is essentially a hybrid method combining low-rank approximations and local GPR to seamlessly integrate a partitioning policy into local approximations to improve prediction accuracy.

One potential disadvantage to this proposed partitioning is that it can result in large-size subdomains, which makes the application of the full GPR in each subdomain computationally inefficient; this limitation is overcome by using covariance approximation methods for each region. This approximation also has the added benefit of increasing the flexibility of choosing the sizes of the subdomains to further reduce the number of boundaries. While SPLK has a higher computational complexity compared to sparse and low-rank approximation methods due to the handling of the boundary conditions, however, the use of local covariance functions in each subdomain better captures the heterogeneous data structures compared to low-rank approximation methods. Another trade-off is that since the covariance structure of a spatial process can vary in different directions, partitioning in one direction using parallel hyperplanes may not be the most flexible way of capturing such structures. Nonetheless, this simple partitioning of SPLK significantly reduces computation time over existing local Kriging methods while still maintaining acceptable prediction accuracy.

As the dimension of the data domain increases, handling the boundary conditions of SPLK becomes more computationally expensive due to the expansion of the boundary spaces. Therefore, we suggest applying SPLK to spatial datasets with a moderate number of exogenous variables. However, we note that the methodology is general and can be efficiently applied to any large dataset (on the order of hundreds of thousand of data points) with a small number of input variables (say ten or fewer). Our numerical studies demonstrate that SPLK outperforms, or performs as well as, the competing algorithms in terms of computation time or accuracy on two and three-dimensional spatial data, higher-dimensional spatial data with exogenous variables, and a nine-dimensional non-spatial data.

The remainder of this paper is organized as follows. Section 2 introduces GPR, and a few approximation techniques that are relevant to this paper. Section 3 explains the proposed method including domain partitioning, training local models subject to boundary conditions, and choosing directions of cuts. Section 4 compares the proposed method to commonly used algorithms. Section 5 concludes the paper and suggests future research. The supplemental material includes proof of all theorems and other technical details and analyses related to the proposed approach.

2 Gaussian Process Regression

Given an index set , a Gaussian Process (GP) is a stochastic process where for any as a finite subset of , the random vector follows a multivariate normal distribution (Rasmussen and Williams, 2006), where is a realization of a measurable function for a given . Here, we consider the index set to be a subset of such that for a given , the random vector follows a multivariate normal distribution, where for all , and denotes the set of positive integers smaller than or equal to , i.e., .

We say a GP is fully specified when the function follows a GP distribution with mean function and covariance function . In other words, given and , the mean vector and the covariance matrix of random vector can be calculated, i.e., and , where denotes the expectation operator. This means that , , and .

In the context of regression, given a training dataset consisting of noise contaminated observations, i.e., , the Gaussian Process Regression (GPR) seeks , the predictive distribution of at given . We can derive this predictive distribution directly from the definition of the GP using joint Gaussian distribution

(1)

where is a covariance matrix of pairwise elements in , is a vector of covariances between and , and is the variance at . Hence, the GPR predictive distribution can be obtained by conditioning given in (1),

(2)

Since calculating (2) entails inverting matrices of size , the computational complexity is of , which is generally too slow for most practical applications, especially spatial statistics. Low-rank covariance approximation methods (Williams and Seeger, 2001; Quiñonero-Candela and Rasmussen, 2005) approximate the original covariance matrix through the Nyström method,

(3)

where is either a subset of or a set of unobserved pseudo-inputs, which are a new set of parameters used to approximate the likelihood of GPR. In particular, Sparse Pseudo-input Gaussian Process (SPGP) (Snelson and Ghahramani, 2007) assumes that observations are conditionally independent, given the pseudo-outputs defined on pseudo-input set . This implies the joint Gaussian likelihood,

(4)

and the predictive mean and variance,

(5)
(6)

where is the low-rank covariance vector between and the test data point calculated by (3). Section 3 explains how the low-rank approximation in SPGP helps us devise a simple but efficient partitioning policy for our proposed local Kriging method.

3 Sparse Pseudo-input Local Kriging

This section describes our proposed method, Sparse Pseudo-input Local Kriging (SPLK), where we partition the domain of data into smaller subdomains with simple boundaries, train local predictors that utilize a low-rank covariance matrix in each subdomain, and connect neighboring local predictors on their joint boundaries to obtain a continuous global predictor. To partition the input domain, we use parallel hyperplanes, i.e., -dimensional linear spaces embedded in a -dimensional space (see Section F.1 for the details). This partitioning minimizes the number of boundaries, because for subdomains, we only need parallel hyperplanes regardless of the dimension of the input space. This reduction in the number of boundaries improves the prediction accuracy, since the accuracy of local models decreases in the regions close to the boundaries regardless of the way the boundary conditions are handled. Moreover, partitioning by parallel hyperplanes creates simple boundary conditions (see Section 3.1), as each boundary is shared by exactly two subdomains. Hence, each boundary only requires two local predictors However, the drawback is that the partitioning policy can result in very large subdomains, where a full GPR is computationally inefficient. We overcome this problem by using covariance approximation techniques that utilize pseudo-inputs.

Among the infinite possible ways to partition a domain by parallel hyperplanes, we seek those that improve the accuracy of local predictors, i.e., the covariance approximation in each subdomain has the smallest error. We present two theorems that together determine the policy for creating subdomains. We begin by presenting the local mean and variance calculations, assuming the subdomains have already been determined. Then we discuss justifications for the proposed parallel hyperplanes for creating subdomains. (See Appendix F for practical aspects of SPLK’s implementation such as constructing hyperplanes, hyperparameter learning, and selection of control points).

Any partitioning policy that results in subdomains whose boundaries do not intersect, e.g., concentric hyperspheres, would benefit from having a small number of boundaries and simple boundary conditions. What makes parallel hyperplanes particularly appealing is the fact that the boundary spaces are minimal compared to any other non-intersectional partitioning policy. In addition, the simple structure of parallel hyperplanes allows us to analyze the direction of partitioning based on the underlying covariance structure; this might not be feasible in other partitioning policies.

3.1 Mean and Variance Prediction

Let denote the input domain, i.e., . We partition into subdomains for such that , and for . We also denote and as the vector of observations corresponding to (see Section 3.2 for an explanation of determining ). The partitioning scheme explained in Section 3.2 and Appendix F.1 can lead to subdomains containing a large number of training data points, which makes the application of a full GPR inefficient. Therefore, for each , we use local pseudo-inputs to form the local and low-rank covariance approximation,

(7)

It is easy to verify that among all the linear predictors , where and follows distribution (1), the GPR mean predictor minimizes the expected squared error, . We extend this idea to find the local and low-rank predictor for each subdomain by assuming that follows the local version of SPGP’s joint likelihood distribution (4). As such, we solve

(8)

where is the local version of , is the covariance vector between the test data point and using low-rank approximation formula (7). Expanding the objective function with respect to the constraint in (8) and removing , which does not depend on , results in the unconstrained optimization problem for each ,

(9)

Note that setting gives , which is the SPGP’s mean predictor for subdomain . Next, we modify the optimization problem to alleviate the problem of discontinuity in the predictions on the boundaries.

To impose continuity on the boundaries, we use a small number of control points on the boundaries of each subdomain (Park and Apley, 2018). Let be the set of all the control points located on the boundaries of . We intend to force local predictor to be equal to the boundary values at the control point locations in ,

(10)

where is a function that evaluates each (see Section F.1 for the details). Adding constraints (10) to local model (9) gives the constrained local optimization for each ,

(11)

The objective function in optimization problem (11) is convex. Considering that the constraints are affine functions, we can solve optimization problem (11) analytically by transforming it into an unconstrained optimization problem using Lagrange duality principle (Bazaraa et al., 2013). Appendix A in the supplemental material presents the solution procedure.

Solving optimization problem (11) obtains the optimal solution as

(12)

where and . Therefore, the local mean predictor for becomes

(13)

Also, the objective function of local problem (9) is in fact the local variance predictor. Therefore plugging into (9) obtains the predictive variance for ,

(14)

where is the constant initially removed from the optimization. Note that in both (13) and (14), the first term is exactly the predictive mean and variance of local SPGP, and the following terms, which are amplified for local points close to the boundaries, appear to maintain the continuity of the global predictive function.

3.2 Subdomain selection

As mentioned in Section 3, for computational efficiency we only consider the subdomains that are separated by parallel hyperplanes. We call these hyperplanes “cutting hyperplanes,” because each of them partitions or “cuts” into two non-overlapping sets on different sides of the hyperplane. However, there are infinite ways of choosing the directions of the cutting hyperplanes. In Proposition 1 of this section, we first provide a criterion to define the meaning of a “good” direction of cutting, given a stationary covariance function. Next, using the first and the second theorems that follow, we characterize the direction that optimizes the criterion. Finally, we introduce a constrained optimization that finds the desired direction using a likelihood function of a sample of the training data.

Recall that each subdomain uses a low-rank approximation for its covariance matrix. Therefore, a natural criterion is to look for subdomains such that the error for this approximation is minimized. Therefore, given a symmetric positive semidefinite kernel , our objective is to create subdomain for which the expected covariance approximation error at any using a set of pseudo inputs , i.e.,

(15)

where the expectation operator is with respect to all and over and , is minimized (see Appendix B for derivation of covariance approximation error). However, since the expected error has a complicated form and its direct calculation is a challenging task, we seek an upper bound for this term and minimize that instead.

Proposition 1.

Let denote a stationary covariance function, and be the evaluation of kernel at an arbitrary point . Then, , where are i.i.d random vectors sampled from subdomain according to some probability distribution .

Proof.

See Appendix C in the supplemental material for proofs of all theorems and propositions. ∎

Propositions 1 provides an upper bound, i.e., , on expected error (15) (See Appendix D for a simulation study showing that the relation between and expected error (15) is more profound. In fact, under certain conditions by increasing , the expected error term itself monotonically decreases). Therefore, we seek to construct the subdomains so that the expected covariance squared function is maximized, i.e., the upper bound of the expected error is minimized. We note that Propositions 1 makes a stationarity assumption and therefore the results may not hold for other types of covariance functions. However, because we use independent covariance functions in each subdomain, we are still able to handle the heterogeneity, i.e., using different parameters for each local covariance function.

For our theoretical framework, we consider a general scenario where, after standardizing the data, the domain of data, , is (or is inscribed in) a hypercube with edge length , one vertex is on the origin, and all the edges are parallel to one axis of . The assumption that the domain of the data is inscribed in a hypercube is valid even if each dimension of the original input domain has different length; this is because after standardization, all the dimensions have the same length. Also we assume that the data points are uniformly sampled from , specifically,

(16)

We call such an a uniform straight hypercube.

Moreover, we consider the anti-isotropic squared exponential function as the choice of the covariance function,

(17)

where is a diagonal matrix with length-scale parameters on the diagonal, and without loss of generality, assume . We note that the squared function of (17), i.e., , is a new squared exponential covariance function with the length scale parameters . Hence, as increases, increases.

Given the primary axis and the vector of angles and assuming that the cutting hyperplanes are equidistant (with distant from each other), all the subdomains and cutting hyperplanes can be fully characterized (See Appendix C). We denote the subdomain created on by , where the indices , and indicate that the cutting hyperplanes are defined by the vector of angels , the primary axis, and the distance . Note that the cutting hyperplanes are orthogonal to the axis only if , that is for .

Theorem 1.

Let be a uniform straight hypercube with side length , and let denote covariance function (17). Then, for a fixed , , and , gives the maximum expected covariance, i.e.,

While Theorem 1 shows that cutting orthogonally to the given axis , i.e., , maximizes the expected covariance compared to any other , Theorem 2 further shows that among all the subdomains created by cutting orthogonally to a primary axis, the one created by cutting orthogonally to the axis associated with the fastest direction of change, i.e., the direction associated with the largest , has the maximum expected covariance

Theorem 2.

Let be a uniform straight hypercube with side length , and let denote covariance function (17). Then for a fixed and , among all the subdomains for , gives the maximum expected covariance, i.e.,

Theorems (1) and (2) along with the property of covariance function (17), i.e., larger values of imply larger , provide a partitioning policy for the domain . That is, cutting orthogonal to the direction of the fastest covariance decay reduces the upper bound of expected error (15), and therefore, gives a more accurate covariance approximation in each subdomain. The policy of cutting orthogonal to the direction of the fastest covariance decay minimizes the correlation between the neighboring subdomains. This is because the covariance on the two sides of each boundary decays faster than any other direction.

However, we note that the direction of the fastest covariance decay may not necessarily be a primary axis of the input domain. To overcome this drawback, we relax the restriction of choosing one of the primary axes as the direction of the fastest covariance decay by using a general form of the squared exponential covariance function, , where is a positive definite matrix (Rasmussen and Williams, 2006). For the purpose of this discussion, we define as , where is a unit direction vector in the input space with length , and is a joint length-scale parameter, to obtain the following covariance function,

(18)

which involves a dot product . This means that for a given distance , the angle between and determines the covariance. In particular, the direction itself has relatively the highest rate of covariance decay.

Although in practice, direction may not exist, fitting covariance function to the data using Maximum Likelihood Estimation can find the best choice of under the MLE criterion. Therefore, under the GP assumptions, we maximize the logarithm of the likelihood function to find the optimal value of vector ,

(19)

where is the covariance matrix formed based on covariance function (18).

Here, since we only want to find direction , the nuisance parameters are the variance and the length scale parameters, and . Therefore, to shrink the parameter space, we set and to small values after standardizing the data.

Note that optimization problem (19) is of , which is the same order of complexity as the original problem. However, since the output of optimization (19) is merely used to find a desired direction, and is not used for prediction, we utilize a small subset of data with size . Further, since is a unit direction vector, we write , where , and add the unity constraint, , to the optimization problem. Consequently,

(20)

where is the response vector of the small subset of data and is the covariance matrix evaluated by covariance function (18) on the same small subset (See Appendix E for solving optimization problem (20) by using Projected Gradient Descent (Nesterov and Nemirovskii, 1994)).

We also note that optimization (20) finds the direction of the fastest covariance decay independent of the assumptions stated for Theorems 1 and 2. Our experiments in Section 4.3.2 show that the directions found by optimization (20) can significantly increase the accuracy of SPLK, even if the original input domains are not hypercubes or if the data points are not uniformly distributed. We refer the reader to Appendix D for intuition behind the theoretical results presented above.

3.3 Computational complexity analysis of SPLK

This section presents the computational complexity analysis for SPLK. To this end, we look at the computational costs of calculating the local mean and variance predictors in Section 3.1, and finding the direction of cut in Section 3.2. In addition, we analyze training the boundary functions presented in Appendix F.1 and training the local models presented in Appendix F.3.

Calculating the local mean and variance predictors in each subdomain (explained in Section 3.1 and Appendix A) requires inverting the low-rank covariance matrix and the boundary covariance matrix with sizes and , respectively. Using Woodbury, Sherman and Morrison matrix inversion lemma (Hager, 1989), the computational complexity of inverting is of the order of , where ; therefore, the complexity of calculating each local mean and variance predictor becomes . Also, training each local model (explained in Appendix F.3) requires maximizing the local likelihood function (69). Snelson (2007) shows that the cost of finding the derivatives and maximizing (69) is of the order of . Therefore, denoting as the average number of pseudo-inputs in each subdomain, and as the average number of control points on each boundary, which implies , we obtain as the total computation complexity of calculating the local mean and variance predictors and training local models.

Furthermore, since we train the boundary function (68) (explained in Appendix F.1) using the full GPR on the set of neighboring data points , the computational complexity of training the boundary functions is of the order of , where is the average size of all . Also, solving the optimization (20) for finding the direction of cut, through solving optimization problem (20), is dominated by the matrix inversion , which has the order .

Consequently, the total computational complexity of SPLK is of the order of . We note that as the dimension of the training data increases, more control points are required to be located on the boundaries, which implies implicitly depends on ; however, since the application of the current study focuses on moderate dimensional problems, the complexity of SPLK is dominated by , under the assumption that the values of and are independent of . Section 3.4 will discuss this assumption and the choice of other tuning parameters.

3.4 Choice of the tuning parameters of SPLK

This section presents some guidelines for the selection of the tuning parameters of SPLK, which are , , , , .

As the average number of local pseudo-inputs in each subdomain, , increases, the accuracy of SPLK increases at the expense of higher computation time. Such a trade-off rules out an “optimal” value for Williams et al. (2002) shows that as the eigenspectrum of the underlying covariance function decays more quickly, given a fixed set of pseudo-inputs, the Nyström approximation becomes more accurate. Therefore, the choice of depends on the covariance structure of the function of interest. However, since SPLK optimizes the distribution of pseudo-inputs in each subdomain using SPGP approximation, SPLK generally requires a smaller number of local pseudo-inputs compared to other approximation methods that use ad-hoc selection of pseudo-inputs (Snelson, 2007). In order to have a computationally efficient algorithm, we suggest setting of the order of , i.e., , where is a tuning parameter that determines the density of pseudo-inputs in each subdomain. Our experiments in section 4 show that setting results in efficient computation time and relatively high accuracy. Alternatively, we note that a Bayesian approach can also be used for the selection of  (Pourhabib et al., 2014), however the computation time is significantly increased by the Markov Chain Monte Carlo sampling that is required in the Bayesian approach.

Similar to , a trade-off exists between the accuracy and computation time for , the number of subdomains. As mentioned earlier, regardless of the approach used for handling the boundary conditions, a larger number of subdomains reduces the computation time as well as the prediction accuracy; smaller local models can be trained more efficiently but result in a larger number of boundaries, which in turn reduces the overall accuracy. This is because the accuracy of the local models decreases in regions close to their boundaries. We suggest choosing such that each subdomain contains between and data points. Based on our experiments, choosing an that results in subdomains with more than data points makes the parameter estimation of each local model computationally inefficient. On the other hand, a value of that results in subdomains with less than data points creates too many boundaries, which reduces the accuracy.

Next, we discuss , the number of control points on each boundary. As the dimension of the input domain increases, we need to locate more control points to efficiently handle the boundary conditions. We suggest setting proportional to the dimension of the boundary to effectively cover the boundary spaces. Specifically, we use , where is the dimension of the domain of data, and determines the density of control points on each boundary space. Moreover, in order to balance the computation time between training the subdomains, which is of , and handling boundary conditions, which is of , we suggest as an upper bound for , which enforces . Theoretically, SPLK can be applied to even higher dimensional spaces, but as the dimension of the input domain increases, the upper bound for decreases, which means a more sparse distribution of the control points (Park and Apley, 2018). Also, SPLK uses uniform distribution of control points on the boundaries (see Appendix F.1), which might not be efficient in higher dimensions due to the sparsity of the control points. Therefore, we do not recommend the application of SPLK to very high dimensional spaces. Our experiments in Section F.2 show that choosing provides satisfactory results in terms of both computation time and accuracy.

As mentioned in Section 3.2, a small subset of data with size is used to merely find a desired direction for applying the cutting hyperplanes, and as such we suggest . For our experiments in Section 4, we choose to find the cutting direction through solving optimization problem (20), which resulted in a small computational overhead.

Finally, for the choice of , the average number of neighboring data points of each boundary, we suggest setting , where is the width of each subdomain, and is the maximum distance of the neighboring data points to their associated boundary (see Section F.1). This choice of ensures that the local data points reasonably close to the boundaries when training the boundary functions. Moreover, assuming data points are uniformly disturbed in the input domain, we set for subdomains with sizes ranging between 500 and 5000, which reduces computational overhead when training the boundary functions.

4 Experimental results

In this section, we apply SPLK to four real datasets and compare its performance with local probabilistic regression (LPR)  (Urtasun and Darrell, 2008), Bayesian committee machine (BCM) (Tresp, 2000), bagged Gaussian process (BGP) (Chen and Ren, 2009), partial independent conditional GP (PIC) (Snelson and Ghahramani, 2007), DDM (Park et al., 2012), and PWK (Park and Apley, 2018). We use the BGP, LPR, BCM, and DDM implementations in the GPLP toolbox (Park et al., 2012), PWK and BCM implementations provided by the authors of (Park and Apley, 2018) and Schwaighofer and Tresp (2003) respectively. We also conduct sensitivity analysis of the parameters in SPLK and propose some guidelines for their selection.

4.1 Datasets and evaluation criteria

We implement SPLK in MATLAB and test it on four real datasets:

  1. The spatial dataset, TCO, which contains 65000 observations, collected by the NIMBUS7 satellite for NASA’s Total Ozone Mapping Spectrometer (TOMS) project (https://www.nodc.noaa.gov). The global measurement was conducted on a two-dimensional grid, i.e., latitude and longitude, from 1978 to 2003 on a daily basis. We select the measurements of “total column of ozone” on this grid for the data collected on January 1, 2003. The dataset is highly non-stationary and an appropriate dataset for comparing SPLK and DDM because it is constructed on a two-dimensional input space,

  2. The spatial dataset, Levitus, which contains 56000 observations, is a part of the world ocean atlas that measures the annual means of major ocean parameters (http://iridl.ldeo.columbia.edu/SOURCES/.LEVITUS94). The global measurement was conducted on a three-dimensional grid, i.e., latitude, longitude, and depth, in 1994. We select the “apparent oxygen utilization” as the response variable on this grid.

Recalling that handling exogenous variables in spatial datasets also motivates this paper, we use a third real dataset.

  1. The spatial dataset, Dasilva, which contains 70000 observations, is a part of a five-volume atlas series of Surface Marine Data (http://iridl.ldeo.columbia.edu/SOURCES/.DASILVA/.SMD94/.halfbyhalf/.climatology/). The global measurement was conducted on a two-dimensional grid, i.e., latitude and longitude, on a monthly basis in 1994. We select three exogenous variables, “constrained outgoing heat flux”, “zonal heat flux”, and “sea minus air temperature”, and the objective is to predict “long wave Chi sensitivity” based on the data collected on January 1994.

Although SPLK was developed to handle spatial datasets, the methodology is general and can be efficiently applied to non-spatial data that have a moderate number (say ten or fewer) of exogenous variables. We use a fourth dataset to demonstrate the performance of SPLK on non-spatial data.

  1. The non-spatial dataset, Protein, which contains 46000 measurements, is a collection of Physicochemical Properties of Protein Tertiary Structure (http://archive.ics.uci.edu/ml/datasets/Physicochemical+Properties+of+Protein+Tertiary+Structure). This dataset contains nine “physicochemical properties” of proteins as explanatory variables and “size of the residue” as the response variable.

We randomly partition each dataset into for training and for testing. We use three measures to evaluate the performance of each method. The first one is the measure of prediction accuracy, which is assessed by the Mean Squared Error (MSE),

(21)

where is the noisy observation of the test location and is the mean prediction of this test location. The second measure is the Negative Log Predictive Density (NLPD) that takes into account uncertainty in prediction in addition to accuracy, specifically

(22)

where is variance of the predictor at the test location . The third measure is the computation time, i.e., training plus testing time, that evaluates the success of SPLK in speeding up GPR. Note that the computation time on its own is not an appropriate measure, and the corresponding MSE or NLPD must also be taken into account, as a reduction in training time without an accurate prediction is not useful. Finally, variable selection is beyond the scope of the current study, as we assume that the input variables in each dataset are significant predictors which have passed the variable selection process based on the domain knowledge or a statistical procedure.

4.2 Computation time and prediction accuracy

Here, we compare the computation time and the prediction accuracy of SPLK with those of the competing algorithms. Specifically, we consider the MSE and NLPD as functions of the computation time and plot the set of best results for each algorithm. Under this criterion, the algorithm associated with the curve closest to the origin will be superior. The parameter selection for each algorithm is as follows.

For SPLK, we solve optimization problem (20) for each dataset to find the direction of the cuts. It turns out that for the spatial dataset the best direction, based on the criteria of optimization problem (20), is one of the primary axes of the dataset domains: For dataset TCO, the best direction is the direction of the first primary axis (i.e., latitude), for dataset Levitus, it is the direction of the third primary axis (i.e., depth), and for dataset Dasilva, it is the direction of the first primary axis (i.e., latitude). For dataset Protein, which is not a spatial dataset, the best direction is not the direction of any of the primary axes of the input domain (see Section 4.3.2 for a discussion of cuts in other directions). It is insightful to observe that for the spatial datasets used in this study the solution to optimization problem (20) is aligned with one of the primary axes, which may reflect a relationship between the response surface and the underlying geology. For example, for measuring “long wave Chi sensitivity” in dataset Dasilva the direction of the fastest change is the same as latitude; or for dataset Levitus, the covariance decays fastest when we change the depth of the measurement for “apparent oxygen utilization.”

We use the guidelines discussed in Section 3.4 for choosing the tuning parameters. We choose from the set , except for the dataset Levitus, to keep the number of local data point in each subdomain between and . For Levitus, since we cut the domain of data from the third direction with 33 distinct levels, we choose from the set . Our experiments in Section F.2 suggest that setting to small values results in a good performance and increasing it does not affect the algorithm’s accuracy much. Therefore, we set , for datasets TCO, Levitus, and Dasilva, and for dataset Protein. We also fix for all the datasets (see Section 4.3 for a discussion of varying values of ). Note that as increases, computation time decreases, so the points with smaller computation times belong to larger values of in Figures 1 and 2.

The tuning parameters for DDM are , the number of control points on each boundary, and , the number of subdomains. For the two-dimensional dataset TCO, we set and choose from the set to keep the average size of the subdomains between and as instructed in (Park et al., 2011). As expected, for smaller values of , i.e., larger subdomains, the efficiency of the algorithm deteriorates in terms of computation time; therefore, the points with higher computation times belong to smaller values of in Figures 1 and 2.

For PWK, the major tuning parameters are the number of boundary pseudo-observations, , and the number of subdomains, . Similar to DDM, PWK suggests keeping the average size of the subdomains between 100 and 600; therefore, we choose the values of from . We also choose the value of from the set as suggested in (Park and Apley, 2018). Among the 18 possible combinations of and , we choose five combinations that have different computation times for the sake of clear demonstration. In Figures 1 and 2, those points with higher computation times belong to smaller values of .

PIC, which is the localized version of SPGP, has two tuning parameters, the number of local models, , and the number of pseudo-inputs, . We use -means clustering to partition the domain of data into local models. We note that is the major tuning parameter that affects the algorithm’s computation time. Therefore, we fix to a reasonable value and choose the values of from the set . After testing various values of in the range of 100 to 800, we find that is a reasonable choice for our experiments. Therefore, we set for all the four datasets. In Figures 1 and 2, those points with higher computation times belong to larger values of .

For BCM, we use -means clustering to partition the domain of data into local experts similar to PIC and choose the values of from the set . The points with higher computation times belong to larger values of in Figures 1 and 2.

LPR has three major tuning parameters, which are , the number of local experts; , the size of each local expert; and , the size of the subset used for local hyperparameter learning. The location of data points used for local hyperparameter learning can be chosen randomly or by clustering; however, for the sake of fair comparison, we use clustering to choose these locations. Moreover, we choose the values of , , and from the sets , , and , respectively. For each dataset, we fix to a value that results in better performance in terms of computation time and MSE, and choose five combinations out of the nine possible combinations of and that have different computation times.

Last, BGP has two tuning parameters, the number of bags, , and the number of data points assigned to each bag, . Based on our experiments, is the major tuning parameter affecting the algorithm’s computation time; therefore, we vary the values of from the set and fix the value of to a reasonable number. After varying the values of in range 10 to 80, we chose 40 as the fixed value of . In Figures 1 and 2, those data points with higher computation times belong to larger size bags.

For two-dimensional dataset TCO, SPLK, DDM, PWK, and BCM perform almost the same, but they are faster and more accurate than the other algorithms as shown in Figure 0(a). However, in terms of NLPD, SPLK, DDM, and PWK perform better than BCM as shown in Figure 1(a). We attribute the BCM’s higher NLPD values to underestimating the predictive variance in the BCM algorithm. Also, despite the fact that SPLK uses a low-rank covariance approximation, it performs as efficient as DDM and PWK, mainly because it creates fewer boundaries thus compensating for the inaccuracy of the low-rank approximations in the subdomains. Note that for the other datasets, we cannot compare the performance of DDM with the other competing algorithms, because DDM’s implementation is restricted to one- or two-dimensional spaces.

For three-dimensional dataset Levitus, SPLK, LPR, and PWK outperform the other algorithms in terms of MSE as shown in Figure 0(b). However, in terms of NLPD, performance of SPLK and PWK are superior (Figure 1(b)) meaning that SPLK and PWK obtain a better goodness of fit compared to LPR.

For the five-dimensional dataset Dasilva, SPLK, PWK, and LPR outperform other competing algorithms as shown in Figures 0(c) and 1(c). Comparing these two algorithms however indicates that SPLK can reach higher level of accuracy in terms of MSE, while the lower predictive variance gives PWK better NLPD values. The performance of SPLK for this dataset can be better understood by noting that as the covariance decays faster in one direction, which means as increases, partitioning parallel to that direction reduces the prediction accuracy close to the boundaries. This is due to the fact that the short range of covariance allows a higher degree of mismatch on the boundaries. This has been shown through a simulation study in Section 5.1 of the paper by Park and Apley (2018). However, because SPLK avoids partitioning along the direction of the largest , it partially reduces the degree of mismatches on the boundaries. This becomes particularly helpful when the rates of covariance decay highly differ in various directions, and as such SPLK performs better compared to the other algorithms that do not consider covariance structure in partitioning the domain. In fact, for the dataset Dasilva, the third, fourth, and fifth directions have relatively much lower rates of covariance decay compared to the first two directions. We further investigate this hypothesis by comparing the performance of the competing algorithms on a simulated dataset having a similar covariance structure to Dasilva in Appendix G.

Finally, for nine-dimensional dataset Protein, SPLK, PIC, and PWK perform much better than the other algorithms as shown in Figures 0(d) and 1(d). However, similar to the analysis of TCO and Levitus, the lower NLPD values of SPLK and PWK make them more desirable than PIC. We note that unlike the other datasets in this study, we do not set the density parameter to 3, since control points on each boundary slow down the SPLK’s performance without having a significant effect on accuracy (see Section F.2). Therefore, we set to a smaller value of 2.2.

(a) TCO:
(b) Levitus:
(c) Dasilva:
(d) Protein:
Figure 1: MSE versus computation time. For DDM, and ; for PWK, ; for PIC, and ; for BCM, ; for LPR, ; and for BGP, and
(a) TCO:
(b) Levitus:
(c) Dasilva:
(d) Protein:
Figure 2: NLPD versus computation time. For DDM, and ; for PWK, ; for PIC, and ; for BCM, ; for LPR, ; and for BGP, and

4.3 Sensitivity analysis

This section describes the sensitivity analysis we conduct on the tuning parameters of SPLK. Section 4.3.1 discusses some guidelines for selecting the size of the subdomains and the density of local pseudo-inputs. Section 4.3.2 explains the significance of cutting from various directions. We discuss the effect of the number of control points in Section F.2.

4.3.1 Number of cuts and local pseudo-inputs

In this section, we show the trade-off between accuracy and computation time for the choices of and . In our experiment, for each dataset, we vary the number of subdomains, , and the density of local pseudo-inputs, , and use the values of MSE, NLPD, and computation time as the measures of efficiency. To illustrate the effect of various settings on the algorithm’s efficiency, we plot the values of MSE, NLPD, and computation time for varying and a fixed as shown by the curves in Figures 3 and 4.

Figures 2(e)2(f)3(e), and 3(f) show that as increases, i.e., the size of the subdomains decreases, SPLK performs faster for a fixed value of . Moreover, the curves belonging to smaller values of are always below the curves with larger values of , meaning that as the density of the pseudo-inputs increases, the algorithm becomes slower. Consequently, the algorithm takes longer to run by increasing the size of subdomains or the number of local pseudo-inputs.

On the other hand, Figures 2(a)2(b)3(a), and 3(b) show a positive correlation between and MSE, i.e., by fixing the value of , SPLK performs more accurately in terms of MSE, as the size of the subdomains increases. Moreover, the curves belonging to larger values of are always above the curves with lower values of , i.e., as increases, SPLK becomes more accurate for a fixed value of . Figures 2(c)2(d)3(c), and 3(d) show the same trend for the values of NLPD. Therefore, we conclude that our algorithm attains higher accuracy in terms of MSE and NLPD by increasing the density of local pseudo-inputs or enlarging the size of the subdomains.

In summary, by increasing the size of the subdomains or the density of local pseudo-inputs, the algorithm’s accuracy improves, but computation time increases. Therefore, we suggest using sufficiently large values of in smaller subdomains, because, as shown in Figures 3 and 4, the MSEs are small even with a large number of subdomains and computation times stay relatively low.

(a) TCO (MSE)
(b) Levitus (MSE)
(c) TCO (NLPD)
(d) Levitus (NLPD)
(e) TCO (Time)
(f) Levitus (Time)
Figure 3: MSE, NLPD, and computation time versus . Each curve is associated with a value of .
(a) Dasilva (MSE)
(b) Protein (MSE)
(c) Dasilva (NLPD)
(d) Protein (NLPD)
(e) Dasilva (Time)
(f) Protein(Time)
Figure 4: MSE, NLPD, and computation time versus . Each curve is associated with a value of .

4.3.2 Direction of cuts

This section demonstrates how cutting from different directions affects SPLK’s performance. To discuss the significance of cutting from the direction obtained from optimization (20), we fix the value of and vary the values of and the direction of cuts for each dataset, and measure the accuracy of prediction in terms of MSE. Note that since there is an infinite number of directions of cuts, for the sake of comparison, we only consider the best direction, i.e., the direction found through solving optimization problem (20), along with the directions of primary axes of the input space for each dataset. In Figure 5, each curve shows the trend of changes in MSE for a particular direction and the varying values of .

For dataset TCO, the direction of cuts is the direction of the first primary axis as shown in Figure 4(a). Cutting from this direction attains higher accuracy for the varying values of compared to the direction of the second primary axis.

For dataset Levitus, the direction of cuts is the direction of the third primary axis as shown in Figure 4(b). Cutting from this direction attains higher accuracy compared to the directions of the other primary axes.

For dataset Dasilva, the direction of cuts is the direction of the first primary axis as shown in Figure 4(c). Cutting from this direction attains a much higher accuracy compared to the directions of the third, forth, and fifth primary axes. However, the performance of cutting from the direction of the second primary axis is almost the same as the direction that we find through solving optimization problem (20). This can be justified by considering the objective values of optimization (20) for these two directions. In fact, the objective values for the directions of the first and the second primary axes are very close and much smaller than the other directions. Therefore, we observe such a similar and much accurate performance by cutting from these two directions compared to the other directions.

Finally, for dataset Protein, the direction of cuts, which is not the direction of one of the primary axes of the input domain, is compared with the directions of the first six primary axes as shown in Figure 4(d). Cutting from the direction found by solving optimization problem (20) attains a much higher accuracy compared to the directions of the second and the third primary axes, and slightly better than the direction of the sixth primary axis.

(a) TCO:
(b) Levitus:
(c) Dasilva:
(d) Protein:
Figure 5: Effects of cutting directions on MSE for the four datasets

5 Summary

GPR is a powerful tool in the analysis of spatial systems, but it does not scale efficiently to large datasets. In addition, many spatial datasets have highly heterogeneous covariance structures which cannot be modeled effectively with a single covariance function, and the problem is exacerbated when the spatial data contains environmental variables. This paper proposed Sparse Pseudo-input Local Kriging (SPLK), which simultaneously addressed scalability and heterogeneity by partitioning the data domain into subdomains. The partitioning used parallel hyperplanes to create non-overlapping subdomains and fitted a sparse GPR to the data within each subdomain, which allowed the selected partitions to have large numbers of data points. Two theorems were proposed, and an algorithm was developed to find the desired hyperplanes, which resulted in more accurate approximations of the covariance structures in each subdomain. SPLK also alleviated the discontinuity of the overall prediction surface by putting control points on the boundary of neighboring partitions. SPLK was applied to a spatial dataset with exogenous variables, two spatial datasets without exogenous variables, and a non-spatial dataset. The latter demonstrated that the methodology was general and was not restricted to spatial datasets. The results showed that SPLK maintained a good balance between prediction accuracy and computation time.

The limitations of SPLK could be better understood by using a larger number of real spatial datasets with exogenous variables. We also suggest four paths for future research. First, more flexible cuts, such as parallel hyper-curves or concentric hyper-spheres which give the same number of boundaries created by parallel hyperplanes, could be used. Second, from a theoretical perspective, theories that provide the relationship between the expected covariance function and expected error in the low-rank covariance approximation should be developed under less restrictive assumptions. Third, the value of , the tuning parameter that determines the density of pseudo-inputs in each subdomain, could potentially be determined with more rigorous approaches such as using an estimated rate of eigenspectrum reduction of the covariance matrix. Finally, the proposed method could benefit from more sophisticated techniques for sampling control points, as opposed to using a uniform distribution. This would especially improve the model’s performance on higher-dimensional problems, in which the density of control points decreases close to the boundaries.

References

  • Bazaraa et al. (2013) Bazaraa, M. S., H. D. Sherali, and C. M. Shetty (2013). Nonlinear Programming: Theory and Algorithms. John Wiley & Sons.
  • Becker and Rannacher (2001) Becker, R. and R. Rannacher (2001). An optimal control approach to a posteriori error estimation in finite element methods. Acta Numerica 10, 1–102.
  • Chen and Ren (2009) Chen, T. and J. Ren (2009). Bagging for Gaussian process regression. Neurocomputing 72(7-9), 1605–1610.
  • Cressie (1990) Cressie, N. (1990). The origins of kriging. Mathematical Geology 22(3), 239–252.
  • Damian et al. (2003) Damian, D., P. D. Sampson, and P. Guttorp (2003). Variance modeling for nonstationary spatial processes with temporal replications. Journal of Geophysical Research: Atmospheres 108(D24).
  • Friedman et al. (2009) Friedman, J., T. Hastie, and R. Tibshirani (2009). The Elements of Statistical Learning: Data Mining, Inference and Prediction, Volume 1. Springer.
  • Fuglstad et al. (2015) Fuglstad, G.-A., F. Lindgren, D. Simpson, and H. Rue (2015). Exploring a new class of non-stationary spatial Gaussian random fields with varying local anisotropy. Statistica Sinica, 115–133.
  • Furrer et al. (2006) Furrer, R., M. G. Genton, and D. Nychka (2006). Covariance tapering for interpolation of large spatial datasets. Journal of Computational and Graphical Statistics 15(3), 502–523.
  • Gao et al. (2014) Gao, S., Z. Zhu, S. Liu, R. Jin, G. Yang, and L. Tan (2014). Estimating the spatial distribution of soil moisture based on bayesian maximum entropy method with auxiliary data from remote sensing. International Journal of Applied Earth Observation and Geoinformation 32, 54–66.
  • Gramacy and Lee (2008) Gramacy, R. B. and H. K. H. Lee (2008). Bayesian treed Gaussian process models with an application to computer modeling. Journal of the American Statistical Association 103, 1119–1130.
  • Haas (1990) Haas, T. C. (1990). Kriging and automated variogram modeling within a moving window. Atmospheric Environment. Part A. General Topics 24(7), 1759–1769.
  • Hager (1989) Hager, W. W. (1989). Updating the inverse of a matrix. SIAM Review 31(2), 221–239.
  • Kaufman et al. (2008) Kaufman, C. G., M. J. Schervish, and D. W. Nychka (2008). Covariance tapering for likelihood-based estimation in large spatial data sets. Journal of the American Statistical Association 103(484), 1545–1555.
  • Kim et al. (2005) Kim, H.-M., B. K. Mallick, and C. Holmes (2005). Analyzing nonstationary spatial data using piecewise Gaussian processes. Journal of the American Statistical Association 100(470), 653–668.
  • Lindgren et al. (2011) Lindgren, F., H. Rue, and J. Lindström (2011). An explicit link between Gaussian fields and Gaussian Markov random fields: the stochastic partial differential equation approach. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 73(4), 423–498.
  • Nesterov and Nemirovskii (1994) Nesterov, Y. and A. Nemirovskii (1994). Interior-point polynomial algorithms in convex programming, Volume 13. SIAM.
  • Paciorek and Schervish (2006) Paciorek, C. J. and M. J. Schervish (2006). Spatial modelling using a new class of nonstationary covariance functions. Environmetrics: The official journal of the International Environmetrics Society 17(5), 483–506.
  • Park and Apley (2018) Park, C. and D. Apley (2018). Patchwork Kriging for large-scale Gaussian process regression. Journal of Machine Learning Research 19(7).
  • Park and Huang (2016) Park, C. and J. Z. Huang (2016). Efficient computation of Gaussian process regression for large spatial data sets by patching local Gaussian processes. Journal of Machine Learning Research 17(174), 1–29.
  • Park et al. (2011) Park, C., J. Z. Huang, and Y. Ding (2011). Domain decomposition approach for fast Gaussian process regression of large spatial data sets. Journal of Machine Learning Research 12, 1697–1728.
  • Park et al. (2012) Park, C., J. Z. Huang, and Y. Ding (2012). GPLP: a local and parallel computation toolbox for Gaussian process regression. Journal of Machine Learning Research 13, 775–779.
  • Philip (2007) Philip, J. (2007). The probability distribution of the distance between two random points in a box. TRITA MAT 10(7).
  • Pourhabib et al. (2014) Pourhabib, A., F. Liang, and Y. Ding (2014). Bayesian site selection for fast Gaussian process regression. IIE Transactions 46(5), 543–555.
  • Quiñonero-Candela and Rasmussen (2005) Quiñonero-Candela, J. and C. E. Rasmussen (2005). A unifying view of sparse approximate Gaussian process regression. The Journal of Machine Learning Research 6, 1939–1959.
  • Rasmussen and Ghahramani (2002) Rasmussen, C. E. and Z. Ghahramani (2002). Infinite mixtures of Gaussian process experts. Advances in Neural Information Processing Systems 2, 881–888.
  • Rasmussen and Williams (2006) Rasmussen, C. E. and C. K. I. Williams (2006). Gaussian Processes for Machine Learning. Cambridge, MA: MIT Press.
  • Sampson and Guttorp (1992) Sampson, P. D. and P. Guttorp (1992). Nonparametric estimation of nonstationary spatial covariance structure. Journal of the American Statistical Association 87(417), 108–119.
  • Schmidt and O’Hagan (2003) Schmidt, A. M. and A. O’Hagan (2003). Bayesian inference for non-stationary spatial covariance structure via spatial deformations. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 65(3), 743–758.
  • Schwaighofer and Tresp (2003) Schwaighofer, A. and V. Tresp (2003). Transductive and inductive methods for approximate Gaussian process regression. In Advances in Neural Information Processing Systems, pp. 977–984. MIT Press.
  • Smola and Schölkopf (2000) Smola, A. J. and B. Schölkopf (2000). Sparse greedy matrix approximation for machine learning. In Proceedings of the Seventeenth International Conference on Machine Learning. Morgan Kaufmann.
  • Snelson (2007) Snelson, E. (2007). Flexible and Efficient Gaussian Process Models for Machine Learning. Dissertation, Gatsby Computational Neuroscience Unit, University College London, London, England.
  • Snelson and Ghahramani (2007) Snelson, E. and Z. Ghahramani (2007). Local and global sparse Gaussian process approximations. In International Conference on Artifical Intelligence and Statistics 11. Society for Artificial Intelligence and Statistics.
  • Tresp (2000) Tresp, V. (2000). A Bayesian committee machine. Neural Computation 12(11), 2719–2741.
  • Urtasun and Darrell (2008) Urtasun, R. and T. Darrell (2008). Sparse probabilistic regression for activity-independent human pose inference. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. IEEE.
  • Williams et al. (2002) Williams, C. K., C. E. Rasmussen, A. Scwaighofer, and V. Tresp (2002). Observations on the nyström method for Gaussian process prediction.
  • Williams and Seeger (2001) Williams, C. K. and M. Seeger (2001). Using the Nnyström method to speed up kernel machines. In Advances in Neural Information Processing Systems, pp. 682–688.
  • Zhang and Du (2008) Zhang, H. and J. Du (2008). Covariance tapering in spatial statistics. Positive definite functions: From Schoenberg to space-time challenges, 181–196.
  • Zhang et al. (2017) Zhang, J., X. Li, Q. Liu, L. Zhao, and B. Dou (2017). An extended kriging method to interpolate soil moisture data measured by wireless sensor network. Sensors 17(6), 1390.

Appendix A Solving optimization problem (11)

Due to the convex objective function and affine constraints of optimization problem (11), the duality gap between the primal and dual problems of (11) is zero by Lagrange duality principle (Bazaraa et al., 2013). This allows us to transform the optimization problem (11) to an unconstrained optimization problem and maximize the Lagrangian of (11) instead,

(23)

where is the number of all the control points located on the boundaries of subdomain , and is the vector of the Lagrange multipliers.

Assuming depends on the covariance between and , and depends on the covariance of and , we write and as suggested in (Park et al., 2011), where is a squared matrix with size equal to the number of data points in , and is the Lagrange parameter associated with that does not depend on . Consequently, we rewrite Lagrangian (23) as

(24)

where is a diagonal matrix with diagonal elements , and is the vectors of boundary values of .

Due to convexity of function (24) we can calculate the optimal values of and analytically by writing out the first order necessary conditions,