Efficient Multiscale Gaussian Process Regression using Hierarchical Clustering
Abstract
Standard Gaussian Process (GP) regression, a powerful machine learning tool, is computationally expensive when it is applied to large datasets, and potentially inaccurate when data points are sparsely distributed in a highdimensional feature space. To address these challenges, a new multiscale, sparsified GP algorithm is formulated, with the goal of application to large scientific computing datasets. In this approach, the data is partitioned into clusters and the cluster centers are used to define a reduced training set, resulting in an improvement over standard GPs in terms of training and evaluation costs. Further, a hierarchical technique is used to adaptively map the local covariance representation to the underlying sparsity of the feature space, leading to improved prediction accuracy when the data distribution is highly nonuniform. A theoretical investigation of the computational complexity of the algorithm is presented. The efficacy of this method is then demonstrated on smooth and discontinuous analytical functions and on data from a direct numerical simulation of turbulent combustion.
Keywords:
Gaussian Processes, Sparse regression, Clustering.∎
1 Introduction
The rapid growth in computing power has resulted in the generation of massive amounts of highlyresolved datasets in many fields of science and engineering. Against this backdrop, machine learning (ML) is becoming a widelyused tool in the identification of patterns and functional relationships that has resulted in improved understanding of underlying physical phenomena mjolsness2001 (), characterization and improvement of models yairi1992 (); ling2015 (); eric2015 (), control of complex systems gautier2015 (), etc. In this work, we are interested in further developing Gaussian Process (GP) regression, which is a popular supervised learning technique. While GPs have been demonstrated to provide accurate predictions of the mean and variance of functional outputs, application to largescale datasets in highdimensional feature spaces remains a hurdle. This is because of the high computational costs and memory requirements during the training stage and the expense of computing the mean and variance during the prediction stage. To accurately represent highdimensional function spaces, a large number of training data points must be used. In this work, an extension to GP regression is developed with the specific goal of applicability in noisy and largescale scientific computing datasets.
The computational complexity of GPs at the training stage is related to the kernel inversion and the computation of the logmarginal likelihood. For example, an algorithm based on the Cholesky decomposition is one wellknown approach chol (). The computational complexity of Cholesky decomposition for a problem with training points is . For highdimensional problems with large , this can be prohibitive. Even if this cost can be reduced using, for instance, iterative methods gumeroviter (), the computation of the predictive mean and variance at test points will be , and , respectively. The evaluation (or testing) time could be of great significance if GP evaluations are made frequently during a computational process. Such a situation can occur in a scientific computing setting transprev (); eric2015 () where GP evaluations may be performed every timestep or iteration of the solver. Reducing both the training and evaluation costs while preserving the prediction accuracy is the goal of the current work.
Much effort has been devoted to the construction of algorithms of reduced complexity. Among these is the family of methods of sparse Gaussian process regression. These methods seek to strategically reduce the size of the training set or find appropriate sparse representations of the correlation matrix through the use of induced variables. The covariance matrix, , contains the pairwise covariances between the training points. Descriptions for several methods of this family are provided by QuiñoneroCandela et al. quinonero2005 (), who define induced variables , such that the prediction of a function is given by
(1) 
where is a vector whose elements are , and . is the number of induced variables used in this representation, and is the dimensionality of the inputs; i.e. , . is a single test input vector, and is the ’th column of . is the covariance function. From this, it is evident that if , the matrix inversion will be less costly to perform.
The process of determining the induced variables that can optimally represent the training set can introduce additional computational costs. The most basic method is to randomly select a fraction of data points and define them as the new training set, i.e. choosing from , where and are the original training inputs and outputs. Seeger et al. seeger2003 (), Smola et al. smola2001 (), and others have developed selection methods that provide better test results compared to random selection. In particular, Walder et al. walder2008 () have extended the concept to include the ability to vary such that its hyperparameters are unique to each , i.e. . These methods often depend on optimizing modified likelihoods based on , or on approximating the true log marginal likelihood titsias2009 (). Methods that introduce new induced variables instead of using a subset of the training points can involve solving optimization problems, again requiring the user to consider the additional computational burden.
In addition to choosing the inducing variables, one may also change the form of the covariance function from to , where
(2) 
In contrast to the case where only a subset of the training data is used, all training points are considered in this approach and the matrices are only of rank . There are many variations on this type of manipulation of the covariance. Alternatively, one can approximate directly by obtaining random samples of the kernel, as is described by Rahimi et al. recht2007 (). However, altering the covariance matrix can cause undesirable behaviors when computing the test variance, since some matrices no longer mathematically correspond to a Gaussian process. The work of Ambikasaran et al. ambikasaran () improves the speed of GP regression through the hierarchical factorization of the dense covariance matrix into block matrices, resulting in a less costly inversion process.
Other methods of accelerating GP regression focus on dividing the training set into smaller sets of more manageable size. Snelson and Ghahramani snelson2007local () employ a local GP consisting of points in the neighborhood of the test point in addition to a simplified global estimate to improve test results. In a similar fashion, the work of Park and Choi park2010 () also takes this hierarchical approach, with the globallevel GP informing the mean of the local GP. Urtasun and Darrell urtasun2008 () use the local GP directly, without the induced variables. However, this requires the test points to be either assigned to a precomputed neighborhood of training points snelson2007local (), or to be determined online urtasun2008 (). Both of these approaches to defining the neighbourhood can be quite expensive depending on training set size, though massive parallelization may mitigate the cost in some aspects keane2002 () weiss2014 ().
In this work, the data is partitioned into clusters based on modified kcenter algorithms and the cluster centers are used to define a reduced training set. This leads to an improvement over standard GPs in terms of training and testing costs. This technique also adapts the covariance function to the sparsity of a given neighborhood of training points. The new technique will be referred to as Multiscale GP and abbreviated as MGP. Similar to the work of Snelson and Ghahramani snelson2007local (), this flexibility may allow MGP to produce more accurate outputs by means of a more informed perspective on the input space. The sparse GP technique of Walder et al. walder2008 () can been seen as a very general interpretation of our approach, though a key difference is that the points within the reduced training set originate entirely from the original training set, whereas walder2008 () computes new induced variables. Further, in our work, hyperparameters are restricted by the explicit choice and hierarchy of scales in order to streamline the process of their optimization. The work of Zhou et al. zhou2006 () also employs multiple scales of covariance functions, but it does not take into account the neighbourhoods of training points, and neither does it seek to reduce the size of the training set.
The next section of this paper recapitulates the main aspects of standard GP regression. Section 3 introduces the philosophy behind the proposed approach and presents the algorithm. Section 4 analyses the complexity of the MGP algorithm for both testing and training. In section 5, quantitative demonstrations are presented on a simple analytical problems as well as on data derived from numerical simulations of turbulent combustion. Following this, key contributions are summarized.
2 Preliminaries of GP regression
In this section, a brief review of standard GP regression is presented. While a detailed review can be found in the excellent book by Rasmussen rbfgp2 (), the description and notations below provide the necessary context for the development of the multiscale algorithm proposed in this paper.
We consider the finite dimensional weightspace view and the infinite dimensional functionspace view as in Rasmussen rbfgp2 (). Given a training set of observations, , where is an input vector and is a scalar output, the goal is to obtain the predictive mean, , and variance, , at an arbitrary test point . Training inputs are collected in the design matrix , and the outputs form the vector . Furthermore, it is assumed that
(3) 
where is Gaussian noise with zero mean and variance .
For low dimensionality of the inputs, linear regression cannot satisfy a variety of dependencies encountered in practice. Therefore, the input is mapped into a highdimensional feature space , and the linear regression model is applied in this space. To avoid any confusion with the original feature space , this space will be referred to as the “extended feature space” and its dimensionality can be referred to as the “size” of the system. Denoting the mapping function , we have
(4) 
The predictive mean and the covariance matrix can then be computed as
(5) 
The Bayesian formalism provides Gaussian distributions for the training outputs and for the predictive mean and variance .
(6)  
(7) 
where is the identity matrix. These expressions allow for the computation of the logmarginal likelihood (LML),
(8) 
Maximization of this function leads to optimal values of hyperparameters such as and other variables that define .
3 Multiscale sparse GP regression
As stated earlier, the goal of the proposed GPbased regression process is to decrease the complexity of both training and testing and to make the prediction more robust for datasets that have a highly irregular distribution of features.
We consider Gaussian basis functions,
(9) 
where each function is characterized not only by its center , but now also by a scale . The need for multiple scales may arise from the underlying physics (e.g. particle density estimation) or from the substantial nonuniformity of the input data distribution, which could, for instance, demand smaller scales in denser regions. Note that the matrix in Eq. (5) is given by
3.1 Sparse representations
While training points may be available and maximum information can be obtained when the size of the extended feature space , we will search for subsets of these points leading to a lower size . The size of the extended feature space is related to the accuracy of the lowrank approximation of matrix built on the entire training set (i.e. for the case ).
Assume for a moment that we have a single scale, If is chosen to be smaller than the minimum distance between the training points, the matrix will have a low condition number and a low marginal likelihood. On the other end, if is substantially larger than the maximum distance between the training points the problem will be illposed and the marginal likelihood will be also low. The optima should lie between these extremes. If the solution is not sensitive to the removal of a few training points, a good lowrank approximation can be sought.
The fact that the matrix can be wellapproximated with a matrix of rank means that rows or columns of this matrix can be expressed as a linear combination of the other rows or columns with relatively small error . Assuming that locations (or centers) of the radial basis functions (or Gaussians) are given and denoting such centers by , , we have
(11) 
where are coefficients to be determined. Hence, output (Eqn. 3), where is expanded as Eqn. 4 with , can be written as
(12)  
This shows that if is lower or comparable with noise , then a reduced basis can be used and one can simply set the size of the extended feature space to be the rank of the lowrank approximation, , and coefficients can be determined by solving a system instead of an system.
3.2 Representative training subsets
Let us consider now the problem of determination of the centers of the basis functions and scales . If each datapoint is assigned a different scale, then the optimization problem will become unwieldy, as hyperparameters will have to be determined. Some compromises are made, and we instead use a set of scales . In the limit of , we have a single scale model, while at we prescribe a scale to each training point. To reduce the number of scales while providing a broad spectrum of scales, we propose the use of hierarchical scales, e.g. , , where is of the order of the size of the computational domain and is some dimensionless parameter controlling the scale reduction.
While there exist several randomizationbased strategies to obtain a lowrank representation by choosing representative input points quinonero2005 (), we propose a more regular, structured approach. For a given scale , we can specify some distance , where is chosen such that

The distance from any input point to a basis point is less than . Such a construction provides an approximately uniform coverage of the entire training set with dimensional balls of radius .
The problem of constructing an optimal set as described above is computationally NPhard. However, approximate solutions can be obtained with a modification of the wellknown means algorithm kmeans (). In computational geometry problems, the means algorithm partitions a domain in dimensional space into a prescribed number, , of clusters. In the present problem, the number of clusters is unknown, but the distance parameter is prescribed. Thus, the number of centers depends on the input point distribution and can be determined after the algorithm is executed. Since only the cluster centers are required for the present problem, the following algorithm is used:
Algorithm #1: Determination of cluster centers and k
Input: set of training points , max cluster radius .

Define and set ;

Do steps 3 to 6 while


Assign (where is a random point from set )

Find all points , such that .

Define and .
Output: set of cluster centers , number of clusters .
The construction of training subsets in the case of multiple scales requires further modification of the algorithm. Starting with the coarsest scale , centers can be determined using Algorithm 1 with distance parameter . In our approach, we select the bases such that each input point can serve as a center of only one basis function; therefore, the cluster centers of scale should be removed from the initial training set to proceed further. Next, we partition the remaining set using Algorithm 1 with distance parameter and determine cluster centers. After removal of these points we repeat the process until scale is reached at which we determine cluster centers and stop the process. This algorithm is described below.
Algorithm #2: Determination of cluster centers in multiple scales
Input: Set of training points , set of scales , parameter .

Define

Do steps 3 to 4 for

Execute Algorithm #1 with input and max cluster radius and get cluster centers and number of clusters

Set
Output: set of cluster centers , , number of clusters for each scale , .
Figure 1 illustrates the clustering of a 2dimensional dataset. In principle, one can use the process even further until all the points in the training set become the cluster centers at some scale (i.e. ). However, as mentioned above, the reduction of the basis, , is important to reduce the computational complexity of the overall problem.
4 Complexity
The asymptotic complexity of algorithm #1 is and that of algorithm #2 is . While fast neighbor search methods cheng1984 () can accelerate these algorithms, the overall computational cost of the regression is still expected to be dominated by the training and evaluation costs. It is thus instructive to discuss the complexities of different algorithms available for training and testing, especially when .
4.1 Training
The goal of the training stage is to determine quantities necessary for computations in the testing stage. At this point, two types of computations should be recognized: i) precomputations, when the set of hyperparameters is given, and ii) the optimization problem to determine the hyperparameters.
To accomplish the first task, we use the following steps:

The next step involves the determination of the weights from Eq. (5) This requires solving a . If solutions are obtained through direct methods, the cost is .

Following this, the inverse should be determined
^{5} . The Cholesky decomposition is typically used in this situation rbfgp2 ();(13) where is the lower triangular matrices. The decompositions can be performed with cost , respectively. Note that the Cholesky decompositions can also be used to solve the systems for , in which case the complexity of solution becomes .
The second task requires the computation of the log marginal likelihood. If one uses Eq. (8) directly then the task becomes computationally very expensive (complexity ). The complexity can be reduced with the aid of the following lemma.
Lemma 1
Proof
According to the definition of (6), we have in the right hand side of Eq. (14)
(15) 
Note now the Sylvester determinant theorem, which states that
(16) 
where is an matrix and is , while and are the and identity matrices. Thus, we have in our case
(17) 
From Eq. (5) we have
(18)  
Combining results (15), (17), and (18), one can see that Eq. (14) holds.
Now, replacing in the first term in the righthand side of Eq. (8) with expressions for and from Eqs. (5) and (6) and using Eq. (14) in the second term, we obtain
(19) 
In the case of and using the Cholesky decomposition of , we obtain
(20) 
Here, the cost of computing the first term on the right hand side is , and the cost of the second term is .
Multidimensional optimization is usually an iterative process. Assuming that the number of iterations is , we can write the costs of the training steps marked by the respective superscripts as follows
(21) 
4.2 Testing
The cost of testing can be estimated from Eq. (6) as the cost of computing the predictive mean and variance. Taking into account the Cholesky decomposition (13), these costs can be written as
(22) 
where is the number of test points. The cost for computing the variance is much higher than the cost for computing the mean, because it is determined by the cost of solving triangular systems of size per evaluation point.
5 Numerical Results
In this section, a set of simple analytical examples are first formulated to critically evaluate the accuracy and effectiveness of MGP and to contrast its performance with conventional GP regression. Following this, data from a turbulent combustion simulation is used to assess the viability of the approach in scientific computing problems.
5.1 Analytical examples
The first numerical example we present is a 1D step function (). We compare the standard kernel GP with a Gaussian kernel and the present MGP algorithm restricted to a single scale (). The training set of size was randomly selected from 10000 points distributed in a uniform grid. Points not used for training were used as test points. The initial data was generated by adding normally distributed noise with standard deviation to the step function. Optimal hyperparameters were found using the Lagarias algorithm for the NelderMead method (as implemented by MATLAB’s fminsearch function) lagarias (). For the standard GP, and were optimized, while for the MGP algorithm, the optimal was determined in addition to and . In both cases, the optimal was close to the actual used for data generation. The ratio between the optimal for the kernel and the finitedimensional (“weightspace”) approaches was found to be approximately , which is consistent with the difference of predicted by the theory (the distinction between the weight space and function space approaches is provided in Ref. rbfgp2 ()).
The plots in Figure 2 show that there is no substantial difference in the mean and the variance computed using both methods, while the sparse algorithm required only functions compared to the required for the standard method. Figure 3 shows the relationship between the extended feature space (D) with respect to the size of the training set (N). Ideally, this should be a linear dependence, since the step function is scaleindependent. Due to random noise and the possibility that the optimization may converge to local minima of the objective function, however, the relationship is not exactly linear. Since is nevertheless several times smaller than , the wallclock time for the present algorithm is shorter than that for the standard GP in both testing and training, per optimizer iteration. The total training time for the present algorithm can be larger than that of the standard algorithm due to the overhead associated with the larger number of hyperparameters and the resultant increase in the number of optimization iterations required. However, we observed this only for relatively small values of . For larger , such as , the present algorithm was approximately 5, 10, and 20 times faster than the standard algorithm for , , and , respectively.
Figure 4 illustrates a case where the training points are distributed nonuniformly. Such situations frequently appear in practical problems, where regions of high functional gradients are sampled with higher density to provide a good representation of the function. For example, adaptive meshes to capture phenomena such as shockwaves and boundary layers in fluid flow fall in this category. In the case illustrated here, the training points were distributed with exponential density near the jump (, , where are distributed uniformly at the nodes of a regular grid. was used). For MGP, the number of scales was and the other hyperparameters were optimized using the same routine as before. Due to multiple extrema in the objective function, it is rather difficult to optimize the number of scales, . In practice, one should start from several initial guesses or fix some parameters such as . We used several values of and observed almost no differences for , while results for and were substantially different from the cases where .
It is seen that the MGP provides a much better fit of the step function in this case than the standard method. This is achieved due to its broad spectrum of scales. In the present example, we obtained the following optimal parameters for scales distributed as a geometric progression, : , , and . Other optimal parameters were and . For the standard GP, the optimal scale was . Figure 4 shows that with only a single intermediate scale, it is impossible to approximate the function between training points with a large spacing, whereas MGP provides a much better approximation. Moreover, since , we also have a better approximation of the jump, i.e. of smallscale behavior. This is clearly visible in the figure; the jump for the standard GP is stretched over about 10 intervals between sampling points, while the jump for MGP only extends over 3 intervals. Note that in the present example, we obtained , so the wallclock time for testing is not faster than the standard GP. However, this case illustrates that multiple scales can provide good results for substantially nonuniform distributions where one scale description is not sufficient.
As final analytical example, we explore a sine wave with varying frequency, depicted in Figure 5. As before, training points are randomly chosen from a uniform distribution, and test points are used. For MGP, was used. Compared to the previous examples, this is an intermediate case in terms of scale dependence. One noteworthy result from this dataset is that is almost constant with respect to , as seen in Figure 6. This shows that when the function is relatively smooth, the optimization process is not limited to producing a linear relationship between and . Another result is that the output variance of the multiscale method is visibly higher than that of the standard method. For the previous cases, the variances have been either been close to equal, or the standard method would produce the higher variance. This could be due to the fact that and are inherently related for the current method, whereas is unrestricted for the standard GP. Since , the multiscale for is typically greater than the standard method’s . According to Eq. (7), this would result in greater variance.
5.2 Data from Turbulent Combustion
Combustion in the presence of turbulent flow involves an enormous disparity in time and length scales, meaning that direct numerical simulations (DNS) are not possible in practical problems. Large eddy simulations (LES), in which the effect of scales smaller than the mesh resolution (i.e. subgrid scales) are modeled, is often a pragmatic choice. A key difficulty in LES of combustion is the modeling of the subgrid scale fluxes in the species transport equations kant1 (); kant2 (). These terms arise as a result of the lowpass filtering  represented by the operator  of the governing equations, and are of the form
(23) 
where represent density, velocity and species mass fraction, respectively.
Subgridscale closures based on concepts from nonreacting flows, such as the equation below,
(24) 
are found to be inadequate for turbulent combustion. Modeling of the scalar fluxes thus continues to be an active area of research, and many analytical models are being evaluated by the community. Reference chakraborthy () provides a concise summary of such developments in the area of premixed turbulent flames.
In this work, we intend to apply GP and MGP to identify the model relationships from data. The simulation data is obtained from Ref. raman (), in which DNS of a propagating turbulent flame is performed. The configuration involves a premixed hydrogenair flame propagating in the xdirection. The flame forms a thin front between the burnt and unburnt gases. Specifically, we attempt to learn the normalized flux in the xdirection, , as a function of seven variables:
(25) 
A total of 3 million training points were generated from the dataset by performing lowpass filtering. Some care was taken in choosing training points. Since the flame is thin relative to the size of the domain, the majority of the data points were found to lie outside the region where is nonzero. To mitigate this disadvantageous distribution, 80 percent of the training points were randomly chosen from the data with , and 20 percent were chosen from data with , i.e. outside the flame. 45000 training points were used in total. For testing, a single plane of around 6500 points was set aside.
The predictive results on this plane are shown in Table 1, and Figure 8 shows the ML output versus the true DNS values. From these plots, it is seen that MGP has achieved a fiftyfold increase in evaluation speed for a corresponding two percent increase in error. In the scatter plot, MGP appears to perform better than GP for low values of and worse for high values, but the overall difference is small. Figure 7 is a sidebyside comparison of contours of from DNS and from GP and MGP. It is especially evident in the contour plot that GP and MGP are both able to capture features of the flame, whereas analytical models described in Ref. chakraborthy () were not as accurate
GP  MGP  
Error  Time  Error  Time  
0.1087  0.1105 
6 Conclusion
In this paper, a new Gaussian process (GP) regression technique was presented. The method, referred to as MGP, introduces multiple scales among the Gaussian basis functions and employs hierarchical clustering to select centers for these sparse basis functions. These modifications reduce the computational complexity of GP regression and also achieve better accuracy than standard GP regression when training points are nonuniformly distributed in the dimensional feature space. We illustrated these improvements through analytical examples and in a turbulent combustion datasets.
In analytical examples with smooth functions, the MGP was shown to be at least an order of magnitude faster than the standard GP for a similar level of accuracy. In problems with discontinuities, the MGP is shown to provide a much better fit. In the turbulent combustion example in 7 dimensions, the MGP was shown to achieve a fiftyfold increase in evaluation speed for a corresponding two percent increase in error over the GP method. Overall, MGP is wellsuited for regression problems where the inputs are unevenly distributed or where training and testing speeds are critical.
Based on the results presented in this work, MGP offers promise as a potentially attractive method for use in many scientific computing applications in which datasets may be large, and sparsely distributed in a highdimensional feature space. The MGP can be especially useful when predictive evaluations are performed frequently. However, further developments and more detailed application studies are warranted. It was observed that the optimization process is more likely to terminate at local minima compared to conventional GPs. An immediate area of investigation could explore more efficient and robust techniques for optimization of the hyperparameters. Since an appealing feature of MGP is the reduced complexity when working with large datasets, efficient parallelization strategies should be explored.
The software and all the examples in this paper are openly available at http://umich.edu/~caslab/#software.
Acknowledgments
This work was performed under the NASA LEARN grant titled “A Framework for Turbulence Modeling Using Big Data.”
Footnotes
 email: kdur@umich.edu
 email: kdur@umich.edu
 email: kdur@umich.edu
 journal: Machine Learning
 Alternatively, efficient decompositions enabling solutions with multiple right hand sides can be used.
 are typically constants, and is the strainrate tensor. The superscript denotes Favrefiltering and is defined by . is the filter size.
 These results are not shown
References
 Mjolsness, E. and DeCoste, D., “Machine learning for science: state of the art and future prospects,” Science, Vol. 293, No. 5537, 2001, pp. 2051–2055.
 Yairi, T., Kawahara, Y., Fujimaki, R., Sato, Y., and Machida, K., “Telemetrymining: a machine learning approach to anomaly detection and fault diagnosis for space systems,” Space Mission Challenges for Information Technology, 2006. SMCIT 2006. Second IEEE International Conference on, IEEE, 1992, pp. 8–pp.
 Ling, J. and Templeton, J., “Evaluation of machine learning algorithms for prediction of regions of high Reynolds averaged Navier Stokes uncertainty,” Physics of Fluids, Vol. 27, No. 8, 2015.
 Parish, E. and Duraisamy, K., “A Paradigm for datadriven predictive modeling using field inversion and machine learning,” Journal of Computational Physics, Vol. 305, 2016, pp. 758–774.
 Gautier, N., Aider, J., Duriez, T., Noack, B. R., Segond, M., and Abel, M., “Closedloop separation control using machine learning,” Journal of Fluid Mechanics, Vol. 770, 2015, pp. 442–457.
 Wilkinson, J., The algebraic eigenvalue problem, Vol. 87, Clarendon Press Oxford, 1965.
 Gumerov, N. A. and Duraiswami, R., “Fast radial basis function interpolation via preconditioned Krylov iteration,” SIAM Journal on Scientific Computing, Vol. 29, No. 5, 2007, pp. 1876–1899.
 Duraisamy, K., Zhang, Z., and Singh, A., “New Approaches in Turbulence and Transition Modeling Using Datadriven Techniques,” 2014.
 QuiñoneroCandela, J. and Rasmussen, C. E., “A unifying view of sparse approximate Gaussian process regression,” The Journal of Machine Learning Research, Vol. 6, 2005, pp. 1939–1959.
 Seeger, M., Williams, C., and Lawrence, N., “Fast forward selection to speed up sparse Gaussian process regression,” Artificial Intelligence and Statistics 9, No. EPFLCONF161318, 2003.
 Smola, A. J. and Bartlett, P., “Sparse Greedy Gaussian Process Regression,” Advances in Neural Information Processing Systems 13, MIT Press, 2001, pp. 619–625.
 Walder, C., Kim, K. I., and Schölkopf, B., “Sparse multiscale Gaussian process regression,” Proceedings of the 25th International Conference on Machine learning, 2008, pp. 1112–1119.
 Titsias, M. K., “Variational learning of inducing variables in sparse Gaussian processes,” International Conference on Artificial Intelligence and Statistics, 2009, pp. 567–574.
 Rahimi, A. and Recht, B., “Random features for largescale kernel machines,” Advances in neural information processing systems, 2007, pp. 1177–1184.
 Ambikasaran, S., ForemanMackey, D., Greengard, L., Hogg, D., and O’Neil, M., “Fast Direct Methods for Gaussian Processes,” 2016.
 Snelson, E. and Ghahramani, Z., “Local and global sparse Gaussian process approximations,” International Conference on Artificial Intelligence and Statistics, 2007, pp. 524–531.
 Park, S. and Choi, S., “Hierarchical Gaussian Process Regression,” ACML, 2010, pp. 95–110.
 Urtasun, R. and Darrell, T., “Sparse probabilistic regression for activityindependent human pose inference,” Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, IEEE, 2008, pp. 1–8.
 Choudhury, A., Nair, P. B., and Keane, A. J., “A Data Parallel Approach for LargeScale Gaussian Process Modeling.” SDM, SIAM, 2002, pp. 95–111.
 Gramacy, R. B., Niemi, J., and Weiss, R. M., “Massively parallel approximate Gaussian process regression,” SIAM/ASA Journal on Uncertainty Quantification, Vol. 2, No. 1, 2014, pp. 564–584.
 Zhou, Y., Zhang, T., and Li, X., “Multiscale Gaussian Processes model,” Journal of Electronics (China), Vol. 23, No. 4, 2006, pp. 618–622.
 Rasmussen, C. E., “Gaussian processes for machine learning,” 2006.
 Hartigan, J. A. and Wong, M. A., “Algorithm AS 136: A kmeans clustering algorithm,” Applied statistics, 1979, pp. 100–108.
 Cheng, D., Gersho, A., Ramamurthi, B., and Shoham, Y., “Fast search algorithms for vector quantization and pattern matching,” Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP’84., Vol. 9, IEEE, 1984, pp. 372–375.
 Lagarias, J. C., Reeds, J. A., Wright, M. H., and Wright, P. E., “Convergence properties of the Nelder–Mead simplex method in low dimensions,” SIAM Journal on optimization, Vol. 9, No. 1, 1998, pp. 112–147.
 Lipatnikov, A. N. and Chomiak, J., “Effects of premixed flames on turbulence and turbulent scalar transport,” Progress in Energy and Combustion Science, Vol. 36, No. 1, 2010, pp. 1–102.
 Tullis, S. and Cant, R. S., “Scalar transport modeling in large eddy simulation of turbulent premixed flames,” Proceedings of the Combustion Institute, Vol. 29, No. 2, 2002, pp. 2097–2104.
 Gao, Y., Klein, M., and Chakraborty, N., “Assessment of subgrid scalar flux modelling in premixed flames for Large Eddy Simulations: Apriori Direct Numerical Simulation analysis,” European Journal of Mechanics  B/Fluids, Vol. 52, 2015, pp. 97 – 108.
 Hassanaly, M., Raman, V., Koo, H., and Colkett, M. B., “Influence of Fuel Stratification on Turbulent Flame Propagation,” 2014.