Efficient Multiscale Gaussian Process Regression using Hierarchical Clustering

Efficient Multiscale Gaussian Process Regression using Hierarchical Clustering

Abstract

Standard Gaussian Process (GP) regression, a powerful machine learning tool, is computationally expensive when it is applied to large datasets, and potentially inaccurate when data points are sparsely distributed in a high-dimensional feature space. To address these challenges, a new multiscale, sparsified GP algorithm is formulated, with the goal of application to large scientific computing datasets. In this approach, the data is partitioned into clusters and the cluster centers are used to define a reduced training set, resulting in an improvement over standard GPs in terms of training and evaluation costs. Further, a hierarchical technique is used to adaptively map the local covariance representation to the underlying sparsity of the feature space, leading to improved prediction accuracy when the data distribution is highly non-uniform. A theoretical investigation of the computational complexity of the algorithm is presented. The efficacy of this method is then demonstrated on smooth and discontinuous analytical functions and on data from a direct numerical simulation of turbulent combustion.

Keywords:
Gaussian Processes, Sparse regression, Clustering.
4

1 Introduction

The rapid growth in computing power has resulted in the generation of massive amounts of highly-resolved datasets in many fields of science and engineering. Against this backdrop, machine learning (ML) is becoming a widely-used tool in the identification of patterns and functional relationships that has resulted in improved understanding of underlying physical phenomena mjolsness2001 (), characterization and improvement of models yairi1992 (); ling2015 (); eric2015 (), control of complex systems gautier2015 (), etc. In this work, we are interested in further developing Gaussian Process (GP) regression, which is a popular supervised learning technique. While GPs have been demonstrated to provide accurate predictions of the mean and variance of functional outputs, application to large-scale datasets in high-dimensional feature spaces remains a hurdle. This is because of the high computational costs and memory requirements during the training stage and the expense of computing the mean and variance during the prediction stage. To accurately represent high-dimensional function spaces, a large number of training data points must be used. In this work, an extension to GP regression is developed with the specific goal of applicability in noisy and large-scale scientific computing datasets.

The computational complexity of GPs at the training stage is related to the kernel inversion and the computation of the log-marginal likelihood. For example, an algorithm based on the Cholesky decomposition is one well-known approach chol (). The computational complexity of Cholesky decomposition for a problem with training points is . For high-dimensional problems with large , this can be prohibitive. Even if this cost can be reduced using, for instance, iterative methods gumerov-iter (), the computation of the predictive mean and variance at test points will be , and , respectively. The evaluation (or testing) time could be of great significance if GP evaluations are made frequently during a computational process. Such a situation can occur in a scientific computing setting transprev (); eric2015 () where GP evaluations may be performed every time-step or iteration of the solver. Reducing both the training and evaluation costs while preserving the prediction accuracy is the goal of the current work.

Much effort has been devoted to the construction of algorithms of reduced complexity. Among these is the family of methods of sparse Gaussian process regression. These methods seek to strategically reduce the size of the training set or find appropriate sparse representations of the correlation matrix through the use of induced variables. The covariance matrix, , contains the pairwise covariances between the training points. Descriptions for several methods of this family are provided by Quiñonero-Candela et al. quinonero2005 (), who define induced variables , such that the prediction of a function is given by

(1)

where is a vector whose elements are , and . is the number of induced variables used in this representation, and is the dimensionality of the inputs; i.e. , . is a single test input vector, and is the ’th column of . is the covariance function. From this, it is evident that if , the matrix inversion will be less costly to perform.

The process of determining the induced variables that can optimally represent the training set can introduce additional computational costs. The most basic method is to randomly select a fraction of data points and define them as the new training set, i.e. choosing from , where and are the original training inputs and outputs. Seeger et al. seeger2003 (), Smola et al. smola2001 (), and others have developed selection methods that provide better test results compared to random selection. In particular, Walder et al. walder2008 () have extended the concept to include the ability to vary such that its hyperparameters are unique to each , i.e. . These methods often depend on optimizing modified likelihoods based on , or on approximating the true log marginal likelihood titsias2009 (). Methods that introduce new induced variables instead of using a subset of the training points can involve solving optimization problems, again requiring the user to consider the additional computational burden.

In addition to choosing the inducing variables, one may also change the form of the covariance function from to , where

(2)

In contrast to the case where only a subset of the training data is used, all training points are considered in this approach and the matrices are only of rank . There are many variations on this type of manipulation of the covariance. Alternatively, one can approximate directly by obtaining random samples of the kernel, as is described by Rahimi et al. recht2007 (). However, altering the covariance matrix can cause undesirable behaviors when computing the test variance, since some matrices no longer mathematically correspond to a Gaussian process. The work of Ambikasaran et al. ambikasaran () improves the speed of GP regression through the hierarchical factorization of the dense covariance matrix into block matrices, resulting in a less costly inversion process.

Other methods of accelerating GP regression focus on dividing the training set into smaller sets of more manageable size. Snelson and Ghahramani snelson2007local () employ a local GP consisting of points in the neighborhood of the test point in addition to a simplified global estimate to improve test results. In a similar fashion, the work of Park and Choi park2010 () also takes this hierarchical approach, with the global-level GP informing the mean of the local GP. Urtasun and Darrell urtasun2008 () use the local GP directly, without the induced variables. However, this requires the test points to be either assigned to a pre-computed neighborhood of training points snelson2007local (), or to be determined on-line urtasun2008 (). Both of these approaches to defining the neighbourhood can be quite expensive depending on training set size, though massive parallelization may mitigate the cost in some aspects keane2002 () weiss2014 ().

In this work, the data is partitioned into clusters based on modified k-center algorithms and the cluster centers are used to define a reduced training set. This leads to an improvement over standard GPs in terms of training and testing costs. This technique also adapts the covariance function to the sparsity of a given neighborhood of training points. The new technique will be referred to as Multiscale GP and abbreviated as MGP. Similar to the work of Snelson and Ghahramani snelson2007local (), this flexibility may allow MGP to produce more accurate outputs by means of a more informed perspective on the input space. The sparse GP technique of Walder et al. walder2008 () can been seen as a very general interpretation of our approach, though a key difference is that the points within the reduced training set originate entirely from the original training set, whereas  walder2008 () computes new induced variables. Further, in our work, hyperparameters are restricted by the explicit choice and hierarchy of scales in order to streamline the process of their optimization. The work of Zhou et al. zhou2006 () also employs multiple scales of covariance functions, but it does not take into account the neighbourhoods of training points, and neither does it seek to reduce the size of the training set.

The next section of this paper recapitulates the main aspects of standard GP regression. Section 3 introduces the philosophy behind the proposed approach and presents the algorithm. Section 4 analyses the complexity of the MGP algorithm for both testing and training. In section 5, quantitative demonstrations are presented on a simple analytical problems as well as on data derived from numerical simulations of turbulent combustion. Following this, key contributions are summarized.

2 Preliminaries of GP regression

In this section, a brief review of standard GP regression is presented. While a detailed review can be found in the excellent book by Rasmussen rbfgp2 (), the description and notations below provide the necessary context for the development of the multiscale algorithm proposed in this paper.

We consider the finite dimensional weight-space view and the infinite dimensional function-space view as in Rasmussen rbfgp2 (). Given a training set of observations, , where is an input vector and is a scalar output, the goal is to obtain the predictive mean, , and variance, , at an arbitrary test point . Training inputs are collected in the design matrix , and the outputs form the vector . Furthermore, it is assumed that

(3)

where is Gaussian noise with zero mean and variance .

For low dimensionality of the inputs, linear regression cannot satisfy a variety of dependencies encountered in practice. Therefore, the input is mapped into a high-dimensional feature space , and the linear regression model is applied in this space. To avoid any confusion with the original feature space , this space will be referred to as the “extended feature space” and its dimensionality can be referred to as the “size” of the system. Denoting the mapping function , we have

(4)

The predictive mean and the covariance matrix can then be computed as

(5)

The Bayesian formalism provides Gaussian distributions for the training outputs and for the predictive mean and variance .

(6)
(7)

where is the identity matrix. These expressions allow for the computation of the log-marginal likelihood (LML),

(8)

Maximization of this function leads to optimal values of hyper-parameters such as and other variables that define .

3 Multiscale sparse GP regression

As stated earlier, the goal of the proposed GP-based regression process is to decrease the complexity of both training and testing and to make the prediction more robust for datasets that have a highly irregular distribution of features.

We consider Gaussian basis functions,

(9)

where each function is characterized not only by its center , but now also by a scale . The need for multiple scales may arise from the underlying physics (e.g. particle density estimation) or from the substantial non-uniformity of the input data distribution, which could, for instance, demand smaller scales in denser regions. Note that the matrix in Eq. (5) is given by

3.1 Sparse representations

While training points may be available and maximum information can be obtained when the size of the extended feature space , we will search for subsets of these points leading to a lower size . The size of the extended feature space is related to the accuracy of the low-rank approximation of matrix built on the entire training set (i.e. for the case ).

Assume for a moment that we have a single scale, If is chosen to be smaller than the minimum distance between the training points, the matrix will have a low condition number and a low marginal likelihood. On the other end, if is substantially larger than the maximum distance between the training points the problem will be ill-posed and the marginal likelihood will be also low. The optima should lie between these extremes. If the solution is not sensitive to the removal of a few training points, a good low-rank approximation can be sought.

The fact that the matrix can be well-approximated with a matrix of rank means that rows or columns of this matrix can be expressed as a linear combination of the other rows or columns with relatively small error . Assuming that locations (or centers) of the radial basis functions (or Gaussians) are given and denoting such centers by , , we have

(11)

where are coefficients to be determined. Hence, output (Eqn. 3), where is expanded as Eqn. 4 with , can be written as

(12)

This shows that if is lower or comparable with noise , then a reduced basis can be used and one can simply set the size of the extended feature space to be the rank of the low-rank approximation, , and coefficients can be determined by solving a system instead of an system.

3.2 Representative training subsets

Let us consider now the problem of determination of the centers of the basis functions and scales . If each data-point is assigned a different scale, then the optimization problem will become unwieldy, as hyperparameters will have to be determined. Some compromises are made, and we instead use a set of scales . In the limit of , we have a single scale model, while at we prescribe a scale to each training point. To reduce the number of scales while providing a broad spectrum of scales, we propose the use of hierarchical scales, e.g. , , where is of the order of the size of the computational domain and is some dimensionless parameter controlling the scale reduction.

While there exist several randomization-based strategies to obtain a low-rank representation by choosing representative input points quinonero2005 (), we propose a more regular, structured approach. For a given scale , we can specify some distance , where is chosen such that

  • The distance from any input point to a basis point is less than . Such a construction provides an approximately uniform coverage of the entire training set with -dimensional balls of radius .

  • The number of such balls is - in some sense - optimal. Note that smaller results in smaller errors and and larger in Eqs. (11) and (12). If , where is the minimal distance between the points in the training set, then and , which corresponds to a full rank representation.

The problem of constructing an optimal set as described above is computationally NP-hard. However, approximate solutions can be obtained with a modification of the well-known -means algorithm kmeans (). In computational geometry problems, the -means algorithm partitions a domain in -dimensional space into a prescribed number, , of clusters. In the present problem, the number of clusters is unknown, but the distance parameter is prescribed. Thus, the number of centers depends on the input point distribution and can be determined after the algorithm is executed. Since only the cluster centers are required for the present problem, the following algorithm is used:

{mdframed}

Algorithm #1: Determination of cluster centers and k

Input: set of training points , max cluster radius .

  1. Define and set ;

  2. Do steps 3 to 6 while

  3. Assign (where is a random point from set )

  4. Find all points , such that .

  5. Define and .

Output: set of cluster centers , number of clusters .

The construction of training subsets in the case of multiple scales requires further modification of the algorithm. Starting with the coarsest scale , centers can be determined using Algorithm 1 with distance parameter . In our approach, we select the bases such that each input point can serve as a center of only one basis function; therefore, the cluster centers of scale should be removed from the initial training set to proceed further. Next, we partition the remaining set using Algorithm 1 with distance parameter and determine cluster centers. After removal of these points we repeat the process until scale is reached at which we determine cluster centers and stop the process. This algorithm is described below.

{mdframed}

Algorithm #2: Determination of cluster centers in multiple scales

Input: Set of training points , set of scales , parameter .

  1. Define

  2. Do steps 3 to 4 for

  3. Execute Algorithm #1 with input and max cluster radius and get cluster centers and number of clusters

  4. Set

Output: set of cluster centers , , number of clusters for each scale , .

Figure 1: The separation of 500 random points distributed inside a unit square into clusters. The cluster centers are shown in yellow; points belonging to different clusters are in different shades of gray. Left: small . Right: large .

Figure 1 illustrates the clustering of a 2-dimensional dataset. In principle, one can use the process even further until all the points in the training set become the cluster centers at some scale (i.e. ). However, as mentioned above, the reduction of the basis, , is important to reduce the computational complexity of the overall problem.

4 Complexity

The asymptotic complexity of algorithm #1 is and that of algorithm #2 is . While fast neighbor search methods cheng1984 () can accelerate these algorithms, the overall computational cost of the regression is still expected to be dominated by the training and evaluation costs. It is thus instructive to discuss the complexities of different algorithms available for training and testing, especially when .

4.1 Training

The goal of the training stage is to determine quantities necessary for computations in the testing stage. At this point, two types of computations should be recognized: i) pre-computations, when the set of hyperparameters is given, and ii) the optimization problem to determine the hyperparameters.

To accomplish the first task, we use the following steps:

  • The initial step is to form the matrix , and then the matrix in Eq. (5). According to Eq. (3), the cost of this step is . Note that this cost is much larger than the required to determine the basis function centers using the -means type algorithm described in Section 3.2.

  • The next step involves the determination of the weights from Eq. (5) This requires solving a . If solutions are obtained through direct methods, the cost is .

  • Following this, the inverse should be determined5. The Cholesky decomposition is typically used in this situation rbfgp2 ();

    (13)

    where is the lower triangular matrices. The decompositions can be performed with cost , respectively. Note that the Cholesky decompositions can also be used to solve the systems for , in which case the complexity of solution becomes .

The second task requires the computation of the log marginal likelihood. If one uses Eq. (8) directly then the task becomes computationally very expensive (complexity ). The complexity can be reduced with the aid of the following lemma.

Lemma 1

Determinants of matrices and , defined by Eqs. (5) and (6), are related as

(14)
Proof

According to the definition of (6), we have in the right hand side of Eq. (14)

(15)

Note now the Sylvester determinant theorem, which states that

(16)

where is an matrix and is , while and are the and identity matrices. Thus, we have in our case

(17)

From Eq. (5) we have

(18)

Combining results (15), (17), and (18), one can see that Eq. (14) holds.

Now, replacing in the first term in the right-hand side of Eq. (8) with expressions for and from Eqs. (5) and (6) and using Eq. (14) in the second term, we obtain

(19)

In the case of and using the Cholesky decomposition of , we obtain

(20)

Here, the cost of computing the first term on the right hand side is , and the cost of the second term is .

Multi-dimensional optimization is usually an iterative process. Assuming that the number of iterations is , we can write the costs of the training steps marked by the respective superscripts as follows

(21)

4.2 Testing

The cost of testing can be estimated from Eq. (6) as the cost of computing the predictive mean and variance. Taking into account the Cholesky decomposition (13), these costs can be written as

(22)

where is the number of test points. The cost for computing the variance is much higher than the cost for computing the mean, because it is determined by the cost of solving triangular systems of size per evaluation point.

5 Numerical Results

In this section, a set of simple analytical examples are first formulated to critically evaluate the accuracy and effectiveness of MGP and to contrast its performance with conventional GP regression. Following this, data from a turbulent combustion simulation is used to assess the viability of the approach in scientific computing problems.

5.1 Analytical examples

Figure 2: Comparison of the output of the standard kernel GP using a Gaussian kernel (left) and MGP with (right) for a step function. The crosses are training points (), while the continuous lines and shaded areas show the predictive mean and the variance.
Figure 3: Comparison of the performance of the standard GP and MGP for the step function in Figure 2. The plot on the left shows the size of the extended feature space for different input noise . The wall clock time is obtained using MATLAB on a typical desktop PC.
Figure 4: Comparison of the standard GP (left) and MGP (right) for a step function. In contrast to Figure 2, the training point distribution is non-uniform with an exponential density increase near the jump. The number of training points . Gaussian noise with was added. The crosses represent training points, while the continuous lines and shaded areas show the predictive mean and the variance. The dots on the continuous line have abscissas of the training points.
Figure 5: Comparison of performance of the standard GP (left) and MGP with (right) for a sine function, with Gaussian noise of standard deviation . Crosses represent training points (); continuous lines and shaded regions show the predictive mean and the variance.
Figure 6: Comparison of the performance of the standard GP and MGP for the sine function in Figure 5. The wall-clock time is obtained using MATLAB on a typical desktop PC.

The first numerical example we present is a 1-D step function (). We compare the standard kernel GP with a Gaussian kernel and the present MGP algorithm restricted to a single scale (). The training set of size was randomly selected from 10000 points distributed in a uniform grid. Points not used for training were used as test points. The initial data was generated by adding normally distributed noise with standard deviation to the step function. Optimal hyperparameters were found using the Lagarias algorithm for the Nelder-Mead method (as implemented by MATLAB’s fminsearch function) lagarias (). For the standard GP, and were optimized, while for the MGP algorithm, the optimal was determined in addition to and . In both cases, the optimal was close to the actual used for data generation. The ratio between the optimal for the kernel and the finite-dimensional (“weight-space”) approaches was found to be approximately , which is consistent with the difference of predicted by the theory (the distinction between the weight space and function space approaches is provided in Ref. rbfgp2 ()).

The plots in Figure 2 show that there is no substantial difference in the mean and the variance computed using both methods, while the sparse algorithm required only functions compared to the required for the standard method. Figure 3 shows the relationship between the extended feature space (D) with respect to the size of the training set (N). Ideally, this should be a linear dependence, since the step function is scale-independent. Due to random noise and the possibility that the optimization may converge to local minima of the objective function, however, the relationship is not exactly linear. Since is nevertheless several times smaller than , the wall-clock time for the present algorithm is shorter than that for the standard GP in both testing and training, per optimizer iteration. The total training time for the present algorithm can be larger than that of the standard algorithm due to the overhead associated with the larger number of hyperparameters and the resultant increase in the number of optimization iterations required. However, we observed this only for relatively small values of . For larger , such as , the present algorithm was approximately 5, 10, and 20 times faster than the standard algorithm for , , and , respectively.

Figure 4 illustrates a case where the training points are distributed non-uniformly. Such situations frequently appear in practical problems, where regions of high functional gradients are sampled with higher density to provide a good representation of the function. For example, adaptive meshes to capture phenomena such as shockwaves and boundary layers in fluid flow fall in this category. In the case illustrated here, the training points were distributed with exponential density near the jump (, , where are distributed uniformly at the nodes of a regular grid. was used). For MGP, the number of scales was and the other hyperparameters were optimized using the same routine as before. Due to multiple extrema in the objective function, it is rather difficult to optimize the number of  scales, . In practice, one should start from several initial guesses or fix some parameters such as . We used several values of and observed almost no differences for , while results for and were substantially different from the cases where .

It is seen that the MGP provides a much better fit of the step function in this case than the standard method. This is achieved due to its broad spectrum of scales. In the present example, we obtained the following optimal parameters for scales distributed as a geometric progression, : , , and . Other optimal parameters were and . For the standard GP, the optimal scale was . Figure 4 shows that with only a single intermediate scale, it is impossible to approximate the function between training points with a large spacing, whereas MGP provides a much better approximation. Moreover, since , we also have a better approximation of the jump, i.e. of small-scale behavior. This is clearly visible in the figure; the jump for the standard GP is stretched over about 10 intervals between sampling points, while the jump for MGP only extends over 3 intervals. Note that in the present example, we obtained , so the wall-clock time for testing is not faster than the standard GP. However, this case illustrates that multiple scales can provide good results for substantially non-uniform distributions where one scale description is not sufficient.

As final analytical example, we explore a sine wave with varying frequency, depicted in Figure 5. As before, training points are randomly chosen from a uniform distribution, and test points are used. For MGP, was used. Compared to the previous examples, this is an intermediate case in terms of scale dependence. One noteworthy result from this dataset is that is almost constant with respect to , as seen in Figure 6. This shows that when the function is relatively smooth, the optimization process is not limited to producing a linear relationship between and . Another result is that the output variance of the multiscale method is visibly higher than that of the standard method. For the previous cases, the variances have been either been close to equal, or the standard method would produce the higher variance. This could be due to the fact that and are inherently related for the current method, whereas is unrestricted for the standard GP. Since , the multiscale for is typically greater than the standard method’s . According to Eq. (7), this would result in greater variance.

5.2 Data from Turbulent Combustion

Combustion in the presence of turbulent flow involves an enormous disparity in time and length scales, meaning that direct numerical simulations (DNS) are not possible in practical problems. Large eddy simulations (LES), in which the effect of scales smaller than the mesh resolution (i.e. subgrid scales) are modeled, is often a pragmatic choice. A key difficulty in LES of combustion is the modeling of the subgrid scale fluxes in the species transport equations kant1 (); kant2 (). These terms arise as a result of the low-pass filtering - represented by the operator - of the governing equations, and are of the form

(23)

where represent density, velocity and species mass fraction, respectively. Subgrid-scale closures based on concepts from non-reacting flows, such as the equation below,6

(24)

are found to be inadequate for turbulent combustion. Modeling of the scalar fluxes thus continues to be an active area of research, and many analytical models are being evaluated by the community. Reference  chakraborthy () provides a concise summary of such developments in the area of premixed turbulent flames.

Figure 7: Contours of on the test plane. Left: DNS values (exact). Center: GP output. Right: MGP output.
Figure 8: Original test output versus learned result for . Red regions are from the diagonal, i.e. the optimized estimate of the input noise.

In this work, we intend to apply GP and MGP to identify the model relationships from data. The simulation data is obtained from Ref. raman (), in which DNS of a propagating turbulent flame is performed. The configuration involves a premixed hydrogen-air flame propagating in the x-direction. The flame forms a thin front between the burnt and unburnt gases. Specifically, we attempt to learn the normalized flux in the x-direction, , as a function of seven variables:

(25)

A total of 3 million training points were generated from the dataset by performing low-pass filtering. Some care was taken in choosing training points. Since the flame is thin relative to the size of the domain, the majority of the data points were found to lie outside the region where is nonzero. To mitigate this disadvantageous distribution, 80 percent of the training points were randomly chosen from the data with , and 20 percent were chosen from data with , i.e. outside the flame. 45000 training points were used in total. For testing, a single plane of around 6500 points was set aside.

The predictive results on this plane are shown in Table 1, and Figure 8 shows the ML output versus the true DNS values. From these plots, it is seen that MGP has achieved a fifty-fold increase in evaluation speed for a corresponding two percent increase in error. In the scatter plot, MGP appears to perform better than GP for low values of and worse for high values, but the overall difference is small. Figure 7 is a side-by-side comparison of contours of from DNS and from GP and MGP. It is especially evident in the contour plot that GP and MGP are both able to capture features of the flame, whereas analytical models described in Ref. chakraborthy () were not as accurate7. Figure 9 plots along several locations perpendicular () and parallel () to the flame in the test plane.

GP MGP
Error Time Error Time
0.1087 0.1105
Table 1: Numerical results for learning in the flame. Error is defined as , where is a vector of the exact outputs , and is a vector of the GP or MGP outputs at the test points . Time is test time in seconds, obtained in MATLAB. MGP utilized 3 scales.
(a)
(b)
(c)
(d)
Figure 9: as a function of for four different constant values of on the test plane. Dashed black lines are from DNS. GP results on the left, MGP results on the right. Red regions are one standard deviation for the output.

6 Conclusion

In this paper, a new Gaussian process (GP) regression technique was presented. The method, referred to as MGP, introduces multiple scales among the Gaussian basis functions and employs hierarchical clustering to select centers for these sparse basis functions. These modifications reduce the computational complexity of GP regression and also achieve better accuracy than standard GP regression when training points are non-uniformly distributed in the -dimensional feature space. We illustrated these improvements through analytical examples and in a turbulent combustion datasets.

In analytical examples with smooth functions, the MGP was shown to be at least an order of magnitude faster than the standard GP for a similar level of accuracy. In problems with discontinuities, the MGP is shown to provide a much better fit. In the turbulent combustion example in 7 dimensions, the MGP was shown to achieve a fifty-fold increase in evaluation speed for a corresponding two percent increase in error over the GP method. Overall, MGP is well-suited for regression problems where the inputs are unevenly distributed or where training and testing speeds are critical.

Based on the results presented in this work, MGP offers promise as a potentially attractive method for use in many scientific computing applications in which datasets may be large, and sparsely distributed in a high-dimensional feature space. The MGP can be especially useful when predictive evaluations are performed frequently. However, further developments and more detailed application studies are warranted. It was observed that the optimization process is more likely to terminate at local minima compared to conventional GPs. An immediate area of investigation could explore more efficient and robust techniques for optimization of the hyperparameters. Since an appealing feature of MGP is the reduced complexity when working with large datasets, efficient parallelization strategies should be explored.

The software and all the examples in this paper are openly available at http://umich.edu/~caslab/#software.

Acknowledgments

This work was performed under the NASA LEARN grant titled “A Framework for Turbulence Modeling Using Big Data.”

Footnotes

  1. email: kdur@umich.edu
  2. email: kdur@umich.edu
  3. email: kdur@umich.edu
  4. journal: Machine Learning
  5. Alternatively, efficient decompositions enabling solutions with multiple right hand sides can be used.
  6. are typically constants, and is the strain-rate tensor. The superscript denotes Favre-filtering and is defined by . is the filter size.
  7. These results are not shown

References

  1. Mjolsness, E. and DeCoste, D., “Machine learning for science: state of the art and future prospects,” Science, Vol. 293, No. 5537, 2001, pp. 2051–2055.
  2. Yairi, T., Kawahara, Y., Fujimaki, R., Sato, Y., and Machida, K., “Telemetry-mining: a machine learning approach to anomaly detection and fault diagnosis for space systems,” Space Mission Challenges for Information Technology, 2006. SMC-IT 2006. Second IEEE International Conference on, IEEE, 1992, pp. 8–pp.
  3. Ling, J. and Templeton, J., “Evaluation of machine learning algorithms for prediction of regions of high Reynolds averaged Navier Stokes uncertainty,” Physics of Fluids, Vol. 27, No. 8, 2015.
  4. Parish, E. and Duraisamy, K., “A Paradigm for data-driven predictive modeling using field inversion and machine learning,” Journal of Computational Physics, Vol. 305, 2016, pp. 758–774.
  5. Gautier, N., Aider, J., Duriez, T., Noack, B. R., Segond, M., and Abel, M., “Closed-loop separation control using machine learning,” Journal of Fluid Mechanics, Vol. 770, 2015, pp. 442–457.
  6. Wilkinson, J., The algebraic eigenvalue problem, Vol. 87, Clarendon Press Oxford, 1965.
  7. Gumerov, N. A. and Duraiswami, R., “Fast radial basis function interpolation via preconditioned Krylov iteration,” SIAM Journal on Scientific Computing, Vol. 29, No. 5, 2007, pp. 1876–1899.
  8. Duraisamy, K., Zhang, Z., and Singh, A., “New Approaches in Turbulence and Transition Modeling Using Data-driven Techniques,” 2014.
  9. Quiñonero-Candela, J. and Rasmussen, C. E., “A unifying view of sparse approximate Gaussian process regression,” The Journal of Machine Learning Research, Vol. 6, 2005, pp. 1939–1959.
  10. Seeger, M., Williams, C., and Lawrence, N., “Fast forward selection to speed up sparse Gaussian process regression,” Artificial Intelligence and Statistics 9, No. EPFL-CONF-161318, 2003.
  11. Smola, A. J. and Bartlett, P., “Sparse Greedy Gaussian Process Regression,” Advances in Neural Information Processing Systems 13, MIT Press, 2001, pp. 619–625.
  12. Walder, C., Kim, K. I., and Schölkopf, B., “Sparse multiscale Gaussian process regression,” Proceedings of the 25th International Conference on Machine learning, 2008, pp. 1112–1119.
  13. Titsias, M. K., “Variational learning of inducing variables in sparse Gaussian processes,” International Conference on Artificial Intelligence and Statistics, 2009, pp. 567–574.
  14. Rahimi, A. and Recht, B., “Random features for large-scale kernel machines,” Advances in neural information processing systems, 2007, pp. 1177–1184.
  15. Ambikasaran, S., Foreman-Mackey, D., Greengard, L., Hogg, D., and O’Neil, M., “Fast Direct Methods for Gaussian Processes,” 2016.
  16. Snelson, E. and Ghahramani, Z., “Local and global sparse Gaussian process approximations,” International Conference on Artificial Intelligence and Statistics, 2007, pp. 524–531.
  17. Park, S. and Choi, S., “Hierarchical Gaussian Process Regression,” ACML, 2010, pp. 95–110.
  18. Urtasun, R. and Darrell, T., “Sparse probabilistic regression for activity-independent human pose inference,” Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, IEEE, 2008, pp. 1–8.
  19. Choudhury, A., Nair, P. B., and Keane, A. J., “A Data Parallel Approach for Large-Scale Gaussian Process Modeling.” SDM, SIAM, 2002, pp. 95–111.
  20. Gramacy, R. B., Niemi, J., and Weiss, R. M., “Massively parallel approximate Gaussian process regression,” SIAM/ASA Journal on Uncertainty Quantification, Vol. 2, No. 1, 2014, pp. 564–584.
  21. Zhou, Y., Zhang, T., and Li, X., “Multi-scale Gaussian Processes model,” Journal of Electronics (China), Vol. 23, No. 4, 2006, pp. 618–622.
  22. Rasmussen, C. E., “Gaussian processes for machine learning,” 2006.
  23. Hartigan, J. A. and Wong, M. A., “Algorithm AS 136: A k-means clustering algorithm,” Applied statistics, 1979, pp. 100–108.
  24. Cheng, D., Gersho, A., Ramamurthi, B., and Shoham, Y., “Fast search algorithms for vector quantization and pattern matching,” Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP’84., Vol. 9, IEEE, 1984, pp. 372–375.
  25. Lagarias, J. C., Reeds, J. A., Wright, M. H., and Wright, P. E., “Convergence properties of the Nelder–Mead simplex method in low dimensions,” SIAM Journal on optimization, Vol. 9, No. 1, 1998, pp. 112–147.
  26. Lipatnikov, A. N. and Chomiak, J., “Effects of premixed flames on turbulence and turbulent scalar transport,” Progress in Energy and Combustion Science, Vol. 36, No. 1, 2010, pp. 1–102.
  27. Tullis, S. and Cant, R. S., “Scalar transport modeling in large eddy simulation of turbulent premixed flames,” Proceedings of the Combustion Institute, Vol. 29, No. 2, 2002, pp. 2097–2104.
  28. Gao, Y., Klein, M., and Chakraborty, N., “Assessment of sub-grid scalar flux modelling in premixed flames for Large Eddy Simulations: A-priori Direct Numerical Simulation analysis,” European Journal of Mechanics - B/Fluids, Vol. 52, 2015, pp. 97 – 108.
  29. Hassanaly, M., Raman, V., Koo, H., and Colkett, M. B., “Influence of Fuel Stratification on Turbulent Flame Propagation,” 2014.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
10526
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description