Standing Wave Decomposition Gaussian Process

Standing Wave Decomposition Gaussian Process

Chi-Ken Lu    Scott Cheng-Hsin Yang    Patrick Shafto Department of Mathematics and Computer Science, Rutgers University, Newark, New Jersey 07302
July 15, 2019
Abstract

We propose a Standing Wave Decomposition (SWD) approximation to Gaussian Process regression (GP). GP involves a costly matrix inversion operation, which limits applicability to large data analysis. For an input space that can be approximated by a grid and when correlations among data are short-ranged, the kernel matrix inversion can be replaced by analytic diagonalization using the SWD. We show that this approach applies to uni- and multi-dimensional input data, extends to include longer-range correlations, and the grid can be in a latent space and used as inducing points. Through simulations, we show that our approximate method outperforms existing methods in predictive accuracy per unit time in the regime where data are plentiful. Our SWD-GP is recommended for regression analyses where there is a relatively large amount of data and/or there are constraints on computation time.

I Introduction

Gaussian Process (GP) regression rasmussen2006gaussian () is powerful as well as elegant because of its ability to transform correlation into prediction. This nonparametric and Bayesian approach needs not specify a predetermined function that the data points fit to, and only assumes that any subset of data follows a joint Gaussian distribution characterized by the mean and covariance functions. GP is equivalent to a class of neural networks which have infinite number of hidden units neal1997monte (); williams1997computing (). As a flexible model for highly complicated functions and physical processes, GP is a familiar tool for physicists in many context, such as predicting melting temperature seko2014machine (), inter-atomic potential bartok2010gaussian (), fractional Brownian motion sadhu2018generalized (), DNA replication kinetics baker2012inferring (); baker2014inferring (), gravitational waveform doctor2017statistical (), etc. Furthermore, in the machine learning community, GP can be used for classification rasmussen2006gaussian (); morales2017remote (), data dimension reduction, and signal reconstruction titsias2010bayesian ().

However, the scalability of GP regression is limited by its training time for training data points. Inverting the kernel matrix is the critical step which essentially transforms correlation into prediction. The task of speeding up the GP training algorithm has been taken up by computer scientists who use inducing points and Nyström method to approximate the kernel matrix williams2001using (), and, recently, by physicists das2018continuous (); zhao2015quantum () who employ quantum algorithms capable of solving system of linear equations harrow2009quantum (). In this paper, we propose an approximating GP based on sparsifying the kernel matrix of a latent grid in the input space, which allows us to analytically diagonalize the kernel matrix by employing standing wave decomposition. Observing that correlation among nearest neighbors is most relevant to prediction, we approximate the full matrix with its tridiagonal matrix.111In a tridiagonal matrix, the only non-zero values are on the main diagonal, superdiagonal, and subdiagonal. This tridiagonal matrix has analytic eigenvalues and eigenvectors resembling standing waves, a fact that has been extensively exploited in the studies of electronic structure of solids marcus1993tight (). If the training data is on a grid, like time-series data, then the approximated kernel matrix can be immediately used in the usual GP prediction algorithm. When the given data is not on a grid in the input space, a latent grid is treated as inducing points and the function values are obtained by Bayesian projection from off-grid data. Our approach works best for the squared exponential (SE) kernel matrix, a very popular kernel for many machine learning tasks, because the off-diagonal matrix elements decay rapidly. For high dimensional grid data, the SE kernel matrix has a Kronecker product form, which makes the present approach applicable as well.

One constraint of the approximated GP is that the ratio of length scale for SE kernel to the grid spacing, , has an upper bound because the kernel matrix must be semi-definite positive, which is guaranteed if the smallest eigenvalue is positive. The tridiagonal approximation works well when the neglected off-diagonal matrix elements have little effect on those small eigenvalues and the corresponding eigenvectors. An improvement to the tridiagonal approximation and the upper bound for is the pentadiagonal approximation, which has the next nearest correlation included in the kernel matrix. The pentadiagonal matrix, however, does not have simple form of eigenvalues and eigenvectors. Thus, using the eigenvector of tridiagonal kernel matrix, we reconstruct the kernel matrix so that the next-nearest neighbor correlation terms are included and the eigenvectors are still exact with eigenvalues modified. In the end, a simulation with synthetic data is carried out with comparison with other GP approximations to highlight the condition under which the proposed method is at its best.

The rest of the paper is structured as follows. In Sec. II, an overview of GP regression along with an analytic example in terms of eigen basis decomposition is given. In Sec. III, we introduce the approximate tridiagonal kernel matrix for a one-dimensional grid input and derive the standing wave eigen basis. The extension to two-dimensional grid input using the Kronecker product is provided in Sec. IV. To account for longer-ranged correlations, we demonstrate in Sec. V the reconstruction of a nearly pentadiagonal kernel matrix which allows the standing wave decomposition Gaussian process (SWD-GP) to be applied with longer length scale. Based on the above techniques, in Sec. VI our SWD-GP is then combined with a latent grid (LG) to model data that do not do not lie on a grid. Our LGSWD-GP method is compared with popular existing methods, and the simulated accuracy and run time are reported in Sec. VII. Sec. VIII is devoted to the investigation of the time complexity of our approximated methods and a comparison with some popular algorithms. In the end, the related works and a brief discussion are given in Sec. IX and Sec. X, respectively.

Ii Overview of GP

Here we give an overview of GP by considering two unobserved values and at two distinct locations and , respectively. To make it a tractable problem, we may assume the values follow a joint normal distribution, namely with zero mean and convariance matrix

(1)

The parameter is a function of distance , which quantifies the correlation between and . Here, so that is positive semidefinite. It is then interesting to see how the joint distribution adjusts itself after one of the variable, say, has been observed. The way the correlation is transformed into prediction is seen by first decomposing the inverse of kernel matrix using the eigenvalues and eigenvectors, which yields . The eigenvalues are and the corresponding eigenvectors read

(2)

After some algebra, we find that the joint distribution is factorized into,

(3)

Thus, the adjusted distribution for becomes conditional on the observed . It is easy to verify that is still Gaussian, and the updated mean and variance and , respectively.

Generalizing the above procedure to a multivariate joint distribution over the set of observed and unobserved variables with a general kernel matrix constitutes the GP regression, which is to seek the underlying latent function mapped from the input points . Because of the Gaussianity, the correlation among the data leads to the conditional distribution over the unobserved variable, , which is another Gaussian with updated mean,

(4)

and variance,

(5)

where denotes the kernel matrix associated with training data and denotes the kernel matrix between training and test points . Corresponding to the above example, , , and .

In the following, we shall focus on the kernel matrix taken from the squared exponential kernel function ,

(6)

where the product form appears due to the separable sum of squared distance along each dimension. The hyperparameters include the variance and length scale .

Iii Standing Wave Eigen Basis

It is observed that the kernel matrix given by (6) has a simplified form if all the input points are on an one-dimensional grid since the matrix elements only depend on , as in a Toeplitz matrix gray2006toeplitz (). Here we make further simplification by retaining only the most relevant terms, namely the matrix elements with , which is quite legitimate when the grid unit length is larger than length scale. Notice that the rest of matrix elements decrease exponentially as with . The eigenvalue equations in terms of the coefficients of eigenvector read

(7a)
(7b)
(7c)

where the index applies in (7b). The eigenvalues are rescaled as . We observe that (7) is similar with the energy Hamiltonian for an electron hopping on semi-infinite one-dimensional tight-binding lattice marcus1993tight (). Defining the z-transform associated with an auxiliary as , we find from (7a) and (7b) that . Moreover, the sequence can be reproduced from the contour integral  marcus1993tight (),

(8)

followed by employing the residue theorem. We note that the pair of poles of are with which the eigenvalues can be consistent. Thus, the coefficients are shown to be,

(9)

for all nonnegative . The final stage is to take care of (7c), which is not necessarily consistent with the solution in (9) with arbitrary and . The only way to make (9) a valid solution is to set the auxiliary term , which results in

(10)

and the corresponding eigenvalues,

(11)

Consequently, we may conclude the orthonormal standing wave basis,

(12)

Fig. 1 demonstrates four pairs of symmetric eigenvectors of an 2020 kernel matrix . As seen from the identity in (10), it is interesting to remark that the arguments and always sum up to , which suggests the symmetric coefficients . Such symmetry is also manifested in Fig. 1 where, for instance, in upper left panel the star symbols representing alternate its sign whereas the dot symbols associating with do not. Finally, the inverted kernel matrix can be simply decomposed as,

(13)

Figure 1: Four pairs of eigenvectors and are shown for the tridiagonal kernel matrix with = 20. Circle (star) symbols denote the components () for = 1, 2, 3, and 4.

Here, we note that the same sparse approximation which results in the tridiagonal must also be applied to the kernel matrix so that the product shall result in a row vector which is one at the th element and zero otherwise when the test point coincides with the th training point. The expression for in (13) suggests that the prediction mean of SWD-GP shall pass through the training points if all the eigenvalues are positive. Next, we employ the inverted kernel matrix (13) in the GP learning algorithm (4) and (5), and apply this SWD-GP to the on-grid data, , evenly spaced in , and with . The upper and lower panels in Fig. 2 represent the results using and 0.54, respectively. The prediction mean associated with smaller passes through all the training points, but is not able to capture the longer-distance features. Away from the training points, the variance is close to the value of used in (11), which is expected from the squared exponential kernel. On the other hand, the prediction mean and variance are optimal for the used length scale because further increased shall lead to zero or negative eigenvalues in (11).

Figure 2: Regression using on-grid data (red dots) generated by the function (black dashed line). Blue line and yellow dashed lines represent the prediction mean of SWD-GP and plus/minus one variance around the mean. The 10 grid points are uniformly placed in . The length scale are and 0.06, respectively, for upper and lower panels.

Iv Multi-dimensional Grid

For higher dimensional input , the squared exponential kernel function as an instance of a product kernel allows a compact and efficient representation of kernel matrix, namely  saatcci2012scalable ()

(14)

where the matrix element associated with the th dimension is given by . Fig. 3 shows the regression results for data generated from the function using the 10x10 (middle) and 20x20 (right) grid, respectively, and .

Figure 3: Two-dimensional grid used for regression with data generated by function .

V Kernel Matrix Reconstruction

Thus far it has been seen how the eigen decomposition of kernel matrix using the standing wave basis in (12) leads to the GP prediction given one-dimensional (Figure 2) and two-dimensional grid inputs (Figure 3). Although the size of grid and the ensuing length can be varied arbitrarily, the length scale always has a constraint. More precisely, demanding the positive semi-definiteness of kernel , which is guaranteed if the minimum eigenvalue in (11) is positive, leads to the constraint . However, an exact GP in which every correlation term is retained is free of such constraint on . In the following, we show how to incrementally relax the constraint by reconstructing the kernel matrix.

As a motivation, let us examine the upper panel in Fig. 4 where the results from exact GP (green star) and tridiagonal SWD-GP (blue solid line) for data generated by (black dashed line) using for a grid of 20 points in [0,1] are shown. First, we note that the GP and the present SWD-GP have identical results, suggesting that the SWD-GP is indeed a good approximation. Secondly, both results here are unable to capture the curvature feature near the extreme points, e.g. at there is a subtle discrepancy against the true function. Besides the fact that there are not sufficiently many data points near these points, increasing the length scale may work, as suggested by the exact GP result (green star) in the lower panel of Fig. 4 where a longer is used. However, our simulation shows that directly applying the tridiagonal SWD-GP with does not generate as a smooth prediction mean as the case with does. This is because as the length scale approaches the upper bound, the smallest eigenvalue approaches zero and creates numerical instabilities.

In order to extend the applicable range for , we are interested in restoring the next nearest neighbor (NNN) correlation terms proportional to in , i.e. the matrix elements with . Inspection of (7) suggests that the eigenvalue acquires a new contribution so that the new eigenvalue reads

(15)

while the eigenvectors are left intact. In fact, neither the eigenvalues nor the eigenvectors are exact because the correlation among the grid points is not symmetric near the boundary.

Instead of solving for the true eigenvalues and true eigenvectors of the kernel matrix containing the NNN correlation, we reconstruct the kernel matrix which is still diagonalized by the eigen basis in (12) and has eigenvalues as in (15). We apply the following identity to reconstruct ,

(16)

with the coefficient defined as,

(17)

for and otherwise. It is easy to see that the constant () together with the first cosine term () in (15) gives rise to the original tridiagonal matrix. As for the last term in (15) corresponding to , the RHS of (16) associated with equal , which leads to the desired NNN correlation term. However, special care is needed for the cases for and , which, respectively, correspond to and with . It is easily verified that the reconstructed kernel matrix elements corresponding to the end points in grid are .

Figure 4: Regression results from the SWD-GP (blue lines) and exact GP (green star symbols) in comparison with the true function (black-dashed lines). Red dots denote the training data points generated by (black dashed line) on the grid of 20 points in [0, 1]. Upper panel is obtained by applying tridiagonal SWD-GP and use length scale 0.03, while the lower panel is from the reconstructed pentadiagonal kernel matrix with 0.04.

In summary, the reconstructed pentadiagonal kernel matrix , which contains the NNN correlation terms, reads,

(18)

As shown in the lower panel of Fig. 4, we employ the pentadiagonal using to repeat the regression task, and we are able to reproduce the same result as the exact GP with same length scale. Moreover, including NNN correlation also improves the accuracy near the local extreme points.

Vi Latent Grid SWD-GP

vi.1 Bayesian Data Projection

When applying to data where the input points are randomly selected, we shall construct a latent grid so that the previous SWD-GP approach can be employed. The missing components are the corresponding function values evaluated at these grid points. Now we assume that is a set of random variable and we find the most probable values in the Baysian approach. We first calculate the posterior distribution associated with ,

(19)

where we have assumed independence among the given data. The likelihood function follows the usual GP predictive distribution by assuming is known,

(20)

and the variance at is denoted by

(21)

The prior has its zero mean distribution with the kernel matrix ,

(22)

With some tedious calculations, the posterior distribution can be shown to be,

(23)

where the updated variance is encoded in

(24)

with the matrix

(25)

The diagonal matrix . Furthermore, the mean vector represents the most probable projected function values,

(26)

Now, by marginalizing the latent function values , we may obtain the predictive distribution for the function value associated with the test point ,

(27)

which is consistent with the results presented in snelson2006sparse (); quinonero2005unifying () in the noise free case.

vi.2 Perturbation and Matrix Inversion

Unlike the grid data to which the inverted kernel matrix is readily applicable, here we have to invert the matrix in (25) in order to find the mean of predictive distribution in (27). To approach this problem, we shall eventually write in terms of the SW eigen basis of . Hence, the second term in (25) is treated as perturbation so that the eigenvalues in (11) as well as eigenvectors in (12) are modified. Here, to make the calculation more efficient and accessible, we regroup the eigenvectors with into pairs , and for ease of notation we use and to represent the two eigenvectors in a given pair. Projecting onto the subspace spanned by and , we arrive at the following eigenvalue equation for each pair, and the associated eigenvalue and eigenvector can be obtained by solving,

(28)

The value is the first-order correction to unperturbed eigenvalue due to the term , while is the cross term from . It can be shown that the updated eigenvalues and eigenvectors are given by,

(29)

and

(30)

with the rotation angle specified by

(31)

The parameter . Consequently, the matrix has the following decomposition,

(32)

Figure 5: Illustration of latent grid SWD-GP 1st order in (28)] and 2nd order () perturbation using data generated from function (red dashed line) plus noise. A latent grid of 20 points and are used, and the prediction mean is denoted by the blue solid line. (a) First-order purturbation with 200 data points. (b) Second-order perturbation with same parameters as (a). (c) Same as (a) but with 1000 data points. (d) Same as (b) but the number of data points increased to 1000.

Next, we apply the the above approach to the data generated by the function with input randomly selected from [0,1]. The top and bottom panels in Fig. 5 shows the results for the training number 200 and 1000, respectively. Panels (a) and (c) in Fig. 5 are obtained in first-order perturbation, or by setting in (28). With a larger number of training points [Fig. 5(c) versus (a)], the most probable projected function values are closer to the true function values on the grid. Panels (b) and (d) correspond to a fuller second-order perturbation treatment by considering the eigenvector correction due to nonzero in (28). Incremental improvements, and in particular near the end points, can be seen by comparing (a) versus (b), or (c) versus (d).

Vii Simulations

We compare the method presented here with three existing, popular methods—exact GP, fully independent training conditional GP (FITC-GP) snelson2006sparse (); quinonero2005unifying (), and Kernel Interpolation for Scalable Structured GP (KISS-GP) wilson2015kernel (). The main quantities of interest are data efficiency—the amount of data needed to reach a certain accuracy—and run time. We apply all methods to a toy regression problem, where the x-values are drawn uniformly random from , and the corresponding y-values are given by . Accuracy is measured by standardized mean-square error (SMSE). Run time is measured on a 2015 model of MacBook with 1.1 GHz dualcore Intel Core M and 8GB 1600MHz LPDDR3. All methods are implemented in MATLAB 2017b. We largely followed an example222http://www.gaussianprocess.org/gpml/code/matlab/doc/demoGrid1d.m in the documentation of the GPML toolbox for using the existing methods. For the implementation of our latent grid standing wave decomposition GP (LGSWD-GP), we used only first-order correction on the eigenvalues (i.e., setting in (28)), since Fig. 5 suggests that the improvement from adding second-order correction on the eigenvectors is relatively small compared to increasing the number of training points. Our code is available at: TBA.

For all methods, the squared-exponential kernel is used. The kernel’s overall variance, is fixed at , and the variance of likelihood is set to the true value, . Exact GP and KISS-GP are allowed to optimize for the kernel’s length scale, and FITC-GP is allowed to optimize for the location of the inducing points in addition to the kernel’s length scale. These hyper-parameter optimization are limited to 20 iterations. Note that for our LGSWD-GP, the length scale can be determined without invoking hyper-parameter optimization (see Sec. V). The number of inducing points are fixed at throughout for FITC-GP, KISS-GP, and LGSWD-GP. The inducing grid in KISS-GP and LGSWD-GP are evenly spaced in . The test points are evenly spaced points in . Example fits are shown in Figure 6.

Figure 6: Example fits of exact GP, FITC-GP, KISS-GP, and our LGSWD-GP. Black line represents the true underlying function; markers represent the mean of the predictive distribution at the test points. In the LGSWD-GP panel, red circles and blue crosses correspond to tridiagonal and pentadiagonal reconstruction, respectively. indicates the number of training data points used in the corresponding methods. The different in each panel corresponds to the largest used for the method in Fig. 7.

Figure 7 shows the accuracy and run time as a function of number of training points for the different methods. From the error plot, we see that our method has a lower data efficiency compared to existing methods; that is, compared to those methods, the LGSWD-GP needs more data to achieve the same accuracy. Nevertheless, the run-time plot shows that the poorer data-efficiency is offset by the radical increase in computational efficiency. The LGSWD-GP is roughly two orders of magnitude faster than FITC-GP and KISS-GP and is capable of reaching or even surpassing these methods in accuracy in the same amount of time. A trade-off between speed and accuracy is expected. What is interesting is that these seemingly bold tridiagonal and pentadiagonal kernel approximations provides enough data efficiency to make the trade off favorable.

Figure 7: Accuracy and run time as a function of number of training points for exact GP, FITC-GP, KISS-GP, and LGSWD-GP. LGSWD-3 and LGSWD-5 correspond to the tridiagonal and pentadiagonal approximations, respectively.
Figure 8: Accuracy by regions. SMSE is calculated for the test points in the region labeled by each panel’s title. In theory, all the training points affect the prediction at every test point; thus, the x-axis is the same as that in Fig. 7. But effectively, only roughly one fifth of the training points (those in the test region) contributed to the fit. The error plots for the remaining regions , , and all look similar to the right panel.

Next, we analyze the accuracy by breaking the test region into five equi-sized regions, i.e., . Note that the effective length scale of the underlying function is different in each region (see Fig. 6). Let the effective length scale in region be and the number of training points in be . The ratio and are useful in characterizing the accuracy behavior of GP. First, all methods do relatively poorly when is small (Fig. 8, left panel) and relatively well when is large enough (Fig. 8, right panel), which is just a restatement of classical sampling theory. Second, methods that depend on inducing points reach a plateau when is relatively large (Fig. 8, left panel) and thus are limited in data efficiency. Third, when becomes sufficiently small (Fig. 8, right panel), this limitation no longer exists. Based on these observations, we recommend using exact GP when both and are sufficiently large, FITC-GP or KISS-GP when is sufficiently small and is sufficiently large, and LGSWD-GP when is sufficient small and is large. In other words, among the methods shown, our method is the most accurate and time-efficient one in the regime where the latent grid is relatively fine and the number of training points is large.

Viii Time Complexity Analysis

For training data points on the grid, (4) and (13) gives the predictive mean for SWD-GP:

(33)

Since the second inner product has time complexity , the sum has a total time complexity of . The space complexity of SWD-GP is . For inducing points on a grid and training data points, from Sec. VI one can show that the predictive mean for LGSWD-GP is

(34)

where . The time complexity for computing is , where is the number of bands used in the approximation ( for tridiagonal and for pentadiagonal); thus, the sum gives a total time complexity of . The same time complexity is found for computing and . The space complexity of LGSWD-GP is with the bottleneck being at representing .

Next, for clarity of comparison, we show how our method differs from other methods in terms of the other methods’ bottleneck operation: Exact GP requires composing the kernel matrix and inverting it. The bottleneck operation is the inverse, which has time complexity for training points. If the data lie on a grid, the SWD method can compose the inverted kernel analytically via (13), although in practice, as shown by (33), we never need to explicitly represent the inverted kernel matrix for computing the predictive mean.

The bottleneck operation of FITC-GP is the matrix multiplication in (25), which has time complexity for inducing points and training points. If the inducing points lie on a grid, that multiplication in LGSWD-GP with -diagonal kernel approximation has time complexity , because each row in has at most non-zero elements. Note that one can choose not to explicitly compute this particular matrix product as it is absorbed into the ’s in (34). In addition, whenever a matrix inverse operation is required, such as in (21), LGSWD-GP benefits from being able to construct the inverse directly.

KISS-GP, which also uses inducing points on a grid, replaces all the operations in computing the predictive mean in (27) by solving a linear system through linear conjugate gradients, where wilson2015kernel (). The optimization has time complexity because the product can be computed in given the sparse approximation made by KISS-GP and because the optimization can be fixed to steps. The contribution of and in KISS-GP correspond to the contribution of and in LGSWD-GP, respectively: and stem from sparse approximations, while and a good choice for depend on the number of contributing eigenvalues.

Lastly, we mention that there are a couple of ways to further speed up SWD-GP and LGSWD-GP. First, as a result of the sine functions appeared in , we may rewrite the mean and variance in (4) and (5) in terms of Discrete Fourier Transformed vector for the training data ,

(35)

where . As such, the mean function becomes

(36)

and the predicting variance is

(37)

For SWD-GP, this reduces the time complexity from to , analogous to how Toeplitz structure reduces the time complexity of matrix multiplication. For LGSWD-GP, although this will not help reduce the bottleneck time complexity, it can help reduce many of the operations involving matrix products with the eigenvectors. Second, the sum in the eigen-decomposition of (12) and (32) can be computed independently and is directly amenable to parallelization.

Ix Related Work

Previous works have also taken advantage of structured data to increase the computational efficiency of GP. For input data that lie on a grid, cunningham2008fast (); turner2010statistical () sidestep the matrix inversion by solving with gradient-based methods. Importantly, the optimization is fast because Toeplitz / circulant structure can be exploited for fast matrix multiplication. These fast multiplications take on forms similar to (35). For input data that do not lie on a grid, wilson2015kernel () (KISS-GP) introduces a latent grid, finds a sparse, approximate representation of in terms of , and makes inference using the same kind of optimization problem in cunningham2008fast (); turner2010statistical (). The sparsification in KISS-GP and that in our LGSWD-GP share the same flavor in ignoring the diminishing long-range correlations among data points. Recently, evans2017scalable () further extended this optimization approach to efficiently handle high dimensional data by applying the Nyström approximation for eigen-decomposition and by exploiting properties of Kronecker and Khatri-Rao products for fast matrix multiplication. Overall, our method and these methods all take advantage of the grid structure to form fast matrix multiplication; however, our method differs in that it overcomes the matrix inversion bottleneck via analytic diagonalizaion rather than optimization.

Variational approaches are another popular method for GP inference. Variational methods turn the inference of the predictive mean and variance into an optimization problem titsias2009variational (). Variational GP often provides more accurate prediction than FITC bauer2016understanding () and has been made progressively faster through a series of development—from stochastic variational inference hensman2013gaussian (), to distributed variational inference gal2014distributed (), to asynchronous distributed variational inference peng2017asynchronous ()—to handle billions of input data. Similar to our work and evans2017scalable (), nickson2015blitzkriging () uses grid inducing points with stochastic variational inference, which allows added efficiencies in computation via the Kronecker and Khatri-Rao products.

X Discussion

In this paper we present an analytic approach to invert a class of kernels for GP regression with grid inputs through standing wave decomposition (SWD). This class of kernels is roughly Toeplitz 333The tridiagonal kernel is exactly Toeplitz. For pentadiagonal kernel, (18) shows which elements are modified and by how much. The procedure is readily extendable to more diagonials. and thus approximates all translational invariant kernels with grid inputs. Here we show results for tridiagonal Toeplitz kernels, which can be thought of as approximating a square exponential kernel by keeping only the nearest neighbor correlation among data points, and a pentadiagonal kernel, which includes the next nearest neighbor correlation. We demonstrate that the approach can be extended to approximate product kernels in higher dimensions. Lastly, we apply the approach to a latent grid of inducing points so as to handle training data that do not lie on a grid.

Our SWD approach is unique in that the inversion of the (approximated) kernel matrix is done analytically. This analytic inverse, along with the sparse matrices induced by keeping only the most important correlations among data points, makes this approach computationally very efficient (Sec. VIII). Furthermore, the optimal length scale hyperparameter in the SWD approach can be determined without the need to invoke optimization (Sec. V). When applied to training data on a grid, we see that the predictions of SWD-GP are very similar to those of exact GP if the two use the same length scale (Fig. 4), although vastly faster. For data that do not lie on a grid, our results show that LGSWD-GP can be both faster and more accurate than existing approximation GP methods when the number of training points is large (Figs. 7 and 8). We expect that this advantage will be magnified in situations where a highly precise latent representation is not crucial, such as GP classification. The analytical form of the eigen-decomposition also allows straight-forward parallelization of our approach, which we leave to future work. Other future directions include extending the approach to kernel approximations with longer-range correlation, to other types kernel such as periodic kernels, and to latent grid in high dimensions.

Acknowledgements.
This material is based on research sponsored by the Air Force research Laboratory and DARPA under agreement number FA8750-17-2-0146 to P.S. and S.Y. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon.

References

  • (1) C. E. Rasmussen and C. K. I. Williams. Gaussian Process for Machine Learning. MIT press, Cambridge, MA, 2006.
  • (2) Radford M Neal. Monte carlo implementation of gaussian process models for bayesian regression and classification. arXiv preprint physics/9701026, 1997.
  • (3) Christopher KI Williams. Computing with infinite networks. In Advances in neural information processing systems, pages 295–301, 1997.
  • (4) Atsuto Seko, Tomoya Maekawa, Koji Tsuda, and Isao Tanaka. Machine learning with systematic density-functional theory calculations: Application to melting temperatures of single-and binary-component solids. Physical Review B, 89(5):054303, 2014.
  • (5) Albert P Bartók, Mike C Payne, Risi Kondor, and Gábor Csányi. Gaussian approximation potentials: The accuracy of quantum mechanics, without the electrons. Physical review letters, 104(13):136403, 2010.
  • (6) Tridib Sadhu, Mathieu Delorme, and Kay Jörg Wiese. Generalized arcsine laws for fractional brownian motion. Physical review letters, 120(4):040603, 2018.
  • (7) A Baker, B Audit, SC-H Yang, J Bechhoefer, A Arneodo, et al. Inferring where and when replication initiates from genome-wide replication timing data. Physical review letters, 108(26):268101, 2012.
  • (8) A Baker and J Bechhoefer. Inferring the spatiotemporal dna replication program from noisy data. Physical Review E, 89(3):032703, 2014.
  • (9) Zoheyr Doctor, Ben Farr, Daniel E Holz, and Michael Pürrer. Statistical gravitational waveform models: What to simulate next? Physical Review D, 96(12):123011, 2017.
  • (10) Pablo Morales-Álvarez, Adrián Pérez-Suay, Rafael Molina, and Gustau Camps-Valls. Remote sensing image classification with large-scale gaussian processes. IEEE Transactions on Geoscience and Remote Sensing, 2017.
  • (11) M. Titsias and N. Lawrence. Bayesian gaussian process latent variable model. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 844–851, 2010.
  • (12) Christopher KI Williams and Matthias Seeger. Using the nyström method to speed up kernel machines. In Advances in neural information processing systems, pages 682–688, 2001.
  • (13) Siddhartha Das, George Siopsis, and Christian Weedbrook. Continuous-variable quantum gaussian process regression and quantum singular value decomposition of nonsparse low-rank matrices. Physical Review A, 97(2):022315, 2018.
  • (14) Zhikuan Zhao, Jack K Fitzsimons, and Joseph F Fitzsimons. Quantum assisted gaussian process regression. arXiv preprint arXiv:1512.03929, 2015.
  • (15) Aram W Harrow, Avinatan Hassidim, and Seth Lloyd. Quantum algorithm for linear systems of equations. Physical review letters, 103(15):150502, 2009.
  • (16) R. A. Marcus. Tight-binding approximation for semi-infinite solids. application of a transform method and of delta function normalization. The Journal of Chemical Physics, 98(7):5604–5611, 1993.
  • (17) Robert M Gray et al. Toeplitz and circulant matrices: A review. Foundations and Trends® in Communications and Information Theory, 2(3):155–239, 2006.
  • (18) Yunus Saatçi. Scalable inference for structured Gaussian process models. PhD thesis, University of Cambridge, 2012.
  • (19) E. Snelson and Z. Ghahramani. Sparse gaussian processes using pseudo-inputs. In Advances in neural information processing systems, pages 1257–1264, 2006.
  • (20) Joaquin Quiñonero-Candela and Carl Edward Rasmussen. A unifying view of sparse approximate gaussian process regression. Journal of Machine Learning Research, 6(Dec):1939–1959, 2005.
  • (21) Andrew Wilson and Hannes Nickisch. Kernel interpolation for scalable structured gaussian processes (kiss-gp). In International Conference on Machine Learning, pages 1775–1784, 2015.
  • (22) John P Cunningham, Krishna V Shenoy, and Maneesh Sahani. Fast gaussian process methods for point process intensity estimation. In Proceedings of the 25th international conference on Machine learning, pages 192–199. ACM, 2008.
  • (23) Richard E Turner. Statistical models for natural sounds. PhD thesis, UCL (University College London), 2010.
  • (24) Trefor W Evans and Prasanth B Nair. Scalable gaussian processes with grid-structured eigenfunctions (gp-grief). 2017.
  • (25) Michalis Titsias. Variational learning of inducing variables in sparse gaussian processes. In Artificial Intelligence and Statistics, pages 567–574, 2009.
  • (26) Matthias Bauer, Mark van der Wilk, and Carl Edward Rasmussen. Understanding probabilistic sparse gaussian process approximations. In Advances in neural information processing systems, pages 1533–1541, 2016.
  • (27) James Hensman, Nicolo Fusi, and Neil D Lawrence. Gaussian processes for big data. arXiv preprint arXiv:1309.6835, 2013.
  • (28) Yarin Gal, Mark Van Der Wilk, and Carl Edward Rasmussen. Distributed variational inference in sparse gaussian process regression and latent variable models. In Advances in Neural Information Processing Systems, pages 3257–3265, 2014.
  • (29) Hao Peng, Shandian Zhe, Xiao Zhang, and Yuan Qi. Asynchronous distributed variational gaussian process for regression. In International Conference on Machine Learning, pages 2788–2797, 2017.
  • (30) Thomas Nickson, Tom Gunter, Chris Lloyd, Michael A Osborne, and Stephen Roberts. Blitzkriging: Kronecker-structured stochastic gaussian processes. arXiv preprint arXiv:1510.07965, 2015.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
199124
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description