Constant-Time Predictive Distributions for Gaussian Processes

Constant-Time Predictive Distributions for Gaussian Processes

Abstract

One of the most compelling features of Gaussian process (GP) regression is its ability to provide well calibrated posterior distributions. Recent advances in inducing point methods have drastically sped up marginal likelihood and posterior mean computations, leaving posterior covariance estimation and sampling as the remaining computational bottlenecks. In this paper we address this shortcoming by using the Lanczos decomposition algorithm to rapidly approximate the predictive covariance matrix. Our approach, which we refer to as LOVE (LanczOs Variance Estimates), substantially reduces the time and space complexity over any previous method. In practice, it can compute predictive covariances up to times faster and draw samples time faster than existing methods, all without sacrificing accuracy.

\printAffiliationsAndNotice

1 Introduction

Gaussian processes (GPs) are fully probabilistic models which can naturally estimate predictive uncertainty through posterior variances. These uncertainties play a pivotal role in many application domains. For example, uncertainty information is crucial when incorrect predictions could have catastrophic consequences, such as in medicine [Schulam & Saria(2017)Schulam and Saria] or large-scale robotics [Deisenroth et al.(2015)Deisenroth, Fox, and Rasmussen]; Bayesian optimization approaches typically incorporate model uncertainty when choosing actions [Snoek et al.(2012)Snoek, Larochelle, and Adams, Deisenroth & Rasmussen(2011)Deisenroth and Rasmussen, Wang & Jegelka(2017)Wang and Jegelka]; and reliable uncertainty estimates are arguably useful for establishing trust in predictive models, especially when predictions would be otherwise difficult to interpret [Doshi-Velez & Kim(2017)Doshi-Velez and Kim, Zhou et al.(2017)Zhou, Arshad, Luo, and Chen].

Although predictive uncertainties are a primary advantage of GP models, they have recently become their primary computational bottleneck. Historically, this has not always been the case. The use of GPs used to be limited to problems with small datasets, since learning and inference computations naïvely scale cubically with the number of data points (). However, recent advances in inducing point methods have managed to scale up GPs to much larger datasets [Snelson & Ghahramani(2006)Snelson and Ghahramani, Quiñonero-Candela & Rasmussen(2005)Quiñonero-Candela and Rasmussen, Titsias(2009)]. For example, Kernel Interpolation for Scalable Structured GPs (KISS-GP) scales to millions of data points [Wilson & Nickisch(2015)Wilson and Nickisch, Wilson et al.(2015)Wilson, Dann, and Nickisch]. For a given test point , KISS-GP expresses the GP’s predictive mean as , where is a pre-computed vector dependent only on training data, and is a sparse interpolation vector. This particular formulation affords the ability to compute predictive means in constant time, independent of .

However, these computational savings do not extend naturally to predictive uncertainties. With KISS-GP, computing the predictive covariance between two test points requires computations, where is the number of inducing points used (see Table 1). While this asymptotic complexity is lower than that of standard Gaussian process inference, it quickly becomes prohibitive when is large, or when we wish to make many repeated computations. Additionally, drawing samples from the predictive distributions – a necessary operation in many applications – is similarly expensive. Matching the reduced complexity of predictive mean inference has remained an open problem.

Figure 1: Comparison of predictive variances on airline passenger extrapolation. The variances predicted with LOVE are accurate within , yet can be computed orders of magnitude faster.

In this paper, we provide a solution based on the tridiagonalization algorithm of \citetlanczos1950iteration. Our method takes inspiration from KISS-GP’s mean computations: we express the predictive covariance between and as , where is an matrix dependent only on training data. However, crucially, we take advantage of the fact that affords fast matrix-vector multiplications (MVMs) to avoid explicitly computing the matrix. Using the Lanczos algorithm, we can efficiently decompose as two rank- matrices in nearly linear time. After this one-time upfront computation, and due to the special structure of , all variances can be computed in constant time per (co)variance entry. Further, we extend our method to sample from the predictive distribution at points in time – independent of training dataset size.

We refer to this method as LanczOs Variance Estimates, or LOVE for short. LOVE has the lowest asymptotic complexity of any method for computing predictive (co)variances and drawing samples with GPs. We empirically validate LOVE on seven datasets and find that it consistently provides substantial speedups over existing methods: Variances and samples are accurate to within four decimals, and can be computed up to 18,000 times faster.

2 Background

A Gaussian process is a prior over functions functions, , specified by a prior mean function and prior covariance function . Given a dataset of observations and a Gaussian noise model, the posterior is again a Gaussian process with mean and covariance :

(1)
(2)

where denotes the kernel matrix between and , (for observed noise ) and . Given a set of test points , the equations above give rise to a dimensional multivariate Gaussian joint distribution over the function values of the test points. This last property allows for sampling functions from a posterior Gaussian process by sampling from this joint predictive distribution. For a full overview, see [Rasmussen & Williams(2006)Rasmussen and Williams].

2.1 Inference with matrix-vector multiplies

Computing predictive means and variances with (1) and (2) requires computing solves with the kernel matrix (e.g. ). Linear solves for GPs are often computed using the Cholesky decomposition of , which requires time to compute. Linear conjugate gradients (linear CG) provides an alternative approach that solves linear systems through only matrix-vector multiplies (MVMs) with . Linear CG exploits the fact that the solution is the unique minimizer of the quadratic function for positive definite matrices [Golub & Van Loan(2012)Golub and Van Loan]. This function is minimized with a simple three-term recurrence, where each iteration involves a single MVM with the matrix .

After iterations CG is guaranteed to converge to the exact solution , although in practice numerical convergence depends on the conditioning of rather than . Extremely accurate solutions typically require only iterations (depending on the conditioning of ) and suffices in most cases [Golub & Van Loan(2012)Golub and Van Loan]. For the kernel matrix , the standard running time of linear CG is (the time for MVMs). This runtime, which is already faster than the Cholesky decomposition, can be greatly improved if the kernel matrix affords fast MVMs. Fast MVMs are possible if the data are structured [Cunningham et al.(2008)Cunningham, Shenoy, and Sahani, Saatçi(2012)], or by using a structured inducing point method [Wilson & Nickisch(2015)Wilson and Nickisch].

Method Pre-computation Computing variances Drawing samples
(time) (storage) (time) (time)
Standard GP
SGPR
KISS-GP
KISS-GP (w/ LOVE)
Table 1: Asymptotic complexities of predictive (co)variances ( training points, inducing points, Lanczos/CG iterations) and sampling from the predictive distribution ( samples, test points).

2.2 The Lanczos algorithm

The Lanczos algorithm factorizes a symmetric matrix as , where is symmetric tridiagonal and is orthonormal. For a full discussion of the Lanczos algorithm see \citetgolub2012matrix. Briefly, the Lanczos algorithm uses a probe vector and computes an orthogonal basis of the Krylov subspace :

Applying Gram-Schmidt orthogonalization to these vectors produces the vectors of , (here is the Euclidean norm of ). The orthogonalization coefficients are collected into . Because is symmetric, each vector needs only be orthogonalized against the two preceding vectors, which results in the tridiagonal structure of [Golub & Van Loan(2012)Golub and Van Loan]. The orthogonalized vectors and coefficients are computed in an iterative manner. iterations of the algorithm produces the first orthogonal vectors of ) and their corresponding coefficients . Crucially, iterations requires only matrix vector multiplies with the original matrix , which is ideal for matrices that afford fast MVMs.

The Lanczos algorithm can be used in the context of GPs for computing log determinants [Dong et al.(2017)Dong, Eriksson, Nickisch, Bindel, and Wilson], and can be used to speed up inference when there is product structure [Gardner et al.(2018)Gardner, Pleiss, Wu, Weinberger, and Wilson]. Another application of the Lanczos algorithm is performing matrix solves [Lanczos(1950), Parlett(1980), Saad(1987)]. Given a symmetric matrix and a single vector , the matrix solve is computed by starting the Lanczos algorithm of with probe vector . After iterations, the solution can be approximated using the computed Lanczos factors and as

(3)

where is the unit vector . In fact, the linear CG algorithm can be derived from (3) when is positive definite [Golub & Van Loan(2012)Golub and Van Loan]. Linear CG tends to be preferred for matrix solves in practice, since Lanczos solves require storing the matrix. However, one advantage of Lanczos is that the and matrices can be used to jump-start subsequent matrix solves . \citetparlett1980new, \citetsaad1987lanczos, \citetschneider2001krylov, and \citetnickisch2009bayesian argue that subsequent solves can be approximated as

(4)

2.3 Kernel Interpolation for Scalable Structured GPs

Structured kernel interpolation (SKI) [Wilson & Nickisch(2015)Wilson and Nickisch] is an inducing point method explicitly designed for fast MVM-based inference. Given a set of inducing points, , SKI assumes that a data point can be well-approximated as a local interpolation of . For example, using cubic interpolation [Keys(1981)], is expressed in terms of its 4 closest inducing points, and the interpolation weights are captured in a sparse vector . The interpolation vectors are used to approximate the kernel matrix :

(5)

Here, contains the interpolation vectors for all , and is the covariance matrix between inducing points. MVMs with (i.e. ) require at most time due to the sparsity of . \citetwilson2015kernel reduce this runtime even further with Kernel Interpolation for Scalable Structured GPs (KISS-GP), an instantiation of their SKI approach in which all inducing points lie on a regularly spaced grid. This gives Toeplitz structure (or Kronecker and Toeplitz structure), resulting in the ability to perform MVMs in at most time.

Computing predictive means.

One advantage of KISS-GP’s fast MVMs is the ability to perform constant time predictive mean calculations [Wilson & Nickisch(2015)Wilson and Nickisch, Wilson et al.(2015)Wilson, Dann, and Nickisch]. Substituting the KISS-GP approximate kernel into (1) (and assuming a prior mean of 0 for brevity), the predictive mean is given by

(6)

Because is the only term in (6) that depends on , the remainder of the equation (denoted as ) is pre-computed and (Throughout this paper blue will highlight computations that do not depend on test data). This pre-computation takes time using linear CG [Wilson et al.(2015)Wilson, Dann, and Nickisch]. After computing and caching , the multiplication requires time, as has only four nonzero elements.

3 Lanczos Variance Estimates (Love)

In this section we introduce LOVE, an approach to efficiently approximate the predictive covariance

where denotes a set of test points, and a set of training points.

Solving naïvely scales , but is typically done as a one-type pre-computation during training since does not depend on the test data. At test time, all covariance calculations cost . In what follows, we will gradually build a significantly faster method for computing predictive covariances, mirroring the pre-compute phase and test phase structure of the naïve approach.

3.1 A first approach with KISS-GP

It is possible to obtain some computational savings when using inducing point methods. For example, we can replace and with their corresponding KISS-GP approximations, and , which we restate here:

is the sparse interpolation matrix for training points ; and are the sparse interpolations for and respectively. Substituting these into (2) results in the following approximation to the predictive covariance:

(7)

By fully expanding the second term in (7), we obtain

(8)

Precomputation phase.

Denote by the braced portion of (8). Critically, neither depends on test point nor and we can pre-compute it offline during training. The predictive covariance calculation becomes:

(9)

Computing requires linear solves with : one for each column vector in . Computing the right hand sides takes time total because is an matrix with four non-zero elements per row. Because is exactly the data matrix used in KISS-GP, each of these solves takes time \citepwilson2015kernel and we can solve all systems in time.

Test phase.

As contains only four nonzero elements, the inner product of with a vector takes time, and the multiplication requires time during testing.

Limitations.

Although this technique offers computational savings over non-inducing point methods, the quadratic dependence on in the pre-computation phase is a limiting computational bottleneck. In contrast, all other operations with KISS-GP require at most linear storage and near-linear time. Indeed, one of the hallmarks of KISS-GP is the ability to use a very large number of inducing points so that kernel computations are nearly exact.

In practice, it is often preferred to avoid this pre-computation phase due to its quadratic dependence on . Instead, (7) is computed directly using linear CG to compute the matrix solve. This has a cost of , which is the time to perform iterations of linear CG with . The dependence on and is problematic in the common case of large training sets or many inducing points.

3.2 Fast predictive (co-)variances with LOVE

We propose to overcome this limitation through an altered pre-computation step. In particular, we approximate in (9) as a low rank matrix. Letting and be matrices such that , we rewrite (9) as:

(10)

which reduces the complexity to , due to the sparsity of and . Recalling the approximation from 2.2, we substitute in the Lanczos approximation to the definition for and derive forms for and :

To compute and in an efficient manner, we compute iterations of the Lanczos algorithm to achieve using the average column vector as a probe vector. Both and are matrices, and thus require storage.

Computing .

To compute , we first multiply . Since is sparse, this takes time, and results in a matrix. Next we multiply by the result. Since is Toeplitz, each MVM takes time, and there are of them. The total running time to compute is therefore .

Computing .

To compute , we note that . We can perform this solve by Cholesky decomposing , which is positive definite since is. The decomposition take time and each of the solves similarly takes time because is tridiagonal [Loan(1999)].

After this total pre-computation, all predictive variances can be computed in time using (10). depends on the conditioning of the matrix and not its size [Golub & Van Loan(2012)Golub and Van Loan]. In practice, a constant is sufficient for most matrices [Golub & Van Loan(2012)Golub and Van Loan], and therefore can be considered to be constant. Running times are summarized in Table 1.

3.3 Predictive distribution sampling with LOVE

A distinctive property of LOVE is that it allows us not only to efficiently compute predictive variances, but also predictive covariances, and operations involving the predictive covariance matrix. Let be a test set of points. In many applications (2) will be used only to compute the predictive variance terms for each , i.e. . However if we want to draw samples from  — the predictive function evaluated on , then we must evaluate the cross-covariance terms as well. Sampling functions from the posterior is an operation that arises commonly when using Gaussian processes. In Bayesian optimization, it is a necessary step in several popular acquisiton functions such as predictive entropy search [Hernández-Lobato et al.(2014)Hernández-Lobato, Hoffman, and Ghahramani], max-value entropy search [Wang & Jegelka(2017)Wang and Jegelka], and the knowledge gradient [Frazier et al.(2009)Frazier, Powell, and Dayanik]. There are many other applications where GP sampling could be useful; however, parametric approximations are often used instead due to the large asymptotic cost of sampling [Deisenroth & Rasmussen(2011)Deisenroth and Rasmussen].

The predictive distribution is multivariate normal with mean and covariance . We can draw a sample from by first sampling Gaussian noise and reparameterizing:

(11)

where is a matrix such that . Typically is taken to be the factor from the Cholesky decomposition of . Computing the Cholesky decomposition incurs a cost on top of the predictive covariance evaluations. This may be prohibitively expensive, even with constant-time predictive covariances.

A Fast Low-Rank Sampling Matrix.

We use LOVE and KISS-GP to rewrite (10) as

(12)

where is the interpolation matrix for test points. Once again we have reduced the full covariance matrix to a test-independent term that can be pre-computed. Directly computing would result in time and space complexity. Therefore, we apply the Lanczos algorithm once again in the pre-computation phase to obtain a rank- approximation:

(13)

This Lanczos decomposition requires matrix vector multiplies with , each of which requires time. Substituting (13) into (12), we get:

(14)

If we take the Cholesky decomposition of (a operation since is tridiagonal), we rewrite (14) as

(15)

Setting , we see that . can be precomputed and cached since it does not depend on test data. In total, this pre-computation takes time in addition to what is required for fast variances. To sample from the predictive distribution, we need to evaluate (11), which involves multiplying . Multiplying by requires time, and finally multiplying by takes time. Therefore, drawing samples (corresponding to different values of ) takes time total during the testing phase (see Table 1). Note that alternative approaches have a cubic dependence on , whereas this approach has a linear dependence on .

{algorithm2e}

[t] \SetAlgoLined\SetKwInOutInputInput \SetKwInOutOutputOutput \SetCommentStygraycomment \SetKwFunctionmvmkxxmvm_K \SetKwFunctionmvmkuxmvm_K \SetKwFunctionlanczoslanczos \SetKwFunctionsparsemmsparse_mm \SetKwFunctioncholeskyfactorcholesky_factor \SetKwFunctioncholeskysolvecholesky_solve LOVE for fast predictive variances. \Input, – interpolation vectors for , – prior covariance between ,
– average column of
\mvmkxx: a function that performs MVMs with
\mvmkux: a function that performs MVMs with
\OutputApproximate predictive variance . \BlankLine\If have not previously been computed \lanczos \mvmkxx,
  \tcpk iter. of Lanczos w/   \tcpmatrix and probe vec. \choleskyfactor
\mvmkux \tcp* \choleskysolve,
  \tcp \sparsemm ,
\sparsemm ,
\Return

3.4 Extension to additive kernel compositions

LOVE is applicable even when the KISS-GP approximation is used with an additive composition of kernels,

Additive structure has recently been a focus in several Bayesian optimization settings as it causes the cumulative regret to depend only linearly on the number of dimensions [Kandasamy et al.(2015)Kandasamy, Schneider, and Póczos, Wang et al.(2017)Wang, Li, Jegelka, and Kohli, Gardner et al.(2017)Gardner, Guo, Weinberger, Garnett, and Grosse, Wang & Jegelka(2017)Wang and Jegelka]. Additionally, deep kernel learning models \citepwilson2016stochastic, wilson2016deep typically use kernels that are sums of one-dimensional kernel functions applied to each deep network feature. To apply LOVE, we note that this additive composition can be re-written as

(16)

The block matrices in (16) are analogs of their 1-dimensional counterparts in (5). Therefore, we can directly apply 3.3, replacing , , , and with their block forms. The block vectors are -sparse, and therefore interpolation takes time. MVMs with the block matrix take time by exploiting the block-diagonal structure. With additive components, predictive variance computations cost only a factor more than their 1-dimensional counterparts.

4 Results

Dataset \theadSpeedup over KISS-GP (w/o LOVE) \theadSpeedup over SGPR \theadVariance SMAE (compared
to KISS-GP w/o LOVE)
Name \thead(from scratch) \thead(after pre-comp.) \thead(from scratch) \thead(after pre-comp.)
\theadAirfoil
\theadSkillcraft
\theadParkinsons
\theadPoleTele
\theadElevators
\theadKin40k
\theadProtein
Table 2: Speedup and accuracy of KISS-GP with LOVE for computing predictive variances. For KISS-GP, all results reported use deep kernel learning. All models utilize GPU acceleration. ( is the number of data points, is the data dimensionality.) The scaled mean average error (SMAE) is measured by comparing the variance computations of a KISS-GP model with and without LOVE.

In this section we demonstrate the effectiveness of LOVE, both at computing predictive variances and also at sampling from predictive distributions. When using LOVE we use Lanczos iterations. All KISS-GP models use inducing points unless otherwise stated. We implement LOVE and all KISS-GP models with the GPyTorch library1. All experiments utilize GPU acceleration. Timing results are performed on an NVIDIA GTX 1080.

4.1 Predictive Variances

\theadDataset \theadSample Covariance Error \theadSpeedup over Standard GP w/ Cholesky
\theadStandard GP
\theadFourier
\theadKISS-GP
w/ LOVE \theadStandard GP
w/ Cholesky \theadFourier
Features \theadKISS-GP
w/ LOVE
(from scratch) \theadKISS-GP
w/ LOVE
(after pre-comp.)
\theadPolTele
\theadElevators
\theadBayesOpt (Eggholder)
\theadBayesOpt (Styblinski-Tang)
Table 3: Accuracy and computation time of drawing samples from the predictive distribution.

In these experiments, we aim to measure the accuracy and speed of LOVE at computing predictive variances with KISS-GP models. To measure accuracy, we train a single KISS-GP model, and compute the variances with LOVE and without LOVE (i.e. standard variance computations). We then compare how well the LOVE variances compare to standard (non-LOVE) variances, which are exact and can therefore be taken as ground truth. We report the scaled mean absolute error (SMAE)2 [Rasmussen & Williams(2006)Rasmussen and Williams] of the LOVE variances as measured against the standard variances. (Note that the predictive posterior means are completely identical with both methods, as LOVE only affects variance calculations.) In order to control for variations during training, we use the same GP model for both variance computations and only vary the prediction procedure.

One-dimensional example.

We first demonstrate LOVE on a complex one-dimensional regression task. The airline passenger dataset (Airline) measures the average monthly number of passengers from 1949 to 1961 [Hyndman(2005)]. We aim to extrapolate the numbers for the final 4 years (48 measurements) given data for the first 8 years (96 measurements). Accurate extrapolation on this dataset requires a kernel function capable of expressing various patterns, such as the spectral mixture (SM) kernel [Wilson & Adams(2013)Wilson and Adams]. Our goal is to evaluate if LOVE produces reliable predictive variances, even with complex kernel functions.

We train a KISS-GP model with a -mixture SM kernel, and compute the model’s predicted variances with and without LOVE. In Figure 1, we see that the confidence intervals from both computations agree almost everywhere. The SMAE of LOVE’s predicted variances (compared with the non-LOVE variances) is . Although not shown in the plot, we confirm the reliability of these predictions by computing the log-likelihood of the test data. We compare the KISS-GP model (with and without LOVE) to an exact GP and a sparse variational GP (SGPR) model3 [Titsias(2009), Hensman et al.(2013)Hensman, Fusi, and Lawrence]. All methods achieve nearly identical log-likelihoods, ranging from to .

Speedup on large datasets.

We measure the speed of computing predictive variances with LOVE for the entire test set on several large-scale regression benchmarks from the UCI repository [Asuncion & Newman(2007)Asuncion and Newman]. For each of the datasets, we train a KISS-GP model with deep kernel learning (DKL/KISS-GP) using the architectures described in [Wilson et al.(2016b)Wilson, Hu, Salakhutdinov, and Xing]. We compare the speed of computing the DKL/KISS-GP model’s predictive variances with LOVE and without LOVE. In addition, we compare against the speed of SGPR, one of the leading approaches to scalable GPs. The SGPR model uses a standard RBF kernel and inducing points. On all datasets, we measure the time to compute variances from scratch (i.e. assuming we have not pre-computed any terms) and the time to compute variances after pre-computing any terms that aren’t specific to test points. (Note that the KISS-GP benchmark without LOVE does not pre-compute any terms.)

Figure 2: Predictive variance error as a function of the Lanczos iterations (KISS-GP model, , Protein, Kin40k, PoleTele, Elevators UCI datasets).

Table 2 summarizes the results. We see that KISS-GP with LOVE yields a substantial speedup over KISS-GP without LOVE and SGPR across all datasets. The speedup is between and when computing variances from scratch and up to after pre-computation. The biggest improvements are obtained on the largest datasets, since the running time of LOVE is independent of dataset size after pre-computation, unlike the other methods. It is worth noting that these speedups come with almost no loss in accuracy. In the last column of Table 2, we report the SMAE of LOVE variances from the KISS-GP model, as compared against the standard variances (without LOVE). On all datasets, we find that the scaled mean average error of this approximation is on the order of or less.

Accuracy vs. Lanczos iterations.

In Figure 2, we measure the SMAE of LOVE’s predictive variances as a function of the number of Lanczos iterations (). We train a DKL/KISS-GP model with a -layer deep RBF kernel and inducing points on the four largest datasets from Table 2. We measure the relative mean average error of LOVE’s predictive variances compared to the standard KISS-GP variances. As seen in Figure 2, error decreases exponentially with the number of Lanczos iterations, up until roughly iterations. After roughly iterations, the error levels off, though this may be an artifact of floating-point precision (which may also cause small subsequent fluctuations). Recall that number of Lanczos iterations () corresponds with the rank of the and matrices in (10).

4.2 Sampling

We evaluate the quality of posterior samples drawn as described in 3.3. We compare these samples to those drawn from an exact GP, and to samples drawn with random Fourier features \citeprahimi2008random, a baseline described in both [Wang & Jegelka(2017)Wang and Jegelka] and [Hernández-Lobato et al.(2014)Hernández-Lobato, Hoffman, and Ghahramani]. We use 5000 random Fourier features, which was the maximum number of features that could be used without exhausting available GPU memory. To control for variations during training, we learn hyperparameters on the simple GP and then use the same hyperparameters for the all methods.

We test on the two largest UCI datasets which can still be solved exactly (PolTele, Eleveators) and two Bayesian optimization benchmark functions after 100 iterations (Eggholder – 2 dimensional, and Styblinski-Tang – 10 dimensional). On the two UCI datasets, we use the same DKL architecture as in the previous section. We use a standard (non-deep) RBF kernel for Eggholder BayesOpt. For Syblinski-Tang, we exploit the function’s inherent additive structure and use the additive kernel decomposition suggested by \citetkandasamy2015high.4

Sample accuracy.

In Table 3 we evaluate the accuracy of the different sampling methods. We draw samples at test locations and compare the sample covariance matrix with the true posterior covariance matrix in terms of element-wise mean absolute error. It is worth noting that all methods incur some error – even the Cholesky method which computes an “exact” sampling matrix. Nevertheless, both Cholesky and LOVE produce very accurate sample covariance matrices. Both methods achieve between 1 and 3 orders of magnitude less error than the random Fourier Feature method. Though LOVE and Cholesky are comparable in terms of accuracy, LOVE is significantly faster. Even without pre-computation (i.e. sampling “from scratch”), LOVE is comparable to the Fourier features method in terms of speed. After pre-computation, LOVE is up to times faster.

Bayesian optimization.

Many acquisition functions in Bayesian optimization rely on sampling from the posterior GP. For example, max-value entropy search [Wang & Jegelka(2017)Wang and Jegelka] draws samples from a posterior GP in order to estimate the function’s maximum value . The corresponding maximum location distribution, , is also the primary distribution of interest in predictive entropy search [Hernández-Lobato et al.(2014)Hernández-Lobato, Hoffman, and Ghahramani].

Figure 3: A density estimation plot of the predictive maximum distribution, for the (normalized) Eggholder function after 100 iterations of BayesOpt. Samples drawn with LOVE closely match samples drawn using a Cholesky decomposition (Exact). See \citetwang2017max for details.

In Figure 3, we test how faithful each sampling method is to the true distribution on the Eggholder benchmark. We plot kernel density estimates of the sampled maximum value distributions after iterations of BayesOpt. Since the Cholesky method computes an “exact” sampling matrix, its sampled max-value distribution can be considered closest to ground truth. The Fourier feature sampled distribution differs significantly. In contrast, LOVE’s sampled distribution very closely resembles the “exact” Cholesky distribution. However, LOVE is faster than the exact Cholesky method on this dataset (Table 3).

5 Discussion, Related Work, and Conclusion

This paper has primarily focused on KISS-GP as an underlying inducing point method due to its near constant asymptotic complexities when used with LOVE. However, LOVE and MVM inference are fully compatible with other inducing point techniques as well. Many inducing point methods make use of the subset of regressors (SOR) kernel approximation , optionally with a diagonal correction [Snelson & Ghahramani(2006)Snelson and Ghahramani], and focus on the problem of learning the inducing point locations [Quiñonero-Candela & Rasmussen(2005)Quiñonero-Candela and Rasmussen, Titsias(2009)]. After work to Cholesky decompose , this approximate kernel affords MVMs. One could apply LOVE to these methods and compute a test-invariant cache in time, and then compute single predictive covariances in time. We note that, since these inducing point methods by construction make a rank approximation to the kernel, setting results in the Lanczos decomposition being exact, and recovers exactly the precomputation time and prediction time of these methods.

Ensuring Lanczos solves are accurate.

Given a matrix , the Lanczos decomposition is designed to approximate the solve , where is the first column of . As argued in 2.2, the and can usually be re-used to approximate the solves . This property of the Lanczos decomposition is why LOVE can compute fast predictive variances. While this method usually produces accurate solves, the solves will not be accurate if some columns of are (nearly) orthogonal to the columns of . In this scenario, \citetsaad1987lanczos suggests that the additional Lanczos iterations with a new probe vector will correct these errors. In practice, we find that these countermeasures are almost never necessary with LOVE – the Lanczos solves are almost always accurate.

Numerical stability of Lanczos.

A practical concern for LOVE is round-off errors that may affect the Lanczos algorithm. In particular it is common in floating point arithmetic for the vectors of to lose orthogonality [Paige(1970), Simon(1984), Golub & Van Loan(2012)Golub and Van Loan], resulting in an incorrect decomposition. To correct for this, several methods such as full reorthogonalization and partial or selective reorthogonalization exist [Golub & Van Loan(2012)Golub and Van Loan]. In our implementation, we use full reorthogonalization when a loss of orthogonality is detected. In practice, the cost of this correction is absorbed by the parallel performance of the GPU due to the small value of and does not impact the final running time.

Conclusion.

In this paper, we have demonstrated a method for computing predictive covariances and drawing samples from the predictive distribution in nearly constant time with almost no loss in accuracy. Whereas the running times of previous state-of-the-art methods depend on dataset size, LOVE provides constant time predictive variances. In addition to providing scalable predictions, LOVE’s fast sampling procedure has the potential to dramatically simplify a variety of applications of Gaussian processes. Many applications require both posterior samples and speed [Deisenroth & Rasmussen(2011)Deisenroth and Rasmussen, Hernández-Lobato et al.(2014)Hernández-Lobato, Hoffman, and Ghahramani, Wang & Jegelka(2017)Wang and Jegelka]. As sampling has traditionally scaled cubicly with the number of test points, many methods have resorted to parametric approximations or finite basis approaches to approximate sampling efficiently. Because LOVE produces reliable posterior samples with minimal overhead, it may render these approaches unnecessary. The ability to obtain samples in linear time may unlock new applications for Gaussian Processes in the future.

Acknowledgements

JRG, GP, and KQW are supported in part by the III-1618134, III1526012, and IIS-1149882 grants from the National Science Foundation, as well as the Bill and Melinda Gates Foundation and the Office of Naval Research. AGW is supported by NSF IIS-1563887.

Footnotes

  1. github.com/cornellius-gp/gpytorch
  2. Mean absolute error divided by the variance of y.
  3. Implemented in GPFlow [Matthews et al.(2017)Matthews, van der Wilk, Nickson, Fujii, Boukouvalas, León-Villagrá, Ghahramani, and Hensman], .
  4. The Syblinski-Tang KISS-GP model uses the sum of 10 RBF kernels – one for each dimension – and inducing points.

References

  1. Asuncion, Arthur and Newman, David. Uci machine learning repository. https://archive.ics.uci.edu/ml/, 2007. Last accessed: 2018-02-05.
  2. Cunningham, John P, Shenoy, Krishna V, and Sahani, Maneesh. Fast gaussian process methods for point process intensity estimation. In ICML, pp. 192–199. ACM, 2008.
  3. Deisenroth, Marc and Rasmussen, Carl E. Pilco: A model-based and data-efficient approach to policy search. In ICML, pp. 465–472, 2011.
  4. Deisenroth, Marc Peter, Fox, Dieter, and Rasmussen, Carl Edward. Gaussian processes for data-efficient learning in robotics and control. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(2):408–423, 2015.
  5. Dong, Kun, Eriksson, David, Nickisch, Hannes, Bindel, David, and Wilson, Andrew Gordon. Scalable log determinants for gaussian process kernel learning. In NIPS, 2017.
  6. Doshi-Velez, Finale and Kim, Been. A roadmap for a rigorous science of interpretability. arXiv preprint arXiv:1702.08608, 2017.
  7. Frazier, Peter, Powell, Warren, and Dayanik, Savas. The knowledge-gradient policy for correlated normal beliefs. INFORMS journal on Computing, 21(4):599–613, 2009.
  8. Gardner, Jacob, Guo, Chuan, Weinberger, Kilian, Garnett, Roman, and Grosse, Roger. Discovering and exploiting additive structure for bayesian optimization. In AISTATS, pp. 1311–1319, 2017.
  9. Gardner, Jacob R., Pleiss, Geoff, Wu, Ruihan, Weinberger, Kilian Q., and Wilson, Andrew Gordon. Product kernel interpolation for scalable gaussian processes. In AISTATS, 2018.
  10. Golub, Gene H and Van Loan, Charles F. Matrix computations, volume 3. JHU Press, 2012.
  11. Hensman, James, Fusi, Nicolo, and Lawrence, Neil D. Gaussian processes for big data. arXiv preprint arXiv:1309.6835, 2013.
  12. Hernández-Lobato, José Miguel, Hoffman, Matthew W, and Ghahramani, Zoubin. Predictive entropy search for efficient global optimization of black-box functions. In NIPS, pp. 918–926, 2014.
  13. Hyndman, Rob J. Time series data library. http://www-personal.buseco.monash.edu.au/~hyndman/TSDL/, 2005. Last accessed: 2018-02-05.
  14. Kandasamy, Kirthevasan, Schneider, Jeff, and Póczos, Barnabás. High dimensional bayesian optimisation and bandits via additive models. In International Conference on Machine Learning, pp. 295–304, 2015.
  15. Keys, Robert. Cubic convolution interpolation for digital image processing. IEEE transactions on acoustics, speech, and signal processing, 29(6):1153–1160, 1981.
  16. Lanczos, Cornelius. An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. United States Governm. Press Office Los Angeles, CA, 1950.
  17. Loan, Charles F Van. Introduction to scientific computing: a matrix-vector approach using MATLAB. Prentice Hall PTR, 1999.
  18. Matthews, Alexander G de G, van der Wilk, Mark, Nickson, Tom, Fujii, Keisuke, Boukouvalas, Alexis, León-Villagrá, Pablo, Ghahramani, Zoubin, and Hensman, James. Gpflow: A gaussian process library using tensorflow. Journal of Machine Learning Research, 18(40):1–6, 2017.
  19. Nickisch, Hannes, Pohmann, Rolf, Schölkopf, Bernhard, and Seeger, Matthias. Bayesian experimental design of magnetic resonance imaging sequences. In Advances in Neural Information Processing Systems, pp. 1441–1448, 2009.
  20. Paige, CC. Practical use of the symmetric lanczos process with re-orthogonalization. BIT Numerical Mathematics, 10(2):183–195, 1970.
  21. Parlett, Beresford N. A new look at the lanczos algorithm for solving symmetric systems of linear equations. Linear algebra and its applications, 29:323–346, 1980.
  22. Quiñonero-Candela, Joaquin and Rasmussen, Carl Edward. A unifying view of sparse approximate gaussian process regression. Journal of Machine Learning Research, 6(Dec):1939–1959, 2005.
  23. Rahimi, Ali and Recht, Benjamin. Random features for large-scale kernel machines. In Advances in neural information processing systems, pp. 1177–1184, 2008.
  24. Rasmussen, Carl Edward and Williams, Christopher KI. Gaussian processes for machine learning, volume 1. MIT press Cambridge, 2006.
  25. Saad, Youcef. On the lanczos method for solving symmetric linear systems with several right-hand sides. Mathematics of computation, 48(178):651–662, 1987.
  26. Saatçi, Yunus. Scalable inference for structured Gaussian process models. PhD thesis, University of Cambridge, 2012.
  27. Schneider, Michael K and Willsky, Alan S. Krylov subspace estimation. SIAM Journal on Scientific Computing, 22(5):1840–1864, 2001.
  28. Schulam, Peter and Saria, Suchi. What-if reasoning with counterfactual gaussian processes. In NIPS, 2017.
  29. Simon, Horst D. The lanczos algorithm with partial reorthogonalization. Mathematics of Computation, 42(165):115–142, 1984.
  30. Snelson, Edward and Ghahramani, Zoubin. Sparse Gaussian processes using pseudo-inputs. In NIPS, pp. 1257–1264, 2006.
  31. Snoek, Jasper, Larochelle, Hugo, and Adams, Ryan P. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pp. 2951–2959, 2012.
  32. Titsias, Michalis K. Variational learning of inducing variables in sparse gaussian processes. In AISTATS, pp. 567–574, 2009.
  33. Wang, Zi and Jegelka, Stefanie. Max-value entropy search for efficient bayesian optimization. In ICML, 2017.
  34. Wang, Zi, Li, Chengtao, Jegelka, Stefanie, and Kohli, Pushmeet. Batched high-dimensional bayesian optimization via structural kernel learning. arXiv preprint arXiv:1703.01973, 2017.
  35. Wilson, Andrew and Adams, Ryan. Gaussian process kernels for pattern discovery and extrapolation. In ICML, pp. 1067–1075, 2013.
  36. Wilson, Andrew G, Hu, Zhiting, Salakhutdinov, Ruslan R, and Xing, Eric P. Stochastic variational deep kernel learning. In Advances in Neural Information Processing Systems, pp. 2586–2594, 2016a.
  37. Wilson, Andrew Gordon and Nickisch, Hannes. Kernel interpolation for scalable structured gaussian processes (kiss-gp). In ICML, pp. 1775–1784, 2015.
  38. Wilson, Andrew Gordon, Dann, Christoph, and Nickisch, Hannes. Thoughts on massively scalable gaussian processes. arXiv preprint arXiv:1511.01870, 2015.
  39. Wilson, Andrew Gordon, Hu, Zhiting, Salakhutdinov, Ruslan, and Xing, Eric P. Deep kernel learning. In AISTATS, pp. 370–378, 2016b.
  40. Zhou, Jianlong, Arshad, Syed Z, Luo, Simon, and Chen, Fang. Effects of uncertainty and cognitive load on user trust in predictive decision making. In IFIP Conference on Human-Computer Interaction, pp. 23–39. Springer, 2017.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
130097
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description