1 Introduction

Gaussian Process Learning via Fisher Scoring

of Vecchia’s Approximation

Joseph Guinness

Cornell University, Department of Statistics and Data Science

Abstract

We derive a single pass algorithm for computing the gradient and Fisher information of Vecchia’s Gaussian process loglikelihood approximation, which provides a computationally efficient means for applying the Fisher scoring algorithm for maximizing the loglikelihood. The advantages of the optimization techniques are demonstrated in numerical examples and in an application to Argo ocean temperature data. The new methods are more accurate and much faster than an optimization method that uses only function evaluations, especially when the covariance function has many parameters. This allows practitioners to fit nonstationary models to large spatial and spatial-temporal datasets.

## 1 Introduction

The Gaussian process model is an indispensible tool for the analysis of spatial and spatial-temporal datasets and has become increasingly popular as a general-purpose model for functions. Because of its high computational burden, researchers have devoted substantial effort to developing numerical approximations for Gaussian process computations. Much of the work focuses on efficient approximation of the likelihood function. Fast likelihood evaluations are crucial for optimization procedures that require many evaluations of the likelihood, such as the default Nelder-Mead algorithm (Nelder and Mead, 1965) in the R optim function. The likelihood must be repeatedly evaluated in MCMC algorithms as well.

Compared to the amount of literature on efficient likelihood approximations, there has been considerably less development of techniques for numerically maximizing the likelihood (see Geoga et al. (2018) for one recent example). This article aims to address the disparity by providing:

1. Formulas for evaluating the gradient and Fisher information for Vecchia’s likelihood approximation in a single pass through the data, so that the Fisher scoring algorithm can be applied. Fisher scoring is a modification of the Newton-Raphson optimization method, replacing the Hessian matrix with the Fisher information matrix.

2. Numerical examples with simulated and real data demonstrating the practical advantages that the new techniques provide over an optimizer that uses function evaluations alone.

Among the sea of Gaussian process approximations proposed over the past several decades, Vecchia’s approximation (Vecchia, 1988) has emerged as a leader. It can be computed in linear time and with linear memory burden, and it can be parallelized. Maximizing the approximation corresponds to solving a set of unbiased estimating equations, leading to desirable statistical properties (Stein et al., 2004). It is general in that it does not require gridded data nor a stationary model assumption. The approximation forms a valid multivariate normal model, and so it can be used for simulation and conditional simulation. As an approximation to the target model, it is highly accurate relative to competitors (Guinness, 2018). Vecchia’s approximation also forms a conceptual hub in the space of Gaussian process approximations, since a generalization includes many well-known approximations as special cases (Katzfuss and Guinness, 2017). Lastly, there are publicly available R packages implementing it (Finley et al., 2017; Guinness and Katzfuss, 2018).

The numerical examples in this paper show that, in realistic data and model scenarios, the new techniques offer significant computational advantages over default optimization techniques. Although it is more expensive to evaluate the gradient and Fisher information in addition to the likelihood, the Fisher scoring algorithm converges in a small number of iterations, leading to a large advantage in total computing time over an optimization method that uses only the likelihood. For isotropic Matérn models, the speedup is roughly 2 to 4 times, and on more complicated models with more parameters, the new techniques can be more than 40 times faster. This is a significant practical improvement that will be attractive to practitioners choosing among various methods.

## 2 Background

Let be locations in a domain . At each , we observe a scalar response , collected into column vector . Along with the response, we observe covariates collected into an design matrix . In the Gaussian process, we model as a multivariate normal vector with expected value (), and covariance matrix , where the entry of is . The function is positive definite on and depends on covariance parameters . The loglikelihood for and is

 logfβ,θ(y)=−n2log(2π)−12logdetΣθ−12(y−Xβ)TΣ−1θ(y−Xβ). (1)

Unless has some exploitable structure, evaluation of the loglikelihood involves storing the entries of and performing floating point operations to obtain the Cholesky factor of , both of which are computationally prohibitive when is large.

Vecchia’s loglikelihood approximation is a modification of the conditional representation of a joint density function. Let , and be the corresponding subvector of . Vecchia’s loglikelihood approximation is

 ℓ(β,θ)=n∑i=1logfβ,θ(yi|yg(i)), (2)

leading to computational savings when is small. As mentioned in the introduction, Vecchia’s likelihood approximation corresponds to a valid multivariate normal distribution with mean and a covariance matrix . To motivate why obtaining the gradient and Fisher information poses an analytical challenge, consider the partial derivative of Vecchia’s loglikelihood with respect to :

 ∂ℓ(β,θ)∂θj=12(y−Xβ)T˜Σ−1θ∂˜Σθ∂θj˜Σ−1θ(y−Xβ)−12Tr(˜Σ−1θ∂˜Σθ∂θj), (3)

where is an matrix of partial derivatives of with respect to . Not only is too large to store in memory, the covariances are not easily computable, nor are their partial derivatives. In the next section, we outline a simple reframing of Vecchia’s likelihood that leads to a computationally tractable method of evaluating the gradient and Fisher information.

## 3 Derivations for Single Pass Algorithm

To derive formulas for the gradient and Fisher information, it is helpful to rewrite the conditional likelihoods in terms of marginals. To this end, define and . Define the design matrices for and , respectively, as and , and define the covariance matrices for and , respectively as and (suppressing dependence on ). The notation is chosen to follow the mnemonic device that the first of the two letters alphabetically is a subvector or submatrix of the second letter. Vecchia’s loglikelihood can then be rewritten as

 ℓ(β,θ)= m∑i=1logfβ,θ(vi)−logfβ,θ(ui) (4) = −12n∑i=1[logdetBi−logdetAi] (5) −12n∑i=1[(vi−Riβ)TB−1i(vi−Riβ)−(ui−Qiβ)TA−1i(ui−Qiβ)]−n2log(2π). (6)

Our proposed algorithm for obtaining the likelihood, gradient, and Fisher information involves computing the following quantities in a single pass through the data.

 (logdet) =n∑i=1(logdetBi−logdetAi) (7) (dlogdetj) =n∑i=1(Tr(B−1iBi,j)−Tr(A−1iAi,j)) (8) (ySy) =n∑i=1(vTiB−1ivi−uTiA−1iui) (9) (XSy) =n∑i=1(RTiB−1ivi−QTiA−1iui) (10) (XSX) =n∑i=1(RTiB−1iRi−QTiA−1iQi) (11) (dySyj) =−n∑i=1(vTiB−1iBi,jB−1ivi−uTiA−1iAi,jA−1iui) (12) (dXSyj) =−n∑i=1(RTiB−1iBi,jB−1ivi−QTiA−1iAi,jA−1iui) (13) (dXSXj) =−n∑i=1(RTiB−1iBi,jB−1iRi−QTiA−1iAi,jA−1iQi) (14) (Trjk) =n∑i=1[Tr(B−1iBi,jB−1iBi,k)−Tr(A−1iAi,jA−1iAi,k)], (15)

where and are the matrices of partial derivatives of and , respectively, with respect to . The quantities having the form are simply the partial derivatives of the corresponding quantity with respect to . Each of these quantities can be updated at each , and so all can be evaluated in a single pass through the data. We refer to them collectively as our single-pass quantities.

### 3.1 Profile Likelihood, Gradient, and Fisher Information

Given covariance parameter , denote the maximum Vecchia likelihood estimate of as . Since has a closed form expression (Section 3.2), we can maximize the profile likelihood over alone. The profile likelihood can be written in terms of our single-pass quantities as

 ℓ(ˆβ(θ),θ)=−n2log(2π)−12(logdet)−12[(ySy)−2(XSy)ˆβ(θ)+ˆβ(θ)T(XSX)ˆβ(θ)]. (16)

Therefore the partial derivatives can also be written in terms of the single pass quantities as

 ∂ℓ(ˆβ(θ),θ)∂θj= −12(dlogdetj)−12[(dySyj)−2(dXSyj)ˆβ(θ)+ˆβ(θ)T(dXSXj)ˆβ(θ)] −12[−2(XSy)∂ˆβ(θ)∂θj+2ˆβ(θ)T(XSX)∂ˆβ(θ)∂θj], (17)

where is the column vector of partial derivatives of the entries of with respect to covariance parameter . The Fisher information is

 I(θ)jk=12n∑i=1[Tr(B−1iBi,jB−1iBi,k)−Tr(A−1iAi,jA−1iAi,k)]=12(Trjk).

It remains to be shown that and can be computed using our single-pass quantities.

### 3.2 Mean Parameters

The profile likelihood estimate satisfies for every . These partial derivatives are

 ⎡⎢ ⎢⎣∂ℓ(β,θ)/∂β1⋮∂ℓ(β,θ)/∂βp⎤⎥ ⎥⎦=n∑i=1RTiB−1i(vi−Riβ)−QTiA−1i(ui−Qiβ), (18)

giving the equation

 [n∑i=1(RTiB−1iRi−QTiA−1iQi)]ˆβ(θ)=[n∑i=1(RTiB−1ivi−QTiA−1iui)]. (19)

Therefore, the profile likelihood estimate of is

 ˆβ(θ)=(XSX)−1(XSy), (20)

a function of our single pass quantities. Taking partial derivatives with respect to yields

 (21)

also a function of our single pass quantities.

## 4 Numerical Studies

This section contains timing results, comparing the R optim implementation of the Nelder-Mead algorithm to the Fisher scoring algorithm. In Fisher scoring, we reject steps that do not increase the loglikelihood, dividing the step size by 2 iteratively. If we cannot increase the loglikelihood along the Fisher step direction, we attempt to step along the gradient. For both Nelder-Mead and Fisher scoring, we first generate estimates using Vecchia’s approximation with nearest neighbors, then refine the estimates using . In Nelder-Mead, we evaluate only the likelihood, not the gradient and Fisher information. The Fisher scoring algorithm stops when the dot product between the step and the gradient is less than 1e-4. Default stopping criteria were used for Nelder-Mead algorithm. We simulate all datasets from the same model:

 Y(s)=μ+Z(s)+ε(s), (22)

where , is a Gaussian process with exponential covariance function , and are i.i.d.  with . We take . Data are simulated on an evenly spaced grid of 4900 locations on . In addition to the exponential covariance with unknown variance and range, we estimate parameters in three covariance models that generalize the exponential:

 K(s1,s2) =σ2Γ(ν)2ν−1(∥s1−s2∥α)νKν(∥s1−s2∥α) (23) K(s1,s2) =σ2Γ(ν)2ν−1(∥Ls1−Ls2∥)νKν(∥Ls1−Ls2∥) (24) K(s1,s2) =exp(J∑j=1bj(ϕj(s1)+ϕj(s2)))σ2Γ(ν)2ν(∥s1−s2)∥α)νKν(∥s1−s2∥α). (25)

The first is an isotropic Matérn covariance function. The second is a geometrically anisotropic Matérn covariance, with anisotropy parameterized by the lower triangular matrix . The third is a Matérn covariance with a nonstationary variance function. The nonstationary variances are defined in terms of pre-specified known basis functions and unknown parameters . For identifiability purposes, the basis functions are an orthogonal basis that is also orthogonal to a constant function. The orthogonal basis is formed by applying Gram-Schmidt orthogonalization to a set of Gaussian basis functions.

Excluding , which is estimated by profile maximum likelihood, but including the nugget variance , the four models have 3, 4, 6, and 12 unknown parameters. Each model has a multiplicative variance parameter . In the Nelder-Mead algorithm, we profile out , whereas In Fisher scoring, we do not. We found that profiling does not substantially influence convergence speed in Fisher scoring. All positive parameters are mapped to the real line by a log transform. Each model was fit to each dataset on an independent R process running on a single thread of an 8-core (16 thread) Intel Xeon W-2145 (3.7GHz, 4.5GHz Turbo) processor with 16GB RAM. Fifteen datasets were sent to each of 14 threads, yielding 210 simulated datasets.

Hisotograms of the timing results are given in Figure 1. Considering the median times, Fisher scoring is 2-3 times faster for the isotropic models (3 and 4 parameters), more than 10 times faster for the stationary Matérn model (6 parameters), and 47 times faster for the nonstationary model (12 parameters). There is also no noticeable loss in accuracy, which we can evaluate by comparing the maximum loglikelihoods returned by Fisher scoring to the loglikelihoods returned by Nelder-Mead. For the isotropic models, the two loglikelihoods never differed by more than 0.001 units. In the stationary model, the Fisher scoring loglikelihood was more than 0.01 larger than the Nelder-Mead loglikelihood in 11 of the 210 datasets, whereas the Nelder-Mead loglikelihood was never more than 0.01 larger than the Fisher scoring loglikelihood. For the nonstationary model, the Fisher scoring loglikelihoods were more than 0.01 larger than the Nelder-Mead loglikelihoods in 199 of the 210 datasets, whereas the Nelder-Mead loglikelihoods were more than 0.01 larger in just 3 of the 210 datasets. Not only does Nelder-Mead take nearly 50 times longer than Fisher scoring in the nonstationary model, it usually does not find the actual maximum likelihood estimates.

### 4.1 Identifiability

Surprisingly, the maximum likelihood estimate of the nugget variance can be a negative number. This is not an error of the optimization algorithm (a triumph rather); in these cases, the algorithm finds a negative nugget with a higher likelihood than any nonnegative nugget can produce. Negative nugget estimates occur most frequently when the data are evenly spaced and when the maximum likelihood estimate of the smoothness parameter is small (). In this scenario, the covariance function has a narrow peak at distances smaller than the minimum spacing between locations. This happened frequently enough in our testing of the Fisher scoring algorithm that we found it necessary to impose a penalty on very small values of the nugget and smoothness parameters, since it is not sensible to return negative nugget estimates. The penalties are

 pen(τ2)=−0.01log(1+0.01/τ2),pen(ν)=−0.01log(1+0.2/ν).

The likelihood function also has difficulty jointly identifying variance and range parameters when the range parameter estimate is much larger than the maximum distance between points in the dataset. This is a theoretically well-studied problem (Zhang, 2004) that no optimization routine can overcome. We have found that penalizing large variance parameters helps improve convergence of Fisher scoring without sacrificing accuracy. We used the penalty

 pen(σ2)=log(1+eσ2/˜σ2−6),

where is the estimate of the residual variance parameter in a least squares fit of the response to the constant covariate. This imposes essentially no penalty on the parameter unless it is several times larger than the least squares estimate, after which the penalty increases roughly linearly in . These two identifiability problems can be handled more elegantly in a Bayesian framework, but we do not pursue that here because identifiability is not the focus of this paper.

## 5 Case Study: Argo Ocean Temperature Data

Argo is a global program that deploys floating ocean temperature sensors (International Argo Program, 2019). Each Argo float operates on a 10 day cycle, during which it descends to a 2000m depth and returns to the surface, collecting temperature and salinity measurements along the depth profile. The floats drift freely in the horizontal direction with ocean currents. As of May 2019, 3,799 floats covered the globe. We analyze a subset of the observations collected at 100 dbar (approximately 100m depth) between January 1 and March 31, 2016. Preprocessed data were provided by Mikael Kuusela and are described in more detail in Kuusela and Stein (2018). In total, there are 32,492 measurements over the three month period. The data are plotted in Figure 2.

We model the data from day and location on the sphere as

 Y(s,t)=β0+β1L(s)+β2L2(s)+Z(s+Φ(s),t)+ε(s,t),

where is the latitude of location , is a Gaussian process with covariance function , is a spatial warping function, and are i.i.d. mean zero normals with variance . We consider both spatial and spatial-temporal models for :

 Kθ((s1,t1),(s2,t2)) =σ22ν−1Γ(ν)(dα(s1,s2))νKν(dα(s1,s2)), Kθ((s1,t1),(s2,t2)) =σ22ν−1Γ(ν)(dα((s1,t1),(s2,t2)))νKν(dα((s1,t1),(s2,t2))).

The function is Euclidean distance scaled by either a spatial range parameter or spatial and temporal range parameters

 dα(s1,s2) =∥s1−s2∥α,dα((s1,t1),(s2,t2))=(∥s1−s2∥2α21+|t1−t2|2α22)1/2.

The warping function is assumed to be a linear combination of the gradients of the five spherical harmonic functions of degree 2, where the gradient is with respect to the three Euclidean coordinates. We use degree 2 because the degree 0 function is constant, and the degree 1 spherical harmonics have constant partial derivatives (as a function of ), and so degree 1 warpings simply translate all points by the same vector and do not affect the covariances. We also consider the special case of for all , which corresponds to isotropic models in space and time. The spatial warping model has 9 parameters, while the space-time warping model has 10. The isotropic models have 4 and 5 parameters.

We fit each model using both Fisher scoring and Nelder-Mead, with the results given in Table 1. Fisher scoring is able to fit the space-time warping model in 12.13 minutes, whereas Nelder-Mead ran for 190 minutes and returned a loglikelihood value 2.038 units lower. In the spatial-only warping model, Fisher Scoring finished in 6.56 minutes, whereas Nelder-Mead returned a loglikelihood value 0.01 lower after 80 minutes. The two methods produced nearly the same loglikelihoods on the isotropic models, with Fisher scoring running more than twice as fast. The results closely mirror the numerical study, where Fisher scoring had its largest improvements in both speed and accuracy when fitting models with the many parameters. Finally, in Figure 2, we plot as a function of , with km. The images show that the warping model produces an anisotropic variogram, with larger increment variances near the equator.

## 6 Discussion

We believe that practitioners will benefit from the availability of high quality algorithms for fitting nonstationary Gaussian process models to large spatial and spatial-temporal datasets. The methods are applicable to any covariance function that is differentiable with respect to its parameters. This is important because it separates the tasks of constructing models and developing methods for fitting the models, freeing us to select the most appropriate covariance function for the data rather than the most appropriate model for which a specialized method exists. The Fisher scoring algorithm, as well as anisotropic, nonstationary variance, and warping covariance functions, will be implemented in version 0.2.0 of the GpGp R package (Guinness and Katzfuss, 2018).

Acknowledgements

This work was supported by the National Science Foundation under grant No. 1613219 and the National Institutes of Health under grant No. R01ES027892.

## References

• Finley et al. (2017) Finley, A., Datta, A., Banerjee, S., and Mckim, A. (2017). spNNGP: Spatial Regression Models for Large Datasets using Nearest Neighbor Gaussian Processes. R package version 0.1.1.
• Geoga et al. (2018) Geoga, C. J., Anitescu, M., and Stein, M. L. (2018). Scalable Gaussian process computations using hierarchical matrices. arXiv preprint arXiv:1808.03215.
• Guinness (2018) Guinness, J. (2018). Permutation and grouping methods for sharpening Gaussian process approximations. Technometrics, 60(4):415–429.
• Guinness and Katzfuss (2018) Guinness, J. and Katzfuss, M. (2018). GpGp: Fast Gaussian Process Computation Using Vecchia’s Approximation. R package version 0.1.0.
• International Argo Program (2019) International Argo Program (2019). Online: accessed 2019-05-19.
• Katzfuss and Guinness (2017) Katzfuss, M. and Guinness, J. (2017). A general framework for Vecchia approximations of Gaussian processes. arXiv preprint arXiv:1708.06302.
• Kuusela and Stein (2018) Kuusela, M. and Stein, M. L. (2018). Locally stationary spatio-temporal interpolation of Argo profiling float data. Proceedings of the Royal Society A, 474(2220):20180400.
• Nelder and Mead (1965) Nelder, J. A. and Mead, R. (1965). A simplex method for function minimization. The computer journal, 7(4):308–313.
• Stein et al. (2004) Stein, M. L., Chi, Z., and Welty, L. J. (2004). Approximating likelihoods for large spatial data sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 66(2):275–296.
• Vecchia (1988) Vecchia, A. V. (1988). Estimation and model identification for continuous spatial processes. Journal of the Royal Statistical Society: Series B (Methodological), 50(2):297–312.
• Zhang (2004) Zhang, H. (2004). Inconsistent estimation and asymptotically equal interpolations in model-based geostatistics. Journal of the American Statistical Association, 99(465):250–261.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters