Going off grid: Computationally efficient inference for log-Gaussian Cox processes
This paper introduces a new method for performing computational inference on log-Gaussian Cox processes. The likelihood is approximated directly by making novel use of a continuously specified Gaussian random field. We show that for sufficiently smooth Gaussian random field prior distributions, the approximation can converge with arbitrarily high order, while an approximation based on a counting process on a partition of the domain only achieves first-order convergence. The given results improve on the general theory of convergence of the stochastic partial differential equation models, introduced by Lindgren et al. (2011). The new method is demonstrated on a standard point pattern data set and two interesting extensions to the classical log-Gaussian Cox process framework are discussed. The first extension considers variable sampling effort throughout the observation window and implements the method of Chakraborty et al. (2011). The second extension constructs a log-Gaussian Cox process on the world’s oceans. The analysis is performed using integrated nested Laplace approximation for fast approximate inference.
Biometrika \jyear20?? \jvol?? \jnum? \accessdateAdvance Access publication on ?? ???? 20?? \copyrightinfo\Copyright 20?? Biometrika Trust
Printed in Great Britain
pproximation of Gaussian random fields; Gaussian Markov random fields; Integrated nested Laplace approximation; Spatial point processes; Stochastic partial differential equations.
Data consisting of sets of locations at which some objects are present are common in biology, ecology and economics. The appropriate statistical models for this type of data are spatial point process models, which have been extensively studied by statisticians and probabilists (Møller & Waagepetersen, 2004; Illian et al., 2008) but are less commonly used by the scientists producing the data sets. Point process models are often hard to fit, so scientists often resort to using inappropriate methods. (Chakraborty et al., 2011) discuss this in the context of presence-only datasets, and outline various ad hoc approaches used by ecologistsThere is an interesting discussion of this in the context of presence only data sets (Chakraborty et al., 2011), which outlines a number of ad hoc approaches taken by the ecological community.
Many real data sets do not have the simple structure usually considered in the classical statistical literature, i.e., that of a simple point pattern that has been observed everywhere within a simple, often rectangular, plot. For instance, in real data sets the observation process is often not straightforward due to practical limitations, or the observation window is complex. This includes data sets mapping the locations of bird species, for which very little data have been collected in the Himalayas for obvious reasons. Therefore, on top of sampling issues such as incompletely observed point patterns, positional errors, etc., this data set has a large hole where it is believed that birds reside, but it is impractical to look for them. Very different, but similarly complex, data deal with freak waves in the oceans. Even if we ignore temporal aspects, or the uncertainty in the observed locations, this data set remains complicated, as the observation window covers most of a sphere and has a very complicated boundary. Motivated by data sets of this nature, this paper proposes an easy to use, computationally efficient method for performing inference on spatial point process models that is sufficiently flexible to handle these and other data structures.
In this paper we focus on log-Gaussian Cox processes, a class of flexible models that is particularly useful in the context of modelling aggregation relative to some underlying unobserved environmental field (Møller et al., 1998; Illian et al., 2012). However, standard methods for fitting Cox processes are computationally expensive and the Markov chain Monte Carlo methods that are commonly used are difficult to tune for this problem. Recently, Illian et al. (2012) developed a fast, flexible framework for fitting log-Gaussian Cox processes using integrated nested Laplace approximation (Rue et al., 2009). They construct a Poisson approximation to the true log-Gaussian Cox process likelihood to perform inference on a regular lattice over the observation window, counting the number of points in each cell. If the lattice is fine enough and the latent Gaussian field is appropriately discretised, this approximation is quite good (Waagepetersen, 2004), but it can be computationally wasteful, especially when the process intensity is high or the observation window is large or oddly shaped. New results on the strong convergence of the lattice approximation, provided in the Appendix, show that the rate of convergence on a lattice is fundamentally limited to by the counting approximation.
In the Appendix, we provide detailed results on the convergence of the approximations proposed in this paper. In particular, we show that, for a Gaussian random field with fixed parameters, the posterior distributions generated using the proposed method will converge strongly to the true posterior distribution. Furthermore, it is shown that these posterior distributions can converge with arbitrarily high order and the convergence is limited only by the smoothness of the random field. In this paper, we place particular emphasis on the combination of this method with the flexible stochastic partial differential equation models of Lindgren et al. (2011) and we significantly improve the existing convergence theory for these models. In particular, we show that the approximate posterior distributions converge weakly and the error when computing a posterior functional is almost .
The first of our aims is to re-examine the standard methodology for Bayesian inference on log-Gaussian Cox processes and to propose an approach that is much more computationally efficient based on continuously specified finite-dimensional Gaussian random fields. The key characteristic of our approach is that the specification of the Gaussian random field is completely separated from the approximation of the likelihood, leading to far greater flexibility. The second aim is to demonstrate that this approach can be handled within the general approximation framework of Rue et al. (2009), by modelling the Gaussian random field through a stochastic partial differential equation (Lindgren et al., 2011). This provides a unified modelling structure. An associated R-package makes our methods that accessible to scientists.
2 Log-Gaussian Cox processes
Consider a bounded region . A simple point process model is the inhomogeneous Poisson process, in which the number of points within a region is Poisson distributed with mean , where is the intensity surface of the point process. Given the intensity surface and a point pattern the likelihood of an inhomogeneous Poisson process is
This likelihood is analytically intractable, as it requires the integral of the intensity function, which typically cannot be calculated explicitly. This integral can, however, be computed numerically using standard methods.
Treating the intensity surface as a realisation of a random field yields a particularly flexible class of point processes known as Cox or doubly stochastic Poisson processes (Møller & Waagepetersen, 2004). These are typically used to model aggregation in point patterns resulting from observed or unobserved environmental variation. In this paper we consider log-Gaussian Cox processes, where the intensity surface is modelled as , and is a Gaussian random field. Conditional on a realisation of , a log-Gaussian Cox process is an inhomogeneous Poisson process. The likelihood for such a process is of the form (1), where the integral is further complicated by the stochastic nature of , and methods for approximating (1) are the focus of the next two sections. Log-Gaussian Cox processes fit naturally within the Bayesian hierarchical modelling framework and are latent Gaussian models. They may be fitted using the integrated nested Laplace approximation approach of Rue et al. (2009), allowing us to construct models that include covariates, marks and non-standard observation processes while still allowing computationally efficient inference (Illian et al., 2012). Therefore, approximating the likelihood in (1) constitutes a basic calculation for practical problems such as those discussed in Section 7.
3 Computation on fine lattices is wasteful
A common method for performing inference with log-Gaussian Cox processes is to take the observation window , construct a fine regular lattice over it, and then consider the number of points observed in each cell of the lattice (Møller et al., 1998; Illian et al., 2012). It is a simple consequence of the definition of a log-Gaussian Cox process that the may be considered as independent Poisson random variables, that is where is the total intensity in each cell. It is impossible to compute the total intensity for each cell and we therefore use the approximation , where is a representative value of within the cell and is the area of cell . The log-Gaussian Cox process model can then be treated within the generalised linear mixed model framework. This method has been used in a number of applications and converges to the true solution as the size of the cells decreases to zero; see Corollary A1 or Waagepetersen (2004).
The computational challenge is that, if is a general Gaussian random field, the multivariate Gaussian vector that contains the s will have a dense covariance matrix. The resulting computational complexity limits this method to quite small lattices. If is stationary and the observation window is a rectangle, it is possible to use the block Toeplitz structure of the covariance matrix to speed up some computations (Møller et al., 1998). Unfortunately, the block Toeplitz structure is fragile and any inference method that constructs a second-order approximation to the posterior distribution, such as manifold Markov chain Monte Carlo simulation (Girolami & Calderhead, 2011) or the integrated nested Laplace approximation, will destroy the computational savings.
A common computationally efficient approach is to model as a conditional autoregressive model on the fine lattice and use this to perform fast computations (Rue & Held, 2005). The conditional autoregressive approach has been used extensively in applications and may be fitted using the integrated nested Laplace approximation (Illian et al., 2012). Both methods rely heavily on the regularity of the lattice, as it is quite difficult to construct a conditional autoregressive model on an irregular lattice that is resolution-consistent (Rue & Held, 2005).
However, these methods are unsatisfactory since the computational lattice has two fundamentally different roles. The first and most natural role is to approximate the latent Gaussian random field . The second and rather unnatural role of the computational lattice is to approximate the locations of the points, even though the data have often been collected with high precision. Clearly, the finer the lattice is, the less information is lost, so the quality of the likelihood approximation primarily depends on the size of the grid. In fact, Corollary A1 shows that this binning process is the dominant source of error in the lattice approximation. As a result, we are required to compute on a much finer grid than is necessary for the approximation of the latent Gaussian field, making lattice-based approaches inherently wasteful in this context.
The inflexibility inherent in lattice-based methods also implies that the approximation to the latent random field cannot be locally refined. In the problem considered in Section 7.3, a large region has not been sampled. Generating a high resolution approximation to the latent field over this area would be computationally wasteful. It would be more efficient to reduce the resolution in these areas without affecting that in those that have been sampled. While this is impossible with lattice-based methods, the flexible method introduced here allows local changes to the resolution of the approximation.
4 Approximating the likelihood using a finite-dimensional random field
Rather than defining a Gaussian random field over a fine lattice, we propose a finite-dimensional continuously specified random field of the form
where is a multivariate Gaussian random vector and is a set of linearly independent deterministic basis functions. This is similar in spirit to the Karhunen–Loève decomposition of stochastic processes, which is based on eigen-decomposition of the covariance function of the process. Three other common approximations to Gaussian random fields can also be expressed as in (2). Process convolution models (Higdon, 1998) use the approximation where the first integral is a white noise integral, the are independent Gaussian random variables, and the points lie on a lattice within . The second class of models uses correlated weights and selects basis functions, either based on a parent Gaussian process as for predictive processes (Banerjee et al., 2008), or from other considerations, as in fixed-rank kriging (Cressie & Johannesson, 2008). Chakraborty et al. (2011) investigated log-Gaussian Cox process models using predictive processes. The third class comprises the stochastic partial differential equation models of Lindgren et al. (2011), which take to be compactly-supported piecewise linear functions. This choice of delivers considerable computational benefits and will be further explored in Section 5 and Appendix .4. All of the examples in this paper use stochastic partial differential equation models for the latent process .
With the continuous Gaussian random field model in place, we are in a position to attack the intractable likelihood (1). In this section, we outline a procedure for approximating the likelihood that extends the standard approximation to the non-lattice, unbinned data case. The log-likelihood consists of two terms: the stochastic integral, and the evaluation of the field at the data points. While the continuously-specified stochastic partial differential equation models allow us to compute the sum term exactly, we must approximate the integral by a sum. Consider a deterministic integration rule of the general form for fixed, deterministic nodes and weights . Using this integration rule, we can construct the approximation
where is a constant, is a matrix containing the values of the latent Gaussian model (2) at the integration nodes , and evaluates the latent Gaussian field at the observed points .
The advantage of the approximation (3) is that it is of Poisson form. In particular, given and , the approximate likelihood consists of independent Poisson random variables. To see this, we write and . Then, if we construct some pseudo-observations , the approximate likelihood factors as
which is similar to the likelihood for observing conditionally independent Poisson random variables with means and observed values .
Numerical integration schemes that lead to likelihood approximations of the form (4) were also considered by Baddeley & Turner (2000) for approximating pseudolikelihoods of Gibbs-type point processes. However, to the best of our knowledge, these ideas have not been extended to log-Gaussian Cox processes, probably due to the paucity of computationally efficient continuously specified Gaussian random field models.
In the Appendix, we show that the approximate posterior distribution converges to the true posterior distribution generated using the correct log-Gaussian Cox process likelihood at a rate that depends on the smoothness of the field and the quality of the integration rule. Hence, while Baddeley & Turner (2000) suggest placing “one […] point, either systematically or randomly”, for log-Gaussian Cox processes, there is a strong advantage to carefully designing the underlying integration scheme.
5 Stochastic partial differential equations and Markov random fields
The approximation outlined in Section 4 will work for any finite-dimensional random field (2). This section shows how this approach fits naturally with our preferred finite-dimensional random field model. In particular, we review the stochastic partial differential equation construction of Lindgren et al. (2011) and show how this naturally extends the conditional autoregressive modelling strategy of Illian et al. (2012).
The basic idea of Lindgren et al. (2011) is that, given a surface, an appropriate lower-resolution approximation to the surface can be constructed by sampling the surface in a set of well designed points and constructing a piecewise linear interpolant. We will, therefore, take the basis functions in (2) to be a set of piecewise linear functions defined over a triangular mesh, which gives more geometric flexibility than does a traditional grid-based method.
We consider Matérn random fields, i.e. zero-mean Gaussian stationary, isotropic random fields with covariance function , where , is the modified Bessel function of the second kind, is the smoothing parameter, is the range parameter, is a scaling parameter, and the normalisation is chosen to link it with the representation (5). The subset of Matérn random fields for which is an integer, where is the dimension of the space, yields computationally efficient piecewise linear representations by representing of the Matérn field as the stationary solution to the stochastic partial differential equation
where is an integer, is the Laplacian operator, and is spatial white noise. This representation was first constructed by Whittle (1954, 1963) while proving that the classical second-order conditional autoregression model converges under lattice refinement to a Matérn field with .
Piecewise linear approximations to deterministic partial differential equations are commonly constructed in physics, engineering and applied mathematics using the finite element method, which was used by Lindgren et al. (2011) to efficiently represent the appropriate Matérn fields. When , the final outcome of their procedure replaces the stochastic partial differential equation (5) with a simple equation for the weights in the basis expansion (2)
where , and are sparse matrices with entries
The boundary of is , while is the normal derivative of and is diagonal, see Appendix C.5 in Lindgren et al. (2011) for a discussion on the choice of . Lindgren et al. (2011) also show that these models lead exactly to the classical conditional autoregressive models when computed over a regular lattice. This model can be extended to non-stationary, anisotropic, multivariate and spatiotemporal random fields (Cameletti et al., 2013; Fuglstad et al., 2015), and the methods described in this paper extend to these cases in a straightforward way, although the implementation of these models may be non-trivial.
The matrix in (6) encodes information on the process on the boundary of the observation window . The effect of physical boundaries in spatial models has received very little attention in the literature. A notable example in the context of Bayesian smoothing is Wood et al. (2008). For the remainder of this paper, we will set , which corresponds to Neumann, or no-flux, boundary conditions. These specify that the normal derivative of the field at the boundary is zero and can be physically related to an insulating boundary from which no heat escapes. We discuss the interpretation of this condition in Section 7.4.
We suggest a meshing strategy that constructs a regular triangulation of the observation window, and refine it in areas where there are a large number of points. Point pattern data hold information on the relevant point process even in areas with only a few points. Hence, in order to avoid approximation bias introduced by the choice of mesh, the triangulation needs to cover the space in a fairly regular way. On the other hand, we are unlikely to be able to infer the fine-scale latent structure in areas where we have no points or there has been little sampling. A detailed discussion of mesh selection can be found in Chapter 6 of Blangiardo & Cameletti (2015).
In order to complete the model specification, we must define an integration scheme to be used in (3). The simplest option is to attach to each node in the mesh a region for which the value of the basis function is greater than the value of any other basis function. This construction, shown in Fig. 1, corresponds to the important notion of the dual mesh. The corresponding integration rule sets to be the node location and to be the volume of the dual cell. This approximation, known as the midpoint rule, is second-order accurate on a regular grid but will be first-order accurate on an irregular mesh. We can use the structure of the mesh in other ways when constructing the integrator, for example constructing an integration scheme as the sum of optimal Gaussian integration rules on the individual triangles in the mesh. The weights and integration points for general triangles can be found in books on numerical analysis or finite element methods (Ern & Guermond, 2004). We discuss this further in the Appendix.
6 Convergence of the approximations
The proposed method (3) for approximating the likelihood in a log-Gaussian Cox process has two distinct approximations: one to the integral in the likelihood and another to the latent Gaussian random field. Here, we show that both of these converge. The proofs and more general statements of all of the results can be found in the Appendix.
The first aspect of the approximation, discussed in Section 4, replaces the intractable integral of the random intensity with a numerical quadrature scheme. The following theorem shows that, for any Gaussian random field , this approximation converges and the Hellinger distance between the true posterior distribution and the posterior distribution constructed from the approximate likelihood is bounded by the error in the integration scheme.
Assume that has, almost surely, square-integrable derivatives, and that the -point integration scheme in (3) has deterministic error of order . Then the Hellinger distance between the posterior distributions generated with the true and approximate likelihoods is .
An interesting aspect of using stochastic partial differential equation models as our finite-dimensional Gaussian random field is that the prior distribution converges as the mesh is refined (Lindgren et al., 2011; Simpson et al., 2012b). This is distinct from predictive processes or fixed-rank kriging approaches, where the finite-dimensional model (2) is taken as the true underlying model. The convergence of this approximation was established by Lindgren et al. (2011). The following theorem refines this result and shows that the approximate posterior distribution converges weakly.
Assume that the observation window is a convex polygon in and let be the maximum edge length in the mesh. Let be the solution of (5) with . If is a uniformly Lipschitz continuous function, then the error in the posterior expectation due to the stochastic partial differential equation approximation is of order for any .
7.1 Log-Gaussian Cox processes with extensions
We consider the application of log-Gaussian Cox processes in three increasingly complicated situations. In the first, a log-Gaussian Cox process with covariates is fitted to a real data set observed everywhere in a rectangular area (Rue et al., 2009; Illian et al., 2012). The second example is a simulation study in the vein of Chakraborty et al. (2011), where the point pattern is incompletely observed due to varying sampling effort across the region of interest. The third case-study is inspired by the problem of mapping the risk associated with freak waves on oceans. We have constructed a point process defined only on the world’s oceans, i.e., over a very irregular, multiply-connected bounded region on a sphere. To the best of our knowledge, no other method can be practically extended to fit a log-Gaussian Cox process in this situation.
The examples are run using the R-INLA package (Rue et al., 2009; Martins et al., 2013) which implements both the stochastic partial differential equation models and the integrated nested Laplace approximation in the statistical computing language R (R Core Team, 2013). Wherever not specified otherwise, we use independent Gaussian prior distributions with mean and variance on and . In all examples, .
7.2 Comparison with a lattice-based approach for rainforest data
This case-study is a standard application of spatial point processes, associating species with soil properties in tropical rainforests. The complete data set consists of the location of all trees with diameter at breast height of 1cm or greater, for a total of tree species within a 50 ha rainforest plot on Barro Colorado Island in Panama that has never been logged. We model the large spatial pattern formed by 4294 trees of species Protium tenuifolium, shown in Fig. 2, relative to the covariate phosphorus, which is given on an interpolated grid (John et al., 2007; Hubbell et al., 1999; Condit, 1998). The plot on Barro Colorado Island is only one plot within a large network of 50 ha plots t established as part of an international effort to understand species survival and coexistence in species-rich ecosystems (Burslem et al., 2001).
Data sets with a similar structure have been analysed both with descriptive (Law et al., 2009) and model-based approaches (Waagepetersen, 2007; Waagepetersen & Guan, 2009; Wiegand et al., 2007). Integrated nested Laplace approximation can be used to fit a log-Gaussian Cox process to similar data, and also to a joint model of the pattern and covariates (Rue et al., 2009; Illian et al., 2012). For illustration, we fit a simple model, where the latent field is , where is a constant mean, is a spatially varying covariate describing the level of phosphorus in the soil and is an approximately intrinsic stochastic partial differential equation model with , which corresponds to a range much larger than the spatial domain. The parameter is assigned a vague Gaussian prior distribution with mean zero and variance .
For comparison, we fit a lattice model with linear predictor where is a vector of ones, the phosphorus concentration, is an intrinsic second-order conditional autoregression (Rue & Held, 2005) and . Both models required around seconds to fit in R-INLA. The posterior means for the spatial random effects are shown in Fig. 3 and are centred at the same location. We believe the difference between the posterior distributions can be accounted for by the different prior distributions for and the different precision parameters. The posterior distribution for the effect of soil phosphorus on the locations of trees are shown in Fig. 2.
7.3 Incorporating variable sampling effort
A major challenge when applying spatial point process models to real data sets is that the point pattern is rarely captured exactly, so sampling effort must be included in the observation process (Chakraborty & Gelfand, 2010; Chakraborty et al., 2011; Niemi & Fernandez, 2009). In this example, we consider the case where there is a sub-area in the data set with no measurements, but where presences are possible. This type of situation occurs, for instance, when considering the spatial distribution of an animal species over an area that contains a region that is impossible to survey (Elith et al., 2006). In a related situation, data sampling effort varies spatially and is higher in areas where the scientists expect a good chance of presence, as in preferential sampling models (Diggle et al., 2010).
Following Chakraborty et al. (2011), we include known sampling effort in our model by writing the intensity as , where is a known function describing the sampling effort at location . In this example, we assume that the point pattern has been observed perfectly, except in a rectangle where the pattern is not observed; see Fig. 4. We therefore define to be zero inside this rectangle and unity everywhere else. It is straightforward to see from (1) that, with this choice of , the unsampled area does not contribute to the integral in the likelihood. We can therefore choose the mesh to be quite coarse in this area, as long as this does not adversely affect the stochastic partial differential equation approximation to the random field. Figure 4 shows a mesh that has been coarsened in a rectangular region corresponding to a hole in the sampling effort. When coarsening the mesh, it is important to remember that we still want small triangles in the vicinity of the observed region, and we want these to gradually change to larger triangles. This ensures that the stochastic partial differential equation approximation is stable. In Fig. 4 this transition can be clearly seen. The changes to the R-INLA code necessary to add sampling effort to basic point process code are minimal. This method can be extended in a straightforward manner to cover more complicated designs, although Chakraborty et al. (2011) suggest it is necessary to assume that the design is known.
In order to test our method on this type of problem, we simulated a log-Gaussian Cox process on and removed the points from the rectangle to simulate the variable sampling. The simulated data set is shown in Fig. 4, and the difference in the posterior mean generated from the full data and the censored data is shown in Fig. 5. There is very little difference between the two posterior means outside the censored area, whereas there are missing features within the censored area. We also compared the results obtained for two different meshes with the same maximum edge length, a regular lattice that covers the entire domain and contains points, and the irregular mesh consisting of points that is coarsened in the censored area, shown in Fig. 4. The posterior marginal distributions for the parameters for these two meshes are extremely similar. In order to avoid boundary effects in the finite-dimensional random field, it is important to have points inside the censored area to ensure that the random field behaves properly. Use of the mesh correctly adapted to the problem resulted in a significant decrease in computational time. With the regular grid, the full inference took seconds on a Linux laptop with a 2.2 GHz i7 4702HQ processor, whereas the computation on the irregular mesh required only seconds, a reduction by coarsening of the total mesh.
7.4 A point process over the ocean
In applications, point processes often occur over complicated domains rather than rectangles, and the topology, topography and geometry of the domain will typically be meaningful when modelling the covariance structure, see the discussion of Wood et al. (2008) in the context of spatial smoothers. For this case-study, we have simulated a log-Gaussian Cox process on the oceans, motivated by a model for assessing the risk of freak waves.
The oceans form a non-convex, multiply-connected bounded region on the sphere and it is, therefore, necessary to construct a Gaussian random field model over this region. The main complication beyond those considered by Lindgren et al. (2011) is that we need a model for the covariance at the boundary. This difficult issue has been discussed very little in the statistics literature. As we are working with simulated data, we can choose a relatively simple, yet realistic, boundary model. We expect that wave heights vary more near the coast than in the deep ocean and, as the designation of a freak wave is relative to the expected wave height, the random field is defined using the Neumann boundary conditions, which approximately doubles the variance near the boundaries of the domain (Lindgren et al., 2011, Theorem 1).
The point process shown in Fig. 6 was constructed by simulating a Gaussian random field associated with the mesh in Fig. 6. The resulting point pattern has 913 points. Inference was performed on this model and the posterior mean is shown in Fig. 7. The posterior mean shows the same large-scale features as the sample used to generate the log-Gaussian Cox process, see Fig. 7, with the expected loss of information due to the uninformative nature of point pattern data.
Effects induced by the boundary conditions can be seen in Fig. 8. The pointwise posterior standard deviation of the latent Gaussian field is shown in Fig. 8. The standard deviation is reasonably constant away from the coasts, but is much higher near the boundaries. There are some interesting effects in the Gulf of Carpentaria in Australia, and in the North Sea. This is an effect of the prior model, which increases the variance near the boundaries and in areas with high curvature of the coastline.
In the context of freak wave modelling, the most important result is displayed in Fig. 8, showing the probability that the log-risk will be greater than . Once again we see pronounced effects near the coastlines. This type of map can easily be computed using the function inla.pmarginal in the R-INLA package. It is also possible to use the excursions package (Bolin & Lindgren, 2015) in R to construct joint exceedance maps.
8 Discussion and future work
The approximation to analyse log-Gaussian Cox processes introduced in this paper is valid also when using kernel methods (Higdon, 1998), predictive processes (Banerjee et al., 2008) or fixed-rank kriging (Cressie & Johannesson, 2008). The problem with using these methods in the given context is that their basis functions are typically non-local and, therefore, the point evaluation matrices in (4) are dense; see Simpson et al. (2012b) for a further discussion of the choice of basis functions in spatial statistics.
In Section 7.4, we consider a point process over a complicated region of the sphere. To the best of our knowledge, there are no other applicable inference methods for this example that include a covariance model at the boundaries. In general, modelling of boundary effects for point processes has not previously been discussed in the literature. We argue that by using Neumann, or no-flux, boundary conditions the variance at the boundaries increases. Similarly, Dirichlet boundary conditions, which correspond to fixing the value of the field at the boundaries, decrease the variance. A future question is to construct good boundary models, and study their effect in a statistical context.
There is work to be done on the theoretical properties of the approximation presented in this paper. Some partial results are given in the Appendix, but they are not the complete story. In particular, it would be interesting to study the effect of both the likelihood approximation and the finite-dimensional approximation of the hyper-parameters of the model. These parameters, which control range, variance and, in more complicated cases, non-stationarity, are often of scientific interest and determining the rate of convergence will help us understand their interpretations.
Moving our considerations to more general finite-dimensional expansion (2), it is also of interest to quantify the link between the basis functions and the statistical properties of the estimator. Although there has been work by Stein (2014), there are a number of open questions. This is a challenging problem as the interest is in non-asymptotic behaviour both in the number of basis functions and in the amount of data. In order to do practical spatial statistics, we need to give something up and often methods will be asymptotically incorrect. However, it may be that in realistic regimes, the resulting statistical error is manageable.
Finally, the approximation in Section 4 applies even when the latent random field is not Gaussian. The only requirement is that it has the basis function expansion (2) and that the statistical properties of are known. In particular, this approximation applies to stochastic partial differential equation models with non-Gaussian noise. This has been investigated for type- Lévy processes, and especially for Laplace random fields (Bolin, 2014). Similarly, replacing Gaussian white noise with Poisson noise would result in shot-noise Cox process models of the Matérn type. It may be possible to avoid the assumptions that the random field is Gaussian in the Appendix. The main use of Gaussianity is in the form of Fernique’s theorem, which is a statement about the tails of a Gaussian random field and it is possible that similar results would hold for non-Gaussian fields after modifying the growth conditions on the likelihood and the functionals.
The authors wish to thank the associate editor and the anonymous reviewers for their useful suggestions and, in particular, for pushing us to work out the convergence results. The authors gratefully acknowledge the financial support of Research Councils UK for Illian. We would also like to thank David Burslem for long discussions on the rainforest data.
The BCI forest dynamics research project was made possible by National Science Foundation grants to Stephen P. Hubbell: DEB-0640386, DEB-0425651,DEB-0346488, DEB-0129874, DEB-00753102, DEB-9909347, DEB-9615226, DEB-9615226, DEB-9405933, DEB-9221033, DEB-9100058, DEB-8906869, DEB-8605042,DEB-8206992, DEB-7922197, support from the Center for Tropical Forest Science, the Smithsonian Tropical Research Institute, the John D. and Catherine T. MacArthur Foundation, the Mellon Foundation, the Small World Institute Fund, and numerous private individuals, and through the hard work of over 100 people from 10 countries over the past two decades. The plot project is part of the Center for Tropical Forest Science, a global network of large-scale demographic tree plots.
.1 Likelihood approximation
Throughout this appendix, we assume that the parameters in the covariance model for are known and fixed. We show that, for a fixed random field , the posterior distribution computed using the likelihood approximation converges strongly to the true posterior distribution, and the rate of convergence can increase as the smoothness of the field increases. We also show that for either the true or approximate likelihood, the posterior distribution generated using the finite-dimensional stochastic partial differential equation model converges weakly to that generated using the limiting Matérn model. Finally, we show that under further assumptions the convergence rate depends both on the basis functions and the smoothness of the underlying random field. The main tools used in this appendix come from the inverse problems literature, surveyed in Stuart (2010), which deals with inference of indirectly observed continuous Gaussian random fields.
In order to show that the approximate posterior distributions converge to the true posterior distribution, it is useful to re-write the problem in terms of measures. Let be the Gaussian measure defined by the Gaussian random field prior measure on . If we define
then the posterior probability measure for conditioned on can be defined through its Radon–Nikodym derivative where is a normalising constant required to ensure that is a probability measure. We can, in a similar fashion, define the approximate posterior measure as
where and is a normalising constant. Cotter et al. (2010) showed that, under conditions and , the Hellinger distance
between the approximate and true posterior distributions converges to zero. Stuart (2010) notes that convergence in Hellinger distance implies convergence in the total variation metric and it can be related to convergence of functionals using the identity
The following theorem shows that their theory applies to our approximate likelihood.
Consider a Gaussian random field defined on a Lipschitz domain and assume that its paths are almost surely in the Sobolev space with . Assume that the integration rule satisfies
where as and . Then, as , . Furthermore, if is an integer, then
This result follows directly from Theorem 2.4 of Cotter et al. (2010) if we can show that the potential is bounded above and below and that the error in the likelihood approximation is integrable. Let , and let be the number of points in the point pattern . Firstly we note that, by assumption, and are almost surely finite. Then, if , straightforward calculation shows that Similarly, when ,
where the last inequality follows from Sobolev’s embedding theorem and is true for every . Similar arguments show that is also bounded above and below independently of .
To show that the error in the likelihood induces a similar error in the posterior distribution, we need to verify that, for sufficiently small , there exists a that does not depend on such that By assumption, this reduces to showing that for large enough . Let be an integer. Now, for any realisation of , there exists an extension such that has compact support and . Using the quotient space structure of a Sobolev space on a domain, it follows that
where the first inequality follows from Theorems 2 and 3 of Bourdaud & Sickel (2010), the second inequality follows from the boundedness of the extension operator and the constant changes from line to line.
The condition that is an integer can probably be relaxed, but it is an open question whether can be bounded for non-integer in the same way as in the integer case. If this was true, it would suggest the use of integration rules of order rather than and would slightly improve the convergence rate.
The techniques used to prove Theorem 3 also allow us to give a more informative convergence result for the traditional counting process approximation to the log-Gaussian Cox process than those considered by Waagepetersen (2004).
Assume . Then the classical lattice approximation to the log-Gaussian Cox process converges in the Hellinger distance at a rate of .
For simplicity, we will assume that the observation window is square and the lattice is equally spaced in both directions. The lattice approximation is of the form (7) with
where is the lattice cell and is the centroid of . The first term in (10) is the midpoint rule approximation to , which, due to the regularity of the lattice satisfies (9) with , (Theorem 8.5, Ern & Guermond, 2004). The error in the likelihood arising from the approximation of by for any can be bounded using Taylor’s theorem as where the second inequality is a consequence of Sobolev’s embedding theorem (Brenner & Scott, 2007, Corollary 1.4.7). It follows using the arguments in the proof of Theorem 3 that for a two-dimensional lattice, , and the result follows from Theorem 2.4 of Cotter et al. (2010).
Examining the proof of Corollary A1, it can be seen that the rate of convergence is determined by the binning procedure and using the lattice quadrature rule and the approximate likelihood proposed in this paper the rate of convergence would be for smooth enough fields.
.2 Random field approximation
While Appendix .1 shows that for fixed the likelihood approximation introduced in this paper converges, this is not enough to show that the posterior distributions computed in Section 7 converge as we are simultaneously approximating the log-Gaussian Cox process likelihood and the Gaussian random field using the approximation outlined in Section 5. In this appendix, we close this gap when the hyperparameters are fixed and show the convergence of a general class of finite-dimensional approximations to problems in which the indirectly observed unknown random function is equipped a priori with a Gaussian random field.
There are a number of technical challenges to showing convergence of this approximation. The first is that we need to compare a measure on an infinite-dimensional space with a sequence of measures on different finite-dimensional spaces. We will, therefore, no longer be able to consider convergence in the Hellinger metric, but rather we will consider a weaker mode of convergence of an approximating measure to , that is the convergence of functionals of the form for Lipschitz continuous functions that satisfy a growth condition to ensure the functionals are finite. This is slightly stronger than convergence in distribution, for which bounded Lipschitz functions suffice (Bogachev, 2007, Section 8.3). When the finite-dimensional approximation to the Gaussian random field is computed by truncating its Karhunen–Loève expansion, Dashti & Stuart (2011) showed convergence. Their techniques, which relied heavily on the idea that truncation of the Karhunen–Loève expansion is an projection, are not directly applicable to the approximation outlined in Section 5.
In Section .3, we extend Theorem 2.6 of Dashti & Stuart (2011) to a general class of finite-dimensional approximations. In particular we show that if the approximation to is stable, in the sense that uniformly in , then the convergence of the functionals is governed by the deterministic error in the pathwise approximation. In Section .4 we show that for approximations of the general form of the stochastic partial differential equation approximation, this error is controlled by the ability of the finite-dimensional basis functions to approximate realisations of the true prior model. These results mirror previous quantitative results (Simpson et al., 2012a, b; Bolin & Lindgren, 2013), in which the stable, convergent approximation properties of piecewise linear functions were used to argue for the adoption of stochastic partial differential equation models.
.3 A general result on the convergence of finite-dimensional approximations
Let be Banach spaces and assume that . Assume that the Gaussian random field has paths almost surely in and define the approximate random field , where is a deterministic linear operator, and is an –dimensional vector space that is not necessarily a subspace of . In the special case that and is a projector, the arguments of Dashti & Stuart (2011) can be used to show convergence.
Extending the notation from Appendix .1, we define to be the law of and consider the infinite-dimensional posterior distribution defined by Similarly, we define the law of to be and define the approximate posterior distributions as where is a normalising constant. We make the following assumptions on the potential (Dashti & Stuart, 2011).
Consider the potential function . Assume that for every and , and there exists a , which may change from line to line, such that, for all , Also, for every , where , and for every ,
The following theorem says that for nice functionals, the error in the approximation depends on how well the approximate random field approximates the true random field in a pathwise sense as well as on the quality of the likelihood approximation. While the argument holds mutatis mutandis for Banach space-valued functionals (see Dashti & Stuart, 2011), for simplicity we restrict ourselves to real-valued functionals.
Assume that Assumption .3 holds for , and uniformly in and . Let be a Lipschitz continuous function such that, for every , there exists a such that, for every and , If the restriction operator satisfies the stability estimate
for all then
Using the notation of Appendix .1, it follows that
The normalising constants and are bounded both above and below uniformly in (Theorems 4.1 and 4.2, Stuart, 2010).
Let be the law of the coupling . Then, for any ,
where the second inequality follows from standard bounds on the exponential function, the assumptions on and , and the stability assumption (11). The third inequality follows from the observation that almost surely, almost surely and the embedding , and the final inequality follows from Fernique’s theorem, which ensures that the expectation is finite (Stuart, 2010).
To bound , we first note that uniformly in by assumption and Fernique’s theorem. Then it is enough to note that
using the reasoning above.
.4 Convergence of the stochastic partial differential equation approximation
In order to apply Theorem 4 to stochastic differential equation models, it is useful to consider the abstract version of the approximation outlined in Section 5. Let be an operator and define the random field through the equation Then is a Gaussian random field over the sample space with covariance operator , where the star denotes the adjoint operator. If is the Galerkin approximation to over defined by for all , then the corresponding approximate Gaussian random field has covariance operator given by , where is the pseudoinverse of satisfies . With this setup in mind, the restriction operator is defined by the equation , from which it can be seen that is a natural choice. If converges in distribution to , which is the case for the models in Section 5 (Lindgren et al., 2011), we can use Skorohod’s representation theorem to construct, possibly on a different probability space, the coupling , defined by almost surely, that is required in Theorem A4. Hence
and the rate of convergence is governed by how well solutions to the partial differential equation can be approximated by solutions to .
The following Theorem shows that, for fixed parameters, the approximate posterior distributions computed using the stochastic partial differential equation approach introduced by Lindgren et al. (2011) converge.
Let be a convex polygon. Let be a Lipschitz continuous function that satisfies the assumptions of Theorem A4. Assume that and the family of triangulations is quasi-uniform (Definition 4.4.13 Brenner & Scott, 2007). Then, if the approximate posterior measure is defined using the approximation and the integration rule outlined in Section 5, then, for any , where is the length of the largest edge in the mesh.
The use of Theorem A4 is complicated by the lack of Sobolev regularity of the Gaussian random field. In particular, the field considered in Section 5 is almost surely in for all (Lemma 6.2.7 Stuart, 2010). We then take and define the differential operator as . We define the approximation space to be the space of piecewise linear functions defined over the triangulation and let be the maximum edge length. Under the assumptions on , (Ern & Guermond, 2004, 3.12), where a Sobolev space with negative index is defined as the dual of the space with the corresponding positive index. This is consistent with the fact that white noise can be considered a random function in (Walsh, 1986). In order to define , we need the –orthogonal projector and we define the Galerkin approximation as .
Fix and and let be the distributional solution to . We emphasise that is not a function in an ordinary sense, but a distribution, and in the remainder of this proof integrals containing () should be interpreted as where the angle brackets denote the duality pairing.
As the standard convergence theory for finite element methods (Brenner & Scott, 2007; Ern & Guermond, 2004), would require the sample paths to be almost surely in , we modify the arguments used to prove Proposition 1 in Scott (1976). The crucial step in Scott’s method is to approximate by a piecewise linear function defined over such that is controlled by a negative power of . We define as for all , where the second integral is understood in the sense of distributions and makes sense because (Belgacem & Brenner, 2001). For an arbitrary , let be the orthogonal projection of onto . Then
where the final inequality follows from equations (1.5) and (1.6) of Belgacem & Brenner (2001). As was arbitrary, this gives an appropriate bound for .
Define to be the solution of for all and consider the finite element approximation defined as for all The dependence of on is suppressed for readability. The key observation is that can be considered a finite element approximation to both and as for every . It follows from standard finite element theory (Brenner & Scott, 2007, Theorem 5.7.6) that
The final ingredient of the proof is to bound . Fix and let be the solution to , where is the adjoint of . Then it follows that, for any ,