On The Inverse Geostatistical Problem of Inference on Missing Locations
The standard geostatistical problem is to predict the values of a spatially continuous phenomenon,
say, at locations using data where is the
realisation at location of , or of a random variable that is stochastically
related to .
In this paper we address the inverse problem of predicting the locations of
observed measurements . We discuss how knowledge of the sampling mechanism can and should
inform a prior specification, say, for the joint distribution of the
measurement locations ,
and propose an efficient Metropolis-Hastings algorithm for drawing samples from the
resulting predictive distribution of the missing elements of .
An important feature in many applied settings is that this predictive
distribution is multi-modal, which severely limits the usefulness of simple summary measures
such as the mean or median. We present two simulated examples to
demonstrate the importance of the specification for , and
analyse rainfall data from Paraná State, Brazil
to show how, under additional assumptions, an
empirical of estimate of can be used when no prior information
on the sampling design is available.
Keywords: Geostatistics; kernel density estimation; missing locations; multi-modal distributions.
Geostatistics was originally developed as a self-contained methodology for spatial prediction (e.g. matheron1963) but is now embedded as a sub-branch of spatial statistics with applications in many different disciplines. The canonical geostatistical problem is to predict the value of a spatially continuous process, say, at any required location in a region of interest , using data consisting of a set of measured values at each of locations in . A widely used geostatistical model is that the are realisations of random variables , where are mutually independent, zero-mean Gaussian variables, and is a Gaussian process (diggle1998). Predictive inference for is then based on the predictive distribution , where means “the distribution of” and . Conventionally, the set of measurement locations is regarded either as fixed or, if stochastic, as non-informative in the sense that and are stochastically independent; diggle2010 refer to this assumption as non-preferential sampling.
In this paper we address the inverse problem of predictive inference for missing locations associated with a subset of the measured variables . Our interest in this problem was motivated by wasser2004. They proposed a geostatistical model for data on 399 DNA samples of elephant tusks collected from 28 distinct known locations acrossAfrica, and showed how their model could be used to infer the geographic location of a DNA sample of unknown origin, in order to identify areas of intense poaching. For their Bayesian analysis, they specified a uniform prior for sample locations over either savannah or forest regions of Africa, according to each sample’s known origin and used the corresponding posterior median of each spatial coordinate as a point prediction of the unknown location. They evaluated predictive performance empirically using leave-one-out cross-validation, and found that their geostatistical approach outperformed other methods that do not use geographic information.
Our objectives in this paper are the following: to extend the approach of wasser2004 to include multiple missing locations; to show how prior information on the sampling design can and should inform the analysis; and to highlight the unsatisfactory nature of simple statistics such as the mean or the median as summaries of the predictive distribution of missing locations.
The paper is structured as follows. In Section 2 we propose a modelling framework for predictive inference on multiple missing locations, incorporating prior knowledge about the sampling design. In Section 3 we show how numerical quadrature can be used for predictive inference on a single unknown location, and propose an MCMC algorithm for sampling from the joint predictive distributions of multiple unknown locations. In both cases we assume that parameter values are known, which in practice corresponds to ignoring the uncertainty parameter estimates. In Section 4, we describe two simulated examples to illustrate the limitations of simple summary statistics of the predictive distribution for the missing locations. We also report an analysis of rainfall data from Parańa State, Brazil, where we use a non-parametric density estimate as an empirical prior for . Section 5 is a concluding discussion.
2 Model formulation
2.1 Measurement data
We adopt a standard geostatistical model,
where is a zero-mean stationary Gaussian process with variance and correlation function indexed by the parameter , and the are mutually independent variates. Equivalently, the are conditionally independent given , with . Write for a set of unknown locations at which measurements have been made, and ; the observed quantities are and . For any set of points write ; hence, . Assume that and are stochastically independent. The joint distribution of , and is then
Our assumption that the sampling design is non-preferential allows a straightforward marginalisation of (2) to give
where is a multivariate Gaussian distribution with mean vector and covariance matrix with diagonal elements and off-diagonal elements .
2.2 Sampling design
Depending on the problem under investigation, the set of sampling locations might be the result of a natural process, for example the locations of nests in a colony of birds. Alternatively, they might be obtained by using a random or regular lattice designs, as it is often the case for household surveys or agricultural field trials, respectively. Knowledge of the underlying process generating the sampling locations should then be incorporated into the specification of the distribution . We now briefly outline some approaches to this specification, and propose a non-parametric approach that can be used when information on the underlying sampling process is limited.
One approach would be to model as an inhomogeneous Poisson process over the region of interest , with intensity
where is a -dimensional vector of spatial covariates, such as population density in the case of a randomised household survey, and is the associated vector of regression coefficients.
When no information on the sampling design is available a non-informative uniform distribution could be used, hence for all . An alternative approach is to estimate the intensity from the data, using a kernel method. Let and denote the coordinates of the horizontal and vertical axes for a given point . Then, the kernel density estimate of , i.e. the marginal density function of any component of , based on the observed locations is given by
where is a bivariate kernel with symmetric and positive definite 2 by 2 smoothing matrix . If we choose a Gaussian kernel, then is the variance matrix of a bivariate Gaussian density and
The elements of can be estimated by optimising an estimate of the mean-square-error (breman1989). Alternatively, if we assume that is an independent random sample from a bivariate Gaussian distribution, the optimal in the sense of minimisng the integrated mean-square-error is (lucy2002), where is the sample covariance matrix.
Finally, a common practice in geostatistical investigations is to choose locations that are more regularly distributed over than would be a realisaton of a homogeneous Poisson process. In this case, a more appropriate prior for would be an inhibitory point process (diggle2013, pages 110-111); we give an example in Section 4.2.
3 Computational details
3.1 One unknown location
If consists of a single unknown location, numerical quadrature can be used for efficient computation of the predictive distribution . Let be a grid of spatial points in the region of interest and let denote the density function of . It follows from (3) that
By treating the Gaussian process as constant within each grid cell, we approximate the above density function by
where and the elements of are the grid points closest to and to the corresponding elements of , respectively. Summaries of the predictive distribution, such as mean, mode and component-wise median, can then be approximately computed through . Additionally, high density regions of coverage can also obtained as
3.2 Multiple unknown locations
When there is more than one unknown location, the numerical solution is no longer feasible. Instead, we have developed an MCMC algorithm that takes account of the presence of the multiple modes that typically characterize the density function of , each mode corresponding to a location where the absolute difference between a value of associated with an unknown location and an observed value of is small. At each iteration of the MCMC, and with a pre-specified probability, the algorithm proposes a draw from a mixture distribution with a mode centred on each observed location.
Let and denote the number of unknown and known locations, respectively; we propose the following Metropolis-Hastings algorithm to simulate from the target density given by (6).
Propose a new value as follows. For :
with probability perform a random walk by proposing a value from a bivariate Gaussian distribution with mean the -th element of and covariance matrix , where and is a 2 by 2 identity matrix;
with probability sample a data point uniformly from the set of observed locations and propose a value from a bivariate Gaussian distribution with mean and covariance matrix , where .
Accept with probability
In 8, is the density function of a standard Gaussian distribution and is the Euclidean distance.
Repeat 2 and 3 for the desired number of iterations.
In this algorithm, the standard deviations , and should be tuned manually via pilot runs.
4.1 Simulated data from a homogeneous Poisson process
Our first example highlights the difficulty of summarizing the multi-modal distribution by a single measure of location. We simulated two data sets of size , using an isotropic exponential correlation function for the Guassian process . The parameter values were set as , , , and or . From each of the two simulated data-sets we treated one of the 201 locations, chosen at random, as unknown. Locations were generated uniformly in the unit square, hence for lies in the unit square and 0 otherwise.
Figures 1(a)-(b) and 1(c)-(d) show the results for and , respectively. The multi-modality of the distribution is indicated by numerous black patches of high density, that become more spread when . In Figure 1(d), the mean lies outside the highest density region, whilst the mode, although not having that unpleasant feature, has only slightly larger predictive density than other local modes.
4.2 Simulated data from an inhibitory point process
We now consider the case when is a simple sequential inhibition process on the unit square. Denote by the minimum permissible distance between any two locations. A sample from is obtained by a sequential sampling of points from a by regular lattice, where each new location given is generated uniformly on the intersection of the unit square with .
We simulated two datasets of observations with model parameter values set as in the previous example but fixing and letting take values and ; again, only one location is treated as unknown. The resulting density of the predictive distribution is shown in Figure 2. As expected, the extent of the predictive distribution is reduced by imposition of the minimum permissible distance . In particular, for in Figure 2(b), only a few small, disjoint regions remain as admissible for the missing location; note also that the mean and the median, in this specific example, lie outside the support of the predictive distribution.
4.3 Rainfall data from Paraná State in Brazil
We now consider a data set, previously analysed by diggle2002, on average rainfall over different years for the period May-June (dry-season) recorded in 143 recording stations throughout Paraná State, Brazil. Data locations are reported in Figure 4, three of which, denoted by triangles, were randomly selected and treated as unknown.
Using the 140 observations with known locations, we first fitted the model , where is an isotropic Gaussian process with variance and matern1986 correlation function
where is distance between two locations, denotes the modified Bessel function of the second kind of order and is a scale parameter. The maximum likelihood estimates are: , , , and . As also indicated by the resulting theoretical semi-variogram in Figure 4(b), the data show evidence of long-range spatial correlation: the most distant pair of observations at 619.492 km from each other have an estimated correlation of about 0.25.
In order to make inference on the three unknown locations, we model in two different ways: 1) is uniform over the square ; 2) is estimated from the data using kernel density estimation as described in Section 2. The resulting non-parametric estimate of is shown in Figure 4(a); note that the boundaries of Paraná State play no part in the analysis but are displayed only to add context.
In both scenarios, using the MCMC algorithm described in Section 3, we obtained 10000 samples from the predictive distribution , iterating 110000 times and retaining every 10th sample after a burn-in of 10000 samples; we set , and . The autocorrelogram plots for the vertical and horizontal coordinates in Figure 4 suggest rapid convergence of the MCMC algorithm. The resulting predictive distributions for one of the three missing location, specifically , are shown in Figure 5 with corresponding high density regions. Panels (b) and (d) show that the high density regions obtained using a uniform distribution have a much wider extent than in the second case and are little informative on the possible positioning of . However, when using the kernel density estimate , the true location is only contained in the high density region; indeed, as indicated in Figure 4(a), this point lies in a region where is relatively small.
In this paper we have proposed a methodology that allows to make inference on unknown locations of geostatistical data. We have developed efficient quadrature and MCMC algorithms for sampling from the predictive distribution for single and multiple unknown locations , respectively, and applied these to two simulated data-sets and to rainfall data from Paraná state, Brazil, with three unknown locations. In other examples not shown, we found that the MCMC algorithm continued to mix well when there are more unknown locations.
The conjunction of a sufficiently large data-set and many unknown locations would increase the computational burden of the MCMC algorithm. A computationally more efficient, but approximate procedure would be to use numerical quadrature in turn for each of the missing locations.
As shown in Section 4.1, the distribution is often characterized by multiple modes and disjoint high density regions, in which case commonly used indices, such as the mean and the median, are misleading summaries.
We have shown that the use of a homogeneous Poisson process prior may result in a very diffuse predictive distribution with widespread regions of high density. The use of a kernel density estimate for , based on the set of observed locations is useful when the empirical distribution of is spatially heterogeneous and the conditional distribution of the unknown locations is expected to follow the same pattern. Conversely, an inhibitory process for is more appropriate when the context suggests that the complete set of locations is likely to show some degree of spatial regularity.
We have not considered incorporation of observations with unknown locations into a likelihood for parameter estimation. However, this can be pursued by noticing that, starting from (3), the likelihood is given by
where the first factor is the standard likelihood function obtained from data at known locations. The second factor is an intractable integral of dimension twice the number of unknown locations, say. As discussed in Section 3, either numerical quadrature or MCMC can be used to evaluate the integral according to the value of . More efficient Monte Carlo techniques based on importance sampling or the EM algorithm could also be considered. For Bayesian inference, we would need additionally to specify a prior for the model parameters. It would then be interesting to determine under what circumstances the incorporation of observations with missing locations into the model-fitting process leads to materially better parameter estimates. We conjecture that the additional information will be small unless the prior for is highly informative.
Finally, we have assumed throughout that , and hence , is stochastically independent of . Extensions of the modelling framework that allow for stochastic dependence between and the latent process would add to the predictability of unknown locations but would also complicate the fitting of the model. One of a number of possibilities would be to model as a Cox Process with intensity stochastically dependent on , as in diggle2010.