Augmenting correlation structures in spatial data using deep generative models

Augmenting correlation structures in spatial data using deep generative models

Konstantin Klemmer
The Alan Turing Institute &
University of Warwick
kklemmer@turing.ac.uk
&Adriano Koshiyama 11footnotemark: 1
The Alan Turing Institute &
University College London
akoshiyama@turing.ac.uk
&Sebastian Flennerhag
The Alan Turing Institute &
University of Manchester
sflennerhag@turing.ac.uk
Equal contribution
Abstract

State-of-the-art deep learning methods have shown a remarkable capacity to model complex data domains, but struggle with geospatial data. In this paper, we introduce SpaceGAN, a novel generative model for geospatial domains that learns neighbourhood structures through spatial conditioning. We propose to enhance spatial representation beyond mere spatial coordinates, by conditioning each data point on feature vectors of its spatial neighbours, thus allowing for a more flexible representation of the spatial structure. To overcome issues of training convergence, we employ a metric capturing the loss in local spatial autocorrelation between real and generated data as stopping criterion for SpaceGAN parametrization. This way, we ensure that the generator produces synthetic samples faithful to the spatial patterns observed in the input. SpaceGAN is successfully applied for data augmentation and outperforms compared to other methods of synthetic spatial data generation. Finally, we propose an ensemble learning framework for the geospatial domain, taking augmented SpaceGAN samples as training data for a set of ensemble learners. We empirically show the superiority of this approach over conventional ensemble learning approaches and rivaling spatial data augmentation methods, using synthetic and real-world prediction tasks. Our findings suggest that SpaceGAN can be used as a tool for (1) artificially inflating sparse geospatial data and (2) improving generalization of geospatial models.

 

Augmenting correlation structures in spatial data using deep generative models


  Konstantin Klemmer thanks: Equal contribution The Alan Turing Institute & University of Warwick kklemmer@turing.ac.uk Adriano Koshiyama 11footnotemark: 1 The Alan Turing Institute & University College London akoshiyama@turing.ac.uk Sebastian Flennerhag The Alan Turing Institute & University of Manchester sflennerhag@turing.ac.uk

\@float

noticebox[b]Preprint. Under review.\end@float

1 Introduction

The empirical analysis of geospatial patterns has a long tradition, with applications ranging from estimating rainfall patterns [4] to predicting housing prices [5]. Recently, machine learning methods have become increasingly popular for these tasks. Traditional techniques to model spatial dependencies include clustering [22] or kernel methods like Gaussian Processes (GPs) [8]. Recent years have seen efforts to scale GP models to high-dimensional data [15] and the emergence of convolutional neural networks (CNNs) for learning spatial representations [41]. But while deep learning methods like CNNs improve upon GP models by enabling non-euclidean, graph-structured data [19], they appear to struggle with long-range spatial dependencies [31]. A recent review paper by Reichenstein et al. [39] highlights further problems of deep learning applications with spatial data, setting a research agenda aiming to improve the representation of spatial structures, particularly in deep learning methods.

Furthering this agenda, we explore how generative adverserial nets (GANs) [17] can capture spatially dependent data and how we can leverage them to learn observed spatial patterns. As they preform well on visual data, in the geospatial context GANs have been used for generating satellite imagery [30]. However, geospatial point patterns—data points distributed across continuous or discrete -dimensional space with one or more feature dimensions—remain unexplored in that regard. While previous studies have examined GAN performance in the presence of one-dimensional autocorrelation, such as temporal point processes [46] or financial time-series [26], the multi-dimensional correlation structures in geospatial point patterns pose a more complex challenge. We tackle this issue by introducing SpaceGAN: Borrowing well established techniques from geographic information science, we use spatial neighbourhoods as context to train a conditional GAN (cGAN) and optimize cGAN selection for the best representation of the inputs local spatial autocorrelation structures. GANs are difficult to train, often failing to converge to a stable solution. Our novel stopping criterion explicitly measures the quality of the representation of observed spatial patterns. Furthermore, this approach enables us to work with data distributed in discrete and continuous space. Respresentations learned by SpaceGAN can be used for downstream tasks, even on out-of-sample geospatial locations. We show how this can be used for prediction via an ensemble learning framework. We test our approach on synthetic and real-world geospatial prediction tasks and evaluate the results using spatial cross-validation.

The main contributions of this study are as follows: First, we introduce a novel cGAN approach for geospatial data domains, focusing on capturing spatial dependencies. Second, we introduce a novel ensemble learning method tailored to spatial prediction tasks by utilizing SpaceGAN samples as training data for a set of base learners. Across different experimental settings, we show that SpaceGAN-generated samples can substantially improve the performance of predictive models. As such, the results also have practical implications: our proposed framework can be used to inflate low-dimensional spatial data. This allows for enhanced model training and reduced bias by compensating for a lack of training data. We thus improve generalization performance, even when compared to existing methods for data augmentation. The remainder of this paper is structured as follows: Section 2 introduces the SpaceGAN framework and elaborates on the technical details in respect to the cGAN architecture and spatial autocorrelation representation. In Section 3, we evaluate SpaceGAN empirically using synthetic and real-world data and comparing it to existing methods for spatial data augmentation and ensemble learning. Section 4 reviews existing literature related to our study.

2 SpaceGAN

2.1 Spatial Correlation Structures

Figure 1: Examples for spatial weight matrices in the discrete and continuous case.

The so-called "First Law of Geography", made famous by Waldo Tobler, states that "everything is related to everything else, but near things are more related than distant things" [44]. Following this premise, when working with geospatial data, inherent local inter-dependencies represent an additional information layer that can be exploited. A brief example to illustrate this concept: In a typical city, when we want to estimate the price of a house, we might want to check house prices at nearby locations. If, for instance, the house is located in a rich, spatially contained neighbourhood, just knowing the price of a nearby property and without any further knowledge about the features of the house (e.g. size, age), can provide us with an informed guess. Let us formulate this intuition by first defining the -th data point as a tuple , where describes a set of features, describes the target vector and describes the point coordinates in space. While a supervised learning setting with target is not needed, we introduce it here for simplicity since we apply this during the experiments in Section 3. The features can be distributed across space randomly, or follow a—global or local—spatial process. This can be examined by measuring the correlation of a feature with its local neighbourhood, the so called local spatial autocorrelation, which is given by the Moran’s I metric [35]. While originally theorized for phenomena distributed in -dimensional space, the concept was widely popularized in geostatistics by Luc Anselin [1]. His formalization gives a local autocorrelation coefficient for a vector distributed across space. While this can be applied to any vector in the feature set of , we will explain the concept using the target vector here. We assume to follow some spatial process . As such, consists of real-valued observations referenced by an index set indicating the spatial unit corresponding to the coordinate . Let the neighbourhood of the spatial unit be . In accordance with our conceptualization above, we can then compute its local spatial autocorrelation as:

(1)

where represents the mean of ’s and are components of a weight matrix indicating membership of the local neighbourhood set between the observations and . For distributed in continuous space, the weight matrix can, for example, correspond to a -nearest-neighborhood with if and otherwise. For distributed in discrete space (e.g. non-overlapping, bordering polygons), the weight matrix could for example correspond to a queen neighbourhood (see Figure 1). The Moran’s I metric hence takes in a vector distributed in space and its corresponding neighbourhood structure to calculate how strongly (positively or negatively) the vector is autocorrelated with its spatial neighbourhood at any given location. Intuitively, this makes the selection of the weight matrix , i.e. the definition of "neighbourhood", an important design choice which we have to account for when trying to augment spatial data imitating the spatial autocorrelation structures of the input. For this augmentation process, we turn towards a popular family of generative models: GANs.

2.2 Spatially-conditioned GANs

GANs are a class of models employing two Neural Networks: a Generator () and a Discriminator (). The Generator is responsible for producing a latent representations of the input, attempting to replicate a given data generation process. It is defined as a neural network with parameters , mapping noise to some feature space (). The Discriminator, a neural network , aims to probabilistically distinguish the synthetic input created by the Generator and real data (). Both networks compete in a minimax game, improving their performance until the real and synthetic data are undistinguishable from one another. But while GANs have been successfully applied in many areas, training them is highly non-trivial[40, 18] and remains an area of intense study [3, 18, 45, 32]. This is further complicated by the non- nature of geospatial data, in which learning an unconditional model would ignore inherent local dependencies. To overcome this, a sampling process taking spatial structure into account is needed, thus preserving statistical properties such as local spatial autocorrelation.

Therefore, conditional GANs (cGANs) [34] are better fit to handle context-dependent data generation, such as geospatial data. In cGANs, the input to both the generator and discriminator are augmented by a context vector . Typically, represents a class label that we want the cGAN to generate an input for, but it can be any form of contexualization. Formally, we can define a cGAN by including the conditional variable in the original formulation so that and . The minimax game between and is then given as :

(2)

cGANs have previously been used for spatial conditioning of image data, using pixel coordinates. In our formulation, this would translate to setting [29, 20]. However, this approach is not sufficient for our problem since mere conditioning on the point coordinate alone would omit valuable information about the local neighbourhood of each point. Instead, for each point we are interested in capturing how its features relate to those of neighbouring points . As such, we define the SpaceGAN context vector of point as .

Similarly to our intuition of spatial autocorrelation, outlined above, we assume that the features of nearby data points may offer valuable information on the point-of-interest. By conditioning each data point on all neighbouring points we allow for the learning of local patterns across the feature space. Beyond this, the versatility of constructing spatial weights enables experimentation with and optimization of different spatial neighbourhood definitions. This offers a flexibility that is not provided by point coordinate conditioning.

2.3 Training and Selecting Generators for Spatial Data

One problem concerning GANs is that they typically fail to converge to a stable solution. To overcome this, we seek to tie training convergence to some measure of quality of the synthesized data. Accordingly, we propose to evaluate the generator performance by the faithfulness of its produced spatial patterns in relation to the true patterns observed in the input. For this, we introduce a new metric, the Mean Moran’s I Error (MIE). It is defined as the mean absolute difference between the local spatial autocorrelation of the input versus that of the generated samples :

(3)

We apply this metric for model selection by choosing the model that minimizes , i.e. the loss of local spatial autocorrelation between real and generated . In our supervised learning setting, we are particularly interested in a faithful representation of the target vector and hence use to calculate . Of course, can also be calculated using any other feature vector from . An implementation for multidimensional input is also formalized by Anselin [2] or can be achieved by averaging through multiple features. To train SpaceGAN, we proceed as when training a normal cGAN, but include the stopping-criterion. Algorithm 1 details our training procedure.

1:, , : hyper-parameter
2:for number of training steps (do
3:     Sample minibatch of noise samples from noise prior
4:     Sample minibatch of examples from
5:     Update the discriminator by ascending its stochastic gradient:
6:     Sample minibatch of noise samples from noise prior
7:     Update the generator by ascending its stochastic gradient:
8:     if  then
9:         , , store current , as , ; initiate
10:         for  do draw samples from
11:              for  do generate spatial data
12:                  sample noise vector
13:                  draw               
14:              Measure SpaceGAN samples spatial autocorrelation goodness-of-fit:
(4)
         
15:         Average of all samples:      
16:return ,
Algorithm 1 SpaceGAN Training and Selection

The set of user-defined hyperparameters for running SpaceGAN Training and Selection mainly encompass: and architectures, number of lags , noise vector size and prior distribution, minibatch size , number of epochs, snapshot frequency (), number of samples as well as parameters associated to the stochastic gradient optimizer. For a precise description of the architecture and specific settings, see the experiments in Section 3 for details. Notably, our proposed stopping criterion can be seen as choosing the best member from a population of GANs acquired during training. In this way, our approach resembles "snapshot ensembling", introduced by Huang et al. [21].

2.4 Ganning: GAN augmentation for ensemble learning

Figure 2: Architecture of SpaceGAN for ensemble learning (Ganning).
1: (number of samples), (base learner),
2:for  do generate spatial data
3:     for  do
4:         sample noise vector
5:         draw      
6:     train base learner:
7:return ensemble
Algorithm 2 "Ganning" for ensemble learning

A common use-case of geospatial data is spatial prediction. We approach this from an ensemble learning perspective. In ensemble learning, individually "weak" base learners (e.g. Regression Trees) can be aggregated and as such outperform "strong" learners (e.g. Support Vector Machines). Traditionally, this idea include models like Random Forest, Gradient Boosting Trees and other implementations that make use of Bagging, Boosting or Stacking principles [14, 11]. Here, we follow Koshiyama et al. [26] and utilize SpaceGAN-generated samples as training data for the ensemble learners. This approach has not been applied to spatial data before, and since it is analogous to Bagging, we will refer to it as "Ganning" from hereon. Algorithm 2 outlines this approach. Assuming a fully trained and parametrized SpaceGAN, we repeatedly draw SpaceGAN samples and train a base learner for each. After repeating this for samples we return the whole set of base models as an ensemble. The benefits of ensemble learning schemes can be best explained using the variance reduction lemma [14]. Intuitively, we can reduce the variance of the ensemble by averaging many weakly correlated predictors. Following the concept of bias-variance trade-off [14, 11], the ensemble Mean Squared Error (MSE) decreases, particularly when low bias and high variance base learners such as Deep Decision Trees are used. Nevertheless, there is a potential risk factor to this approach. Should SpaceGAN fail to replicate the true data generation process truthfully, SpaceGAN samples might not only be more diverse, but also more "biased". Consequentially, this could lead to base learners missing obvious patters, or finding new patters that do not exist in the real data.

3 Experiments

(a) Target vector (top) and its Moran’s I value
(bottom) of the observed and SpaceGAN
generated data for Toy 1.
(b) Target vector (top) and its Moran’s I value
(bottom) of the observed and SpaceGAN
generated data for Toy 2.
(c) Target vector (top) and its Moran’s I value (bottom) of the observed data, SpaceGAN generated data and a Gaussian Process smooth for California Housing 50.
Figure 3: Experiment 1: We compare the real data to SpaceGAN generated samples (averaged over 500 samples) showing both the target and its Moran’s I value . The data is synthesized out-of-sample using spatial cross-validation.

We evaluate our proposed methods in two experiments. First, we assess SpaceGAN’s ability to generate spatial data, including realistic representations of its internal spatial autocorrelation structure. Second, we analyze the use of SpaceGAN samples in an ensemble learning approach for spatial predictive modeling. For this, we use three different datasets:

Toy 1: The data points are a rectangular grid of regularly distributed, synthetic point coordinates , a random Gaussian noise vector and an outcome variable , a simple quadratic function of the spatial coordinates and random vector .
Toy 2: The data points are a rectangular grid of regularly distributed, synthetic point coordinates , a random Gaussian noise vector and an outcome variable . Here, is a more complex combination of a -function, a -function and a linear global pattern of and .
California Housing: This real-world dataset describes the prices of California houses, taken from the 1990 census. The house prices come with point coordinates and some further predictor variables , such as house age or number of bedrooms. The dataset was introduced by Pace and Barry [25] and is a standard example for continuous, spatially autocorrelated data.

All our experiments are conducted using 10-fold spatial cross-validation [37]. Here, points spatially close to the test set are removed from the training set. This is done to prevent overfitting in spatial prediction tasks, as including spatially close and—assuming spatial dependencies—hence similar data to the test set during training can lead to overconfident predictions. For a further elaboration on this scheme, see the Appendix. Note that for the real-world dataset, we refer to California Housing 15 as a -nearest neighbour implementation of the spatial cross-validation, and California Housing 50 as a -nearest neighbour implementation. For both toy datasets, we use simple queen neighbourhood (see Figure 1). For a description of the specific neural network architectures for SpaceGAN used in the different experiments, see Appendix for details.

3.1 Experiment 1: Reproducing spatial correlation patterns

Dataset GP SpaceGAN
Toy 1 1.9495 (0.1750) 0.3173 (0.1791)
Toy 2 0.2195 (0.0175) 0.2141 (0.0157)
California Housing 15 1.9932 (0.0826) 1.1468 (0.0416)
California Housing 50 3.8183 (0.2072) 0.9333 (0.0288)

- output and prediction were normalized before calculation.

Table 1: (and its standard error) between real and augmented data for SpaceGAN and GP implementations

Our first experiment aims to investigate SpaceGANs ability to not only generate data, but also its capability of reproducing observed spatial patterns. We train SpaceGAN on the three experimental datasets and at each spatial location return samples from the generator, as shown in Figure 3. Note that these results show out-of-sample extrapolations. For the dataset Toy 1, SpaceGAN is able to capture both the target vector and its spatial autocorrelation almost perfectly. In Toy 2, which represents a substantially more complicated pattern, we capture parts of the observed pattern seamlessly, however the spatial areas characterized by more subtle patterns are not captured fully. Nevertheless, this result shows that SpaceGAN also works when the spatial correlation structure is homogeneous. Lastly, we assess the real-world dataset California Housing. Again, SpaceGAN is able to capture both the target and the spatial dependencies in the data. In the real-world setting we also compare SpaceGAN to a Gaussian Process (GP) smooth for data augmentation (implemented as Vanilla-GP with RBF kernel in sklearn [36]). We can see that the GP struggles with capturing both, the target vector and its local spatial autocorrelation. Table 1 provides the metric for SpaceGAN and a GP smooth, showing that SpaceGAN is best capable of capturing the spatial interdependencies in the input. Higher resolution figures and GP comparisons for Toy 1 and Toy 2 can be found in the appendix.

3.2 Experiment 2: Data augmentation for predictive modeling

Figure 4: Model performance of SpaceGAN and competing methods across different datasets. values are given for out-of-sample prediction using spatial cross-validation.

Our second experiment focuses on predictive modelling in a spatial setting. As outlined in section 2.4, we seek to use SpaceGAN-generated samples in an ensemble learning setting—so called "Ganning". More specifically, we test two SpaceGAN configurations: First, a SpaceGAN using as convergance criterion, second, a SpaceGAN using for convergence. These are compared to two comparable ensemble baselines: First, a GP-Bagging approach, where we draw samples from a fully trained Gaussian Process posterior and use these to train base models for ensembling (GP). Second, a traditional Bagging approach using spatial bootstrapping (Spatial Boot). Table 2 provides the out-of-sample prediction values for the four approaches. Figure 4 highlights the average (with confidence intervals bars) with . We can observe that SpaceGAN (with convergence) outperforms the competitors by a substantial margin on all three datasets.

Model (B = 100)
Dataset SpaceGAN-MIE SpaceGAN-RMSE GP Spatial Boot
Toy 1 0.9921 (0.0995) 1.1993 (0.1494) 1.2388 (0.1490) 1.2013 (0.1366)
Toy 2 1.0097 (0.1092) 1.2065 (0.1496) 1.3135 (0.1443) 1.2962 (0.1413)
California Housing 15 139534 (12026) 143983 (10341) 159340 (8550) 148830 (8660)
California Housing 50 128756 (7463) 145612 (7152) 156814 (8718) 148546 (8611)
Table 2: Experiment 2: Prediction scores () and their standard errors across folds for different ensemble methods with samples across the different prediction tasks.

4 Related work

We now want to contextualize our findings in relation to existing work in the field. As the academic field of machine learning advances, more and more sophisticated techniques are being developed with the aim to capture the complexity of the real world they are trying to model. This is particularly true for spatial methods, where assumptions like distributive independence or Euclidean distances restrict the performance of the most common algorithms. The motivation for this study originates from recent approaches of a more explicit modeling of spatial context within machine learning techniques. Among these are the emergence of vector embeddings for spatially distributed image data [23], the opportunities to model non-Euclidean spatial graphs using graph convolutional networks (GCNs) [10] and the modelling of spatial point processes using matrix factorization [33]. We see SpaceGAN as an addition to the family of spatially explicit machine learning methods.

GAN models already have been applied to data autocorrelated in one dimensional space, e.g time series [46, 26], two dimensional space, e.g. remote sensing imagery [30, 48] and even three dimensional space, e.g. point clouds [27, 12]. However, none of this previous work used measures of local autocorrelation to improve the representation of spatial patterns. In the context of data augmentation, GANs have become a popular tool for inflating training data and increasing model robustness [46, 42, 13, 6]. However, such a method does not exist yet for multivariate point data, where techniques such as the spatial bootstrap [7] or synthetic point generators [28, 38] are most commonly used. Spatial image data and point clouds on the other hand are often augmented using random perturbations, rotations or cropping [16, 47]. Lastly, ensemble learning is increasingly popular for spatial modeling [9], with applications ranging from forest fire susceptibility prediction [43] to class ambiguity correction in spatial classifiers [24]. Nevertheless, to our knowledge, no research has yet been conducted combining GAN augmentation and ensemble learning within a spatial data environment, highlighting the novelty of this study.

5 Conclusion

In this paper we introduce SpaceGAN, a novel data augmentation method for spatial data, reproducing both the data structure and its spatial dependencies through two key innovations: First, we provide a novel approach to spatially condition the GAN. Instead of conditioning on raw spatial features like coordinates, we use the feature vectors of spatially near data points for conditioning. Second, we introduce a novel convergence criterion for GAN training, the . This metric measures how well the generator is able to imitate the observed spatial patterns. We show that this architecture succeeds at generating faithful samples in experiments using synthetic and real-world data. Turning towards predictive modeling, we propose an ensemble learning approach for spatial prediction tasks utilizing augmented SpaceGAN samples as training data for an ensemble of base models. We show that this approach outperforms existing methods in ensemble learning and spatial data augmentation.

In developing SpaceGAN, we seek to further the agenda of spatial representations in deep learning. As many real-world applications of deep learning algorithms deal with geospatial data, tools tailored to these tasks are required [39]. Nevertheless, the potential applications of SpaceGAN go beyond the geospatial and other -dimensional data domains. While the use of neighbourhood structures as well as the Moran’s I metric allow for the handling of data distributed in -dimensional space, we seek to confirm the applicability in future studies. Further potentially fruitful research directions include experiments with different GAN architectures, e.g. Wasserstein loss functions, and application studies with sensitive spatial data, which could be obfuscated using SpaceGAN without loosing desirable statistical properties.

Acknowledgments

The authors gratefully acknowledge funding from the UK Engineering and Physical Sciences Research Council, the EPSRC Centre for Doctoral Training in Urban Science (EPSRC grant no. EP/L016400/1); The Alan Turing Institute (EPSRC grant no. EP/N510129/1).

References

Appendix

A. Experimental Data

Here we provide a more elaborate description of the datasets used for evaluating Experiment 1 and Experiment 2.

Toy 1: We create a synthetic dataset of observations. Following the notation , we first set the spatial resolution, i.e. the spatial coordinates :

(5)

We then add an independent feature as a random draw from a Gaussian distribution with mean and standard deviation :

(6)

Now, we create the target variable as a function of spatial coordinates and the random noise as follows:

(7)

The table below provides the summary statistics of the such constructed synthetic dataset.

Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
400 50.000 28.868 2.500 26.250 73.750 97.500
400 50.000 28.868 2.500 26.250 73.750 97.500
400 0.000 1.000 0.846 0.731 0.426 3.743
400 0.008 0.960 2.993 0.641 0.638 2.702
Table 3: Summary statistics of the Toy 1 synthetic dataset.

Toy 2: We create a synthetic dataset of observations. We again start by setting the spatial resolution, i.e. the spatial coordinates :

(8)

We again add an independent variable as a random draw from a Gaussian distribution with mean and standard deviation :

(9)

Lastly, we create the target variable as a more complex function of spatial coordinates and the random noise as follows:

(10)

where . The table below provides again provides the summary statistics.

Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
841 50.750 29.301 1.750 26.250 75.250 99.750
841 50.750 29.301 1.750 26.250 75.250 99.750
841 0.000 1.000 2.294 0.606 0.631 2.488
841 0.032 1.021 3.372 0.700 0.660 3.495
Table 4: Summary statistics of the Toy 2 synthetic dataset.

California Housing: This real world dataset, introduced by [25], is widely popular for analyzing spatial patterns and accessible via Kaggle111See:https://www.kaggle.com/camnugent/california-housing-prices (it is also integrated into sklearn222See:https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html). The table below provides an overview of the features and their statistical properties:

Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
longitude 20,640 119.570 2.004 124.350 121.800 118.010 114.310
latitude 20,640 35.632 2.136 32.540 33.930 37.710 41.950
housing_median_age 20,640 28.639 12.586 1 18 37 52
total_rooms 20,640 2,635.763 2,181.615 2 1,447.8 3,148 39,320
total_bedrooms 20,433 537.871 421.385 1.000 296.000 647.000 6,445.000
population 20,640 1,425.477 1,132.462 3 787 1,725 35,682
households 20,640 499.540 382.330 1 280 605 6,082
median_income 20,640 3.871 1.900 0.500 2.563 4.743 15.000
median_house_value 20,640 206,855.800 115,395.600 14,999 119,600 264,725 500,001
Table 5: Summary statistics of the California Housing dataset.

We can break the dataset down into the familiar notation as follows:

(11)
(12)
(13)

B. Experimental Setting

The tables below provide details on architecture and configuration of the neural networks used in SpaceGAN during our experiments. Note that the kernel size parameter for Toy 1 and Toy 2 corresponds to the queen neighbourhood (for discrete spatial data) outlined in 1 and is the same neighbourhood that is used for spatial conditioning and spatial cross validation (see Appendix E). The kernel size for California Housing 15 and California Housing 50 corresponds to same kNN-neighbourhood (with ) that is used for spatial conditioning and spatial cross-validation.

Parameter Values
Architecture 1D-CNN
Number of hidden layers 1
Training steps 20000
Batch Size 100
Optimizer Stochastic Gradient Descent
Optimizer Parameters learning rate = 0.01
Noise prior
Snapshot frequency () 500
Number of samples for evaluation 500
Input features scaling function -score (standardization)
Target scaling function -score (standardization)
Table 6: Dataset-specific configurations of the SpaceGAN architecture for the experiments.
Parameter Toy 1 Toy 2 California 15 California 50
(, ) filters () (50, 50) (100, 100) (100, 100) (200, 200)
(, ) kernel size (8, 8) (8, 8) (15, 15) (50, 50)
(, ) hidden layer function (relu, tanh) (relu, tanh) (relu, tanh) (relu, tanh)
(, ) output layer function (linear, sigmoid) (linear, sigmoid) (linear, sigmoid) (linear, sigmoid)
Noise dimension 8 8 15 15
Table 7: Overview of the general SpaceGAN architecture and its hyperparameters.

D. Training convergence: vs.

(a) Toy 1
(b) California Housing 50
Figure 5: MIE and RMSE evolution through a typical SpaceGAN training cycle.

During Experiment 2, we compare the convergence (and performance) of two SpaceGAN implementations, one using and one using as convergence criterion. For completeness, we define the (root mean squared error) as follows:

(14)

The figure below shows SpaceGAN training using the different convergence criteria for Toy 1 and California Housing 50 over training steps , during a typical training cycle. Interestingly, for Toy 1, both criteria are almost antithetic, that is a local minimum for convergence approximately relates to a local maximum for convergence in the same training step. Moreover, struggles to provide assistance for when a convergence point is reached, as it shows several local minima of approximately similar value. This point is also true for the California Housing 50 dataset. The criterion however appears to have a relatively stable minimum at the first local minimum point.

E. Spatial Cross-Validation

Figure 6: Illustration of the spatial -fold cross validation process.

We use a variation of -fold spatial cross-validation [37] to evaluate all our experiments. The goal of spatial cross-validation is to check for generalizability of spatial models and to avoid overfitting. In a naive cross-validation setting with spatial data, this can occur when training and test points are spatially to close. Assuming some spatial dependency between nearby points, this would roughly relate to training on the test set. Hence, we need to create a so-called buffer area around the test set within which we remove all data points from the training set. Assuming a set of data points , we first create spatially coherent test sets. In our case, we do this by slicing through each of the two dimensions of the coordinate space five times with equal binning, thus creating folds of the same width. This leaves us with a set of test sets . We now define the training set as all points in set which are not part of the test set and which are not neighbouring points of the test set points, thus creating a buffer area: . As a quick example, for the California Housing 50 dataset, we would define the test set, then exclude all point which are not part of the test set, but are one of the -nearest-neighbours of one of the test set points. The remaining, not excluded points provide the training set. While we chose to define the buffer zone according to the neighbourhood based spatial weights matrix , other methods such as defining a deadzone area using a radius around the test set are also applicable. The spatial -folds cross validation process is outlined in Figure 6 to the right.

F. Experimental Results

Here, we want to provide some higher-resolution images of the SpaceGAN-augmented data across the three example datasets. Please note again that all synthetic samples are based on out-of-sample extrapolations from the respective generator (SpaceGAN or GP).

Figure 7: Real vs. synthetic data for the Toy 1 dataset. SpaceGAN and GPs are used for data augmentation. The upper row shows the target vector , the lower row it’s local spatial autocorrelation . All synthetic data is generated through out-of-sample extrapolation with spatial cross-validation.
Figure 8: Real vs. synthetic data for the Toy 2 dataset. SpaceGAN and GPs are used for data augmentation. The upper row shows the target vector , the lower row it’s local spatial autocorrelation . All synthetic data is generated through out-of-sample extrapolation with spatial cross-validation.
Figure 9: Real vs. synthetic data for the California Housing 15 dataset. Here, we use the 15 nearest neighbours of each datapoint for (1) defining the ConvNet kernel size in , (2) calculating and (3) spatial cross-validation. SpaceGAN and GPs are used for data augmentation. The upper row shows the target vector , the lower row it’s local spatial autocorrelation . All synthetic data is generated through out-of-sample extrapolation with spatial cross-validation.
Figure 10: Real vs. synthetic data for the California Housing 50 dataset. Here, we use the 50 nearest neighbours of each datapoint for (1) defining the ConvNet kernel size in , (2) calculating and (3) spatial cross-validation. SpaceGAN and GPs are used for data augmentation. The upper row shows the target vector , the lower row it’s local spatial autocorrelation . All synthetic data is generated through out-of-sample extrapolation with spatial cross-validation.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
366295
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description