Infrastructure Quality Assessment in Africa using Satellite Imagery and Deep Learning

Infrastructure Quality Assessment in Africa using Satellite Imagery and Deep Learning

Barak Oshri Stanford University Annie Hu Stanford University Peter Adelson Stanford university Xiao Chen Stanford University Pascaline Dupas Stanford University Jeremy Weinstein Stanford University Marshall Burke Stanford University David Lobell Stanford University  and  Stefano Ermon Stanford University

The UN Sustainable Development Goals allude to the importance of infrastructure quality in three of its seventeen goals. However, monitoring infrastructure quality in developing regions remains prohibitively expensive and impedes efforts to measure progress toward these goals. To this end, we investigate the use of widely available remote sensing data for the prediction of infrastructure quality in Africa. We train a convolutional neural network to predict ground truth labels from the Afrobarometer Round 6 survey using Landsat 8 and Sentinel 1 satellite imagery.

Our best models predict infrastructure quality with AUROC scores of 0.881 on Electricity, 0.862 on Sewerage, 0.739 on Piped Water, and 0.786 on Roads using Landsat 8. These performances are significantly better than models that leverage OpenStreetMap or nighttime light intensity on the same tasks. We also demonstrate that our trained model can accurately make predictions in an unseen country after fine-tuning on a small sample of images. Furthermore, the model can be deployed in regions with limited samples to predict infrastructure outcomes with higher performance than nearest neighbor spatial interpolation.

deep learning, remote sensing, computational sustainability
journalyear: 2018copyright: acmcopyrightconference: The 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining; August 19–23, 2018; London, United Kingdombooktitle: KDD ’18: The 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, August 19–23, 2018, London, United Kingdomprice: 15.00doi: 10.1145/3219819.3219924isbn: 978-1-4503-5552-0/18/08ccs: Applied computing Economicsccs: Computing methodologies Neural networks

1. Introduction

Basic infrastructure availability in developing regions is a crucial indicator of quality of life (Pottas, 2014). Reliable infrastructure measurements create opportunities for effective planning and distribution of resources, as well as guiding policy decisions on the basis of improving the returns of infrastructure investments (Varshney et al., 2015). Currently, the most reliable infrastructure data in the developing world comes from field surveys, and these surveys are expensive and logistically challenging (Dabalen et al., 2016). Some countries have not taken a census in decades (Xie et al., 2016), and data on key measures of infrastructure development are still lacking for much of the developing world (Jean et al., 2016; IEAG, 2014). Overcoming this data deficit with more frequent surveys is likely to be both prohibitively costly, perhaps costing hundreds of billions of U.S. dollars to measure every target of the United Nations Sustainable Development Goals in every country over a 15-year period (Jerven, 2014), and institutionally difficult, as some governments see little benefit in having their performance documented (Sandefur and Glassman, 2015; Jean et al., 2016).

Sewerage Electricity Piped Water
Figure 1. We compare the labels for sewerage, electricity, and piped water, in the top row, with our predictions for these variables in the bottom row. Positive labels are shown in blue and negative labels in green.

One emerging technology for the global observation of infrastructure quality is satellite imagery. As satellite monitoring becomes more ubiquitous, with an increasing number of commercial players in the sector, improvements in spatial and temporal resolution open up new applications, uses, and markets (OECD, 2014), including the possibility to monitor important sustainability outcomes at scale. Additionally, satellite imagery can observe developing countries that do not have significant prior data, containing a wealth of observations that can be harnessed for social development applications (Jean et al., 2016), including infrastructure assessment.

Such rich and high quality image data enable advanced machine learning techniques to perform sophisticated tasks like object detection and classification, and deep learning in particular has shown great promise (Esteva et al., 2017; He et al., 2016; Dai et al., 2016; Albert et al., 2017). While a number of recent papers discuss the use of deep learning on satellite imagery for applications in land use cover (Albert et al., 2017), urban planning (Audebert et al., 2017), environmental science (Bragilevsky and Bajic, [n. d.]), etc. (Maharana et al., 2017; You et al., 2017; Pryzant et al., 2017), many unanswered questions remain in the field, particularly in the application of deep learning to social and economic development.

Our contributions in this paper are to both the applied deep learning literature and to socioeconomic studies involving remote sensing data. We propose a new approach to using satellite imagery combined with field data to map infrastructure quality in Africa at the 10m and 30m resolution. We explore multiple infrastructure outcomes, including but not limited to electricity, sewerage, piped water, and road to identify the remote sensing predictability of different infrastructure categories on a continental level. Prediction maps for three outcomes are given in Figure 1. We show that, through fine tuning a pretrained convolutional neural network (CNN), our models achieve 0.881, 0.862, 0.739, and 0.786 area of the receiver operating characteristic (AUROC) scores on these outcomes and perform better than nighttime lights intensity (nightlights) and OpenStreetMap (OSM). Our primary datasets are a combination of 10m and 30m resolution satellite imagery from Sentinel 1 and Landsat 8 respectively as well as the georeferenced Afrobarometer Round 6 survey encompassing 36 countries in Africa. Our work provides the ability to assess infrastructure in an accurate and automated manner, to supplement the spatial extent of field survey data, and to generate predictions in unseen regions.

To the best of our knowledge, we are the first to use CNNs with Sentinel 1 imagery for social development research.

1.1. Organization of Paper

The remainder of the paper is organized as follows. Section 2 (Related Work) discusses recent applications of machine learning on satellite imagery and contextualizes previous work in infrastructure quality detection. Section 3 (Data) describes the survey data and satellite imagery data sources. Section 4 (Methodology) introduces the problem formulation and modeling techniques used in this paper. Section 5 (Experimental Results) presents the performance of our model. Section 6 (Baseline Models) benchmarks our model performance against three baselines. In Section 7 (Generalization Capabilities) we explore a few settings that test the deployment potential of the model, including its performance on urban and rural enumeration areas, as well as performance in countries that the model was not originally trained on. In Section 8 we discuss conclusions and future work.

2. Related Work

The application of CNNs to land use classification can be traced back to the work of Castelluccio et al. (2015) and Penatti et al. (2015) who trained deep models on the UC Merced land use dataset (Yang and Newsam, 2010), which consists of 2100 images spanning 21 classes. Similar early studies on land use classification that employ deep learning techniques are the works of Romero et al. (2016) and Papadomanolaki et al. (2016). In Liu et al. (2017) a spatial pyramid pooling technique is employed for land use classification using satellite imagery. These studies adapted architectures pre-trained to recognize natural images from the ImageNet dataset, such as VGGNet (Simonyan and Zisserman, 2014), to fine-tune them on their much smaller land use data. A more recent study (Albert et al., 2017) uses state-of-the-art deep CNNs VGG-16 (Simonyan and Zisserman, 2014) and Residual Neural Networks (He et al., 2016) to analyze land use in urban neighborhoods with large scale satellite data.

A few recent works, which are related to infrastructure detection through deep learning, inspire us to use additional data sources such as OSM (Haklay and Weber, 2008) to support our investigation. One project that is closely related to our investigation is DeepOSM111, in which the authors take the approach of pairing OSM labels with satellite imagery obtained from Google Maps and use a convolutional architecture for classification. In (Yuan, 2016), the authors show that their model can achieve a precision of 0.74 and recall of 0.70 on building detection, training CNNs on 0.3 meter resolution OSM images. Their CNN consisted of 7 identical blocks of of filtering, pooling, and convolutional layers. Mnih and Hinton (2010, 2012) built satellite image models for road detection, and they obtain almost 0.8 precision and 0.9 at best case in one urban area. Recently, (Albert et al., 2017) predicted land use classes in urban environments with 0.7 to 0.8 accuracy, commenting on the inherent difficulty in the task of understanding high-level, subjective concepts of urban planning from satellite imagery.

Compared to prior work, our key contributions are that we train the first wide scale classifier of infrastructure quality via deep learning and publicly available satellite imagery on 11 infrastructure quality outcomes, and that we are able to achieve state-of-the-art performance in predicting infrastructure accessibility on a large imagery dataset in Africa.

3. Data

This project relates two data sources in a supervised learning setting: survey data from Afrobarometer as ground truth infrastructure quality labels, and satellite imagery from Landsat 8 and Sentinel 1 as input sources.

3.1. Afrobarometer Round 6

Figure 2. Distribution of piped water in the Afrobarometer Round 6 survey. Our study is the first to conduct a large scale infrastructure analysis at the continent-scale across 36 nations in Africa. Positive examples are shown in green and negative examples in 0.

Conducted over 2014 and 2015 across 36 countries in Africa, the Afrobarometer Round 6 survey collected surveyor-assessed quality indicators about infrastructure availability, access, and satisfaction based on respondent data 222Available by request with AidData (BenYishay et al., 2017). The dataset surveys 7022 enumeration areas with 36 attributes regarding various aspects of welfare, infrastructure quality, and wealth. (Oyuke et al., [n. d.]). Afrobarometer data from previous rounds dates back to 1999; our application used only Round 6, but adding other rounds represents valuable further work. Each survey response is an aggregate of face-to-face surveys in the enumerated area, which can encompass a city, village, or rural town, and between 1200 and 2400 samples are collected over all enumeration areas for each country. Each country in the Round 6 survey has between 150 and 300 enumeration areas. Each enumeration area is georeferenced with its latitude and longitude, and we center a satellite image for each enumeration area around these coordinates (Afrobarometer, 2014). Figure 2 shows the spatial distribution of all enumeration areas.

The Afrobarometer Round 6 survey includes 11 binary infrastructure outcomes, with each denoting the availability and quality of that infrastructure in the enumeration area. We primarily focus on highlighting results in electricity, sewerage, piped water, and road for their novel contributions. We show results on all binary outcomes except for cellphone and school due to their high class imbalances. Due to the variation of class balances across all variables we assess performance on multiple metrics and stress AUROC due to its insensitivity to class imbalances. This helps assess comparability in performance between the outcomes.

Infrastructure Label 1 Label 0 Balance
Electricity 4680 2343 0.667
Sewerage 2239 4784 0.319
Piped Water 4303 2720 0.613
Road 3886 3137 0.553
Post Office 1728 5295 0.246
Market Stalls 4811 5148 0.685
Police Station 2553 4470 0.364
Bank 1875 0.767 0.267
Cellphone 6576 456 0.936
School 6082 941 0.866
Health Clinic 4115 2908 0.586
Table 1. Overview of number of examples in each class for all binary infrastructure variables in the Afrobarometer Round 6 survey, including the balance (proportion of positive labels).

3.2. Satellite Imagery

Two primary sources of satellite observations were used, both offering coverage of most of the enumeration areas. The satellite data is temporally consistent with the survey data, from 2014 and 2015. For a given enumeration area with sampling location (latitude, longitude) at the center, we collect pixel images.

Landsat 8: Landsat 8 is a satellite with the objective of collecting publicly available multispectral imagery of the global landmass. Landsat 8 imagery has a 30m resolution providing coverage in the following six bands: Blue, Green, Red, Near Infrared, and two bands of Shortwave Infrared. Each pixel value represents the degree of reflectance in that specific band. Cloud cover removal is handled natively by the Landsat 7 Automatic Cloud Cover Assessment algorithm (Irish, 2000).

Sentinel 1: Sentinel 1 is the first satellite in the Copernicus Programme satellite constellation. It uses a C-band synthetic-aperture radar (SAR) instrument to acquire imagery regardless of weather and light conditions. Imagery obtained from Sentinel 1 has a resolution of 10 meters, providing coverage. It is processed to five bands, comprised of four polarizations and a look angle: VV, VH, Angle, VV, and VH. Each pixel value in the polarization channels represents the degree of backscatter in that specific band. For the Afrobarometer dataset, images were taken from two different orbital paths resulting in different look angles, ascending or descending, though not every enumerated area had both images. We choose the image with ascending path when available, otherwise we choose the image with descending path.

4. Methodology

4.1. Problem Formulation

The infrastructure detection task is a multi-label binary classification problem. The input is a satellite image and the outputs are binary labels , corresponding to quality indicators of different infrastructure outcomes. We optimize the mean binary cross entropy loss


where is the probability that the model predicts that input has label for infrastructure outcome .

4.2. Model

We train deep learning models to learn useful representations of data from input imagery. Convolutional Neural Networks (CNNs) have been particularly successful in vision tasks and have also been demonstrated to perform well on satellite imagery (Xie et al., 2016; Jean et al., 2016; Albert et al., 2017; Bragilevsky and Bajic, [n. d.]). For all experiments in the paper, we train a Residual Neural Network (ResNet) architecture (He et al., 2016). The following describe further specifications of our model:

ResNet. ResNet has achieved state-of-the-art results in ImageNet (He et al., 2016), and its main contribution over previous convolutional neural networks is learning residual functions in every forward propagation step with reference to the layer inputs. We posit that this is useful in satellite imagery analysis for retaining low-level features in high-level classifications. We train an 18 layer network.

Transfer Learning. Instead of training the network from random initializations, we initialize our network weights with those of a ResNet pre-trained on ImageNet (Krizhevsky et al., 2012). Even though the weights are initialized on an object recognition task, this approach has been demonstrated to be effective in training on new tasks compared with initializing using random weights (Oquab et al., 2014) and useful for learning low-level features like edges in satellite imagery.

Multi Channel Inputs. ImageNet architectures originally take inputs with three channels corresponding to RGB values. However, Landsat 8 and Sentinel 1 have six and five bands respectively. We change the first convolution layer in the network to have greater than three input channels by extending the convolutional filters to further channels. The number of output channels, stride, and padding for the first layer is the same as in the original ResNet. With Landsat 8, we initialize the RGB band parameters of the first layer layer with the same parameters as in the pre-trained ResNet weights, and initialize the non-RGB bands with Xavier initialization (Glorot and Bengio, 2010). With Sentinel 1, which does not include RGB bands, we initialized three bands with the pretrained RGB channel weights, and the other two with Xavier initialization. In Xavier initialization, each weight is sampled uniformly from


where is the input dimensions of the convolutional filter and is the number of neurons in the next layer.

4.3. Data Processing

Our pipeline includes several data processing steps and augmentation strategies:

Unique Geocoded Images in Test Set. Each enumerated area in the Afrobarometer survey has a unique geocode field, and enumerated areas with the same geocode field have substantial spatial overlap. To ensure that there is no spatial overlap between images observed in the training set and in the test set, we enforce that only points with a unique geocode appear in the test set.

Cropping. Our satellite images are ingested at pixel bounding boxes. We try downsampling, cropping at random regions, and cropping around the center pixel to pixels, and we find that the latter has best performance and convergence.

Horizontal Flipping. To augment the limited size of our dataset, at training time we rotate the image around the x-axis with 50% probability.

Normalization. We normalize our data channel-wise to zero mean and unit standard deviation.

Satellite Infrastructure Balance Accuracy F1 Score Precision Recall AUROC
L8 Electricity 0.667 0.832 0.873 0.877 0.870 0.881
L8 Sewerage 0.319 0.815 0.700 0.756 0.650 0.862
L8 Piped Water 0.613 0.673 0.725 0.730 0.720 0.739
L8 Road 0.553 0.705 0.704 0.746 0.667 0.786
L8 Bank 0.267 0.767 0.364 0.543 0.273 0.726
L8 Post Office 0.246 0.753 0.427 0.434 0.420 0.712
L8 Market Stalls 0.685 0.681 0.791 0.688 0.930 0.665
L8 Health Clinic 0.586 0.622 0.719 0.632 0.833 0.664
L8 Police Station 0.364 0.660 0.492 0.490 0.494 0.650
S1 Electricity 0.667 0.769 0.820 0.820 0.821 0.819
S1 Sewerage 0.319 0.802 0.659 0.678 0.842 0.862
S1 Piped Water 0.613 0.663 0.722 0.716 0.728 0.725
S1 Road 0.553 0.702 0.730 0.681 0.786 0.779
Table 2. Our Landsat 8 model achieves AUROC scores above 0.85 on electricity and sewerage, and achieves scores above 0.7 on all but three outcome variables.

4.4. Training

We train independent models for Landsat 8 and Sentinel 1. Our models are trained with multi-label classifiers as well as single-label classifiers, and we report the higher performance for each variable. We train the network end-to-end using Adam optimizer with and (Kingma and Ba, 2014). We train the model with a batch size of 128, and we update our weights with a decaying learning rate of 0.0001. The model weights are regularized with L2 regularization of 0.001.

Due to the limited size of our dataset, we evaluate our model on all the data with K-fold cross validation where , producing a train-test split for every fold with 80% training and 20% testing. We train a model for each fold, predict values on the test set, and once every fold has been tested compute our evaluation metrics.

5. Experimental results

5.1. Evaluation Metrics

Performance of the model was evaluated by a number of metrics: accuracy, F1-score, precision, recall, and AUROC (Christopher Brown, [n. d.]). F1 is calculated as the harmonic mean of precision and recall. AUROC corresponds to the probability that a classifier will rank a randomly chosen positive example higher than a randomly chosen negative example and generally ranges between 0.5 being a random classifier and 1.0 being a perfect predictor.

5.2. Classification Results

In Table 2 we show the classification results our model achieves on each category. Landsat 8 performs better than Sentinel 1 on every category; we show results for Sentinel 1 on our four highest scoring categories. The first column displays the proportion of images that have label 1 in that category to compare with the accuracy our model achieves.

On electricity and sewerage, our best model achieves AUROC greater than 0.85. In particular, the model achieves an F1 score of 0.873 on electricity. Our model does not perform as effectively on other variables, such as market stalls, health clinic, and police station. This is not surprising since Landsat 8 and Sentinel 1 operate at a resolution lower than that needed to resolve individual objects that signify the presence of these facilities. With the better performing categories, the imagery still cannot resolve individual electricity lines, roads, or water tanks at 30m resolution; however, the structures in aggregate might contribute to different spectral signatures. This means that the classification is likely relying on large-scale proxies, such as urban sprawl and geographical features, that correlate with the class values.

6. Baseline Models

We compare our model performance with several baselines of different input sources. To evaluate the difficulty of the task, we also compare against an ideal baseline that uses (expensive to collect) survey labels to make predictions. We suggest that the oracle defines a reasonable notion of great performance on this dataset.

Infrastructure Nightlights OSM Spatial Oracle L8
Electricity 0.79 0.73 0.78 0.89 0.88
Sewerage 0.75 0.77 0.78 0.89 0.86
Piped Water 0.73 0.73 0.75 0.89 0.74
Road 0.67 0.68 0.74 0.79 0.79
Bank 0.57 0.70 0.67 0.93 0.73
Post Office 0.56 0.64 0.70 0.92 0.71
Market Stalls 0.50 0.62 0.66 0.84 0.66
Health Clinic 0.52 0.61 0.64 0.85 0.66
Police Station 0.54 0.63 0.66 0.90 0.65
Table 3. We compare our model with four baselines on AUROC scores. Our Landsat 8 models outperform nightlights and OSM models and performs slightly better or comparably with nearest neighbor spatial interpolation. Performance on three infrastructure outcomes is comparable with the oracle.
Figure 3. Visualization of four satellite imagery sources we use in this paper. Each image is centered around the same geolocation. From left to right: Landsat 8 (multispectral), Sentinel 1 (synthetic aperture radar), DMSP (coarse nightlights), and VIIRS (nightlights). We find that when coupled with the neural network architectures we considered, Landsat 8 is the most informative source of information about infrastructure quality, followed by Sentinel 1.

6.1. Nightlights Intensity

Jean et al. (2016) used nighttime light intensities as a proxy for poverty level. Since poverty and infrastructure are closely related, we use nighttime lights as a baseline predictor for infrastructure level. For example, we expect nightlight intensity to be a good proxy for electricity access. We use nighttime light intensity data from the Defense Meteorological Satellite Program (DMSP) (Center, 2013), imaged in 2013 with a resolution of 30 arc-seconds, and Visible Infrared Imaging Radiometer Suite (VIIRS) (Center, 2015), imaged in 2015 with a resolution of 15 arc-seconds.

For each survey response, we take a DMSP or VIIRS patch of pixels centered on the geolocation. For both sources, this corresponds to roughly a square, which matches the area coverage of the cropped Landsat images we used in our best model. Figure 3 visualizes all four satellite imagery sources used in this paper for one geolocation. We run a logistic regression classifier for each response variable using cross-validated parameters, and we take the prediction of the better-performing nightlights satellite for each variable. Table 3 shows full results from this baseline.

As expected, nightlights perform quite well at predicting electricity (AUROC of 0.79), and has some predictive power with water, sewerage and roads. However, its performance in other outcomes is only slightly better than random chance (AUROC only a little better than 0.5). Using nightlights thus offers only a limited window into infrastructural provisions by proxying human activity as light emissions and fails to attend to facilities that may be present without such evidence.

6.2. OpenStreetMap

OpenStreetMap (OSM) is a collaborative project for creating a map of the world with crowdsourced information. Users and organizations upload georeferenced tags about anything they would like to identify on the map. OSM contains a wealth of information on infrastructure where it is available. However, because of its crowdsourced nature, the data is less reliable compared to professional surveys (Helbich et al., 2012).

For each enumeration area in a bounding box around the center geocoordinate, we extract the total number of highways and buildings in OSM. We expand the set of input features for every area with several non-linear transformations on these counts, including , square root, and highway-to-building . We normalize each feature to zero mean and unit standard deviation, and then train logistic regression, support vector machine, and random forest classifiers on the set of input features. The logistic regression performs best.

We display results in Table 3. With AUROCs above 0.7 for electricity, piped water and sewerage, results indicate meaningful predictive power. OSM performs worse than nightlights in electricity access, but is generally better for other tasks. Surprisingly, the OSM features are not predictive of the road category (AUROC 0.68). Overall, our results indicate that although OSM is imperfect, it does provide some useful insight into infrastructure quality metrics, achieving AUROCs between and , indicating good performance at discriminating high vs. low infrastructure quality enumeration areas across the African continent.

6.3. Spatial Interpolation

We also compute the baseline performance on the infrastructure outcomes using spatial interpolation methods. For each enumeration area we consider how predictive its latitude and longitude are of each infrastructure variable nonparametrically. We uniformly sample 80% of the enumeration areas as the training set, and for each survey response in the test set use nearest neighbor grid interpolation, which labels the example with the label of its closest neighbor. We take the infrastructure value of the neighbor as our predicted value.

Our spatial interpolation model achieves better performance than Nightlights and OSM models but has lower performance compared with the Landsat 8 models on the highest performing infrastructure outcomes, especially in electricity and sewarage. This indicates that though geographic location is a non-negligible predictor of infrastructure development, satellite imagery is able to extract deeper and more useful insights. Additionally, since spatial interpolation methods are one long-established approach in survey data interpolation (Reibel, 2007), we suggest that our model can be used to improve how survey samples are interpolated to larger regions.

Figure 4. Predictions from left to right: true positive piped water (Egypt, urban), false positive piped water (Malawi, rural), true positive electricity (Burundi, urban), false positive electricity. (Burkino Faso, rural)

6.4. Cross-label Predictions

Finally, we construct an oracle baseline to assess the discriminatory performance between the high quality infrastructure labels. In this baseline, we use all observed variables in the survey data to predict one held-out variable. That is, if there are infrastructure quality variables, we learn parameters such that for all ,


where is the sigmoid function. We optimize and with cross-entropy loss. We find that infrastructure categories have high predictability between the infrastructure labels, show in Table 3. Electricity, piped water, and sewerage achieve 0.89 AUROC.

These results offer a useful comparison for our model. If our model achieves performance similar to the oracle’s, then its predictive power is as potent as if it predicted a set of concrete infrastructure labels that were correlated with the target outcome variable. Additionally, the oracle represents the predictive performance using expensive and limited survey data, whereas satellite imagery is cheap and widely available. Our best model on electricity, piped water, and road performs comparably to the oracle.

7. Generalization Capabilities

Ultimately, we are interested in deploying our deep learning approach to provide high-resolution maps of infrastructure quality that can be updated frequently based on relatively inexpensive remote sensing data. To this end, we evaluate the generalization capabilities of the model where we attempt to make predictions on data the model has not been explicitly trained on.

7.1. Urban-Rural Split

Each enumeration area in the Afrobarometer Round 6 survey is classified as being urban or rural. Urban and rural areas in Africa have significantly different infrastructural provisions. Urban areas are associated with improved water, and access to sanitation facilities is twice as great in urban areas compared with rural areas (Bentley et al., 2015). Additionally, urban and rural areas have large visual differences in the satellite imagery that make them likely to be correlated with the other infrastructure metrics.

We measure the simple matching coefficient over all enumeration areas between the urban/rural variable and several infrastructure quality variables. Given binary variables and , the simple matching coefficient of a sampled set of observations in is the proportion of matches in values between and divided by the number of samples in the set. The simple matching coefficient between urban/rural and electricity, sewerage, and piped water each is 0.70, 0.79, and 0.71 respectively.

To address the concern that our model might be classifying the urban and rural indicators as a proxy for infrastructure quality, we evaluate the performance of our model on infrastructure variables within the urban and rural classes. Table 4 shows the classification results on our best performing infrastructure metrics when the model is trained on only urban or rural areas. The AUROC scores are lower for all infrastructure variables, but not by enough to suggest that the model in the original classification task is exclusively learning to classify based on the urban/rural category. The AUROC of our highest performing outcome electricity drops from 0.88 across both classes to 0.76 in the urban class and 0.82 in the rural class.

Satellite Infrastructure Urban/Rural Balance Accuracy F1 Score Precision Recall AUROC
L8 Electricity Urban 0.953 0.861 0.923 0.971 0.880 0.763
L8 Electricity Rural 0.454 0.741 0.724 0.701 0.748 0.816
L8 Sewerage Urban 0.661 0.687 0.729 0.853 0.636 0.794
L8 Sewerage Rural 0.089 0.897 0.430 0.425 0.436 0.807
L8 Piped Water Urban 0.861 0.807 0.885 0.907 0.864 0.758
L8 Piped Water Rural 0.408 0.628 0.599 0.535 0.680 0.686
Table 4. Results on electricity, sewerage, and piped water when we stratify urban vs. rural areas. The model still performs well, indicating that it is not simply distinguishing urban and rural areas but is actually able to explain the variation within these classes.

7.2. Country Hold-out

In the original classification task, we sample our training and test sets uniformly among all 36 countries. With high probability, every country has data points that appear in the training set. However, we would also like to know whether deploying our model in an unobserved country leads to similarly strong classification results.

We perform an experiment where we validate our model on new countries that it has not trained on before. In this experiment, we train on the enumeration areas of 35 countries, holding out one country, and then test our model on the enumeration areas of the held-out country.

Table 5 shows the results on held-out countries Uganda, Tanzania, and Kenya, three of the countries with the most representation in the Afrobarometer survey. We train with strong regularization values to prevent overfitting on the trained countries. The results are not as strong as with the uniform sampling strategy; for example, we go from AUROC of 0.853 on electricity on the test set when training with uniform sampling to AUROC of 0.637 on Ugandan enumeration areas when Uganda is held-out.

Country Infrastructure Balance Accuracy AUROC
Uganda Electricity 0.348 0.464 0.637
Sewerage 0.076 0.424 0.774
Piped Water 0.268 0.527 0.638
Tanzania Electricity 0.521 0.500 0.541
Sewerage 0.103 0.502 0.578
Piped Water 0.432 0.445 0.588
Kenya Electricity 0.846 0.703 0.518
Sewerage 0.137 0.714 0.813
Piped Water 0.418 0.473 0.602
Table 5. Country hold-out results. We evaluate the performance of our model in a country not seen during training, simulating a realistic but challenging deployment situation. Compared to Table 2, performance drops but the model maintains its usefulness for some infrastructure variables.

However, this is expected as the sample space of satellite images differ between countries, so enumeration areas between countries likely have geographic differences that make the salient features for classification less predictive.

7.3. Fine-tuning Held-out Countries

Though the results for the country hold-out experiments suggest that the model does not immediately generalize to new countries, we aim to show that transfer learning with a small, labeled sample in a new country generalizes those predictions to a significantly larger sample if the distribution of the training sample is representative.

In this experiment, we repeat the procedure of the country hold-out experiments, but fine-tune the trained model on samples of the held-out countries. We train with L2 regularization of 0.1 and with different proportions of uniformly sampled data from the held-out country, from 0% up to 80%, where the lower end (0%) is equivalent to the hold-out experiment, and the upper end (80%) trains with the same amount of data as if the country was trained on in the initial training phase. We freeze the weight updates of the ResNet’s parameters to obtain the final layer as visual features and then train a logistic regression with those features to predict the class label. We require the training and testing distribution to have the same proportion of positive labels.

Figure 5. AUROC scores on held-out countries when fine-tuned on samples of data. The x-axis corresponds to the percentage of data from the held-out country that the model was fine-tuned with.

Our best results on Uganda, Tanzania, and Kenya show that when each of these countries is held out, only 20% of the country’s data is needed to yield approximately the same AUROC as if the country was sampled uniformly in the training set. Additionally, training with 80% of the country’s data yields test scores as good as or better than the average of all the other countries that the model was trained on, with AUROC up to 0.96 and accuracy up to 92%.

Figure 5 shows the AUROC results for Uganda, Tanzania, and Kenya on the electricity category as a function of the amount of data trained on. These results suggest that the model can be fine-tuned on limited data in a new country to good performance on a much larger test set in that country.

8. Conclusion and Future Work

Data on infrastructure quality outcomes in developing countries is lacking, and this work explored the use of globally available remote sensing data for predicting such outcomes. Using Afrobarometer survey data, we introduced a deep learning approach that demonstrates good predictive ability.

We experimented with Landsat 8 (multispectral) and Sentinel 1 (SAR) data, and obtained the best overall performance on Landsat 8. We believe the superior performance of Landsat 8 when compared to Sentinel 1 follows from Landsat 8 having RGB bands, allowing it to better use the ResNet’s pretrained parameters. Sentinel 1 has no RGB bands.

We found the best performance on electricity, sewerage, piped water and road. Accuracy far surpasses balance and random prediction, and the AUROC scores are greater than 0.85 on electricity and sewerage. The model significantly outperforms the OSM and nearest neighbor interpolation baselines on these two variables by an average of 0.1 AUROC. Results also surpass the nightlights baseline. Furthermore, these results are on par with the oracle baseline, indicating that the model is making meaningful and accurate predictions. Intuitively, and as our models show, these variables are feature-defined structures and infrastructure systems that are uniquely distinguishable from satellite imagery. Figure 1 shows the distribution of all test set predictions for sewerage, electricity, and piped water.

The first two images from left to right in Figure 3 show sample satellite images where our model predicted a true positive and a false positive respectively for access to piped water; the former shows clear indication of high activity with a large swath of developed buildings and roads, while the latter may have confused the model due to a high concentration of activity at the center of the image. Similarly the right two images show true and false positive predictions for electricity. These predictions both demonstrate a similar proclivity for developed areas of buildings and roads.

We found poor performance on outcomes like market stalls, health clinic, and police station; such outcomes barely outperformed random guessing and often underperformed OSM baseline results. This make sense as there are few features to resolve the presence of these particular buildings from satellite imagery, and OSM data may offer more insightful features.

The model exhibits high confidence in most predictions and training performance is significantly better than testing performance but we do not observe total overfitting. Hyperparameter tuning was not able to resolve these issues while maintaining optimal model performance. Turning towards the data, we found that images can appear highly similar even if they have different classifications. Possible solutions to this problem include both more data and deeper, more flexible models, although without sufficient data, the latter approach risks overfitting.

Our results demonstrate an exciting step forward in remote sensing and infrastructure mapping, far surpassing the OSM baseline. However, this task is presently underexplored, and we believe further improvements could be made. More local to the model, Sentinel 1 performance could likely be improved and more data could provide superior performance. Furthermore, transfer learning from other datasets, such as OSM, to these tasks offers a potential way to create a more effective model by learning to associate ground-level features and other observations with satellite imagery.

Within the more general task of infrastructure mapping, we have also identified valuable future work. First, using the previous rounds of Afrobarometer, this model could be tested on its ability to generalize temporally. The ability to extrapolate how infrastructure has developed over time using contemporaneous imagery would be another exciting step in development. Second, a model that simultaneously trains using images from different satellites is worthy of further investigation. Third, the 10m and 30m resolution imagery used in this project is far from the resolution of today’s satellites. We expect that higher resolution data would lead to better results and believe such an approach worthy of investigation. Finally, a model that could take into account prior beliefs about infrastructure availability could offer a powerful tool for practical use.

For all these endeavors, data will form a core issue. The quality of a deep model heavily relies on adequate data available, and a large focus should be towards making better use of existing image and survey data, through strong cataloging and collating efforts. However, our results demonstrate the proof of concept that satellite imagery can be used to predict infrastructure quality.

9. Acknowledgments

We would like to acknowledge Zhongyi Tang, Hamza Husain, and George Azzari for support in data collection, and the Stanford Center on Global Poverty and Development for financial support.


  • (1)
  • Afrobarometer (2014) Afrobarometer. 2014. Round 6 Survey Manual. (2014).
  • Albert et al. (2017) Adrian Albert, Jasleen Kaur, and Marta Gonzalez. 2017. Using convolutional networks and satellite imagery to identify patterns in urban environments at a large scale. arXiv preprint arXiv:1704.02965 (2017).
  • Audebert et al. (2017) Nicolas Audebert, Bertrand Le Saux, and Sébastien Lefèvre. 2017. Beyond RGB: Very high resolution urban remote sensing with multimodal deep networks. ISPRS Journal of Photogrammetry and Remote Sensing (2017).
  • Bentley et al. (2015) Thomas Bentley, Kangwook Han, and Richard Houessou. 2015. Inadequate access, poor government performance make water a top priority in Africa. (2015).
  • BenYishay et al. (2017) A BenYishay, R Rotberg, J Wells, Z Lv, S Goodman, L Kovacevic, and D Runfola. 2017. Geocoding Afrobarometer rounds 1–6: methodology & data quality. AidData (2017).
  • Bragilevsky and Bajic ([n. d.]) Lior Bragilevsky and Ivan V Bajic. [n. d.]. Deep Learning for Amazon Satellite Image Analysis. ([n. d.]).
  • Castelluccio et al. (2015) Marco Castelluccio, Giovanni Poggi, Carlo Sansone, and Luisa Verdoliva. 2015. Land use classification in remote sensing images by convolutional neural networks. arXiv preprint arXiv:1508.00092 (2015).
  • Center (2013) National Geophysical Data Center. 2013. Version 4 DMSP-OLS Nighttime Lights Time Series. (2013).
  • Center (2015) National Geophysical Data Center. 2015. Version 1 VIIRS Day/Night Band Nighttime Lights. (2015).
  • Christopher Brown ([n. d.]) Herbert Davis Christopher Brown. [n. d.]. Receiver operating characteristics curves and related decision measures: A tutorial. ([n. d.]).
  • Dabalen et al. (2016) Andrew Dabalen, Alvin Etang, Johannes Hoogeveen, Elvis Mushi, Youdi Schipper, and Johannes von Engelhardt. 2016. Mobile Phone Panel Surveys in Developing Countries: A Practical Guide for Microdata Collection. World Bank Publications.
  • Dai et al. (2016) Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. 2016. R-fcn: Object detection via region-based fully convolutional networks. In Advances in neural information processing systems. 379–387.
  • Esteva et al. (2017) Andre Esteva, Brett Kuprel, Roberto A Novoa, Justin Ko, Susan M Swetter, Helen M Blau, and Sebastian Thrun. 2017. Dermatologist-level classification of skin cancer with deep neural networks. Nature 542, 7639 (2017), 115.
  • Glorot and Bengio (2010) Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. 249–256.
  • Haklay and Weber (2008) Mordechai Haklay and Patrick Weber. 2008. Openstreetmap: User-generated street maps. IEEE Pervasive Computing 7, 4 (2008), 12–18.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
  • Helbich et al. (2012) Marco Helbich, Chritoph Amelunxen, Pascal Neis, and Alexander Zipf. 2012. Comparative spatial analysis of positional accuracy of OpenStreetMap and proprietary geodata. Proceedings of GI_Forum (2012), 24–33.
  • IEAG (2014) UN IEAG. 2014. A World that Counts–Mobilising the Data Revolution for Sustainable Development. (2014).
  • Irish (2000) Richard R Irish. 2000. Landsat 7 automatic cloud cover assessment. In Algorithms for Multispectral, Hyperspectral, and Ultraspectral Imagery VI, Vol. 4049. International Society for Optics and Photonics, 348–356.
  • Jean et al. (2016) Neal Jean, Marshall Burke, Michael Xie, W. Matthew Davis, David Lobell, and Stefano Ermon. 2016. Combining satellite imagery and machine learning to predict poverty. Science 353, 3601 (2016), 790–794.
  • Jerven (2014) Morten Jerven. 2014. Benefits and costs of the data for development targets for the Post-2015 Development Agenda. Data for Development Assessment Paper Working Paper, September. Copenhagen: Copenhagen Consensus Center (2014).
  • Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
  • Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097–1105.
  • Liu et al. (2017) Qingshan Liu, Renlong Hang, Huihui Song, and Zhi Li. 2017. Learning Multiscale Deep Features for High-Resolution Satellite Image Scene Classification. IEEE Transactions on Geoscience and Remote Sensing (2017).
  • Maharana et al. (2017) Adyasha Maharana, Quynh C. Nguyen, and Elaine O. Nsoesie. 2017. Using Deep Learning and Satellite Imagery to Quantify the Impact of the Built Environment on Neighborhood Crime Rates. CoRR abs/1710.05483 (2017). arXiv:1710.05483
  • Mnih and Hinton (2010) Volodymyr Mnih and Geoffrey E Hinton. 2010. Learning to detect roads in high-resolution aerial images. In European Conference on Computer Vision. Springer, 210–223.
  • Mnih and Hinton (2012) Volodymyr Mnih and Geoffrey E Hinton. 2012. Learning to label aerial images from noisy data. In Proceedings of the 29th International Conference on Machine Learning (ICML-12). 567–574.
  • OECD (2014) OECD. 2014. The Space Economy at a Glance 2014. 144 pages.
  • Oquab et al. (2014) Maxime Oquab, Leon Bottou, Ivan Laptev, and Josef Sivic. 2014. Learning and transferring mid-level image representations using convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on. IEEE, 1717–1724.
  • Oyuke et al. ([n. d.]) Abel Oyuke, Peter Halley Penar, and Brian Howard. [n. d.]. Afrobarometer Dispatch No.75. ([n. d.]).
  • Papadomanolaki et al. (2016) M Papadomanolaki, M Vakalopoulou, S Zagoruyko, and K Karantzalos. 2016. BENCHMARKING DEEP LEARNING FRAMEWORKS FOR THE CLASSIFICATION OF VERY HIGH RESOLUTION SATELLITE MULTISPECTRAL DATA. ISPRS Annals of Photogrammetry, Remote Sensing & Spatial Information Sciences 3, 7 (2016).
  • Penatti et al. (2015) Otávio AB Penatti, Keiller Nogueira, and Jefersson A dos Santos. 2015. Do deep features generalize from everyday objects to remote sensing and aerial scenes domains?. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 44–51.
  • Pottas (2014) André Pottas. 2014. Addressing Africa’s Infrastructure Challenges. Delloitte. Accessed online on (2014).
  • Pryzant et al. (2017) Reid Pryzant, Stefano Ermon, and David Lobell. 2017. Monitoring Ethiopian Wheat Fungus with Satellite Imagery and Deep Feature Learning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). IEEE, 1524–1532.
  • Reibel (2007) Michael Reibel. 2007. Geographic information systems and spatial data processing in demography: a review. Population Research and Policy Review 26, 5-6 (2007), 601–618.
  • Romero et al. (2016) Adriana Romero, Carlo Gatta, and Gustau Camps-Valls. 2016. Unsupervised deep feature extraction for remote sensing image classification. IEEE Transactions on Geoscience and Remote Sensing 54, 3 (2016), 1349–1362.
  • Sandefur and Glassman (2015) Justin Sandefur and Amanda Glassman. 2015. The political economy of bad data: evidence from African survey and administrative statistics. The Journal of Development Studies 51, 2 (2015), 116–132.
  • Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
  • Varshney et al. (2015) Kush R Varshney, George H Chen, Brian Abelson, Kendall Nowocin, Vivek Sakhrani, Ling Xu, and Brian L Spatocco. 2015. Targeting villages for rural development using satellite image analysis. Big Data 3, 1 (2015), 41–53.
  • Xie et al. (2016) Michael Xie, Neal Jean, Marshall Burke, David Lobell, and Stefano Ermon. 2016. Transfer Learning from Deep Features for Remote Sensing and Poverty Mapping. arXiv preprint arXiv:1510.00098 (2016).
  • Yang and Newsam (2010) Yi Yang and Shawn Newsam. 2010. Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings of the 18th SIGSPATIAL international conference on advances in geographic information systems. ACM, 270–279.
  • You et al. (2017) Jiaxuan You, Xiaocheng Li, Melvin Low, David Lobell, and Stefano Ermon. 2017. Deep Gaussian Process for Crop Yield Prediction Based on Remote Sensing Data.. In AAAI. 4559–4566.
  • Yuan (2016) Jiangye Yuan. 2016. Automatic building extraction in aerial scenes using convolutional networks. arXiv preprint arXiv:1602.06564 (2016).
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description