What Goes Where: Predicting Object Distributions from Above
In this work, we propose a cross-view learning approach, in which images captured from a ground-level view are used as weakly supervised annotations for interpreting overhead imagery. The outcome is a convolutional neural network for overhead imagery that is capable of predicting the type and count of objects that are likely to be seen from a ground-level perspective. We demonstrate our approach on a large dataset of geotagged ground-level and overhead imagery and find that our network captures semantically meaningful features, despite being trained without manual annotations.
What Goes Where: Predicting Object Distributions from Above
|Connor Greenwell, Scott Workman, Nathan Jacobs|
|Department of Computer Science, University of Kentucky, USA|
Index Terms— weak supervision, semantic transfer
The goal of remote sensing is to use imagery to obtain some understanding of a particular location. Observations obtained from satellites and aerial imaging have long been used to monitor the Earth’s surface. For example, to map land use, predict the weather, understand urban infrastructure, and enable precision agriculture. A key challenge is that obtaining labeled training data for new tasks can be prohibitively expensive, especially if many manual annotations are required.
Recently, a significant amount of work has explored how deep learning techniques can be applied to remote sensed data (see [deepRS] for a comprehensive review). We propose to use overhead imagery to understand the type and quantity of objects one would expect to see at a particular location. Instead of acquiring manual annotations, we consider labels inferred from nearby geotagged social media. Specifically, we use an off-the-shelf object detector, applied to ground-level imagery, to learn to interpret overhead imagery, a special case of what we call cross-view semantic transfer [zhai2017predicting]. Cross-view training approaches have been applied to a variety of tasks, for example image geolocalization [workman2015wide], image-driven mapping [workman2017beauty], and constructing aural atlases [salem2018soundscape].
In our approach, we train a convolutional neural network, operating on overhead imagery, to predict the distribution of objects obtained from co-located ground-level imagery. We present results which demonstrate how such an approach captures visual representations that are geo-informative, despite being trained without manual annotations.
2 Problem Statement
The goal of this work is to estimate the expected distribution of objects for a given location, , where is a histogram of objects and is a geographic location. Estimating this distribution by directly conditioning on geographic location is challenging, because it would require that the distribution essentially memorize the entire Earth. We attempt to overcome this challenge by conditioning the distribution on the overhead image of a location. This makes intuitive sense because it is often possible to infer the type of objects that would be present at a particular location from an overhead perspective. Therefore, we focus on a particular form of this distribution, specifically, , where is an overhead image, perhaps captured from a satellite or an airplane, of the location, . In the remainder of this section, we describe how we constructed a dataset to support learning such a conditional distribution.
To construct a suitable dataset for evaluating our proposed methods, we begin with the Cross-View USA (CVUSA) dataset [workman2015wide], which was originally created to support training models for image geolocalization. In consists of 1,588,655 geotagged ground-level images, 551,851 of which are from Flickr, the remainder of which are from Google Street View. While there are three overhead images for each ground-level image, we only use the one with the highest resolution.
We use the Faster R-CNN ResNet 101 [ren2015faster] detector trained on the MS-COCO challenge dataset [lin2014microsoft] to detect objects in the CVUSA Flickr images. The activations from the final output layer are thresholded at . Instances of each class with score above the threshold are tallied up to form a histogram describing the objects present in each image. This was implemented in the TensorFlow Object Detection API [huang2016speed].
In Figure 1, we show two histograms of object counts for ground-level images in the CVUSA dataset, which ranges from 0 to 78. A majority of the images in the dataset have at least one objected detected. On average each image contains 2.63 objects, excluding images with zero detections. The most frequently detected object category is person. As a baseline approach to mapping object distributions, Figure 2 shows a simply a locally-weighted average of object counts from each ground-level image. Several interesting patterns emerge, such as the extensive rail network near Chicago and major truck routes across the United States. While these patterns reflect our expectations, due to the sparsity of the imagery it is not possible to construct a high-resolution map in this manner.
3 Learning to Predict Object Distributions
In this section, we describe our approach, which we call WhatGoesWhere (WGW), for predicting the geospatial distribution of ground-level objects. We use the cross-view learning framework, in which we train a network to interpret overhead imagery by having it predict features extracted from ground-level images. This allows us to learn to extract useful features from overhead imagery without the need for manual annotation.
Our model for predicting ground-level object counts from an overhead image (WGW-P) is based on the ResNet50 architecture [he2016deep]. We appended two 2048D Dense-BatchNorm-LeakyReLU layers and a final 91D Dense layer. The outputs of this final layer encode the parameters to a collection of 91 Poisson distributions over object counts, one distribution per MS-COCO object category. We also train two additional models: the first based on the Negative Binomial distribution (WGW-NB) with two 91D output layers and the second based on the Gaussian distribution (WGW-G) with two 91D output layers. The final output layers of each model are passed through a softplus to ensure that the outputs are strictly greater than zero.
We initialize the ResNet50 portion of the model with weights trained on the ImageNet task [russakovsky2015imagenet], and the subsequent Dense layers with Glorot Uniform random noise [bengio2011deep]. During training we minimize the mean negative log likelihood of the resulting distributions. All models were trained using the Nesterov-Adam optimizer with a learning rate of .
We evaluated our WGW models quantitatively on a randomly selected subset (25%) of the ground-level images in the CVUSA dataset, each with an object histogram and corresponding overhead image. For each overhead image, we predict the parameters of a probability distribution over object counts. Then, for each ground-level image, we measure the likelihood of the empirical object counts under the predicted distribution. Table 1 shows the mean log-likelihood on the test set of our different models. The model based on the Poisson distribution, WGW-P, produces the largest log-likelihood. Therefore, for all remaining evaluation we focus on this model.
We qualitatively evaluate WGW-P using the dense overhead imagery in the San Francisco database [workman2015geocnn] to generate fine-grained maps over a large, diverse region. Figure 3 visualizes the results of this experiment as a heatmap of expected counts for a subset of object classes. We observe that the model learns to discriminate using visual cues found in overhead imagery and that the results appear to be geographically consistent. For example, Figure 3 (f) shows that cars are most likely to be found in urban areas, while (b), (d), and (g) show that boats, surfboards, and birds are all found around major bodies of water. These heatmaps are much higher resolution than those shown in Figure 2.
To further visualize what WGW-P has learned, we present the overhead images from San Francisco which maximize the expected count of several object categories in Figure 4. For example, the overhead image for person is of a stadium, surfboard is of a beach, and airplane is of an airport.
In Figure 5 we show the results of performing k-means clustering () on the predicted Poisson distribution parameters for this area. As shown, this process results in visually coherent regions where one can expect to find objects in similar quantities. For example, the red cluster appears to most highly correlate with aquatic areas.
We proposed to use an off-the-shelf object detector, applied to ground-level imagery, to learn to interpret overhead imagery, a special case of what we call crossview semantic transfer. The idea was to use pairs of co-located overhead and ground-level images and train a CNN to predict the distribution of objects in the ground-level image using only the overhead image. We demonstrated how this is able to capture rich and subtle differences between different locations. In some sense, what we learned is similar to land cover and land use classification, the difference is that our approach does not require us to commit to a particular set of classes in advance. Instead, the mixture of object types that are likely to occur in an area informs the types of representations we learn. There are many directions for future work, including applying this strategy to other off-the-shelf methods for interpreting ground-level imagery and using this as a pre-training strategy for a wide variety of overhead image interpretation tasks.
We gratefully acknowledge the support of an NSF CAREER award (IIS-1553116).