What Goes Where: Predicting Object Distributions from Above

What Goes Where: Predicting Object Distributions from Above


In this work, we propose a cross-view learning approach, in which images captured from a ground-level view are used as weakly supervised annotations for interpreting overhead imagery. The outcome is a convolutional neural network for overhead imagery that is capable of predicting the type and count of objects that are likely to be seen from a ground-level perspective. We demonstrate our approach on a large dataset of geotagged ground-level and overhead imagery and find that our network captures semantically meaningful features, despite being trained without manual annotations.

What Goes Where: Predicting Object Distributions from Above

Connor Greenwell, Scott Workman, Nathan Jacobs
Department of Computer Science, University of Kentucky, USA

Index Terms—  weak supervision, semantic transfer

1 Introduction

The goal of remote sensing is to use imagery to obtain some understanding of a particular location. Observations obtained from satellites and aerial imaging have long been used to monitor the Earth’s surface. For example, to map land use, predict the weather, understand urban infrastructure, and enable precision agriculture. A key challenge is that obtaining labeled training data for new tasks can be prohibitively expensive, especially if many manual annotations are required.

Recently, a significant amount of work has explored how deep learning techniques can be applied to remote sensed data (see [deepRS] for a comprehensive review). We propose to use overhead imagery to understand the type and quantity of objects one would expect to see at a particular location. Instead of acquiring manual annotations, we consider labels inferred from nearby geotagged social media. Specifically, we use an off-the-shelf object detector, applied to ground-level imagery, to learn to interpret overhead imagery, a special case of what we call cross-view semantic transfer [zhai2017predicting]. Cross-view training approaches have been applied to a variety of tasks, for example image geolocalization [workman2015wide], image-driven mapping [workman2017beauty], and constructing aural atlases [salem2018soundscape].

In our approach, we train a convolutional neural network, operating on overhead imagery, to predict the distribution of objects obtained from co-located ground-level imagery. We present results which demonstrate how such an approach captures visual representations that are geo-informative, despite being trained without manual annotations.

2 Problem Statement

Fig. 1: Object count histograms in the CVUSA dataset. (top) This histogram shows that most images contain very few objects. (bottom) This histogram shows that person, car, and truck are the most frequently detected object types.
(a) Person
(b) Train
(c) Truck
Fig. 2: A low resolution heatmap of object frequency generated using our baseline method (Section 2.1). Greener (darker) locations mean that an image at that location will typically feature more of that type of object.

The goal of this work is to estimate the expected distribution of objects for a given location, , where is a histogram of objects and is a geographic location. Estimating this distribution by directly conditioning on geographic location is challenging, because it would require that the distribution essentially memorize the entire Earth. We attempt to overcome this challenge by conditioning the distribution on the overhead image of a location. This makes intuitive sense because it is often possible to infer the type of objects that would be present at a particular location from an overhead perspective. Therefore, we focus on a particular form of this distribution, specifically, , where is an overhead image, perhaps captured from a satellite or an airplane, of the location, . In the remainder of this section, we describe how we constructed a dataset to support learning such a conditional distribution.

2.1 Dataset

To construct a suitable dataset for evaluating our proposed methods, we begin with the Cross-View USA (CVUSA) dataset [workman2015wide], which was originally created to support training models for image geolocalization. In consists of 1,588,655 geotagged ground-level images, 551,851 of which are from Flickr, the remainder of which are from Google Street View. While there are three overhead images for each ground-level image, we only use the one with the highest resolution.

We use the Faster R-CNN ResNet 101 [ren2015faster] detector trained on the MS-COCO challenge dataset [lin2014microsoft] to detect objects in the CVUSA Flickr images. The activations from the final output layer are thresholded at . Instances of each class with score above the threshold are tallied up to form a histogram describing the objects present in each image. This was implemented in the TensorFlow Object Detection API [huang2016speed].

In Figure 1, we show two histograms of object counts for ground-level images in the CVUSA dataset, which ranges from 0 to 78. A majority of the images in the dataset have at least one objected detected. On average each image contains 2.63 objects, excluding images with zero detections. The most frequently detected object category is person. As a baseline approach to mapping object distributions, Figure 2 shows a simply a locally-weighted average of object counts from each ground-level image. Several interesting patterns emerge, such as the extensive rail network near Chicago and major truck routes across the United States. While these patterns reflect our expectations, due to the sparsity of the imagery it is not possible to construct a high-resolution map in this manner.

3 Learning to Predict Object Distributions

In this section, we describe our approach, which we call WhatGoesWhere (WGW), for predicting the geospatial distribution of ground-level objects. We use the cross-view learning framework, in which we train a network to interpret overhead imagery by having it predict features extracted from ground-level images. This allows us to learn to extract useful features from overhead imagery without the need for manual annotation.

Our model for predicting ground-level object counts from an overhead image (WGW-P) is based on the ResNet50 architecture [he2016deep]. We appended two 2048D Dense-BatchNorm-LeakyReLU layers and a final 91D Dense layer. The outputs of this final layer encode the parameters to a collection of 91 Poisson distributions over object counts, one distribution per MS-COCO object category. We also train two additional models: the first based on the Negative Binomial distribution (WGW-NB) with two 91D output layers and the second based on the Gaussian distribution (WGW-G) with two 91D output layers. The final output layers of each model are passed through a softplus to ensure that the outputs are strictly greater than zero.

We initialize the ResNet50 portion of the model with weights trained on the ImageNet task [russakovsky2015imagenet], and the subsequent Dense layers with Glorot Uniform random noise [bengio2011deep]. During training we minimize the mean negative log likelihood of the resulting distributions. All models were trained using the Nesterov-Adam optimizer with a learning rate of .

4 Evaluation

Method Distribution Mean Log-Likelihood
WGW-P Poisson
WGW-NB Neg. Binomial
WGW-G Gaussian
Table 1: Quantitative results comparing models with different loss functions. Higher is better.

We evaluated our WGW models quantitatively on a randomly selected subset (25%) of the ground-level images in the CVUSA dataset, each with an object histogram and corresponding overhead image. For each overhead image, we predict the parameters of a probability distribution over object counts. Then, for each ground-level image, we measure the likelihood of the empirical object counts under the predicted distribution. Table 1 shows the mean log-likelihood on the test set of our different models. The model based on the Poisson distribution, WGW-P, produces the largest log-likelihood. Therefore, for all remaining evaluation we focus on this model.

We qualitatively evaluate WGW-P using the dense overhead imagery in the San Francisco database [workman2015geocnn] to generate fine-grained maps over a large, diverse region. Figure 3 visualizes the results of this experiment as a heatmap of expected counts for a subset of object classes. We observe that the model learns to discriminate using visual cues found in overhead imagery and that the results appear to be geographically consistent. For example, Figure 3 (f) shows that cars are most likely to be found in urban areas, while (b), (d), and (g) show that boats, surfboards, and birds are all found around major bodies of water. These heatmaps are much higher resolution than those shown in Figure 2.

To further visualize what WGW-P has learned, we present the overhead images from San Francisco which maximize the expected count of several object categories in Figure 4. For example, the overhead image for person is of a stadium, surfboard is of a beach, and airplane is of an airport.

In Figure 5 we show the results of performing k-means clustering () on the predicted Poisson distribution parameters for this area. As shown, this process results in visually coherent regions where one can expect to find objects in similar quantities. For example, the red cluster appears to most highly correlate with aquatic areas.

(a) Person
(b) Boat
(c) Train
(d) Surfboard
(e) Truck
(f) Car
(g) Bird
(h) Airplane
Fig. 3: A high resolution heatmap of object frequency generated using our WGW-P method. Each map is scaled such that the darkest (greenest) regions represent areas where we expect to see comparatively higher counts within an object category.
(a) Person
(b) Train
(c) Truck
(d) Boat
(e) Surfboard
(f) Car
(g) Bird
(h) Airplane
Fig. 4: Images that result in high expected counts for particular object categories, as estimated by our model. This figure shows that WGW-P learns to identify areas where large groups of the same object are often seen together. For example: (a) people at stadiums, (d) boats in marinas, and (h) airplanes at air strips.
Fig. 5: The k-means clustering over expected object counts for overhead imagery captured over San Francisco. Each location is colored based on which cluster to which it is assigned. The clusters appear to roughly correspond with terrain types in the area. For example, red with the Ocean/Bay, green and orange with the urban areas that encircle the Bay, etc.

5 Conclusion

We proposed to use an off-the-shelf object detector, applied to ground-level imagery, to learn to interpret overhead imagery, a special case of what we call crossview semantic transfer. The idea was to use pairs of co-located overhead and ground-level images and train a CNN to predict the distribution of objects in the ground-level image using only the overhead image. We demonstrated how this is able to capture rich and subtle differences between different locations. In some sense, what we learned is similar to land cover and land use classification, the difference is that our approach does not require us to commit to a particular set of classes in advance. Instead, the mixture of object types that are likely to occur in an area informs the types of representations we learn. There are many directions for future work, including applying this strategy to other off-the-shelf methods for interpreting ground-level imagery and using this as a pre-training strategy for a wide variety of overhead image interpretation tasks.


We gratefully acknowledge the support of an NSF CAREER award (IIS-1553116).


Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description