SpaceNet MVOI: a Multi-View Overhead Imagery Dataset

SpaceNet MVOI: a Multi-View Overhead Imagery Dataset

Nicholas Weir nweir@iqt.org In-Q-Tel CosmiQ Works, Arlington VA 22201 USA David Lindenbaum Current affiliation: Accenture Federal Services, Arlington, VA 22209 USA In-Q-Tel CosmiQ Works, Arlington VA 22201 USA Alexei Bastidas Intel AI Lab, Santa Clara, CA 95054 USA Adam Van Etten In-Q-Tel CosmiQ Works, Arlington VA 22201 USA Sean McPherson Intel AI Lab, Santa Clara, CA 95054 USA Jacob Shermeyer In-Q-Tel CosmiQ Works, Arlington VA 22201 USA Varun Kumar Intel AI Lab, Santa Clara, CA 95054 USA Hanlin Tang hanlin.tang@intel.com Intel AI Lab, Santa Clara, CA 95054 USA
Abstract

Detection and segmentation of objects in overheard imagery is a challenging task. The variable density, random orientation, small size, and instance-to-instance heterogeneity of objects in overhead imagery calls for approaches distinct from existing models designed for natural scene datasets. Though new overhead imagery datasets are being developed, they almost universally comprise a single view taken from directly overhead (“at nadir”), failing to address one critical variable: look angle. By contrast, views vary in real-world overhead imagery, particularly in dynamic scenarios such as natural disasters where first looks are often over off-nadir. This represents an important challenge to computer vision methods, as changing view angle adds distortions, alters resolution, and changes lighting. At present, the impact of these perturbations for algorithmic detection and segmentation of objects is untested. To address this problem, we introduce the SpaceNet Multi-View Overhead Imagery (MVOI) Dataset, an extension of the SpaceNet open source remote sensing dataset. SpaceNet MVOI comprises 27 unique looks from a broad range of viewing angles ( to ). Each of these images cover the same 665 geographic extent and are annotated with 126,747 building footprint labels, enabling direct assessment of the impact of viewpoint perturbation on model performance. We benchmark multiple leading segmentation and object detection models on: (1) building detection, (2) generalization to unseen viewing angles and resolutions, and (3) sensitivity of building footprint extraction to changes in resolution. We find that state of the art segmentation and object detection models struggle to identify buildings in off-nadir imagery and generalize poorly to unseen views, presenting an important benchmark to explore the broadly relevant challenge of detecting small, heterogeneous target objects in visually dynamic contexts.

Recent years have seen increasing use of convolutional neural networks to analyze overhead imagery collected by aerial vehicles or space-based sensors, for applications ranging from agriculture [18] to surveillance [39, 32] to land type classification [3]. Segmentation and object detection of overhead imagery data requires identifying small, visually heterogeneous objects (e.g. cars and buildings) with varying orientation and density in images, a task ill-addressed by existing models developed for identification of comparatively larger and lower-abundance objects in natural scene images. The density and visual appearance of target objects change dramatically as look angle, geographic location, time of day, and seasonality vary, further complicating the problem. Addressing these challenges will provide broadly useful insights for the computer vision community as a whole: for example, how to build segmentation models to identify low-information objects in dense contexts.

The existing SpaceNet Dataset imagery [12] and other overhead imagery datasets [8, 22, 33, 19] explore geographic and sensor homogeneity, but they generally comprise a single view of the imaged location(s) taken nearly directly overhead (“at nadir”). Nadir imagery is not representative of collections during disaster response or other urgent situations: for example, the first public high-resolution cloud-free image of San Juan, Puerto Rico following Hurricane Maria was taken at “off-nadir”, i.e., a angle between the nadir point directly underneath the satellite and the center of the imaged scene [10]. The disparity between looks in public training data and relevant use cases hinders development of models applicable to real-world problems. More generally, satellite and drone images rarely capture identical looks at objects in different contexts, or even when repeatedly imaging the same geography. Furthermore, no existing datasets or metrics permit assessment of model robustness to different looks, prohibiting evaluation of performance. These limitations extend to tasks outside of the geospatial domain: for example, convolutional neural nets perform inconsistently in many natural scene video frame classification tasks despite minimal pixel-level variation [1], and Xiao et al. showed that spatial transformation of images, effectively altering view, represents an effective adversarial attack against computer vision models [35]. Addressing generalization across views both within and outside of the geospatial domain requires two advancements: 1. A large multi-view dataset with diversity in land usage, population density, and views, and 2. A metric to assess model generalization.



Urban Industrial Dense Residential Sparse Residential

7 (NADIR)
(a)
(b)
(c)
(d)

LOOK ANGLE (BIN)
-32 (OFF)
(e)
(f)
(g)
(h)

52 (VOFF)
(i)
(j)
(k)
(l)
Figure 1: Sample imagery from SpaceNet MVOI. Four of the 2222 geographically unique image chips in the dataset are shown (columns), with three of the 27 views of that chip (rows), one from each angle bin. Negative look angle corresponds to South-facing views, whereas positive look angles correspond to North-facing views (Figure 2). Chips are down-sampled from pixel high-resolution images. In addition to the RGB images shown, the dataset comprises a high-resolution pan-chromatic (grayscale) band, a high-resolution near-infrared band, and a lower-resolution 8-band multispectral image for each geographic location/view combination.

To address the limitations detailed above, we introduce the SpaceNet Multi-View Overhead Imagery (MVOI) dataset, which includes 60,000 overhead images collected over Atlanta, Georgia USA and the surrounding areas. The dataset comprises 27 distinct looks, including both North- and South-facing views, taken during a single pass of a DigitalGlobe WorldView-2 satellite. The looks range from almost directly overhead ( off-nadir) to up to off-nadir, with the same geographic area covered by each. Alongside the imagery we open sourced an attendant 126,747 building footprints created by expert labelers. To our knowledge, this is the first multi-viewpoint dataset for overhead imagery with dense object annotations. The dataset covers heterogeneous geographies, including highly treed rural areas, suburbs, industrial areas, and high-density urban environments, resulting in heterogeneous building size, density, context and appearance (Figure 1). At the same time, the dataset abstracts away many other time-sensitive variables (e.g. seasonality), enabling careful assessment of the impact of look angle on model training and inference.

Though an ideal overhead imagery dataset would cover all the variables present in overhead imagery, i.e. look angle, seasonality, geography, weather condition, sensor, and light conditions, creating such a dataset is impossible with existing imagery. To our knowledge, the 27 unique looks in SpaceNet MVOI represent one of only two such imagery collections available in the commercial realm, even behind imagery acquisition company paywalls. We thus chose to focus on providing a diverse set of views with varying look angle and direction, a variable that is not represented in any existing overhead imagery dataset. SpaceNet MVOI could potentially be combined with existing datasets to train models which generalize across more variables.

We benchmark state-of-the art models on three tasks:

  1. Building segmentation and detection.

  2. Generalization of segmentation and object detection models to previously unseen angles.

  3. Consequences of changes in resolution for segmentation and object detection models.

Our benchmarking reveals that state-of-the-art detectors are challenged by SpaceNet MVOI, particularly in views left out during model training. Segmentation and object detection models struggled to account for displacement of building footprints, occlusion, shadows, and distortion in highly off-nadir looks (Figure 3). The challenge of addressing footprint displacement is of particular interest, as it requires models not only to learn visual features, but to adjust footprint localization dependent upon the view context. Addressing these challenges is relevant to a number of applications outside of overhead imagery analysis, e.g. autonomous vehicle vision.

To assess model generalization to new looks we developed a generalization metric , which reports the relative performance of computer vision models when they are applied to previously unseen looks. While specialized models designed for overhead imagery out-perform general baseline models in building footprint detection, we found that models developed for natural image computer vision tasks have better scores on views absent during training. These observations highlight the challenges associated with developing robust models for multi-view object detection and semantic segmentation tasks. We therefore expect that developments in computer vision models for multi-view analysis made using SpaceNet MVOI, as well as analysis using our metric , will be broadly relevant for many computer vision tasks.

1 Related Work

Object detection and segmentation is a well-studied problem for natural scene images, but those objects are generally much larger and suffer minimally from distortions exacerbated in overhead imagery. Natural scene research is driven by datasets such as MSCOCO [20] and PASCALVOC [13], but those datasets lack multiple views of each object. PASCAL3D [34], autonomous driving datasets such as KITTI [14], CityScapes [7], existing multi-view datasets [29, 30], and tracking datasets such as MOT2017[24] or OBT [36] contains different views, but they are confined to a narrow range of angles, lack sufficient heterogeneity to test generalization between views, and are restricted to natural scene images. Multiple viewpoints are found in 3D model datasets [5, 23], but those are not photorealistic and lack the occlusion and visual distortion properties encountered with real imagery.

Previous datasets for overhead imagery focus on classification [6], bounding box object detection [33, 19, 25], instance-based segmentation [12], and object tracking [26] tasks. None of these datasets comprise multiple collections of the same field of view from substantially different look angles, making it difficult to assess model robustness to new views. Within segmentation datasets, Van Etten et al. [12] represents the closest work, with dense building and road annotations. We summarize the key characteristics of each dataset in Table 1. Our dataset matches or exceeds existing datasets in terms of imagery size and annotation density, but critically includes varying look direction and angle to better reflect the visual heterogeneity of real-world imagery.

The effect of different views on segmentation or object detection in natural scenes has not been thoroughly studied, as feature characteristics are relatively preserved even under rotation of the object in that context. Nonetheless, preliminary studies of classification model performance on video frames suggests that minimal pixel-level changes can impact performance [1]. By contrast, substantial occlusion and distortion occurs in off-nadir overhead imagery, complicating segmentation and placement of geospatially accurate object footprints, as shown in Figure 3A-B. Furthermore, due to the comparatively small size of target objects (e.g. buildings) in overhead imagery, changing view substantially alters their appearance (Figure 3C-D). We expect similar challenges to occur when detecting objects in natural scene images at a distance or in crowded views. Existing solutions to occlusion are often domain specific, exploiting either face-specific structure [37] for recognition or relying on attention mechanisms to identify common elements [40] or landmarks [38]. The heterogeneity in building appearance in overhead imagery, and the absence of landmark features to identify them, makes their detection an ideal research task for developing domain-agnostic models that are robust to occlusion.

Dataset Gigapixels # Images Resolution (m) Nadir Angles # Objects Annotation
SpaceNet [12, 8] 10.3 24586 0.31 On-Nadir 302701 Polygons
DOTA [33] 44.9 2806 Google Earth* On-Nadir 188282 Oriented Bbox
3K Vehicle Detection [21] N/A 20 0.20 Aerial 14235 Oriented Bbox
UCAS-AOD [41] N/A 1510 Google Earth* On-Nadir 3651 Oriented Bbox
NWPU VHR-10 [4] N/A 800 Google Earth* On-Nadir 3651 Bbox
MVS [2] 111 50 0.31-0.58 [5.3, 43.3] 0 None
FMoW [6] 1,084.0 523846 0.31-1.60 [0.22, 57.5] 132716 Classification
xView [19] 56.0 1400 0.31 On-Nadir Bbox
SpaceNet MVOI (Ours) 50.2 60000 0.46-1.67 [-32.5, +54.0] 126747 Polygons
PascalVOC [13] - 21503 - - 62199 Bbox
MSCOCO [20] - 123287 - - 886266 Bbox
ImageNet [9] - 349319 - - 478806 Bbox
Table 1: Comparison with other computer vision and overhead imagery datasets. Our dataset has a similar scale as modern computer vision datasets, but to our knowledge is the first multi-view overhead imagery dataset designed for segmentation and object detection tasks. *Google Earth imagery is a mosaic from a variety of aerial and satellite sources and ranges from 15 cm to 12 m resolution [15].

2 Dataset Creation

Figure 2: Collect views. Location of collection points during the WorldView-2 satellite pass over Atlanta, GA USA.

Our dataset, SpaceNet MVOI, contains images of Atlanta, GA USA and surrounding geography collected by DigitalGlobe’s WorldView-2 Satellite on December 22, 2009 [22]. The satellite collected 27 distinct views of the same 665 km2 ground area during a single pass over a 5 minute span. This produced 27 views with look angles (angular distance between the nadir point directly underneath the satellite and the center of the scene) from to off-nadir and with a target azimuth angle (compass direction of image acquisition) of to from true North (see Figure 2). The 27 views in a narrow temporal band provide a dense set of visually distinct perspectives of static objects (buildings, roads, trees, utilities, etc.) while limiting complicating factors common to remote sensing datasets such as changes in cloud cover, sun angle, or land-use change. The imaged area is geographically diverse, including urban areas, industrial zones, forested suburbs, and undeveloped areas (Figure 1).

Challenges in off-nadir imagery
Footprint offset and occlusion
(a) 7 degrees
(b) 53 degrees


Shadows
(c) 30 degrees
(d) -32 degrees
Figure 3: Challenges with off-nadir look angles. Though geospatially accurate building footprints (blue) perfectly match building roofs at nadir (A), this is not the case off-nadir (B), and many buildings are obscured by skyscrapers. (C-D): Visibility of some buildings changes at different look angles due to variation in reflected sunlight.

2.1 Preprocessing

Multi-view satellite imagery datasets are distinct from related natural image datasets in several interesting ways. First, as look angle increases in satellite imagery, the native resolution of the image decreases because greater distortion is required to project the image onto a flat grid (Figure 1). Second, each view contains images with multiple spectral bands. For the purposes our baselines, we used 3-channel images (RGB: red, green, blue), but also examined the contributions of the near-infrared (NIR) channel (see Supplementary Information). These images were enhanced with a separate, higher resolution panchromatic (grayscale) channel to double the original resolution of the multispectral imagery (i.e., “pan-sharpened”). The entire dataset was tiled into tiles and resampled to simulate a consistent resolution across all viewing angles of ground sample distance. The dataset also includes lower-resolution 8-band multispectral imagery with additional color channels, as well as panchromatic images, both of which are common overhead imagery data types.

The 16-bit pan-sharpened RGB-NIR pixel intensities were truncated at 3000 and then rescaled to an 8-bit range before normalizing to . We also trained models directly using Z-score normalized 16 bit images with no appreciable difference in the results.

2.2 Annotations

We undertook professional labeling to produce high-quality annotations. An expert geospatial team exhaustively labeled building footprints across the imaged area using the most on-nadir image ( off-nadir). Importantly, the building footprint polygons represent geospatially accurate ground truth, and therefore are shared across all views. For structures occluded by trees, only the visible portion was labeled. Finally, one independent validator and one remote sensing expert evaluated the quality of each label.

Figure 4: Dataset statistics. Distribution of (A) object sizes, and (B) number objects per tiled image in our dataset.

2.3 Dataset statistics

Our dataset labels comprise a broad distribution of building sizes, as shown in Figure 4A. Compared to natural image datasets, our dataset more heavily emphasizes small objects, with the majority of objects less than 700 pixels in area, or pixels across. By contrast, objects in the PASCALVOC [13] or MSCOCO [20] datasets usually comprise 50-300 pixels along the major axis [33].

An additional challenge presented by this dataset, consistent with many real-world computer vision tasks, is the heterogeneity in target object density (Figure 4B). Images contained between zero and 200 footprints, with substantial coverage throughout that range. This variability presents a challenge to object detection algorithms, which often require estimation of the number of features per image [16]. Segmentation and object detection of dense or variable density objects is challenging, making this an ideal dataset to test the limits of algorithms’ performance.

Task Baseline models
Semantic Segmentation TernausNet [17] , U-NET [27]
Instance Segmentation Mask R-CNN [16]
Object Detection Mask R-CNN [16], YOLT [11]
Table 2: Benchmark model selections for dataset baselines. TernausNet and YOLT are overhead imagery-specific models, whereas Mask R-CNN and U-Net are popular natural scene analysis models.

3 Building Detection Experiments

3.1 Dataset preparation for analysis

We split the training and test sets 80/20 by randomly selecting geographic locations and including all views for that location in one split, ensuring that each type of geography was represented in both splits. We group each angle into one of three categories: Nadir (NADIR), ; Off-nadir (OFF), ; and Very off-nadir (VOFF), . In all experiments, we trained baselines using all viewing angles (ALL) or one of the three subsets. These trained models were then evaluated on the test set of each of the 27 viewing angles individually.

3.2 Models

We measured several state of the art baselines for semantic or instance segmentation and object detection (Table 2). Where possible, we selected overhead imagery-specific models as well as models for natural scenes to compare their performance. Object detection baselines were trained using rectangular boundaries extracted from the building footprints. To fairly compare with semantic segmentation studies, the resulting bounding boxes were compared against the ground truth building polygons for scoring (see Metrics).

3.3 Segmentation Loss

Due to the class imbalance of the training data – only 9.5% of the pixels in the training set correspond to buildings – segmentation models trained with binary cross-entropy (BCE) loss failed to identify building pixels, a problem observed previously for overhead imagery segmentation models [31]. For the semantic segmentation models, we therefore utilized a hybrid loss function that combines the binary cross entropy loss and intersection over union (IoU) loss with a weight factor [31]:

(1)

The details of model training and evaluation, including augmentation, optimizers, and evaluation schemes can be found in the Supplemental Information.

3.4 Metrics

We measured performance using the building IoU- score defined in Van Etten et al. [12]. Briefly, building footprint polygons were extracted from segmentation masks (or taken directly from object detection bounding box outputs) and compared to ground truth polygons. Predictions were labeled True Positive if they had an IoU with a ground truth polygon above 0.5 and all other predictions were deemed False Positives. Using these statistics and the number of undetected ground truth polygons (False Negatives), we calculated the precision and recall of the model predictions in aggregate. We then report the score as

(2)

score was calculated within each angle bin (NADIR, OFF, or VOFF) and then averaged for an aggregate performance score. This metric assesses instance identification better than pixelwise metrics, as it measures the model’s ability to delineate entire objects rather than pixel elements.

Task Model NADIR OFF VOFF Summary
Segmentation TernausNet 0.62 0.43 0.22 0.43
Segmentation U-Net 0.39 0.27 0.08 0.24
Segmentation Mask R-CNN 0.47 0.34 0.07 0.29
Detection Mask R-CNN 0.40 0.30 0.07 0.25
Detection YOLT 0.49 0.37 0.20 0.36
Table 3: Overall task difficulty. As a measure of overall task difficulty, the performance ( score) is assessed for the baseline models trained on all angles, and tested on the three different viewing angle bins: nadir (NADIR), off-nadir (OFF), and very off-nadir (VOFF).

3.5 Results

Image Mask R-CNN TernausNet YOLT

10 (NADIR)
(a)
(b)
(c)
(d)

LOOK ANGLE (BIN)
-29 (OFF)
(e)
(f)
(g)
(h)

53 (VOFF)
(i)
(j)
(k)
(l)

Figure 5: Sample imagery (left) with ground truth building footprints and Mask R-CNN bounding boxes (middle left), TernausNet segmentation masks (middle right), and YOLT bounding boxes (right). Ground truth masks (light blue) are shown under Mask R-CNN and TernausNet predictions (yellow). YOLT bounding boxes shown in blue. Sign of the look angle represents look direction (negative=south-facing, positive=north-facing). Predictions from models trained on on all angles (see Table 3).
Figure 6: Performance by look angle for various training subsets. TernausNet (left), Mask R-CNN (middle), and YOLT (right) models, trained on ALL, NADIR, OFF, or VOFF, were evaluated in the building detection task and scores are displayed for each evaluation look angle. Imagery acquired facing South is represented as a negative number, whereas looks facing North are represented by a positive angle value. Additionally, TernausNet models trained only on North-facing OFF imagery (positive OFF) and South-facing OFF imagery (negative OFF) were evaluated on each angle to explore the importance of look direction.

The state-of-the-art segmentation and object detection models we measured were challenged by this task. As shown in Table 3, TernausNet trained on all angles achieves on the nadir angles, which is on par with previous building segmentation results and competitions [12, 8]. However, performance drops significantly for off-nadir () and very off-nadir () images. Other models display a similar degradation in performance. Example segmentation and object detection results are displayed in Figure 5.

Directional asymmetry. Figure 6 illustrates performance per angle for both segmentation and object detection models. Note that models trained on positive (north-facing) angles, such as Positive OFF (Red), fair particularly poorly when tested on negative (south-facing) angles. This may be due to the smaller dataset size, but we hypothesize that the very different lighting conditions and shadows make some directions intrinsically more difficult (Figure 3C-D). This observation reinforces that developing models and datasets that can handle the diversity of conditions seen in overhead imagery in the wild remains an important challenge.

Model architectures. Interestingly, segmentation models designed specifically for overhead imagery (TernausNet and YOLT) significantly outperform general-purpose segmentation models for computer vision (U-Net, Mask R-CNN). These experiments demonstrate the value of specializing computer vision models to the target domain of overhead imagery, which has different visual, object density, size, and orientation characteristics.

Effects of resolution. OFF and VOFF images have lower base resolutions, potentially confounding analyses of effects due exclusively to look angle. To test whether resolution might explain the observed performance drop, we ran a control study with normalized resolution. We trained TernausNet on images from all look angles artificially reduced to the same resolution of , the lowest base resolution from the dataset. This model showed negligible change in performance versus the model trained on original resolution data (original resolution: , resolution equalized: ) (Table 4). This experiment indicates that viewing angle-specific effects, not resolution, drive the decline in segmentation performance as viewing angle changes.

Training Resolution
Original Equalized
Test Angles (0.46-1.67 m) 1.67 m
NADIR 0.62 0.59
OFF 0.43 0.41
VOFF 0.22 0.22
Summary 0.43 0.41
Table 4: TernausNet model trained on different resolution imagery. Building footprint extraction performance for a TernausNet model trained on ALL original-resolution imagery (0.46 m ground sample distance (GSD) for to 1.67 m GSD at ), left, compared to the same model trained and tested on ALL imagery where every view is down-sampled to 1.67 m GSD (right). Rows display performance ( score) on different angle bins. The original resolution imagery represents the same data as in Table 3. Training set imagery resolution had only negligible impact on model performance.

Generalization to unseen angles. Beyond exploring performance of models trained with many views, we also explored how effectively models could identify building footprints on look angles absent during training. We found that the TernausNet model trained only on NADIR performed worse on evaluation images from OFF (0.32) than models trained directly on OFF (0.44), as shown in Table 5. Similar trends are observed for object detection (Figure 6). To measure performance on unseen angles, we introduce a generalization score , which measures the performance of a model trained on and tested on , normalized by the performance of a model trained on and tested on :

(3)

This metric measures relative performance across viewing angles, normalized by the task difficulty of the test set. We measured for all our model/dataset combinations, as reported in Table 6. Even though the Mask R-CNN model has worse overall performance, the model achieved a higher generalization score () compared to TernausNet () as its performance did not decline as rapidly when look angle increased. Overall however, generalization scores to unseen angles were low, highlighting the importance of future study in this challenging task.

Training Angles
Test Angles All NADIR OFF VOFF
NADIR 0.62 0.59 0.23 0.13
OFF 0.43 0.32 0.44 0.23
VOFF 0.22 0.04 0.13 0.27
Summary 0.43 0.32 0.26 0.21
Table 5: TernausNet model tested on unseen angles. Performance ( score) of the TernausNet model when trained on one angle bin (columns), and then tested on each of the three bins (rows). The model trained on NADIR performs worse on unseen OFF and VOFF views compared to models trained directly on imagery from those views.

4 Conclusion

We present a new dataset that is critical for extending object detection to real-world applications, but also presents challenges to existing computer vision algorithms. Our benchmark found that segmenting building footprints from very off-nadir views was exceedingly difficult, even for state-of-the-art segmentation and object detection models tuned specifically for overhead imagery (Table 3). The relatively low scores for these tasks (maximum VOFF score of 0.22) emphasize the amount of improvement that further research could enable in this realm.

Furthermore, on all benchmark tasks we concluded that model generalization to unseen views represents a significant challenge. We quantify the performance degradation from nadir ( = 0.62) to very off-nadir ( = 0.22), and note an asymmetry between performance on well-lit north-facing imagery and south-facing imagery cloaked in shadows (Figure 3C-D and Figure 6). We speculate that distortions in objects, occlusion, and variable lighting in off-nadir imagery (Figure 3), as well as the small size of buildings in general (Figure 4), pose an unusual challenge for segmentation and object detection of overhead imagery.

The off-nadir imagery has a lower resolution than nadir imagery (due to simple geometry), which theoretically complicates building extraction for high off-nadir angles. However, by experimenting with imagery degraded to the same low resolution, we show that resolution has an insignificant impact on performance (Table 4). Rather, variations in illumination and viewing angle are the dominant factors. This runs contrary to recent observations [28], which found that object detection models identify small cars and other vehicles better in super-resolved imagery. Further study will help elucidate the differences between tasks that dictate the value of imagery resolution.

Generalization Score
Task Model NADIR OFF VOFF
Segmentation TernausNet 0.45 0.43 0.37
Segmentation U-Net 0.64 0.40 0.37
Segmentation Mask R-CNN 0.60 0.90 0.84
Detection Mask R-CNN 0.64 0.92 0.76
Detection YOLT 0.57 0.68 0.44
Table 6: Generalization scores. To measure segmentation model performance on unseen views, we compute a generalization score (Equation 3), which quantifies performance on unseen views normalized by task difficulty. Each column corresponds to a model trained on one angle bin.

The generalization score is low for the highest-performing, overhead imagery-specific models in these tasks (Table 6), suggesting that these models may be over-fitting to view-specific properties. This challenge is not specific to overhead imagery: for example, accounting for distortion of objects due to imagery perspective is an essential component of 3-dimensional scene modeling, or rotation prediction tasks [23]. Taken together, this dataset and the metric provide an exciting opportunity for future research on algorithmic generalization to unseen views.

Our aim for future work is to expose problems of interest to the larger computer vision community with the help of overhead imagery datasets. While only one specific application, advances in enabling analysis of overhead imagery in the wild can concurrently solve broader tasks. For example, we had anecdotally observed that image translation and domain transfer models failed to convert off-nadir images to nadir images, potentially due to the spatial shifts in the image. Exploring these tasks as well as other novel research avenues will enable advancement of a variety of current computer vision challenges.

References

  • [1] A. Azulay and Y. Weiss. Why do deep convolutional networks generalize so poorly to small image transformations? CoRR, abs/1805.12177, 2018.
  • [2] M. Bosch, Z. Kurtz, S. Hagstrom, and M. Brown. A multiple view stereo benchmark for satellite imagery. In 2016 IEEE Applied Imagery Pattern Recognition Workshop (AIPR), pages 1–9, Oct 2016.
  • [3] Y. Chen, X. Zhao, and X. Jia. Spectral–Spatial Classification of Hyperspectral Data Based on Deep Belief Network. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 8(6):2381–2392, July 2015.
  • [4] G. Cheng, P. Zhou, and J. Han. Learning rotation-invariant convolutional neural networks for object detection in vhr optical remote sensing images. IEEE Transactions on Geoscience and Remote Sensing, 54:7405–7415, 2016.
  • [5] S. Choi, Q.-Y. Zhou, S. Miller, and V. Koltun. A large dataset of object scans. arXiv:1602.02481, 2016.
  • [6] G. Christie, N. Fendley, J. Wilson, and R. Mukherjee. Functional map of the world. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun 2018.
  • [7] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [8] I. Demir, K. Koperski, D. Lindenbaum, G. Pang, J. Huang, S. Basu, F. Hughes, D. Tuia, and R. Raskar. Deepglobe 2018: A challenge to parse the earth through satellite images. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2018.
  • [9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
  • [10] DigitalGlobe. Digitalglobe search and discovery. "https://discover.digitalglobe.com". Accessed: 2019-03-19.
  • [11] A. V. Etten. You only look twice: Rapid multi-scale object detection in satellite imagery. CoRR, abs/1805.09512, 2018.
  • [12] A. V. Etten, D. Lindenbaum, and T. M. Bacastow. Spacenet: A remote sensing dataset and challenge series. CoRR, abs/1807.01232, 2018.
  • [13] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The Pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2):303–338, June 2010.
  • [14] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the KITTI vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
  • [15] Google. Google maps data help. https://support.google.com/mapsdata. Accessed: 2019-3-19.
  • [16] K. He, G. Gkioxari, P. Dollar, and R. Girshick. Mask R-CNN. 2017 IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • [17] V. Iglovikov and A. Shvets. Ternausnet: U-Net with VGG11 encoder pre-trained on ImageNet for image segmentation. 2018.
  • [18] F. M. Lacar, M. M. Lewis, and I. T. Grierson. Use of hyperspectral imagery for mapping grape varieties in the Barossa Valley, South Australia. In IGARSS 2001. Scanning the Present and Resolving the Future. Proceedings. IEEE 2001 International Geoscience and Remote Sensing Symposium (Cat. No.01CH37217), pages 2875–2877 vol.6, 2001.
  • [19] D. Lam, R. Kuzma, K. McGee, S. Dooley, M. Laielli, M. Klaric, Y. Bulatov, and B. McCord. xView: Objects in context in overhead imagery. CoRR, abs/1802.07856, 2018.
  • [20] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision (ECCV), Zürich, 2014. Oral.
  • [21] K. Liu and G. Máttyus. Fast multiclass vehicle detection on aerial images. IEEE Geoscience and Remote Sensing Letters, 12:1938–1942, 2015.
  • [22] N. Longbotham, C. Chaapel, L. Bleiler, C. Padwick, W. J. Emery, and F. Pacifici. Very high resolution multiangle urban classification analysis. IEEE Transactions on Geoscience and Remote Sensing, 50(4):1155–1170, April 2012.
  • [23] W. Lotter, G. Kreiman, and D. Cox. Unsupervised learning of visual structure using predictive generative networks, 2015.
  • [24] A. Milan, L. Leal-Taixé, I. Reid, S. Roth, and K. Schindler. MOT16: A benchmark for multi-object tracking. arXiv:1603.00831 [cs], Mar. 2016.
  • [25] T. N. Mundhenk, G. Konjevod, W. A. Sakla, and K. Boakye. A large contextual dataset for classification, detection and counting of cars with deep learning. ECCV, abs/1609.04453, 2016.
  • [26] A. Robicquet, A. Sadeghian, A. Alahi, and S. Savarese. Learning social etiquette: Human trajectory understanding in crowded scenes. In B. Leibe, J. Matas, N. Sebe, and M. Welling, editors, Computer Vision – ECCV 2016, pages 549–565, Cham, 2016. Springer International Publishing.
  • [27] O. Ronneberger, P. Fischer, and T. Brox. U-Net - Convolutional Networks for Biomedical Image Segmentation. MICCAI, 9351(Chapter 28):234–241, 2015.
  • [28] J. Shermeyer and A. V. Etten. The effects of super-resolution on object detection performance in satellite imagery. CoRR, abs/1812.04098, 2018.
  • [29] T. Simon, H. Joo, I. Matthews, and Y. Sheikh. Hand keypoint detection in single images using multiview bootstrapping. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul 2017.
  • [30] S. Sridhar, A. Oulasvirta, and C. Theobalt. Interactive markerless articulated hand motion tracking using rgb and depth data. 2013 IEEE International Conference on Computer Vision, pages 2456–2463, 2013.
  • [31] T. Sun, Z. Chen, W. Yang, and Y. Wang. Stacked u-nets with multi-output for road extraction. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2018.
  • [32] B. Uzkent, A. Rangnekar, and M. J. Hoffman. Aerial vehicle tracking by adaptive fusion of hyperspectral likelihood maps. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 233–242, July 2017.
  • [33] G.-S. Xia, X. Bai, J. Ding, Z. Zhu, S. Belongie, J. Luo, M. Datcu, M. Pelillo, and L. Zhang. DOTA: A Large-scale Dataset for Object Detection in Aerial Images. IEEE Conference on Computer Vision and Pattern Recognition, Nov. 2017.
  • [34] Y. Xiang, R. Mottaghi, and S. Savarese. Beyond pascal: A benchmark for 3d object detection in the wild. In IEEE Winter Conference on Applications of Computer Vision (WACV), 2014.
  • [35] C. Xiao, J.-Y. Zhu, B. Li, W. He, M. Liu, and D. Song. Spatially transformed adversarial examples. In International Conference on Learning Representations, 2018.
  • [36] M. Yang, J. Lim, and Y. Wu. Online object tracking: A benchmark. In 2013 IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pages 2411–2418, 06 2013.
  • [37] S. Yang, P. Luo, C.-C. Loy, and X. Tang. From facial parts responses to face detection: A deep learning approach. 2015 IEEE International Conference on Computer Vision (ICCV), Dec 2015.
  • [38] K. Yuen and M. M. Trivedi. An occluded stacked hourglass approach to facial landmark localization and occlusion estimation. IEEE Transactions on Intelligent Vehicles, 2:321–331, 2017.
  • [39] P. W. Yuen and M. Richardson. An introduction to hyperspectral imaging and its application for security, surveillance and target acquisition. The Imaging Science Journal, 58(5):241–253, 2010.
  • [40] S. Zhang, J. Yang, and B. Schiele. Occluded pedestrian detection through guided attention in cnns. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [41] H. Zhu, X. Chen, W. Dai, K. Fu, Q. Ye, and J. Jiao. Orientation robust object detection in aerial images using deep convolutional neural network. 2015 IEEE International Conference on Image Processing (ICIP), pages 3735–3739, 2015.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
349640
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description