Object detection in satellite imagery using 2-Step Convolutional Neural Networks

Object detection in satellite imagery using 2-Step Convolutional Neural Networks


This paper presents an efficient object detection method from satellite imagery. Among a number of machine learning algorithms, we proposed a combination of two convolutional neural networks (CNN) aimed at high precision and high recall, respectively. We validated our models using golf courses as target objects. The proposed deep learning method demonstrated higher accuracy than previous object identification methods.

Object detection in satellite imagery using 2-Step Convolutional Neural Networks

Hiroki Miyamotothanks: miyamoto-hrk-tomeken@aist.go.jp, Kazuki Uehara, Masahiro Murakawa
Hidenori Sakanashi, Hirokazu Nosato, Toru Kouyama, Ryosuke Nakamura
National Institute of Advanced Industrial Science and Technology, Tokyo, Japan

Index Terms—  remote sensing, object detection, convolutional neural networks, golf course, negative mining

1 Introduction

Earth observation satellites have been monitoring changes on the Earth’s surface over a long period of time. High resolution satellite imagery can detect small objects such as ships, cars, aircraft, and individual houses; whereas, medium resolution satellite imagery can detect relatively larger objects, such as ports, roads, airports and large buildings [1][6]. Total data amount, however, would be too huge to be inspected by human eyes. Therefore, we need an efficient algorithm for automatic object detection on satellite imagery. Previous works employed higher-order local auto correlation [1], random forests [2], and deep learning [3][4][5][6]. Among them, deep learning [7][8] showed higher accuracy in object detection of images than other machine learning methods.

As an example of target object, we selected golf courses because they exist everywhere in the world, are typically of a recognizable size and shape with 30 meter resolution of Landsat 8 imagery. According to the R & A report [9], in 2016 there were 33161 golf courses, which provide more than enough data for training and detection purpose. Once we establish an accurate algorithm, we can continuously monitor the new construction and disappearance of all the golf courses on the Earth.

Among the general object detection framework, Faster R-CNN [8] is the state-of-the-art method. This method consists of a region proposal network for predicting candidate regions and region classification network for classifying object proposal. It is an end-to-end detector that outputs the location and category of object simultaneously. Similarly, we propose a model called “high recall network” specifically for detection of candidate golf course regions and a model called “high precision network” for further confirmation. We then compared the proposed method with other existing methods [1][6].

2 Method

The framework of our object detection method, which involved a two-step process, is illustrated in Fig. 1. The first step employs the high recall network (HRN) model to find candidate regions as much as possible. The second step uses the high precision network (HPN) model for binary classification (golf course or not) of the HRN output. Each step is customized for the purpose as described in 2.1 and 2.2.

Fig. 1: The flowchart of our 2-step CNN model

2.1 High Recall Network (HRN)

The training and validation data set for HRN is derived from 117 clear Landsat 8 scenes taken between 2013 and 2015 all over Japan except five areas reserved for testing (see 3.1 for more detail). Each area is observed 3 to 4 times at different time to account for seasonal variation. Moreover, we have prepared Ground Truth (GT) polygons which outline all the golf courses in Japan. The Landsat scenes were gridded into tiles with 1616 pixels. A tile is classified as positive if the coverage of GT polygons is larger than 20% of the total area; tiles with no overlap with golf courses were classified as negative; and the remainder which fell between 0% 20% coverage were classified as neither positive nor negative (Fig. 2).

Ishii et al. [6] proposed an object detection method that applying classification to detection as like Fully Convolutional Neural Network (FCN) [7] for satellite imagery. They found the increase of recall performance as the relative abundance of negative image decreases in the training data set. Our HRN model is equivalent to their model (Fig. 2), but the percentage of negative data is adjusted to achieve higher recall (Table. 1). In addition, we focus on the recall performance rather than precision during the training. HRN training process output a learned snapshot model each 10 epoch. From the many snapshot models generated by the HRN training process, we selected the model with the highest recall and with at least over 50% precision. When selecting this model, we use validation dataset rather than training data set.

Fig. 2: High Recall Network(HRN) has FCN structure consisting of 4 convolution layers.

2.2 High Precision Network (HPN)

The HPN structure and training process is illustrated in Fig. 3. The HPN performs binary classification on the candidate regions resulted from HRN. Input data used in this process are derived from cropping HRN output into tiles with 6464 pixels (red rectangles in Fig. 3).

For HPN model training, positive images are generated by cropping satellite image of golf course region at the centroid from GT polygons. We conducted data augmentation by rotating each positive image in the step of 90 and then flipped in horizontally and vertically respectively. Negative images are produced by negative mining that crop a centroid of false positive regions generated by HRN. The number of convolution layers is increased to 8 to give the highest precision.

Fig. 3: The High Precision Network (HPN) comprises 8 convolutional layers and a fully connected layer.

3 Experiments and Results

In this section, we conducted golf course detection experiments using our method as well as previous methods (Ishii et al [6] and Uehara et al [1]). For direct comparison, we employed five scenes in Uehara et al [1] as testing data (Fig. 4). It should be noted again that these testing data are not included in the training data. After removing cloudy areas, we classified the output tiles as true positives (TP) if it contains GT polygons and as false positive (FP) otherwise. It would be natural to regard GT polygons with no overlapping detected tiles as false negatives (FN). In Fig. 5, red lines are golf course regions from GT and red rectangles indicate the output tiles: (a) represents an example of TP in which a golf course was correctly detected; (b) represents FP in which an area was erroneously identified as a golf course; and (c) represents FN in which a golf course was not detected due to its atypical structure (a narrow course, no observable trees, etc.).

Table. 2 compares the performance of three methods measured by recall [ TP/(TP+FN) ], precision [TP/(TP+FP)] and F-measure [2precisionrecall/(precision+recall)]. When we use F-measure as a total performance proxy, our method showed 4% more improvement than previous two methods. This improvement could be attributed to collaboration between HRN and HPN. In actual applications, F-measure may not be necessarily the best index to estimate the performance. Some applications would require an exhaustive list of the possible target candidates even with the lower precision, while others may need higher precision at the cost of lower recall. Our framework could be easily adapted to diverse applications due to the explicit separation into two complementary networks.

for HRN for HPN
positive negative positive negative
training 52276 936000 110314 298945
validation 17620 4686120 35766 938271
Table 1: Number of training data in experiments
Fig. 4: Test images used for the experiment. The scene IDs are, (A): LC81070302016189, (B): LC81080342015193, (C):LC81080362016196, (D): LC81110362015214, (E): LC81130372015212
A B C D E total
Ishii 0.333 0.736 0.743 0.903 0.889 0.759
Uehara 0.428 0.714 0.848 0.841 0.897 0.753
Our 0.892 0.929 0.833 0.937 0.908 0.901
Ishii 0.443 1.000 1.000 0.981 0.986 0.924
Uehara 0.715 0.947 0.924 0.910 0.911 0.895
Our 0.456 0.956 0.977 0.953 0.959 0.874
Ishii 0.380 0.848 0.852 0.940 0.935 0.833
Uehara 0535 0.814 0.884 0.874 0.904 0.818
Our 0.604 0.942 0.900 0.945 0.933 0.870
Table 2: Experimental results
Fig. 5: An example of experimental results with the proposed method: (a) True Positives (b) False Positives (c) False Negatives. The red lines represent GT polygons of the golf course, while rectangles indicate the output tiles (64x64 pixels) with detection.

4 Conclusion

In this paper, we proposed a method to detect arbitrary objects in satellite imagery . The method entailed an integration of two convolutional neural networks (CNN) devoted to high recall and high precision, respectively. We customized each CNN through network structure, selection of the input training data and target parameters for optimization. Our method showed a 4% overall improvement compared to previous methods. Furthermore, we can flexibly tune the relative importance of recall and precision by balancing two networks.

In the future work, we will (1) expand the target areas to a global level and evaluate the generalization performance. (2) increase the tile size for HPN to apply the state-of-the-art CNN for general images [10][11]. (3) make a comparison with the state-of-the-art two step algorithm for general images, such as Faster R-CNN [8].

5 Acknowledgment

This paper is based on results obtained from a project commissioned by the New Energy and Industrial Technology Development Organization (NEDO).



  • [1] Kazuki Uehara, Hidenori Sakanashi, Nosato Hirokazu, Masahiro Murakawa, Hiroki Miyamoto, and Ryosuke Nakamura, ”Object Detection on Satellite Images Using Multi Channel Higher-order Local Autocorrelation, ” in 2017 IEEE International conference on Sysytem, Man, and Cybernetics(SMC).
  • [2] Jordan M. Malof, Kyle Bradbury, Leslie M. Collins, and Richard G. Newell, ”Automatic Detection of Solar Photovoltaic Arrays in High Resolution Aerial Imagery, ” in Applied Energy, vol. 183, pp229-240, 2016.
  • [3] Weijia Li, Haohuan Fu, Le Yu, and Arthur Cracknell, ” Deep Learning Based Oil Palm Tree Detection and Counting for High-Resolution Remote Sensing Images, ” in MDPI Remote Sens, 2017,9,22
  • [4] Lei Liu, Zongxu Pan, and Bin Lei, ”Learning a Rotation Invariant Detector with Rotatable Bounding Box, ” in arXiv 1711. 09405, 2017.
  • [5] Rodrigo F. Berriel, Andr’e Teixeira Lopes, Alberto F. de Souza, and Thiago Oliveira-Santos, ”Deep Learning Based Large-Scale Automatic Satellite Crosswalk Classification, ” in IEEE Geoscience and Remote Sensing Society, vol. 14, pp1513-1517, 2017.
  • [6] Tomohiro Ishii, Edgar Simo-Serra, Satoshi Iizuka, Yoshihiko Mochizuki, Akihiro Sugimoto, Ryosuke Nakamura, and Hiroshi Ishikawa, ”Detection by Classification of Buildings in Multispectral Satellite Imagery, ” in 2016 IEEE International conference on pattern recognition (ICPR).
  • [7] Jonathan Long, Evan Shelhamer, and Trevor Darrell, ”Fully Convolutional Networks for Semantic Segmentation, ” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • [8] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun, ”Faster R-CNN:Towards Real-Time Object Detection with Region Proposal Networks, ” in 2015 Neural Information Processing System (NIPS).
  • [9] The R & A, ”Golf around the world 2017, ” in https://www.randa.org/
  • [10] Kaiming He, Xiangyu Zhang, Shaoqong Ren, and Jin Sun, ”Deep Residual Learning for Image Recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • [11] Gao Huang, Zhuang Liu, and Laurens van Maaten, ”Densely Connected Convolutional Networks,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description