A Comparison and Strategy of Semantic Segmentation on Remote Sensing Images
In recent years, with the development of aerospace technology, we use more and more images captured by satellites to obtain information. But a large number of useless raw images, limited data storage resource and poor transmission capability on satellites hinder our use of valuable images. Therefore, it is necessary to deploy an on-orbit semantic segmentation model to filter out useless images before data transmission. In this paper, we present a detailed comparison on the recent deep learning models. Considering the computing environment of satellites, we compare methods from accuracy, parameters and resource consumption on the same public dataset. And we also analyze the relation between them. Based on experimental results, we further propose a viable on-orbit semantic segmentation strategy. It will be deployed on the TianZhi-2 satellite which supports deep learning methods and will be lunched soon.
Compared with natural images, remote sensing images always have more kinds of targets with the lower resolution, and irregular-shaped targets impact each other. These bring difficulties to object detection and recognition in remote sensing images.
As one basic method of image understanding, semantic segmentation conducts pixel-level classification of the image. Different from other methods like image classification and object detection, semantic segmentation can produce not only the category, size and quantity of the target, but also accurate boundary and position. Therefore, it is more suitable for processing remote sensing images.
Some traditional semantic segmentation methods are early proposed, such as active contour model , watershed algorithm  and graph cuts . They usually require artificially setting thresholds or interaction controls, but they are not accurate enough.
In recent years, deep learning method has greatly promoted the development of semantic segmentation. Some deep learning based methods are applied to remote sensing images and achieve good performance. Kampffmeyer et al.  propose a deep Convolutional Neural Network (CNN) for small objects segmentation in remote sensing images of urban areas. Volpi et al.  use the Fully Convolutional Networks (FCN)  with 50 layers deep residual networks  for image segmentation. There have been some works focusing on the comparison of deep learning based semantic segmentation methods. Garcia-Garcia et al.  introduce and compare structures and evaluation results on several natural image datasets for most of the existing semantic segmentation methods. Ball et al.  provide applications and challenges for deep learning theories and related tools in remote sensing images. At present, a comparative research of semantic segmentation models about accuracy, parameters and resource consumption is still lacking for remote sensing images.
TianZhi satellites are designed to verify the feasibility of using satellite-based intelligent applications, and finally build the space-based intelligent system. With the successful launch of TianZhi-1, the feasibility has been initially verified. Unlike TianZhi-1, TianZhi-2 is equipped with a powerful computing system which supports deep learning algorithms, and it will be launched soon. Based on this platform, we can deploy deep learning models to perform on-orbit semantic segmentation of remote sensing images.
Motivated by such prior studies, we conduct a comparison of recent popular semantic segmentation methods of remote sensing images. The main contributions of this paper can be summarized as follows:
First, we use a new image cropping method for the publicly available remote sensing image dataset. By using the training set of the same size, we fairly compare five efficient semantic segmentation methods in terms of structures, accuracy, parameters and resource consumption. These deep learning based methods are FCN, U-Net , SegNet , Pyramid Scene Parsing Network (PSPNet)  and DeepLab .
Second, we propose an on-orbit semantic segmentation strategy for TianZhi-2 based on experimental results.
The remainder of this paper is organized as follows. In Section 2, we introduce the comparison of five methods in detail. We describe experiments and discusses results in Section 3. Section 4 talks about the on-orbit semantic segmentation strategy for TianZhi-2. Finally, the paper is concluded in Section 5.
2 Methods Comparison
Many deep learning based semantic segmentation methods are proposed, and some of them are applied to remote sensing images. We select five most representative methods: FCN, U-Net, SegNet, PSPNet and DeepLab. They and their variants achieve excellent results on some public competitions like the PASCAL VOC-2012 semantic segmentation task . According to the structure, the main deep learning methods for semantic segmentation can be divided into two classes: 1) Encoder-Decoder structure; 2) Multi-Scale representation structure. In this section, we introduce the five methods based on their relations, respectively. As shown in Figure 1, we arrange these methods according to the time when they are first proposed.
FCN  leads to a rapid increase in the number of semantic segmentation networks. The model transforms all of the fully connected layers to convolutional layers and allows the input image with arbitrary size. In addition, it combines image semantic information from deep layers with shallow layers to produce the final segmentation by using a skip architecture. FCN has been used for remote sensing in . The method mainly has three structures as follows:
FCN-32s It is a basic FCN structure with a single stream, which produces the final segmentation with upsampling at stride 32.
FCN-16s The method combines predictions from both FCN-32s and the pool4 layer which uses stride 16. It could predict the finer result.
FCN-8s This model adds predictions from the pool3 layer which is stride 8, and provides further precision. Because remote sensing images contain both coarse, large objects and small, important details, we apply this structure to our experiments in this paper.
Built upon FCN, U-Net  adopts an Encoder-Decoder architecture which consists of a contracting path to capture context and a symmetric expanding path to enable accurate localization. It is originally designed to segment medical images and achieve good results with the fewer training set. Recent years, some studies have shown that U-Net is also suitable for remote sensing images , and it has a great potential to be improved. In , Lu et al. propose a dual-resolution U-Net which uses pairs of images as inputs to capture both high and low resolution features.
Similar to U-Net, SegNet  is also built on an Encoder-Decoder structure. Its encoder network is topologically same as the 13 convolutional layers of the VGG-16 . And the decoder network first use max-pooling indices generated from the corresponding encoder to enhance the location information. In , Bischke et al. use SegNet with a new cascaded multi-task loss to further preserve semantic segmentation boundaries in high resolution satellite images.
DeepLab  exploits a powerful FCN architecture which mainly uses three components as follows:
Atrous Convolution The algorithm is originally proposed for computing the undecimated wavelet transform in . By using the atrous convolution with dilation , the kernel size of a filter is enlarged to as the following formula:
The algorithm fills consecutive filter values with zeros to avoid increasing parameters and computation of the model.
Atrous Spatial Pyramid Pooling (ASPP) ASPP exploits the atrous convolution with different dilation to capture the multiscale information of objects.
Fully-Connected Conditional Random Fields (CRFs) The fully connected CRFs is integrated into DeepLab to avoid reducing the spatial resolution of feature maps as proposed in . And it greatly improves the localization performance.
Based on DeepLab, PSPNet  exploits the pyramid pooling module to aggregate the image global context information with an auxiliary loss. Since DeepLab provides two versions of the model adapted from VGG-16 and ResNet-101, PSPNet can also be applied to VGG and ResNet based network structures, respectively.
Table 1 indicates detailed comparisons about methods as above.
3 Experiments and Results
In this section, we first introduce the dataset and the image cropping method. Then, we compare five deep learning models in semantic segmentation with multiple indicators.
We evaluate the methods on a public subset of the Inria aerial image labeling benchmark . The available dataset contains 180 images of size at 0.3 m resolution. It is collected from five cities: Austin, Chicago, Kitsap, Vienna and West Tyrol, and each of them has 36 images. As described in , the first 5 images of each city are used for testing and the other 155 images for training. These pixel-level labeled images contains two classes: building and non-building.
For all images, we extract patches with 12 pixels of overlap between adjacent patches. The size of patch is more suitable for existing semantic segmentation methods and hardware. And the overlap of patches effectively prevent most small objects from being destroyed by image cropping. After extracting, we have 15500 images for training and 2500 images for testing. We use the same training set and test set for all models.
3.2 Experimental Results
We compare segmentation performance, parameters, resource consumption and edge prediction of five methods: FCN-8s, U-Net, SegNet, PSPNet and DeepLab v2. All of them are based on VGG structure. The base learning rate is 0.0001, and it is updated by the ¡°poly¡± policy. All models are trained for 200 epochs and the best results are reported during the training process.
Segmentation Performance To evaluate the segmentation performance of methods, we use the accuracy and the Intersection over Union (IoU) which is more suitable for evaluating the unbalanced dataset. We evaluate the methods on the overall test set and for each city separately. Table 2 shows that U-Net and DeepLab perform better than other models. More specifically, U-Net achieves the best result in the former three cities, and DeepLab is better in the latter two cities. For the overall test set, the IoU of DeepLab is better. It shows that DeepLab may be suitable for the unbalanced datasets which is more common in the real world environment.
Parameters Figure 2 (left) shows parameters of each model. We find that FCN is the largest model and SegNet is the smallest one. Note that this indicator is closely related to storage resource and transmission capability of satellites.
Resource Consumption Figure 2 (right) illustrates floating point operations (FLOP) which can display the resource consumption of the model. SegNet and U-Net have smaller FLOP for Encoder-Decoder structure. And other models have larger FLOP for Multi-Scale representation structure.
Edge Prediction Figure 3 shows several image segmentation results of the five methods. We observe that U-net and DeepLab do well in predicting edge which helps to restore the shape of targets. The edge predicted by SegNet is also sharp, but there are many points which are misclassified.
4 On-orbit semantic segmentation strategy for TianZhi-2
Due to the complexity of remote sensing image processing and limited energy on satellites, it is very necessary for us to consider segmentation performance and resource consumption simultaneously. Based on experimental results in Section 3, we propose that DeepLab is more suitable for on-orbit semantic segmentation. We propose a strategy to segment remote sensing images on the satellite as follows:
Improve the semantic segmentation performance for remote sensing images. First, because the target distribution in remote sensing images is uneven and irregular, the object region extraction before segmentation will guide it and improve its performance. Second, due to there are many irregular-shaped targets in remote sensing images, optimizing the target edge prediction can further improve the semantic segmentation performance.
Reduce resource consumption by model pruning. In order to reduce the resource occupation on the satellite, we can use methods like -norm to prune the model while keeping the performance levels almost intact and even better.
By using methods above, on-orbit semantic segmentation for remote sensing images can be achieved. After satellite capturing images, our image cropping method splits the images into smaller patches. The model takes these patches as input and outputs corresponding semantic segmentation results. Based on these results, there are many useful production such as the object image tile, target location and language description. They will greatly reduce the data transmission cost of satellite, which are powerful solutions for the limitation of satellite bandwidth.
In this paper, we make a research of deep learning based semantic segmentation methods on remote sensing images. We conduct a detailed comparative analysis of their structures and evaluate their performances using the same public dataset. The experimental result shows that DeepLab does well in accuracy, parameters and resource consumption. By using this model, we propose a strategy for on-orbit semantic segmentation in TianZhi-2. We will further improve the method for the satellite environment, and strive to apply it to Tianzhi-2 which will be launched soon.
-  Kass, M., Witkin, A., Terzopoulos, D.: Snakes: Active contour models. International journal of computer vision, no. 4, 321-331 (1988).
-  Meyer, F., Beucher, S.: Morphological segmentation. Journal of visual communication and image representation, no. 1, 21-46 (1990).
-  Boykov, Y. Y., Jolly, M. P.: Interactive graph cuts for optimal boundary and region segmentation of objects in ND images. In Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International Conference on, vol. 1, pp. 105-112 (2001).
-  Kampffmeyer, M., Salberg, A. B., Jenssen, R.: Semantic segmentation of small objects and modeling of uncertainty in urban remote sensing images using deep convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 1-9 (2016).
-  Volpi, M., Tuia, D.: Dense semantic labeling of subdecimeter resolution images with convolutional neural networks.IEEE Transactions on Geoscience and Remote Sensing 55, no. 2, 881-893 (2017).
-  Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431-3440 (2015).
-  He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770-778 (2016).
-  Garcia-Garcia, A., Orts-Escolano, S., Oprea, S., Villena-Martinez, V., Garcia-Rodriguez, J.: A review on deep learning techniques applied to semantic segmentation. arXiv preprint arXiv:1704.06857 (2017).
-  Ball, J. E., Anderson, D. T., Chan, C. S.: Comprehensive survey of deep learning in remote sensing: theories, tools, and challenges for the community. Journal of Applied Remote Sensing 11, no. 4, 042609 (2017).
-  Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional Networks for Biomedical Image Segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234-241. Springer, Cham (2015).
-  Badrinarayanan, V., Kendall, A., Cipolla, R.: SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 2481-2495 (2017).
-  Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 2881-2890 (2017).
-  Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A. L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40, no. 4, 834-848 (2018).
-  Everingham, M., Eslami, S. A., Van Gool, L., Williams, C. K., Winn, J., Zisserman, A.: The pascal visual object classes challenge: A retrospective. International journal of computer vision 111, no. 1, 98-136 (2015).
-  Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
-  Huang, B., Lu, K., Audebert, N., Khalel, A., Tarabalka, Y., Malof, J., Lef¨¨vre, S.: Large-scale semantic classification: outcome of the first year of Inria aerial image labeliing benchmark. In IEEE International Geoscience and Remote Sensing Symposium¨CIGARSS 2018 (2018).
-  Lu, K., Sun, Y., Ong, S. H.: Dual-Resolution U-Net: Building Extraction from Aerial Images. In 2018 24th International Conference on Pattern Recognition (ICPR), pp. 489-494. IEEE (2018).
-  Bischke, B., Helber, P., Folz, J., Borth, D., Dengel, A., Waterman, M.S.: Multi-task learning for segmentation of building footprints with deep neural networks. arXiv preprint arXiv:1709.05932 (2017).
-  Holschneider, M., Kronland-Martinet, R., Morlet, J., Tchamitchian, P.: A real-time algorithm for signal analysis with the help of the wavelet transform. In Wavelets, pp. 286-297. Springer, Berlin, Heidelberg (1990).
-  Krähenbühl, P., Koltun, V.: Efficient inference in fully connected crfs with gaussian edge potentials. In Advances in neural information processing systems pp. 109-117 (2011).
-  Maggiori, E., Tarabalka, Y., Charpiat, G., Alliez, P.: Can semantic labeling methods generalize to any city? the inria aerial image labeling benchmark. In IEEE International Symposium on Geoscience and Remote Sensing (IGARSS) (2017).