Boundary Loss for Remote Sensing Imagery Semantic Segmentation
In response to the growing importance of geospatial data, its analysis including semantic segmentation becomes an increasingly popular task in computer vision today. Convolutional neural networks are powerful visual models that yield hierarchies of features and practitioners widely use them to process remote sensing data. When performing remote sensing image segmentation, multiple instances of one class with precisely defined boundaries are often the case, and it is crucial to extract those boundaries accurately. The accuracy of segments boundaries delineation influences the quality of the whole segmented areas explicitly. However, widely-used segmentation loss functions such as BCE, IoU loss or Dice loss do not penalize misalignment of boundaries sufficiently. In this paper, we propose a novel loss function, namely a differentiable surrogate of a metric accounting accuracy of boundary detection. We can use the loss function with any neural network for binary segmentation. We performed validation of our loss function with various modifications of UNet on a synthetic dataset, as well as using real-world data (ISPRS Potsdam, INRIA AIL). Trained with the proposed loss function, models outperform baseline methods in terms of IoU score.
Keywords:Semantic Segmentation, Deep Learning, Aerial Imagery, CNN, Loss Function, Building Detection, Computer Vision
A. Bokhovkin, E. Burnaev
Semantic segmentation of remote sensing images is a critical process in the workflow of object-based image analysis, which aim is to assign each pixel to a semantic label [20, 13]. It has applications in environmental monitoring, urban planning, forestry, agriculture, and other geospatial analysis. Although a general image segmentation problem is relatively well investigated, we consider a particular type of this problem related to buildings segmentation.
The common aspect of urban aerial imagery is its high resolution (from 0.05 to 1.0 m). Higher GSD brings a lot of small details and structures, but also increases intra-class variance and decreases inter-class differences. This applies particularly to the class of buildings, including an enormous number of shapes and patterns. After the stage of segmenting buildings, further analysis could be done, such as distinct building classification and height estimation, population prediction, economic forecasting, etc. For most of the tasks, it is crucial to separate all buildings and minimize the number of false positive/false negative instances. Even with high-resolution imagery, it is difficult to segment every object separately, especially in high-density urban areas, although we get more accurate information about the boundaries of buildings. Therefore it is necessary to construct a method that increases the attention of a neural network, used for semantic segmentation, to borders of closely located objects.
CNNs for semantic segmentation typically use such loss functions as cross-entropy, dice score and direct optimization while training the network, but they are not sensitive enough to some misalignment of boundaries. Even being deviated from the ground truth by 5-10 pixels, predicted boundary does not significantly contribute to the value of the loss functions, mentioned above, and scores of common pixel-wise metrics. To better account for boundary pixels we select metric (see original work  and its extension ) to construct a differentiable surrogate and use it in training. The surrogate is not used alone for training, but as a weighted sum with loss (from direct optimization). We found that the impact of the boundary component of the loss function should be gradually increased during training, and so we proposed a policy for the weight update. The assumption why this works is that for the first epochs network learns to label instances as shapeless blobs with very uncertain boundaries, and after finding an instance we only need to adjust its borders because they define the whole segments explicitly.
In experiments we also found that network trained with a combination of and boundary loss converges faster: we increase the confidence of the network in its predictions on the border as fast as for intra-segment pixels; with other losses, the mask near the borders is very blurred. The next point is that network better handles edge effects while baseline models tend to miss or distort masks on the edge of an input image. Besides, if it is highly important to separate neighboring buildings, methods of instance segmentation are often applied, which can require extra data preparation from raw masks and they are commonly multistage. However, binary segmentation with boundary loss is end-to-end. In experiments on the challenging datasets ISPRS Potsdam and INRIA AIL, we obtain and pixel-wise score respectively. There is also a comparison with another method for accurate delineation of curvilinear structures , which encounters difficulties in remote sensing tasks. Our results show a consistent increase in the performance for all models on both datasets in comparison to various loss functions.
2 Related work
The subject of our work intersects with two branches of research, which are Deep Neural Networks (DNNs) and direct optimization of performance measures or their differentiable versions also called surrogates. Besides, we should mention methods for accurate boundary delineation, as one of them (see ) was compared to our approach.
DNNs and semantic segmentation. Semantic segmentation based on DNNs is a pixel-wise mapping to semantic labels. Since 2012 a lot of neural networks  for solving this task were constructed. The success of AlexNet  in ImageNet challenge marked the beginning of neural networks application in computer vision tasks such as classification, object detection, and semantic segmentation. Initially, for the latter problem, they adapted CNNs, used for classification. Namely, they remove fully connected layers, and use initial layers as an encoder to downsample an image and derive features that have stronger semantic information but low spatial resolution. Another part of the network called decoder obtains a high-resolution mask of labels, which upsamples features into the resulting mask.
For the last few years, DNNs have shown to be very effective for semantic segmentation task. There are several models now that are regarded as state-of-the-art. First of all, it is SegNet architecture  which fits online video segmentation very well. DeepLab network  uses atrous convolutions and CRFs at postprocessing step to refine small details of the segmentation. The FRRN  is an example of a model with multi-scale processing technique; it is based on two separate streams of data to better handle the high-level semantic information and the low-level pixel information simultaneously. Finally, UNet architecture  is very popular nowadays for its enormous flexibility. Among aforementioned models UNet remains preferable because of the high efficiency of feature extraction and ability to apply various neural network architectures as backbones. UNet is an asymmetric FCNN (Fully Convolutional Neural Network) with skip connections between the downsampling and upsampling paths. UNet has shown its high performance in competitions and research projects, and we use it for all experiments in the paper.
Performance measure optimization. There already exist papers concerning direct optimization of metrics or their differentiable versions. In various applications such measures as , -score, -area, are widely used. Constructing surrogates of these measures is the most tricky part because it can be very hard or even impossible to replace non-differentiable operations with differentiable ones and keep computational efficiency at the same time. For the task of segmentation, is usually used to measure the performance of any segmentation approach. As a result, there exists a lot of its surrogates, and the goal is to minimize the gap between the actual value and its differentiable approximation. In paper  they proposed loss, which approximates naive -loss with a neural network. Another approach  called Lovasz-Softmax loss, is based on the convex Lovasz extension of submodular losses. A critical property for this surrogate is that it effectively captures the absolute minimum of the original loss.
In this paper, we construct a differentiable version of the metric, presented in . The motivation is that boundaries of segments explicitly define them and their extraction is necessary for accurate building segmentation. Authors of  proposed a novel boundary metric that accounts for accuracy of contour extraction and overcomes the weaknesses of mainstream performance measures. For we will construct a differentiable surrogate and show how it increases the accuracy of segmenting borders.
Below we provide a list of metrics and loss functions to be used for comparative analysis:
Direct loss: ,
loss: where is a weight to balance two components,
loss: implemented in .
The terms denote pixel sets of true positive, false positive, true negative, false negative classes on a predicted binary map. For all loss functions, notations of corresponding metrics or approaches, inducing them, are provided in a subscript. By we denote a weighted sum
for an arbitrary weight .
Accurate boundary delineation. One possible solution to extract borders accurately is to use CRF (Conditional Random Fields). It works as an extra layer above the output of the original neural network. CRF is applied to capture additional contextual information and to produce much more refined prediction . Authors of  used CRF together with Bayesian decision theory and proposed a heuristic to maximize the value of . Another approach, a combination of CRF and superpixels , takes advantage of superpixel edge information and the constraint relationship among different pixels. Proposed algorithm of boundary optimization made it possible to improve score by 3% on PASCAL VOC 2012  and Cityscapes  datasets compared to the performance of plain FCN.
We compare an approach of  to the one, proposed in our paper. Authors claim that pixel-wise losses alone such as binary cross-entropy are unsuitable for predicting curvilinear structures. For this problem, they developed a new loss term based on features extracted with VGG19 network . Conceptually their approach is similar to ours because we add our surrogate to the weighted conventional loss function. In contrast to VGG19 features, using our surrogate, we manually extract boundaries which are features too and encourage a neural network to draw attention to them much stronger.
3 Our Method
In this work, we use deep neural network UNet as a base model, because it is one of the most reliable and universal models. As we already pointed out in the introduction, there are a lot of different types of encoders and decoders that we can use as backbones. Of course, this network can be extended to two-headed version (UNet with two decoders), where the first head predicts whole segments and the other predicts only their boundaries. However, this approach requires additional computational resources and data preparation compared to the original UNet. Our interest is to propose a method that would not increase implementation and computational complexities. Further, we first introduce metric from  and then describe how to construct its differentiable surrogate.
3.1 Boundary metric ()
Let , be the binary maps of class in the ground truth and predicted segmentation respectively, , — the boundaries for these binary maps. Then the precision and the recall for class are defined as follows:
where brackets denote indicator function of a logical expression, is Euclidean distance measured in pixels, - some predefined threshold on a distance; in all experiments we set to or . Here distance is calculated from a point to one of two sets (ground truth or predicted boundary) as the shortest distance from the point to the point of the boundary. The measure for class is defined as:
3.2 Surrogate construction ()
To construct the differentiable version of the metric , we need to extract the boundary of any segment somehow. There are several ways to do this. Let us denote by a binary map for an arbitrary class for some image , predicted by a neural network, a ground truth map for the same class and the same image. We expect values of the predicted map to be distributed in , and values of the ground truth map to be in . The boundaries can be defined as
Here applies a pixel-wise max-pooling operation to the inverted predicted or ground truth binary map with a sliding window of size (hyperparameter must be as small as possible to extract vicious boundary; usually we set ). Value corresponds to inversion of any pixel of the map. To compute Euclidean distances from pixels to boundaries a supporting map should be obtained, which is the map of the extended boundary:
The value of hyperparameter can be determined as not greater than the minimum distance between neighboring segments of the binary ground truth map. After that precision and recall can be computed as follows:
where operation denotes pixel-wise multiplication of two binary maps and operation — pixel-wise summation of a binary map. Finally, the reconstructed metric and a corresponding loss function are defined as:
All operations above are differentiable in terms of either derivative or subderivative. In Fig. 1 there is an example of how the proposed surrogate works. It is worth to mention that of course we can use convolutions with edge detection filters, such as Sobel, but in experiments, we found that this approach is not as effective as max-pooling. Nevertheless, both methods have their drawbacks. The disadvantages of both methods are that on the first step we extract boundary not narrower than - pixels in width, but with Sobel filters, boundaries are even wider and more blurred. The main drawback of the used max-pooling operation is that gradients are passed only in tensor cells with a maximum value within a current sliding window. Sobel operators do not have this problem and take into account all pixels of the sliding window.
4.1 Example on synthetic dataset (AICD)
To demonstrate capabilities of the novel boundary loss and test the hypothesis that can assist a synthetic dataset was used. It consists of images with one primitive segment in each image. We trained on this dataset without augmentations and postprocessing a not deep fully convolutional neural network with four conv layers in the encoder and four conv layers in the decoder to show that outperforms other losses under simple conditions.
|Ground truth||Precision map|
First of all, we present a working principle of the new loss in Fig. 1. In the upper row (see (b), (d) and (f)) we depict ground truth, where (b) is a binary mask for the ground truth segment, (d) is a binary mask of the gt boundary after applying (1). Then we compute expanded boundary (f) using pooling operations (2). The same pipeline is applied to the predicted mask (color intensity represents the probability that pixel belongs to a foreground). After that we obtain Precision map, which represents pixel-wise multiplication of maps corresponding to (d) and (g), and Recall map, which is a multiplication of maps (e) and (f). Precision and Recall are normalized pixel-wise sums of these maps.
Some examples of how loss outperforms loss are presented in Fig. 2. It is clear that in comparison with training with helps better delineate segment borders: they are very similar to ground truth boundaries and even meet at the corners in places where they should be. The second advantage is that the loss better handles edge effects. The third column clearly explains this kind of a problem when a building is located near the edge of the image. loss is unable to segment the building near the edge, however does it accurately. Finally, the boundary loss converges faster on this synthetic dataset, and such behavior is expected not only for this simple data distribution. The intuition behind is that network keeps attention mainly on a boundary, which consists of fewer pixels in comparison with the entire segment.
4.2 INRIA AIL dataset segmentation
The dataset  consists of 180 tiles (RGB) of 405 area in total with GSD of m. There are only two classes: building and not building.
Training on this dataset is more complex than on AICD. As for preprocessing, very strong augmentations were applied including vertical, horizontal flips, hue/saturation adjustments, many kinds of noises and blurs. Every orthophoto was split into patches of size and UNet with Inception-ResNet-v2  as backbone was trained on these patches. After that, all predicted masks for every patch were merged into one mask of the whole orthophoto. Also, test-time augmentation (flips, reflections) was applied during the inference step. As for the training process, first, the network was trained from ImageNet  weights with a frozen encoder for three epochs, and the learning rate was set to . After that all batch-normalization layers were unfrozen, and the learning rate was decreased to . Then all layers were unfrozen and trained for 100 epochs with the learning rate of . Finally, the network was trained for 60 epochs with the learning rate of . We used Adam optimizer with default Keras  settings. There were several experiments with different loss functions. As for , , and losses, they did not require any adjustments while training: these losses were trained as a weighted sum with one by one. As for loss, it requires an additional procedure for mini grid-search: after the 8th epoch for every 30 epochs and for every weight in equation a network was trained. Then the best weight is chosen, and the process repeats. In Fig. 4 we see the evolution of metric while training with combinations of and , , , , or . is also a very important metric for remote sensing image segmentation (see Fig. 4).
Finally, from figures, we see that the final segmentation is better than the baseline model trained with in terms of and . Now we want to understand what is better when training with the boundary loss. Below we provide an example comparing two segmentations.
In Fig. 6 on the right we see more accurate shapes of edges and corners. The important feature, which follows from Fig. 6, is that several buildings are standing in one row with a distance between adjacent constructions about 5 pixels. In comparison with , our loss managed to separate instances of buildings much better.
4.3 ISPRS Potsdam dataset segmentation
The dataset  contains 38 patches, each consisting of a true orthophoto extracted from the larger mosaic. There are five channels (RGB+NIR+DSM) with GSD m for each channel. As a model, UNet was trained with very strong augmentations similar to that used for INRIA AIL. We used test-time augmentation (flips, reflections), set the learning rate to and divided it by a factor of if for epochs there was no improvement in validation loss. We trained a neural network on patches of orthophotos with Adam optimizer.
In Fig. 8 and Fig. 8 we see the evolution of and metric while training with combinations of and , , , , . Here we see slightly faster convergence of for , the final segmentation is 1-2% better than baseline . Also there is an advantage that training with loss is much less noisy.
Even though value is the best with training, boundary metric converges faster with VGG loss; authors of the original paper mentioned that this loss function is also efficient for boundaries delineation. In Fig. 9 and Fig. 10 there are several examples segmented with and the boundary loss.
In Table 1 and Table 2 we provide final results of different losses in the task of binary segmentation of buildings for different models and datasets. Backbones for comparison in these tables are chosen is such way that they showed the best performance over all we have tried (VGG16, VGG19, ResNet34, DenseNet121, DenseNet169, Inception-ResNet-v2).
|AICD||ISPRS Potsdam||INRIA AIL|
|AICD||ISPRS Potsdam||INRIA AIL|
Finally, here are some constraints and recommendations:
It is better to use boundary loss only for segments with precise edges and corners such as buildings, roads etc.
Let us discuss the choice of parameters and (1) and (2) respectively. The best choice for the parameter is as less as possible, but at the same time it must be possible to extract solid boundaries of segments. The next constraint is that and should be smaller than minimum distance between segments, otherwise the performance of segmenting current element is affected by other segments because of overlapped expanded boundaries. Via the trial and error process we set to and to - as a proper choice, because theses values deliver the most accurate boundaries in all experiments.
In this work, we introduced a novel loss function to encourage a neural network better taking into account segment boundaries. As motivation, we applied this approach to a synthetic dataset of binary segmentation and after that proved its validity on real-world data. Our experimental results demonstrate that optimizing assisted by metric surrogate leads to better performance and more accurate boundaries delineation compared to using other loss functions. As for future work, we would like to extend our approach to handle multi-class semantic segmentation. We are going to elaborate on this approach by increasing its computational capabilities with large scale sparse convolutional neural networks , using a new loss function, specially tailored for imbalanced classification [28, 5], utilizing an approach for combining multi-modal data through CNNs features aggregation  and imposing a confidence measure on top of the segmentation model based on the non-parametric conformal approach [14, 25, 31].
-  Alam, F.I., Zhou, J., Liew, A.W., Jia, X., Chanussot, J., Gao, Y.: Conditional random field and deep feature learning for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing 57(3), 1612–1628 (2019). https://doi.org/10.1109/TGRS.2018.2867679
-  Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 2481–2495 (2016)
-  Berman, M., Triki, A.R., Blaschko, M.B.: The lovasz-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition pp. 4413–4421 (2018)
-  Burnaev, E., Cichocki, A., Osin, V.: Fast multispectral deep fusion networks. Bull. Pol. Ac.: Tech. 66(4), 875–880 (2018)
-  Burnaev, E., Erofeev, P., Papanov, A.: Influence of resampling on accuracy of imbalanced classification. In: Eighth International Conference on Machine Vision, 987525 (December 8, 2015). Proc SPIE, vol. 9875 (2015). https://doi.org/10.1117/12.2228523
-  Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence 40(4), 834–848 (2018). https://doi.org/10.1109/TPAMI.2017.2699184
-  Chollet, F., et al.: Keras. https://github.com/fchollet/keras (2015)
-  Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
-  Csurka, G., Larlus, D., Perronnin, F.: What is a good evaluation measure for semantic segmentation? IEEE PAMI 26, 1–11 (2004)
-  Deng, J., Dong, W., Socher, R., Li, L., and: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. pp. 248–255 (2009)
-  Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html
-  Fernandez-Moral, E., Martins, R., Wolf, D., Rives, P.: A new metric for evaluating semantic segmentation: leveraging global and contour accuracy. In: Workshop on Planning, Perception and Navigation for Intelligent Vehicles, PPNIV17. Vancouver, Canada (Sep 2017), https://hal.inria.fr/hal-01581525
-  Ignatiev, V., Trekin, A., Lobachev, V., Potapov, G., Burnaev, E.: Targeted change detection in remote sensing images. In: Eleventh International Conference on Machine Vision (ICMV 2018); 110412H (2019). Proc. SPIE, vol. 11041 (2019). https://doi.org/doi.org/10.1117/12.2523141
-  Ishimtsev, V., Bernstein, A., Burnaev, E., Nazarov, I.: Conformal k-nn anomaly detector for univariate data streams. In: Proc. of 6th Workshop COPA. PRML, vol. 60, pp. 213–227. PMLR (2017)
-  Krizhevsky, A., Sutskever, I., E. Hinton, G.: Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems pp. 1097–1105 (2012)
-  Maggiori, E., Tarabalka, Y., Charpiat, G., Alliez, P.: Can semantic labeling methods generalize to any city? the inria aerial image labeling benchmark. In: 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS). pp. 3226–3229 (2017). https://doi.org/10.1109/IGARSS.2017.8127684
-  Mosinska, A., Marquez-Neila, P., Kozinski, M., Fua, P.: Beyond the pixel-wise loss for topology-aware delineation. pp. 3136–3145 (06 2018). https://doi.org/10.1109/CVPR.2018.00331
-  Nagendar, G., Singh, D., Balasubramanian, V.N., Jawahar, C.V.: Neuro-iou: Learning a surrogate loss for semantic segmentation. In: British Machine Vision Conference 2018, BMVC 2018, Northumbria University, Newcastle, UK, September 3-6, 2018. pp. 278–289 (2018), http://bmvc2018.org/contents/papers/1055.pdf
-  Notchenko, A., Kapushev, Y., Burnaev, E.: Large-scale shape retrieval with sparse 3d convolutional neural networks. In: Analysis of Images, Social Networks and Texts. pp. 245–254. Springer (2018)
-  Novikov, G., Trekin, A., Potapov, G., Ignatiev, V., Burnaev, E.: Satellite imagery analysis for operational damage assessment in emergency situations. In: Abramowicz, W., Paschke, A. (eds.) Business Information Systems – 21st International Conference, BIS 2018, Berlin, Germany, July 18-20, 2018, Proceedings. Lecture Notes in Business Information Processing, vol. 320, pp. 347–358. Springer (2018)
-  Nowozin, S.: Optimal decisions from probabilistic models: The intersection-over-union case. 2014 IEEE Conference on Computer Vision and Pattern Recognition pp. 548–555 (2014)
-  Pohlen, T., Hermans, A., Mathias, M., Leibe, B.: Full-resolution residual networks for semantic segmentation in street scenes. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp. 3309–3318 (2017)
-  Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015. pp. 234–241. Springer International Publishing, Cham (2015)
-  Rottensteiner, F., Sohn, G., Jung, J., Gerke, M., Bailard, C., Benitez, S., Breitkopf, U.: The isprs benchmark on urban object classification and 3d building reconstruction. In: Shortis, M., Paparoditis, N., Mallett, C. (eds.) ISPRS 2012 Proceedings of the XXII ISPRS Congress : Imaging a Sustainable Future, 25 August - 01 September 2012, Melbourne, Australia. Peer reviewed Annals, Volume I-7, 2012. pp. 293–298. International Society for Photogrammetry and Remote Sensing (ISPRS) (8 2012)
-  Safin, A., Burnaev, E.: Conformal kernel expected similarity for anomaly detection in time-series data. Adv. in Systems Sci.and Appl. 17(3), 22–33 (2017)
-  Shelhamer, E., Long, J., Darrell, T.: Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 39(4), 640–651 (2017). https://doi.org/10.1109/TPAMI.2016.2572683
-  Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv 1409.1556 (09 2014)
-  Smoliakov, D., Korotin, A., Erifeev, P., Papanov, A., Burnaev, E.: Meta-learning for resampling recommendation systems. In: Eleventh International Conference on Machine Vision (ICMV 2018); 110411S (2019). Proc. SPIE, vol. 11041 (2019). https://doi.org/10.1117/12.2523103
-  Sulimowicz, L., Ahmad, I., Aved, A.J.: Superpixel-enhanced pairwise conditional random field for semantic segmentation. 2018 25th IEEE International Conference on Image Processing (ICIP) pp. 271–275 (2018)
-  Szegedy, C., Ioffe, S., Vanhoucke, V.: Inception-v4, inception-resnet and the impact of residual connections on learning. In: AAAI. pp. 4278–4284 (2016)
-  Volkhonskiy, D., Burnaev, E., Nouretdinov, I., Gammerman, A., Vovk, V.: Inductive conformal martingales for change-point detection. In: Proc. of 6th Workshop COPA. PRML, vol. 60, pp. 132–153. PMLR (2017)