MAP-Net: Multiple Attending Path Neural Network for Building Footprint Extraction from Remote Sensed Imagery
Building footprint extraction is a basic task in the fields of mapping, image understanding and computer vision, etc. Accurately and efficiently extracting building footprints from a wide range of remote sensed imagery remains a challenge due to the complex structures, variety of scales and diverse appearances of buildings. Existing convolutional neural network (CNN)-based building extraction methods are criticized for their inability to detect tiny buildings because the spatial information of CNN feature maps is lost during repeated pooling operations of the CNN. Additionally, large buildings still have inaccurate segmentation edges. Moreover, features extracted by a CNN are always partially restricted by the size of the receptive field, and large-scale buildings with low texture are always discontinuous and holey when extracted. To alleviate these problems, multi-scale strategies are introduced in the latest researches to extract buildings with different scales. While the features with higher resolution generally extracted from shallow layers, which extracted insufficient semantic information for tinny buildings. This paper proposes a novel multiple attending path neural network (MAP-Net) for accurately extracting multi-scale building footprints and precise boundaries. Unlike existing multi-scale feature extraction strategies, MAP-Net learns spatial localization-preserved multi-scale features through a multi-parallel path in which each stage is gradually generated to extract high-level semantic features with fixed resolution. Then, an attention module adaptively squeezes channel-wise features extracted from each path for optimized multi-scale fusion, and a pyramid spatial pooling module captures global dependency for refining discontinuous building footprints. Experimental results show that our method achieved 0.88%, 0.93% and 0.45% F1-score and 1.53%, 1.50% and 0.82% intersection over union (IoU) score improvements without increasing computational complexity compared with the latest HRNetv2 on the Urban 3D, Deep Globe and WHU datasets, respectively. Specifically, MAP-Net outperforms MA-FCN, which is the state-of-the-art (SOTA) algorithms with post-processing and model voting strategies, on WHU dataset without pre-training and post-processing. The TensorFlow implementation is available at https://github.com/lehaifeng/MAPNet.
The rapid development of remote sensing technology has made it easier to acquire many high-resolution optical remote sensing images that support the extraction of building footprints in a wide range[5, 46, 34]. Immediate and accurate building footprint information is significant for illegal building monitoring, 3D building reconstruction, urban planning and disaster emergency response. Due to low inter-class variance and high intra-class variance in buildings in optical remote sensed imagery, parking lots, roads and other non-buildings are highly similar to buildings in appearance. With the variety of building materials, scales and illumination, the representation of buildings in remote sensed imagery shows significant differences. Therefore, how to accurately and efficiently extract building footprints from remote sensed imagery remains a challenge.
Over the past two decades, numerous algorithms have been proposed to extract building footprints. They can be divided into two categories: traditional image processing-based and CNN-based methods.
Traditional building extraction methods utilize the characteristics of the spectrum, texture, geometry and shadow[13, 31, 16, 6, 10, 50] to design feature operators for extracting buildings from optical images. Since these features vary under different illumination conditions, sensor types and building architectures, traditional methods can resolve only specific issues for specific data. [2, 14, 17, 39] combined optical imagery with GIS , and digital surface models (DSM) were obtained from light detection and ranging (Lidar) or synthetic aperture radar interferometry  to distinguish non-building areas that are highly similar to buildings, which increased the robustness of building extraction, although a wide range of corresponding multi-source data acquisition is always costly.
Because buildings in remote sensing images are diverse in structure, appearance and scale, building extraction algorithms have evolved from handcrafted feature-based methods to learning feature-based methods, such as deep convolutional neural networks (DCNNs). Moreover, deep networks have been practical in designing CNN models. For instance, ResNet  with 152 layers introduced identity mapping in a residual block to solve the problem of gradient explosion in propagation, making it possible to design a deeper network to extract richer semantic features.
Evolving from CNN, fully convolutional network (FCN)-based ,[23, 27, 43, 36, 12, 35, 4, 40, 1, 48] building extraction methods achieve incredible results and are often used in building semantic segmentation tasks. The encoder-decoder framework ,[30, 25, 34, 37, 3, 28], obtains more accurate building footprints than FCN-based methods, particularly on the localization of the boundary since they recover spatial details through skip connections to fuse shallow high-resolution features in the decoder stage. Deep Encoding Network  (DE-Net) introduced lately techniques based on encoder-decoder network for building extraction and achieved higher performance. Nevertheless, coarse features introduced by shallow layers become the main obstacle for accurate building boundaries. [25, 34, 8, 32] used a conditional random field (CRF) for post processing to refine the edge of footprints, which achieved great improvements on building boundaries. Multi-scale aggregation FCN  (MA-FCN) extracted multi-scale features based on feature pyramid network and applied polygon regularization methods for boundary refinement which achieved SOTA result on WHU dataset.
For the problem of multi-scale building extraction, , [29, 25] integrated hierarchical results extracted from multiple models or a design-specific CNN architecture to address multi-scale input for accurately extracting multi-scale buildings. CU-Net  applied multi-constraints based on FCN to enhance multi-scale feature representation and performed better in building segmentation than FCN. SiU-Net  designed two weight-shared branches for multi-scale input and was reported to improve the segmentation accuracy especially for large buildings, yet obviously increased the computational complexity. SRI-Net  aggregated multi-scale features which generated from modified ResNet101 by a spatial residual inception module, which significantly retaining local details due to the higher resolution features throughout the network.  proposed a pyramid spatial pooling module by introducing several global pooling layers to capture multi-scale features without significantly increasing the computational complexity. It is more efficient than  and  in multi-scale building extraction and improves continuously since the global dependency is captured by pooling layers. EU-Net  proposed a deep spatial pyramids pooling (DSPP) module for multi-scale feature extraction which benefits for extracting localization details and multi-scale buildings. Besides, it introduced focal loss to reduce the impact of mislabelling promoting the training more stability.
The attention mechanism [20, 15, 47, 7, 44, 9, 22] is another method for capturing global relations with long-range dependencies in spatial or channel aspects, which effectively improves the performance of segmentation. Recently, HRNetv1  and HRNetv2  proposed a high-resolution CNN to address multi-scale feature extraction and achieved a new goal in semantic segmentation. While the features with different resolutions were fused each other during the feature extraction, the fine boundary details of building localization information may be lost.
In previous studies, CNN-based building footprint extraction algorithms have mainly been encoder-decoder-based, which loses the spatial details in the encoder stage and recovers by fusing shallow feature maps during the decoder stage. However, it causes inaccurate localization on building boundaries since the coarse features introduced from shallow layers and small buildings may be unrecognized. Additionally, the extracted features are always partially restricted by the local respective field, and large-scale buildings with low texture are always discontinuous and holey when extracted.
Although many studies have noted the importance of multi-scale features for accurate building extraction, there are still many aspects that need to be improved. This research proposes a MAP-Net inspired by HRNetv2  to solve the problems described above. First, a parallel multi-path network is generated gradually in each stage which consist of serial convolution blocks to extract high-level semantics and preserve multi-level localization details through serial convolution blocks with fixed spatial resolution in each path. Then, the attention mechanism-based feature enhancement module is introduced to adaptively squeeze channel-wise feature maps from each path for multis-cale feature optimal combination. Pyramid spatial pooling operations follow to extract global semantic information for continuous building footprints. The main contributions of this study are as follows:
We propose a MAP-Net for efficient and accurate multi-scale building footprint boundary extraction through parallel localization-preserved convolutional networks. Different from SOTA methods like HRNetv2, our basic hypothesis is the independence and singularity of path features to be more beneficial. Hence, our strategy is features extracted from each path were independent with fixed scales, and multi-scale features were only fused at the end of the encoder by attention module.
We introduce a channel-wise attention module to adaptively squeeze multi-scale features extracted from the multi-path network. These features strengthen the building representation by optimally combining global semantic and spatial localization.
We validate the effectiveness of the introduced modules in MAP-Net and high-resolution features extracted from shallow layers on building extraction through extensive ablation studies.
The proposed method achieved 0.88%, 0.93% and 0.45% F1-score and 1.53%, 1.50% and 0.82% intersection over union (IoU) score improvements compared with the latest HRNetv2  on the Urban 3D , Deep Globe  and WHU  datasets and outperforms other SOTA methods on WHU dataset without pre-training.
The rest of the paper is organized as follows. Section II introduces the detailed structure of the proposed network for building extraction. Section III describes the experiments and analyses the results. The discussions and conclusions of this paper are presented in Section IV.
Repeated pooling layers or stride convolution lose spatial localization during the feature extraction procedure. Existing CNN-based building extraction methods recover spatial localization through skip connections to fuse shallow feature maps or upsample feature maps extracted from the last layer by interpolation. However, shallow feature maps contain coarse semantics, introducing noise information in building extraction. In addition, convolutions process the local neighbourhood information and hardly to capture global dependency for large buildings. We propose a MAP-Net for multi-scale building footprint extraction with accurate boundaries and continuous entities. Fig. 1 illustrates the structure of the proposed MAP-Net. It mainly includes three components:
A parallel multi-path network to extract multi-scale high-level semantic features while preserving spatial detail information through fixed feature resolution in every path;
An attention-based multi-scale features adaptive squeeze and spatial pooling module for global semantic enhancement;
An interpolation-based upsampling module for building footprint extraction
As shown in Fig. 1, the detail-preserved multi-scale feature extraction network includes three stages, and the parallel path is generated gradually in each stage to extract richer high-level semantic representations with fixed feature spatial resolution to preserve local details. At the end of the feature extract module, multi-scale features extracted from each path were upsampled to the same resolution as features extracted from path 1 by bilinear interpolation and concatenated fusion for attention-based adaptive optimization. The spatial pooling module extracts global dependency to suppress the holes and obtain continuous building entities in the final extraction module. The numbers of channels C and resolution of the feature maps are marked in the figure. H and W represent the height and width of the input image, respectively.
The remainder of this section is arranged as follows. Section B presents a detail-preserved multi-path feature extraction network. Attention-based multi-scale features adaptively squeeze, and spatial global pooling enhancement is described in section C. Finally, section D describes the basic unit and training strategies involved in this study.
Ii-B Localization-Preserved Multi-path Network
Compared with the encoder-decoder-based CNN structure, the advantage of a localization-preserved multi-path feature extraction network is that it extracts multi-scale features that contain rich high-level semantic representation and accurate spatial localization information rather than recovering spatial details by fusing shallow feature maps during the decoding. Multi-scale features extracted from different stages are fed into several parallel paths that are gradually generated to extract richer semantics and preserve spatial resolution without increasing the computational complexity of the network
Fig. 2 illustrates part of the proposed multi-path network. There are two parallel-path extracted feature maps with different dimensions in the previous stage, and a new path is generated in the next stage to extract higher-level semantic features with double-downsampled resolution and double channels as the green arrow shows. Itâs composed of a max pooling layer to downsample resolutions and a 11 convolutional layer to increase feature channels. The blue arrow represents a 33 convolutional layer which transport features from the last convolution block to the next one. The convolution block was composed of four residual blocks to extract high-level semantic representation, details were described in part D. Feature maps extracted in each path maintain spatial resolution, and richer semantic was extracted as the depth of convolution layers increase. Features with higher resolution preserved as many localization details and lower one captured more global semantic. The most important difference from HRNetv2 is that multi-scale features extracted from different paths donât fuse each other during extraction since the fusion operation may weaken localization details.
In the entire process of feature extraction, spatial resolutions and channels of feature maps in each path are fixed. Features in each path are extracted by a series of convolutional blocks that suppress the coarse semantics in high-resolution feature maps compared with encoder-decoder-based CNN. Because detailed representation is preserved in higher-resolution features, smaller buildings and localization of the boundary can be extracted exactly. The effect of a multi-path network that considers preserved localization and high-level semantics is explained in experiment 3.C.1).
Considering the trade-off between complexity and accuracy, MAP-Net is composed of three parallel paths for extracting multi-scale features as analyzed in experiment 3.C.2). The resolutions of feature maps are 1/4, 1/8 and 1/16 of the original image, with the corresponding channels are 64, 128 and 256 in each path.
Ii-C Attention-Based Feature Squeeze
Feature maps extracted from multi-paths have different dimensions. Higher-resolution features contain localization details and high-level semantic information, while lower resolution provides richer global features. The features are upsampled to 1/4 of the original image through bilinear interpolations and fused by concatenation, as shown in Fig. 3. Itâs important to explore the optimal channel-wise combination of multi-scale features since they contain multi-level building localization details. We introduced a channel attention squeeze module adaptively measures the significance of each channel for optimizing the multi-scale features. A spatial pooling enhancement module is introduced to capture global dependence for continuously extracting building entities, especially for large buildings with low texture. The details are described as follows.
In previous CNN-based methods [40, 29, 25][34, 26], multi-scale features were concatenated directly for final pixel-wise prediction. Nevertheless, in our research, multi-scale features from different paths contain multi-level spatial localization and richer semantic representation. features with different scales have a dissimilar influence on building extraction. Itâs the same as each channel, some of them may weaken the semantic representation but increase the computational complexity. In our research, multi-scale features from different paths contain spatial localization and richer semantic representation. It is necessary to distinguish valuable channel-wise features for accurately and efficiently extracting buildings, while a priori knowledge hardly weights the importance of each channel. The attention-based feature adaptive squeeze module inspired by  plays a role in learning the weight for each channel and automatically reconstructing the feature maps for optimal representation.
As illustrated in Fig. 4, a global average pooling operation produces a vector of length 7C from the concatenated multi-scale channel-wise feature, a fully connected layer with a weight parameter of 7C7C, followed by learning a weight vector with a length of 7C corresponding to each channel. The parameters of the fully connected layer are randomly initialized and gradually learned from the features. Finally, the vector that represents the significance of each channel is normalized by a sigmoid function and multiplied to the original features for reconstructing enhanced feature maps.
Due to the extracted features are always partially restricted by local receptive fields, a spatial pooling module is introduced to extract global dependence. The implementation is similar to  except that the global features are generated by four average pooling layers with different sizes that are designed in accordance with the dimensions of features and added to the original feature maps pixel-wise for global spatial enhancement. It captures global relations spatially, which cannot be extracted from the CNN for the local respective field. Hence, extracted buildings have better integrity.
Ii-D Basic Block and Training Strategy
To avoid noise information contained in shallow features due to local receptive fields and decrease the computational complexity, a downsampling block is introduced to decrease the resolution of the features before the multi-path network, as shown in Fig. 4. It consists of a stride convolution layer, two 33 convolutional layers, and a max pooling layers to extract feature maps with 64 channels and 1/4 spatial resolution of the input image. Fig. 4 represents the conv block, which includes several residual blocks in series. The impact of different numbers of blocks on performance is explored in experiment 3.3.2. The residual block consists of a 11 convolutional layer for reducing the dimensions of features, two 33 layers for extracting features, and a 11 convolutional layer for restoring dimensions to the input; a shortcut fuses input to output through element-wise addition, BN and ReLU execute before the convolutional layers, as illustrated in Fig. 4. The building footprint extraction module is shown in Fig. 4, the resolution of the features is recovered through bilinear interpolation in two stages, and the convolutional layers are used to decrease the number of channels. The output layer is a single-channel feature map with the same spatial resolution as input, and each value represents the probability that it belongs to building.
Our research was implemented in TensorFlow using a single 2080Ti GPU with 12 Gigabyte of memory. The Adam optimizer was chosen with an initial learning rate of 0.001, and beta1 and beta2 were set to default as recommended. All compared methods were trained from scratch for approximately 80 epochs until convergence and randomly rotated and flipped for data augmentation on three building datasets described in section III. The batch size was set to 4, given the restrictions of the GPU memory size and the same hyperparameters were maintained to compare the performance with different methods for equality.
Sigmiod cross-entropy loss was selected as the loss function because of the pixel-wise binary classification involved. The computation of loss at position is given as (1); logits represent the predicted result and represents the ground truth; the sigmoid function was applied to logits to ensure that , as shown in (2). The loss value is the average of at all positions for an input image.
Iii Experiment and Analysis
To evaluate the proposed method, we conducted a comprehensive experiment on three open datasets, including the WHU building dataset , the Deep Globe Building Extraction Challenge dataset  and the USSOCOM Urban 3D Challenge dataset . The details are described as follows.
The WHU building dataset includes both aerial and satellite subsets with corresponding shapefiles and raster images. In our experiment, we selected the aerial subset, which has various appearances and scales of buildings, to evaluate the robustness of the proposed algorithm. It consists of more than 187,000 buildings, covering over a 450 area, with 30 ground resolution. Each image has three bands, corresponding to red (R), green (G) and blue (B) wavelengths, with each image size of 512 512 pixels. There is a total of 8,188 tiles of images, including 4,736, 2,416 and 1,036 tiles as training, test and validation datasets, respectively. We conducted our experiment at its original provided dataset partitioning.
The Deep Globe Building Extraction Challenge dataset  contains WorldView-3 satellite imagery captured from Las Vegas, Paris, Shanghai and Khartoum. In this research, the Las Vegas and Shanghai subsets were selected to evaluate the generalization performance of the proposed algorithm. There were approximately 243,382 buildings with 30 ground resolution, covering over 1,216 , and the size of each image was 650650 pixels. All images were randomly divided as 6:1:3 as the training set, validation set and test set,respectively.
The USSOCOM Urban 3D Challenge dataset  contains 208 orthorectified RGB, with corresponding DSM and digital terrain models (DTM) generated from commercial satellite imagery. It contains approximately 157,000 buildings, covering over 360 with a ground resolution of 50 , and the size of each image is 20482048 pixels. DSM and DTM indicate the elevation of buildings, which obviously improves the building extraction performance. We used only the RGB images in our experiment to evaluate the performance of the proposed method. The training, validation and test set include 104, 62 and 42 tiles,respectively, as the original data partition method, and we randomly clipped the images to the size of 512512 pixels for training and testing.
Iii-B Evaluation Metric
Generally, evaluation metric methodologies can be divided into two categories: pixel-level metrics and instance-level metrics. The pixel-level method counts the correctly classified and misclassified pixels pixel-wise. In the instance-level method, a building is correctly extracted only when the IoU between the prediction and ground truth is larger than a specific threshold. Semantic segmentation-based building footprint extraction aims to classify every pixel, whether or not it belongs to a building, for a specific input image. Therefore, we apply a pixel-level metric including precision, recall, F1-score and IoU to evaluate the performance of MAP-Net and other different methods.
There are four classifying conditions: true prediction on a positive sample (TP), false prediction on a positive sample (FP), true prediction on a negative sample (TN) and false prediction on a negative sample (FN). Precision represents the percentage of TP in total positive prediction, recall indicates the percentage of TP over the total positive samples, the F1-score is the weighted average of precision and recall, which considers both FP and FN, and IoU is the average value of the intersection of the prediction and ground truth over their union of the whole image set. Equations are given as follows:
Iii-C Experimental Setup
In this section, we first analysed the significance of the proposed multi-path architecture for extracting multi-scale buildings with exact localization on boundaries compared with the popular encoder-decoder framework. Second, we explored the impact of different network parameters on the complexity and accuracy of MAP-Net on a specific dataset. Third, a contrast experiment was carried out to compare the performance of MAP-Net with four classic semantic segmentation algorithms on building extraction. Then, we compared the performance of most recent studies on building extraction based on the WHU dataset. Finally, we conducted an ablation experiment to validate the significance of the proposed network and analysed the trade-off between complexity and accuracy among the compared methodologies. Details are described in the following sections.
Significance of Multi-path
Feature maps extracted from the proposed localization-preserved multi-path network are visualized in Fig. 5. Columns (b-d) are extracted from the path (P) 1 with the same spatial resolution (R) on each end of the stage (S) corresponding to the sample image in column (a). This indicates that feature maps with higher resolution extracted from deeper convolutional layers (larger S) retained richer semantics; in other words, building and background could be distinguished evidently. Columns (d-f) show the extracted feature maps from each path at the end of stage 3 with decreasing spatial resolution as shown in Fig. 1. It shows that feature maps with lower resolution are more blurred at the edge of buildings, and in worse conditions, small buildings may be lost completely, as shown in column (f), due to the exact localization lost in downsampling operation during the feature extraction procedure.
Encoder-decoder-based networks fuse higher-resolution feature maps extracted from shallow layers, such as columns (b) or (c), to recover exact localization through a skip connection at the decoder stage, which introduces noise information for the coarse semantic features. In addition, small buildings may be lost in the lowest resolution feature maps, such as column (f), which cannot be refined accurately during the decoder stage. As a result, extracted building footprints were inaccurate on the boundary, or worse, small buildings were unrecognized.
Multi-path networks extract multi-scale feature maps through parallel paths. The resolutions of feature maps in each path were fixed, multi-scale features from the different paths were not fused in the whole feature extraction process. Features with higher spatial resolution preserved exact localization and contained rich semantic information through a deeper CNN, such as column (d). It is beneficial for extracting fine boundaries and small buildings compared to skip-connection with coarse shallow features as well as multi-scale features fusion like HRNetv2. Additionally, the features with lower spatial resolution captured global semantic representations, which contribute to the extraction of large buildings. Multi-scale features extracted from multi-paths were combined and enhanced at the end of stage 3 to extract buildings with multi-scales, which makes up for the shortcoming of the existing network.
The structure of the proposed network is mainly affected by the depth of the convolution network and the number of parallel paths on a specific dataset. Without losing generality, we designed an experiment to explore the potential impact of different structures on the performance of MAP-Net on the WHU dataset. Readers should be noticed that the network parameters will be different as dataset changing.
The depth is represented by the number of residual blocks (N-blocks) in each path, empirically, these were set from 3 to 6 in our experiments consider the trade-off between accuracy and complexity. Similarly, the number of paths (N-paths) was chosen from 2 to 4 according to the resolution of the input image. The IoU metric was used to evaluate the performance, and the number of trainable parameters (Para.) was counted to represent the complexity of the network.
The experimental result is illustrated in. Fig. 6 With the increase in N-blocks, the IoU score increased first, and then decreased after N-blocks were greater than a specific value, which may be explained by the complexity of the network growing with N-blocks increasing while weakening the generalization ability of the model. However, the Para. grows linearly with the increase in N-blocks, while increasing exponentially with the increase in N-paths since the generated path doubles the feature channels, which greatly increases the parameters during the feature extraction and enhancement stage.
Features with specific resolution were extracted from each path. The different number of paths impacts the combination of multi-scale semantic features fused in MAP-Net. When the N-paths were equal to 3, the IoU metric was better than that of 2 or 4, as shown by the solid line marked by the red circle in Fig. 6.
Considering the balance between accuracy and complexity, the better structure of the MAP-Net was composed of three parallel paths, and each convolutional block consisted of four residual blocks, which contained fewer parameters and performed better than others on WHU dataset, as indicated by solid circles marked with green.
Besides, channels (C) of feature maps in each path and resolution of input images may also affect the performance of the proposed method. Generally, larger C will lead to better performance in the case of enough training data but increase complexity exponentially. Optimal paths of MAP-Net are probably affected by the resolution of remote sensing imagery for the different combinations of multi-scale features representation. The better structure of MAP-Net on another dataset could be explored similarly. As itâs not the focus of this paper, we will not state the detailed discussion here.
To evaluate the performance of the proposed network, we conducted contrast experiments to compare MAP-Net with four SOTA methods, including UNet, PSPNet with ReNet50 backbone, ResNet101 and HRNetv2, on datasets [11, 18, 24].
U-NetPlus achieved great improvement through re-designed encoder by VGG11 and replaced the transposed convolution with nearest-interpolation based on U-Net which widely used in remote sensing imagery segmentation. Since ResNet improved the training stability and performance of deeper CNN by introducing residual connection, we re-implemented PSPNet with ResNet50 for feature extraction and modified ResNet101 for building extraction through replacing the upsample module with the same as MAP-Net.
Experimental results are shown in TABLE I, TABLE II and TABLE III. Our proposed method demonstrates a great improvement compared with other methods on three experimental datasets and obtains approximately 0.82%, 1.50% and 1.53% IoU improvement and 0.45%, 0.93% and 0.88% F1-score improvement on the WHU dataset, Deep Globe dataset and Urban 3D dataset, respectively, compared with the latest research HRNetv2. The best records are marked with bold.
To compare different methods, some example results on each dataset are presented in Fig. 7, Fig. 8 and Fig. 9. Fig. 7 shows extracted building footprints on the WHU dataset. There are four examples, including buildings with various appearances and scales. Columns (a) and (g) represent the original image and corresponding ground truth, and columns (b-f) are extracted results from U-NetPlus, PSPNet, ResNet101, HRNetv2 and MAP-Net, respectively.
The results show that our proposed method outperforms the other four compared methods obviously, especially by more accurately recognizing small buildings and more completely extracting large buildings, which benefits from the localization-preserved multi-path feature extraction network and the multi-scale feature adaptively enhancement module. The boundary of buildings is more exactly based on the ground truth.
Comparison of Recent Methods
We compare our methods to the most recent building extraction methods including CU-Net , SiU-Net , SRI-Net , DE-Net , EU-Net , and MA-FCN on WHU test Dataset to evaluate the performance of MAP-Net. TABLE IV shows the segmentation result of recent studies and ours, the best records are marked with bold. We didnât reproduce their results since the source codes are not available.
CU-Net  introduced multi-constraint to enhance feature representation based on FCN. SiU-Net  designed multi-scale input with two weight-shared UNet which improve the result especially for a large building. DE-Net  introduced lately segmentation techniques for building extraction based on Encoder-Decoder Network and achieved IoU accuracy higher than 90%. Considering the significance of multi-scale features for building extraction, SRI-Net  generated multi-level features through modified ResNet101 backbone and aggregate multi-scale contexts by spatial residual inception module, which achieved 89.23% IoU on WHU dataset. EU-Net  outperformed previous methods by a large margin with 90.56% IoU since it captures multi-scale features by designed a deep spatial pyramids pooling (DSPP) and introduced focal loss to reduce the impact of mislabelling, which benefit for extracting fine details and multi-scale buildings. Most recent MA-FCN extract multi-scale features based on feature pyramid network considering the multi-scale building extraction and many post-processing strategies for boundary refinement. Itâs better than EU-Net and achieved SOTA results which significantly benifit from multi-model voting strategy on the WHU dataset.
As shown in Table IV, our study achieved 90.86% in IoU metric and 95.21% in F1-Score metric which outperform all the most recently studies. Especially, slightly outperforms MA-FCN on a result with limited performance improvement space without any pre-training and post-processing.
As shown in TABLE IV, our studies proposed a localization-preserved multi-path feature extraction network with a feature enhancement module for building extraction which achieved 90.86 IoU and outperformed the latest MA-FCN without pre-training and post-processing on WHU dataset.
|DE-Net ||90.12 0.24||95.00 0.16||94.60 0.19||94.80 0.18|
To explore the contributions of different modules of the MAP-Net, we conducted ablation experiments on the WHU dataset and evaluate the accuracy of IoU, precision, recall, and F1-score.
Firstly, we validated the performance of the localization-preserved strategy (Baseline) compared with HRNetv2 and HRNetv1. Secondly, we explore the effect of feature squeeze module and global enhancement module on MAP-Net. Finally, we compare the influence of shallow features (F) which have a half resolution of the input imagery on MAP-Net with skip-connection.
Experimental results are recorded in TABLE V, the highest values are highlighted in bold and the visualization of extracted results from each method shown in Fig. 10 for the comparison of different methods. The false prediction marked with red, the false-positive marked with blue and the true positive marked with green.
Our baseline based on HRNetv2 and optimized the architecture on multi-scale features fusion as described in Section II.B, the baseline methods outperform the HRNetv2 by 0.24% on the IoU since modified multi-path localization-preserved strategy without feature fusion in the encoder, and higher 0.74% than the HRNetv1 on the IoU for fusing multi-scale features on the decoder the same as HRNetv2.
Based on baseline, the channel-wise feature adaptively squeezes module based on the attention mechanism called (C) and the spatial pooling enhancement module named (S) improved the IoU performance by 0.29% and 0.34% respectively. The MAP-Net achieved 0.58% IoU arise with feature squeeze module and global enhancement module compared to our baseline. As described in Fig. 1, the resolution of fused features is a quarter of input image in our MAP-Net. To evaluate the influence of shallow feature which has a higher resolution on building extraction, we concatenate the feature with half-resolution of input image extracted from the downsample block to the upsample block thought skip-connection named MAP-Net+F.
According to the experimental results, adding shallow features improved the accuracy by 0.04%, but the introduced shallow features also increase the coarse noise information, as shown in the last column of Fig. 10, resulting in inaccurate building boundary. The MAP-Net can get a smoother edge of the building without introducing shallow features, and reduce unnecessary error recognition with little accuracy loss.
It is worth noting that our algorithms from No.3 to No.6 obtained higher precision measures than HRNetv2 with the same threshold equals to 0.5. A probable explanation is that our methods suppressed false positive prediction, which contributed to accurate multi-scale features extracted from localization-preserved multi-path networks, without fusion during feature extraction. The same conclusion can be inferred from other datasets according to TABLE II and TABLE III.
|7||MAP-Net + F||90.90||95.18||95.29||95.23|
Complexity of MAP-Net
Our proposed algorithm extracted features with multi-scales; specifically, some paths needed to process feature maps with large resolution to preserve exact localization in the whole network, which could lead to large numbers of parameters. To validate the trade-off between the performance and complexity of MAP-Net, we compared FLOPs, trainable parameter and IoU score of related methods on the WHU dataset.
As shown in TABLE VI, The U-NetPlus has the lowest complexity but with poor performance. ResNet101 is the most complicated model due to the most numbers of convolutional layers and the highest numbers of channels among the related methods. HRNetv2 has slightly more parameters than HRNetv1 and performs better since the concatenate fusion of multi-scale features in the decoder stage.
The complexity of our baseline has reduced with higher performance compared to HRNetv2 since the re-design structure of encoder. Although the MAP-Net has increased complexity after introduced a feature enhancement module, the performance has greatly improved with far fewer FLOPs than HRNetv2.
Although MAP-Net maintains a high-resolution feature map in the whole feature extraction process, which may lead to a large number of parameters, the number of channels remains small, allowing it to efficiently extract multi-scale features. To compare the complexity and performance among related methods more intuitively, we presented the experimental results in Fig. 11. The number of trainable parameters and IoU precision represents complexity and performance for each related method. The radius of the green circle indicates the size of the model file. Our proposed methods maintain higher accuracy and lower complexity compared with other related methods.
|Baseline + C||47.16||23.54||90.63|
|Baseline + S||47.58||23.75||90.56|
Iv Discussion and Conclusion
To solve the problem of extracted building footprints with inaccurate boundaries and possibly lost small buildings as well as discontinuous for large-scale buildings, in this research, we proposed a novel localization-preserved multi-path feature extraction network inspired by HRNetv2 with a adaptively multi-scale feature optimal fusion module and spatial enhancement module for building footprint extraction. Multi-scale features extracted from parallel multi-paths that contain multi-level local details, as well as rich semantic representations allow it to accurately extract building footprints with exact edges and recognize small buildings. The enhanced module further reconstructs and optimizes features in channel and spatial aspects, which suppresses the holes and extracts continuous footprints for large buildings.
The experiments on three different benchmarks demonstrate that the MAP-Net outperforms other classic semantic segmentation algorithms with higher accuracy and lower complexity and achieved SOTA result on WHU dataset among the most recent building extraction methods. In addition, we conducted an ablation experiment to evaluate the significance of the proposed module and proved that localization-preserved multi-path network extraction of buildings achieves higher precision than previous methods.
Generally, our research provides a new approach for accurately and efficiently extracting multi-scale objects that are common in the real world. Currently, our experiments are implemented in building extraction, and we will further study multiclass extraction tasks, such as land cover, to achieve automatic interpretation of remote sensed imagery in future work.
- (2017) Simultaneous extraction of roads and buildings in remote sensing imagery with convolutional neural networks. ISPRS Journal of Photogrammetry and Remote Sensing 130, pp. 139–149. Cited by: §I.
- (2013) Automatic extraction of building roofs using lidar data and multispectral imagery. ISPRS journal of photogrammetry and remote sensing 83, pp. 1–18. Cited by: §I.
- (2017) Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence 39 (12), pp. 2481–2495. Cited by: §I.
- (2017) BUILDING extraction from remote sensing data using fully convolutional networks.. International Archives of the Photogrammetry, Remote Sensing & Spatial Information Sciences 42. Cited by: §I.
- (2018) Building footprint extraction from vhr remote sensing images combined with normalized dsms using fused fully convolutional networks. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 11 (8), pp. 2615–2629. Cited by: §I.
- (2014) Detecting blind building façades from highly overlapping wide angle aerial imagery. ISPRS journal of photogrammetry and remote sensing 96, pp. 193–209. Cited by: §I.
- (2019) GCNet: non-local networks meet squeeze-excitation networks and beyond. arXiv preprint arXiv:1904.11492. Cited by: §I.
- (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §I.
- (2018) A^ 2-nets: double attention networks. In Advances in Neural Information Processing Systems, pp. 352–361. Cited by: §I.
- (2012) Automatic rooftop extraction in nadir aerial imagery of suburban regions using corners and variational level set evolution. IEEE transactions on geoscience and remote sensing 51 (1), pp. 313–328. Cited by: §I.
- (2018) Deepglobe 2018: a challenge to parse the earth through satellite images. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 172–17209. Cited by: item 4, §III-A, §III-A, §III-C3.
- (2018) Multi-scale object detection in remote sensing imagery with convolutional neural networks. ISPRS journal of photogrammetry and remote sensing 145, pp. 3–22. Cited by: §I.
- (2019) A novel framework for 2.5-d building contouring from large-scale residential scenes. IEEE Transactions on Geoscience and Remote Sensing 57 (6), pp. 4121–4145. Cited by: §I.
- (2017) Automatic building extraction from lidar data fusion of point and grid-based features. ISPRS Journal of Photogrammetry and Remote Sensing 130, pp. 294–307. Cited by: §I.
- (2019) Dual attention network for scene segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3146–3154. Cited by: §I.
- (2018) Automatic building footprint extraction from high-resolution satellite image using mathematical morphology. European Journal of Remote Sensing 51 (1), pp. 182–193. Cited by: §I.
- (2016) An automatic building extraction and regularisation technique using lidar point cloud data and orthoimage. Remote Sensing 8 (3), pp. 258. Cited by: §I.
- (2018) Urban 3d challenge: building footprint detection using orthorectified imagery and digital surface models from commercial satellites. In Geospatial Informatics, Motion Imagery, and Network Analytics VIII, Vol. 10645, pp. 1064503. Cited by: item 4, §III-A, §III-A, §III-C3.
- (2019) U-netplus: a modified encoder-decoder u-net architecture for semantic and instance segmentation of surgical instruments from laparoscopic images. In 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 7205–7211. Cited by: §I.
- (2019) Adaptive pyramid context network for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7519–7528. Cited by: §I.
- (2016) Identity mappings in deep residual networks. In European conference on computer vision, pp. 630–645. Cited by: §I.
- (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: §I, §II-C.
- (2016) Building extraction from multi-source remote sensing images via deep deconvolution neural networks. In 2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), pp. 1835–1838. Cited by: §I.
- (2018) Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set. IEEE Transactions on Geoscience and Remote Sensing 57 (1), pp. 574–586. Cited by: item 4, §I, §III-A, §III-C3, §III-C4, §III-C4, TABLE IV.
- (2019) A scale robust convolutional neural network for automatic building extraction from aerial and satellite imagery. International Journal of Remote Sensing 40 (9), pp. 3308–3322. Cited by: §I, §I, §II-C.
- (2019) EU-net: an efficient fully convolutional network for building extraction from optical remote sensing images. Remote Sensing 11 (23), pp. 2813. Cited by: §I, §I, §II-C, TABLE IV.
- (2018) Algorithms for semantic segmentation of multispectral remote sensing imagery using deep learning. ISPRS journal of photogrammetry and remote sensing 145, pp. 60–77. Cited by: §I.
- (2018) Automatic pixelwise object labeling for aerial imagery using stacked u-nets. arXiv preprint arXiv:1803.04953. Cited by: §I.
- (2018) A multiple-feature reuse network to extract buildings from remote sensing imagery. Remote Sensing 10 (9), pp. 1350. Cited by: §I, §II-C.
- (2019) Semantic segmentation-based building footprint extraction using very high-resolution satellite images and multi-source gis data. Remote Sensing 11 (4), pp. 403. Cited by: §I.
- (2014) Extracting man-made objects from high spatial resolution remote sensing images via fast level set evolutions. IEEE Transactions on Geoscience and Remote Sensing 53 (2), pp. 883–899. Cited by: §I.
- (2019) ESFNet: efficient network for building extraction from high-resolution aerial images. IEEE Access 7, pp. 54285–54294. Cited by: §I.
- (2019) DE-net: deep encoding network for building extraction from high-resolution remote sensing imagery. Remote Sensing 11 (20), pp. 2380. Cited by: §I, §III-C4, §III-C4, TABLE IV.
- (2019) Building footprint extraction from high-resolution images via spatial residual inception convolutional neural network. Remote Sensing 11 (7), pp. 830. Cited by: §I, §I, §I, §II-C, §III-C4, §III-C4, TABLE IV.
- (2016) Convolutional neural networks for large-scale remote-sensing image classification. IEEE Transactions on Geoscience and Remote Sensing 55 (2), pp. 645–657. Cited by: §I.
- (2018) Building extraction from lidar data applying deep convolutional neural networks. IEEE Geoscience and Remote Sensing Letters 16 (1), pp. 155–159. Cited by: §I.
- (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §I.
- (2005) Rectangular building extraction from stereoscopic airborne radar images. IEEE Transactions on Geoscience and remote Sensing 43 (10), pp. 2386–2395. Cited by: §I.
- (2007) Data fusion of high-resolution satellite imagery and lidar data for automatic building extraction. ISPRS Journal of Photogrammetry and Remote Sensing 62 (1), pp. 43–63. Cited by: §I.
- (2019) Fusion of multiscale convolutional neural networks for building extraction in very high-resolution images. Remote Sensing 11 (3), pp. 227. Cited by: §I, §I, §II-C.
- (2019) Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5693–5703. Cited by: §I.
- (2019) High-resolution representations for labeling pixels and regions. arXiv preprint arXiv:1904.04514. Cited by: item 4, §I, §I.
- (2018) Developing a multi-filter convolutional neural network for semantic segmentation using high-resolution aerial imagery and lidar data. ISPRS journal of photogrammetry and remote sensing 143, pp. 3–14. Cited by: §I.
- (2018) Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803. Cited by: §I.
- (2019) Toward automatic building footprint delineation from aerial images using cnn and regularization. IEEE Transactions on Geoscience and Remote Sensing. Cited by: §I, §III-C4, TABLE IV.
- (2018) Automatic building segmentation of aerial imagery using multi-constraint fully convolutional networks. Remote Sensing 10 (3), pp. 407. Cited by: §I, §I, §III-C4, §III-C4, TABLE IV.
- (2018) Ocnet: object context network for scene parsing. arXiv preprint arXiv:1809.00916. Cited by: §I.
- (2018) Fusion of images and point clouds for the semantic segmentation of large-scale 3d scenes based on deep learning. ISPRS journal of photogrammetry and remote sensing 143, pp. 85–96. Cited by: §I.
- (2017) Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2881–2890. Cited by: §I, §II-C.
- (2014) Seamless fusion of lidar and aerial imagery for building extraction. IEEE Transactions on Geoscience and Remote Sensing 52 (11), pp. 7393–7407. Cited by: §I.
- (2017) Deep learning in remote sensing: a review. Cited by: §I, §I.