Squeeze-and-Attention Networks for Semantic Segmentation
Squeeze-and-excitation (SE) module enhances the representational power of convolution layers by adaptively re-calibrating channel-wise feature responses. However, the limitation of SE in terms of attention characterization lies in the loss of spatial information cues, making it less well suited for perception tasks with very high spatial inter-dependencies such as semantic segmentation. In this paper, we propose a novel squeeze-and-attention network (SANet) architecture that leverages a simple but effective squeeze-and-attention (SA) module to account for two distinctive characteristics of segmentation: i) pixel-group attention, and ii) pixel-wise prediction. Specifically, the proposed SA modules impose pixel-group attention on conventional convolution by introducing an ‘attention’ convolutional channel, thus taking into account spatial-channel inter-dependencies in an efficient manner. The final segmentation results are produced by merging outputs from four hierarchical stages of a SANet to integrate multi-scale contexts for obtaining enhanced pixel-wise prediction. Empirical experiments using two challenging public datasets validate the effectiveness of the proposed SANets, which achieved 83.2% mIoU (without COCO pre-training) on PASCAL VOC and a state-of-the-art mIoU of 54.4% on PASCAL Context.
Semantic segmentation, also known as scene parsing, involves three levels of task: image-level categorical recognition, pixel-level dense prediction, and pixel-group attention modeling. As illustrated in Fig. 1, semantic segmentation and image classification are closely related in that they share the objective of image level recognition but segmentation includes two other levels of dense prediction and pixel grouping. Previous segmentation works mainly focus on improving segmentation performance from the pixel-level but largely ignore the pixel-group attention [long2015fully, chen2018deeplab, zhao2017pyramid, zhang2018context, boykov2004experimental, boykov2006graph]. In the following paragraphs, we discuss semantic segmentation from the perspective of these three task levels.
Image-level categorical recognition is useful for both semantic segmentation and image classification, and this similarity enables the network backbones pre-trained for classification to be easily extended to semantic segmentation via replacing classification heads with segmentation heads. This approach derives from the Fully convolutional network (FCN) [long2015fully] for dense pixel-wise prediction and we have witnessed that FCN-based models significantly improve segmentation performance in multiple benchmarks [he2015spatial, everingham2015pascal, badrinarayanan2015segnet, liu2018path, chen2016attention, zhao2017pyramid, chen2018deeplab].
In contrast to image classification semantic segmentation requires dense pixel-wise prediction, so recent works have come up with multiple approaches to extend standard classification networks to perform better for dense prediction. For example, pooling methods and dilated convolution are utilized to boost segmentation performance. Pooling-based modules are widely used to aggregate spatial information from different scales, like the pyramid pooling module [zhao2017pyramid] and atrous spatial pyramid pooling [chen2018deeplab]. Also, atrous convolution are used to enlarge the receptive scales of convolutional kernels to improve segmentation performance [chen2018encoder]. Although atrous convolution and multi-scale pyramid pooling have been proved effective for semantic segmentation, existing methods have mainly used pooling layers for enhancing the output of the last stage of backbone networks, not throughout the entire backbone network. With the introduction of skip connections [he2016deep], low-level features are fused with high-level features to encourage more accurate boundaries[ronneberger2015u]. Although these methods achieved promising results in segmentation benchmarks, the non-adaptive multi-scale feature learning modules hinder the generality of these methods and they have not taken advantage of the multi-stage outputs of backbone networks.
Semantic segmentation implicitly facilitates pixel-group attention modeling through grouping pixels with different semantic meaning. Given the effectiveness of SE module for image classification, we put forward a hypothesis: There exists a module that specifically accounts for pixel-level prediction and pixel-group attention. If this hypothesis is correct, this module should emphasize the spatial information that SE modules omit. Additionally, it should has a simple yet effective architecture like its counterpart for classification. Therefore, inspired by the effectiveness of SE module for image classification [hu2017squeeze], we design a novel squeeze-and-attention (SA) module with a down-sampled but not fully squeezed convolutional channel to produce a flexible module. The SA module adapts the SE module from classification to segmentation and it takes a dual-usage mechanism on the down-sampled attention channel. Specifically, this additional channel generates categorical specific soft attention masks for pixel grouping, while adding scaled spatial features on top of the classical convolution channels for pixel-level prediction. In other words, the output of the down-sampled attention channel functions as both global attention masks and multi-scale contextual features simultaneously.
To take advantage of multi-scale features of backbone networks, we design a SA net (SANet) built on top of SA modules to merge their multi-stage outputs, resulting in better object boundaries and hence better scene parsing outcomes. 2D Convolution can be used to generate attention masks because each convolutional kernel sweeps across input feature maps. The spatial attention mechanism introduced by the SA modules emphasizes the attention of pixel groups that belong to the same classes at different spatial scales. Additionally, the squeezed channel works as global attention masks. This simple but effective innovation makes it easier to generalize SANets to other related visual recognition tasks. We validate the SANets using three challenging segmentation datasets: PASCAL context and PASCAL VOC 2012 [everingham2015pascal, zhou2016semantic, zhou2017scene].
The contributions of this paper are three-fold:
We disentangle semantic segmentation into three tasks: image-level categorization, pixel-level dense prediction, and pixel-group attention.
We design a squeeze-and-attention (SA) module that adapts SE modules for semantic segmentation via accounting for the multi-scale dense prediction of individual pixels and the spatial attention of pixel groups.
We propose a squeeze-and-attention network (SANet) with multi-level heads to exploit the representational boost from SA modules, and to integrate multi-scale contextual features and image-level categorical information.
2 Related Works
Multi-scale contexts. Recent improvements for semantic segmentation have mostly been made possible by incorporating multi-scale contextual features to facilitate segmentation models to extract discriminative features. a Laplacian pyramid structure is introduced to combine multi-scale features[ghiasi2016laplacian] introduced. A multi-path RefineNet explicitly integrate features extracted from multi-scale inputs to boost segmentation outputs. Encoder-decoder architectures have been used to fuse features that have different levels of semantic meaning [badrinarayanan2015segnet, noh2015learning]. The most popular methods adopt pooling operations to collect spatial information from different scales [zhao2017pyramid, chen2018deeplab]. Similarly, EncNet employs an encoding module that projects different contexts in a Gaussian kernel space to encode multi-scale contextual features [zhang2018context]. Graphical models like CRF and MRF are used to impose smoothness constraints to obtain better segmentation results [zheng2015conditional, liu2015parsenet, arnab2016higher]. Recently, a gather-excite module is designed to alleviate the local feature constraints of classic convolution by gathering features from long-range contexts [hu2018gather]. We improve the multi-scale dense prediction by merging outputs from different stages of backbone residual networks.
Channel-wise attention. Selectively weighting the channels of feature maps effectively increases the representational power of conventional residual modules. A good example is the squeeze-and-excitation (SE) module because it emphasizes attention on the selected channels of feature maps. This module significantly improves classification accuracy of residual networks by grouping related classes together [hu2017squeeze]. EncNet also uses the categorical recognition capacity of SE modules [zhang2018context]. Discriminative Feature Network (DFN) utilize the channel-weighting paradigm in its smooth sub-network. [lin2018multi]. Also, in this work, we use the SE module for the task of categorical information to boost segmentation performance.
Pixel-group attention. The success of attention mechanism in neural language processing foster its adoption for semantic segmentation. Spatial Transform Networks explicitly learn spatial attention in the form of affine transformation to increase feature invariance [jaderberg2015spatial]. Since machine translation and image translation share many similarities, RNN and LSTM have been used for semantic segmentation by connecting semantic labeling to translation [zheng2015conditional, lin2018multi]. [chen2016attention] employed a scale-sensitive attention strategy to enable networks to focus on objects of different scales. [zhao2018psanet] designed a specific spatial attention propagation mechanism, including a collection channel and a diffusion channel. [wang2018non] used self-attention masks by computing correlation metrics. [hu2018gather] designed a gather-and-excite operation via collecting local features to generate hard masks for image classification. Different from exiting attention modules, we use the down-sampled channels that implemented by pooling layers to aggregate multi-scale features and generate soft global attention masks simultaneously. Therefore, the SA models enhance the objective of pixel-level dense prediction and consider the pixel-group attention that has largely been ignored.
Classical convolution mainly focuses on spatial local feature encoding and Squeeze-and-Excitation (SE) modules enhance it by selectively re-weighting feature map channels through the use of global image information[hu2017squeeze]. Inspired by this simple but effective SE module for image-level categorization, we design a Squeeze-and-Attention (SA) module that incorporates the advantages of fully convolutional layers for dense pixel-wise prediction and additionally adds an alternative, more local form of feature map re-weighting, which we call pixel-group attention. Similar to the SE module that boosts classification performance, the SA module is designed specifically for improving segmentation results.
3.1 Residual Block Formulation
Residual networks (ResNets) are widely used as the backbones of segmentation networks because of their strong performance on image recognition, and it has been shown that ResNets pre-trained on the large image dataset ImageNet transfer well to other vision tasks, including semantic segmentation [zhao2017pyramid, chen2018deeplab]. Since classical convolution can be regarded as a spatial attention mechanism, we start from the residual blocks that perform as the fundamental components of ResNets. As shown in Fig. 2 (a), conventional residual blocks can be formulated as:
where represents the residual function, which is parameterized by and denotes the structure of two convolutional layers. and are input and output feature maps. The SE module improve residual block by re-calibrating feature map channels, It is worth noting that we adopt the updated version of SE module, which perform equivalently to original one in [hu2017squeeze]. As shown in Fig. 2 (b), the SE module can be formulated as:
where the learned weights for re-calibrating the channels of input feature map is calculated as:
where the represents the sigmoid function and denotes the ReLU activation function. First, an average pooling layer is used to ‘squeeze’ input feature map . Then, two fully connected layers parameterized by and are adopted to get the ‘excitation’ weights. By adding such a simple re-weighting mechanism, the SE module effectively increases the representational capacity of residual blocks.
3.2 Squeeze-and-attention modules
Useful representation for semantic segmentation appears at both global and local levels of an image. At the pixel level, convolution layers generate feature maps conditional on local information, as convolution is computed locally around each pixel. Pixel level convolution lays the foundation of all semantic segmentation modules, and increased receptive field of convolution layers in various ways boost segmentation performance [zhao2017pyramid, zhang2018context], showing larger context is useful for semantic segmentation.
At the global image level, context can be exploited to determine which parts of feature maps are activated, because the contextual features indicate which classes likely to appear together in the image. Also, [zhang2018context] shows that the global context provides a broader field of view which is beneficial for semantic segmentation. Global context features encode these areas holistically, rather than learning a re-weighting independently for each portion of the image. However, there remains little investigation into encoding context at a more fine-grained scale, which is needed because different sections of the same image could contain totally different environments.
To this end, we design a squeeze-and-attention (SA) module to learn more representative features for the task of semantic segmentation through a re-weighting mechanism that accounts for both local and global aspects. The SA module expands the re-weighting channel of SE module, as shown in figure 2 (b), with spatial information not fully squeezed to adapt the SE modules for scene parsing. Therefore, as shown in Fig. 2 (c), a simple squeeze-attention module is proposed and can be formulated as:
where is a up-sampled function to expand the output of the attention channel:
where represents the output of the attention convolution channel , which is parameterized by and the structure of attention convolution layers . A max pooling layer is used to perform the not-fully-squeezed operation and then the output of the attention channel is up-sampled to match the output of main convolution channel .
In this way, the SA modules extend SE modules with preserved spatial information and the up-sampled output of the attention channel aggregates non-local extracted features upon the main channel.
3.3 Squeeze-and-attention network
We build a SA network (SANet) for semantic segmentation on top of the SA modules. Specifically, we use SA modules as heads to extract features from the four stages of backbone networks to fully exploit their multi-scale. As illustrated in Fig. 3, we consider the three-level tasks of semantic semantic segmentation and use three corresponding losses: categorical loss (Cat. loss) for image-level categorization, auxiliary loss (Aux. loss) for pixel-wise dense prediction, and attention loss (Attn. loss) for pixel-group attention. Therefore, the total loss of SANets can be represented as:
where and are weighting parameters of categorical loss and auxiliary loss, respectively. Each component of the total loss can be formulated as follows:
where N is number of training data size for each epoch, M represents the spaital locations, and C denotes the number of classes for a dataset. and are the predictions of SANets and ground truth, and are the categorical predictions and targets to calculate the categorical loss . The takes a binary cross entropy form. and are typical cross entropy losses. The auxiliary head is similar to the strategy of deep supervision [zhao2017pyramid, zhang2018context], but its input comes from the fourth stage of backbone ResNet instead of the commonly used third stage. The final prediction of SANets integrates the outputs of multiple SA heads and is regularized by a SE head. Hence, the final segmentation prediction of a SANet is:
Dilated FCNs have been used as the backbones of SANets. Suppose that the input image has a size of ; Table 1 shows the detailed feature map sizes of SA heads, along with those of a SE module and an auxiliary head. The main channel of SA modules has the same channel numbers as their attention counterparts and the same spatial sizes as the input features. All convolution layers of SANets adopt 2D kernels. Empirically, we reduce the channel sizes of inputs to a fourth in both main and attention channels, set the downsample (max pooling) and upsample ratio of attention channels to 8, and set the channel number of the intermediate fully connected layer of SE modules to 4 in both datasets. We add a convolutional layer to adapt outputs of SA heads to the class number of PASCAL Context dataset. Therefore, the outputs of all SA heads have a depth of 59. To merge the outputs of four SA heads, we upsample the outputs of SA head2-4 to a spatial size of in order to match that of SA head1.
4 Experimental Results
In this section, we first compare SA module to SE modules, then conduct an ablation study using the PASCAL Context [mottaghi2014role] dataset to test the effectiveness of each component of the total training loss, and further validate SANets on the challenging PASCAL VOC dataset [everingham2010pascal]. Following the convention for scene parsing [chen2018deeplab, zhang2018context], we paper both mean intersection and union (mIoU) and pixel-wise accuracy (PAcc) on PASCAL Context, and mIoU only on PASCAL VOC dataset to assess the effectiveness of segmentation models.
We use Pytorch [paszke2017automatic] to implement SANets and conduct ablation studies. For the training process, we adopt a poly learning rate decreasing schedule as in previous works [zhao2017pyramid, zhang2018context]. The starting learning rates for PASCAL Context and PASCAL VOC are 0.001 and 0.0001, respectively. Stochastic gradient descent and poly learning rate annealing schedule are adopted for both datasets. For PASCAL Context dataset, we train SANets for 40 epochs. As for the PASCAL VOC dataset, we pretrain models on the COCO dataset. Then, we train networks for 50 epochs on the validation set. We adopt the ResNet50 and ResNet101 as the backbones of SANets because these networks have been widely used for mainstream segmentation benchmarks. We set the batch-size to 16 in all training cases and use sync batch normalization across multiple gpus recentely implemented by [zhang2018context]. We use four SA heads to exploit the multi-scale features of different stages of backbones and also to regularize deep networks.
4.2 Results on PASCAL Context
The Pascal Context dataset contains 59 classes, 4998 training images, and 5105 test images. Since this dataset is relatively small in size, we use it as the benchmark to design module architectures and select hyper-parameters including and . To conduct an ablation study, we explore each component of SA modules that contribute to enhancing the segmentation results of SANets.
The ablation study includes three parts. First, we test the impacts of the weights and of the total training loss. As shown in Fig. 4, we test from 0 to 1.0, and find that the SANet with works the best. Similarly, we fix to find that yields the best segmentation performance. Second, we study the impacts of categorical loss and auxiliary loss of in equation (7) using selected hyper-parameters. Table 2 shows that the SANet, which contains the four dual-usage SA modules, using ResNet50 as the backbone improves significantly (a 2.7% PAcc and 6.0% mIoU increase) compared to the FCN baseline. Also, the categorical loss and auxiliary loss boost the segmentation performance.
We compare SANets with state-of-the-art models to validate their effectiveness, as shown in Table 3, the SANet using ResNet101 as its backbone achieves 53.0% mIoU. The mIoU equals to 52.1% when including the background class this result and outperforms other competitors. Also, we use the recently published Efficient Net (EffNet) [tan2019efficientnet] as backbones. Then, the EffNet version SANet achieved state-of-the-art 54.4% mIoU that sets new records for the PASCAL Context dataset. Fig. 4 shows the segmentation results of a ResNet50 FCN and a SANet using the same backbone. In the first three rows, SANets generate better object boundaries and higher segmentation accuracy. However, for complex images like the last row, both models fail to generate clean parsing results. In general, the qualitative assessment is in line with quantitative papers.
We also validate the effectiveness of SA modules by comparing them with SE modules on top of the baseline dilated FCNs, including ResNet50 and ResNet101. Table 4 shows that the SANets achieve the best accuracy with significant improvement (4.1% and 4.5% mIoU increase) in both settings, while FCN-SE models barely improve the segmentation results.
4.3 Attention and Feature Maps
The classic convolution already yields inherent global attention because each convolutional kernel sweeps across spatial locations over input feature maps. Therefore, we visualize the attention and feature maps of a example of PASCAL VOC set and conduct a comparison between Head1 and Head4 within a SANet To better understand the effect of attention channels in SA modules. We use L2 distance to show the attention maps of the attention channel within SA module, and select the most activated feature map channels for the outputs of the main channel within the same SA module. The activated areas (red color) of the output feature maps of SA modules can be regarded as the pixel groups of selected points. For the sake of visualization, we scale all feature maps illustrated in Figure 6 to the same size. we select three points (red, blue, and magenta) in this examples to show that the attention channel emphasizes the pixel-group attention, which is complementary to the main channels of SA modules that focus on pixel-level prediction.
Interestingly, as shown in Figure 6, the attention channels in low-level (SA head1) and high-level (SA head4) play different roles. For the low-level stage, the attention maps of the attention channel have broad field of view, and feature maps of the main channel focus on local feature extraction with object boundary being preserved. In contrast, for the high-level stage, the attention maps of the attention channel mainly focus on the areas surrounding selected points, and feature maps of the main channel present more homogeneous with clearer semantic meaning than those of head1.
|SANet (ours)||ResNet101||86.1 2|
4.4 Results on PASCAL VOC
The PASCAL VOC dataset [everingham2010pascal] is the most widely studied segmentation benchmark, which contains 20 classes and is composed of 10582 training images, and 1449 validation images, 1456 test images. We train the SANet using augmented data for 80 epochs as previous works [long2015fully, dai2015boxsup].
First, we test the SANet without COCO pretraining. As shown in Table 5, the SANet achieves 83.2% mIoU which is higher than its competitors and dominates multiple classes, including aeroplane, chair, cow, table, dog, plant, sheep, and tv monitor. This result validates the effectiveness of the dual-usage SA modules. Models [chollet2017xception, chen2017rethinking] use extra datasets like JFT [sun2017revisiting] other than PASCAL VOC or COCO are not included in Table 5.
Then, we test the the SANet with COCO pretraining. As shown in Table 6, the SANet achieves an evaluated result of 85.4% mIoU using COCO data for pretraining, which is comparable to top-ranking models including PSPNet [zhao2017pyramid], and outperforms the RefineNet [lin2017refinenet] that is built on a heavy ResNet152 backbone. Our SA module is more computationally efficient than the encoding module of EncNet [zhang2018context]. As shown in Fig. 6, the prediction of SANets yields clearer boundaries and better qualitative results compared to those of the baseline model. Figures 8-9 show some segmentation results on the PASCAL Context and PASCAL VOC datasets using trained SANets with ResNet101 as their backbones.
In this paper, we disentangle semantic segmentation into three tasks — categorical recognition, pixel-wise dense prediction, and pixel-group attention modeling. We design a SA module that enhances the pixelwise prediction and emphasizes the largely ignored pixel-group attention. We propose SANets to aggregate multi-stage multi-scale extracted features, resulting in promising performance. Most importantly, the SANet using EffNet-b7 sets new records on the PASCAL Context dataset. We hope that the simple yet effective SA modules and the SANets built on top of SA modules can facilitate the segmentation research of other groups.