Location-aware Upsampling for Semantic Segmentation

Location-aware Upsampling for Semantic Segmentation

Xiangyu He    Zitao Mo    Qiang Chen    Anda Cheng    Peisong Wang    Jian Cheng
NLPR, Institute of Automation, Chinese Academy of Sciences, Beijing, China
xiangyu.he, zitao.mo, qiang.chen, anda.cheng, peisong.wang, jcheng@nlpr.ia.ac.cn

Many successful learning targets such as minimizing dice loss and cross-entropy loss have enabled unprecedented breakthroughs in segmentation tasks. Beyond these semantic metrics, this paper aims to introduce location supervision into semantic segmentation. Based on this idea, we present a Location-aware Upsampling (LaU) that adaptively refines the interpolating coordinates with trainable offsets. Then, location-aware losses are established by encouraging pixels to move towards well-classified locations. An LaU is offset prediction coupled with interpolation, which is trained end-to-end to generate confidence score at each position from coarse to fine. Guided by location-aware losses, the new module can replace its plain counterpart (e.g., bilinear upsampling) in a plug-and-play manner to further boost the leading encoder-decoder approaches. Extensive experiments validate the consistent improvement over the state-of-the-art methods on benchmark datasets. Our code is available at https://github.com/HolmesShuan/Location-aware-Upsampling-for-Semantic-Segmentation.

1 Introduction

Recent advances in deep learning have empowered a wide range of applications [23, 36, 19, 41, 12], including semantic segmentation [52, 47, 29, 6]. The pioneering Fully Convolutional Network (FCN) [29] with deconvolutional operations [37] dramatically surpasses classical methods relying on hand-crafted features. Following FCNs, a series of CNN-based methods further enhance the state-of-the-art by introducing dilated (atrous) convolutions [47, 7], shortcut connections [1] and CRFs [45, 2, 27]. Latest semantic segmentation methods [50, 8] exploit fully trainable encoder-decoder architectures with simple bilinear upsamplings to extract high-resolution predictions.

Although recent works have exhibited state-of-the-art performance, we suggest that the location information hidden in the label map has been overlooked. To introduce location prediction into segmentation models, we revisit the idea of semantic segmentation: the task of assigning each pixel of a photograph to a semantic class label [24], then decompose this target into two relatively simple sub-tasks: pixel-level localization and classification. Instead of following previous methods to simply predict class labels, we further propose to predict locations simultaneously, more precisely, to learn offsets with additional supervision.

This fresh viewpoint leads to the Location-aware Upsampling (LaU) directly, which upsamples classification results while accounting for the effect of offsets. Besides that, LaU tends to meet the key challenges in semantic segmentation: First, CNN-based segmentation methods extract high-level semantic features at the cost of losing detailed pixel-level spatial information. Second, the feature distribution of objects is imbalanced, where the interior points enjoy the rich knowledge after high-ratio upsampling yet pixels around edges suffer from indiscriminative features due to bilinear interpolation. A natural solution is to borrow the information from interior points to guide the categorization near the boundary, which exactly corresponds to our two sub-tasks: learning offsets and utilizing the information at the new position to make a prediction.

Our observation is that feature maps used by pixel-level classification, e.g., the second last layer in decoders, can also be used for predicting locations. On top of these features, we construct LaUs by adding an offset prediction branch to produce offsets at each position in the high-resolution feature map (details in section 3.3). To facilitate the offset prediction of interior points and improve the convergence, we first define the candidate locations then use predicted offsets to regress fine-grained coordinates, i.e., .

The proposed offset prediction makes it possible to further introduce location supervision implicitly/explicitly into the training process. In this paper, we present two location-aware losses forcing the offsets to predict fine-grained coordinates with the correct class. Besides, our location-aware losses avoid human labeling and produce “label” online. In the experiments, we empirically show that location-aware mechanism outperforms strong baseline methods, and it is feasible to apply the rich knowledge in the design of location regression to the location-aware upsampling. We believe dense predictions can widely benefit from the location-aware module, including but not limited to the semantic segmentation.

Figure 1: (a) PixelShuffle [38] is in the form of periodical shuffling of multiple intermediate feature maps generated by independent filters, which cuts off the strong spatial relationship among adjacent pixels on the low-resolution feature map (connected by gray dashed lines). (b) DUpsampling [40] replaces convolutions with a matrix product to produce a small patch, yet can be transformed into PixelShuffle. In both cases, networks are forced to learn the new spatial relationship generated by hand-crafted rearrangement.

2 Related Work

Feature map upsampling has been widely used in deep models, such as encoder-decoder architectures [32, 37], generative models [35, 30], super-resolution tasks [13, 38, 20] and semantic segmentation [29, 6]. Deconvolution operations [48, 49], also known as transposed convolutions [44], first bridge the gap between low-resolution feature maps and desired full-resolution predictions. Despite its successes, recent works [33, 16] show that deconvolutions can easily suffer from the checkboard problem when the kernel size is not divisible by the stride.

Advances like PixelShuffle [38] and DUpsampling [40] turn to hand-crafted pixel rearrangement to reconstruct high-resolution features. However, the empirical design results in the inconsistent spatial relationship among adjacent pixels on the high-resolution feature map (As illustrated in Figure 1, pixels connected by blue lines are generated by different filters, yet the input of each filter corresponds to the same receptive field. Both filter and receptive filed are different for pixels connected by red lines.). Currently, PixelDCL [16] generates intermediate feature maps sequentially so that later stages depend on previously generated ones, which is similar to pixel-recurrent neural networks (PixelRNNs) [43] and PixelCNNs [42]. Yet pixel-wise prediction is time-consuming and there is no explicit time sequence relation between adjacent pixels, i.e., previously generated pixels can still rely on later produced ones which leads to a chicken-and-egg problem.

Conditional Random Field (CRF) is another group of methods, that is primarily used in [5] as a disjoint post-processing. Following works [27, 45, 28] further integrate CRF into networks to model the semantic relationships between objects. Due to the increase of computing complexity caused by integration, [3, 4] propose a Gaussian conditional random field to avoid the iterative optimization [53] while suffering from heavy gradient computation. Similar to CRFs, [2, 22] focus on refining the output label map by random walks. However, the dense matrix inversion term and the two branch model with cascaded random walks cannot be easily merged into leading encoder-decoder architectures.

Spatial Transformer Networks (STN) [21] is the first work to introduce spatial transformation capacity into neural networks, which enables global parametric transformation like 2D affine transformation. Since the pointwise location sampling in STN is weakly coupled with the input feature map at each position (only through global transformation parameters produced by the preceding feature maps), this greatly limits its capacity in pixel-wise predictions such as semantic segmentation. The module we present in this paper can be seen as a new transformation in a local and dense manner. The upsampled feature map can still enjoy the detailed information of low-resolution inputs by using pixel-wise offsets.

Deformable convolution networks (DCNN) [9, 55] achieve strong performance on object detection and semantic segmentation with trainable offsets. However, it is commonly used in feature extraction and we empirically prove that replacing the last bilinear upsampling with “DCNN+Bilinear” will cause damage to the performance (details in Table 9 and appendix section 2.2). As illustrated in Figure 2, DCNN aims at augmenting the spatial sampling grid of filters without extra supervision, yet our method directly refines the sampling coordinates with location-guided losses which contribute to higher performance.

Figure 2: The illustrations of the original convolution (a), deformable convolution [9] (b), bilinear upsampling (c) and LaU (d).

3 Methodology

Data-independent upsampling such as bilinear interpolation is one of the basic techniques in computer vision and image processing. Despite its high efficiency, this module can be oversimple (illustrated in Figure 3) and fail to capture the semantic information. In this section, we alleviate the existing problems in bilinear upsampling by introducing the location-aware mechanism.

Learning data-dependent offsets is the key step in LaU. This idea is based on the fact that adjacent boundary points on the high-resolution label map can degrade into the same pixel in the low-resolution feature map, since upsampling is commonly used in popular methods [7, 51, 50]. Instead of sampling from the local area, we propose to use the predicted offsets, generating the output map from the location-refined coordinates. We describe the formulation of a location-aware upsampling in the following sections.

Figure 3: Different cases in 2D bilinear interpolation. (a) An ideal case that fully exploits four points to obtain the desired estimate. (b) If one of the variables (i.e., or ) happens to be an integer, bilinear interpolation degrades into a linear interpolation. (c) The special case : both and are integers. Bilinear interpolation is an injective non-surjective function. (d)/(e) A natural solution to the degradation in (b,c) is to slightly move the interpolation point, which leads to another problem: how to determine the offsets?

3.1 Spatial Upsampling

To estimate the values in the output feature map , a sampler applys the sampling kernel to a particular grid in the input feature map . Each coordinate defines the spatial location in and determines the kernel function at a particular pixel in channel . Formally,


where , are the integer coordinates in the output feature map, and are the parameters of a generic sampling function which can be preset or data-dependent 111 can be in , and in LaU.. Note that the existing upsamplings such as bilinear and PixelShuffle apply the same to every channel of the input (i.e., data-independent), while LaU enables free changes of at each position. That is, the sampling kernel in LaU changes with the input data and coordinates.

In theory, is a (sub-)differentiable function of the coordinates and , along with the parameters , which can be in any form. For example, corresponds to the bilinear interpolation. Here, we show that PixelShuffle is another data-independent case, which may lead to sub-optimal results.

3.1.1 PixelShuffle

PixelShuffle can also be reformulated in the form of Eq.(1). Specifically, in PixelShuffle upsampling is

where is the Kronecker delta function. Note that changes along with channel , yet still fixed with respect to input data. From the view of , this sampling kernel equates to the data-independent methods like bilinear interpolation. Furthermore, the output of Kronecker delta is either or , which partly pinpoints the root of missing spatial relationship. That is, generated by PixelShuffle can be considered as a bijection. There is no direct relationship between adjacent elements in its codomain (output feature map), unless the source elements in its domain (intput feature map) are strongly correlated. For example, neighboring pixels connected by red lines in Figure 1 belong to different receptive fields, and are generated by different filters; They are likely to look different (i.e., uncorrelated) unless the filter weights are similar/same. Therefore, the network has to consider the high-resolution spatial relationships when producing low-resolution intermediate feature maps.

Figure 4: Illustrations of the location-aware loss. ,, refer to the interpolation and upsampling. is the coordinate point for interpolating. , rounds input upward and downward, respectively. (a) Offset-guided loss will punish the moved pixel with loss weight if it yields even larger loss than the original point. (b) The illustrations of different sampling kernels , , and . (c) The optimal candidate coordinates correspond to pixels with the smallest loss among LaU, left top, left bottom, right top and right bottom upsamplings. Ground-truth is the same as subfigure (a).

3.1.2 Location-aware Upsampling

Mathematically, bilinear upsampling is a non-injective surjective function (not a bijection), denoted as . Neighboring points in sampled from the same grid in are strongly correlated, since they are generated from the same elements. Compared with PixelShuffle, this scheme avoids checkerboard artifacts. As illustrated in Figure 2, location-aware upsampling is a simple yet effective improvement to the bilinear interpolation with the following sampling kernel


where is the upsampling ratio. Inspired by the anchor-based methods in object detection [17, 36], we introduce as a candidate location then regress the detailed location with predicted offsets. To encourage the network to jump out of the local grid, we add no constraint on .

To learn the offsets in the upsampling module, we utlize the gradients with respect to and . This allows the backpropagation of the loss function through the (sub-)differentiable location-aware upsampling module. For upsampling kernel (2), the partial derivatives are

With the gradients to offsets, we can further define the offset prediction branch to allow the end-to-end training.

3.2 Location-aware Loss

To fully exploit the effect of predicted offsets , we introduce (semi-)supervised location-aware losses into the training process. Since offsets could point to any input location, as long as it contains the correct category, we first consider the offset-guided loss without hard constraints. Inspired by the bounding box regression, we further propose the offset regression loss to guide the offset prediction directly.

3.2.1 Offset-guided Loss

In common training process of semantic segmentation, the network is optimized by per-pixel cross-entropy loss . Instead of learning from isolated per-pixel supervision, we consider the loss pair generated by bilinear upsmapling and LaU (shown in Figure 4a). Formally, we introduce the auxiliary loss (with no gradient) and the target loss ,

where is the input image, is the model parameters and refers to the predicted offsets. The per-pixel offset-guided loss


with loss weight forces the network to “take a step”, since the trivial solution (i.e., ) will be punished by the extra cost . Besides, demostrates that the moved pixel should have a smaller cross-entropy loss (i.e., a higher confidence score) than the original point. Therefore, (3) encourages the pixel to move towards a better position, which contains potentially correct class.

3.2.2 Regression Loss

Since there is no groud-truth label of the offsets, we could intuitively search for the closest well classified point among the neighbouring pixels to generate the offsets. However, the brute search is time-consuming for online label generation. Hence, we emprically constrain the search space to the intergral coordinate points (shown in Figure 4b).

Following the preivous subsection, we further introduce the auxiliary losses , , and , e.g., with the sampling kernels

where is the “indicator function”, that is and . Then, we define the loss set

and the coordinate set

Combining and , we obtain the optimal candidate corrdinates

where the summation stops at . Given (illustrated in Figure 4c), we formulate the per-pixel offset regression loss as


where controls the strength of the regression loss. In our experiments, we set to 0.1 without fine-tuning. Note that the auxiliary losses will be excluded from the backward propagation (with no gradient) and omitted at inference time.

3.3 Location-aware Network

The combination of predicted offsets and interpolation operation forms a location-aware upsampling (Figure 5). Note that the offset prediction branch can take any form in practice, such as a fully-connected network like RPN in Faster-RCNN [36]. In this paper, we use “Conv+LReLU+Conv+PixelShuffle” to illustrate the basic framework of location-aware mechanism in semantic segmentation.

The first convolution layer reduces the channel number of the input feature map from to . Then, the last convolution layer produces an offset tensor where and . can be the number of class to allow the per-element shift. However, the extra computing cost will be unacceptable if is too large. When we set to 1, the output feature map shares offset across different channels. Compared with , this setting saves over G FLOPs on ADE20K dataset.

Encoder-decoder architecture with LaU tends to first predict the general location and rough boundary, then recontruct the detailed information using predicted offsets. As shown in Figure 7 and 7, this two-step scheme decomposes the original problem into two sub-tasks: pixel-level localization and classification, which correspond to the two branches in location-aware network.

Figure 5: The framework overview of LaU. Our approach employs the same encoder-decoder architecture as the original method, which adds a convolution layer over the decoder output to generate per-pixel classification scores (i.e., ). The offset prediction branch (green blocks) shares the input feature map with the convolution then produces . A location-aware upsampling merges two branches and generates high-resolution outputs.

4 Experiments

We evaluate the proposed method on the PASCAL VOC 2012 semantic segmentation benchmark [14], PASCAL Context [31] and ADE20K [54] datasets. The standard evaluation metrics are pixel accuracy (pixAcc) and mean Intersection of Union (mIoU). In this section, we first introduce datasets used in this paper as well as the implementation details. We then apply LaU to the top of networks and the upsampler in decoders to show the effectiveness of the location-aware mechanism. Moreover, the visualization results demonstrate the superiority of optimizing the proposed location-aware segmentation (shown in Figure 6).

\thesubsubfigure Image
\thesubsubfigure GT
\thesubsubfigure EncNet
\thesubsubfigure LaU
Figure 6: Visualization results of different upsampling modules using EncNet-ResNet50 [50]. ”EncNet” refers to the original bilinear upsampling. ”LaU” is location-aware upsampling with offset-guided loss.

4.1 Experimental Settings


PASCAL Context dataset [31] provides pixel-wise semantic annotations for whole scenes, which contains 4998 images for training and 5105 images for testing. Following [26, 15, 46, 50], we use the most frequent 59 object categories plus background (60 classes in total) as the groundtruth labels. PASCAL VOC 2012 consists of 1464 (train), 1449 (val) and 1456 (test) pixel-level annotated images, which involves 20 foreground object classes and one background class. Following previous works [8, 18], we augment the dataset by the extra annotations, resulting in 10582 (trainaug) training images. The mIoU is averaged across 21 classes. ADE20K [54] scene parsing benchmark consists of 150 categories. There are 20K images for training, 2K for validation and 3K for testing. The results on test set are provided by the ADE20K online evaluation server.

Implementation Details

For the experiments on PASCAL Context, we adopt the same settings as described in [50, 46]. The learning rate starts at 0.001, which gradually decrease to 0 using the ”poly” strategy (power=0.9). The networks are trained for 80 epochs with SGD. Concretely, the momentum is set to 0.9 and weight decay is 0.0001. To encourage the LaU to ”look wider”, we set the weight decay of the convolutional layer in LaU to 0. For data augmentation, we randomly scale (from 0.5 to 2.0) and horizontally flip the input images, then apply a random crop to the images. All the experiments are conducted on 4 GPUs with batch size 16. Since ImageNet pre-trained backbones can significantly contribute to higher performances, we follow the practice of [50, 46, 15, 40] to use the standard ResNet-50 and ResNet-101 [19].

For training on PASCAL VOC, we follow the setting in [40, 8]. The initial learning rate is set to 0.007 with ”poly” policy and weight decay is 0.0001. The total iteration is for ResNet-50 experiments with batch size 48, yet another iterations are required for training ResNet-101 with initial learning rate being 0.001 [8]. The scaling with flipping data augmentation and LaU settings remain the same as PASCAL Context.

For the experiments on ADE20K, as described in EncNet [50], we train our model on the training set for 180 epochs with learning rate 0.004 and evaluate it on the validation set using multi-scale testing. We then fine-tune our models on the train-val set for another 30 epochs with learning rate and submit results to the test server. We use a base size with crop for training. The backbone networks are the same as previous experiments.

Unless specified, we employ the same loss function as described in [7, 50] plus a location-aware loss and report the single-scale results, without any bells and whistles.

Upsampling Ratio mIoU% Params FLOPs
Bilinear ([50]) 49.08 41.82M 150.69G
LaU 49.61 M G
49.67 M G
50.27 M G
49.65 M G
Table 1: Performance on Pascal Context val set with EncNet-ResNet50 (60 classes) using different upsampling ratios. “LaU” refers to LaU with offset-guided loss. We cover the reamining upsampling factor of LaU with bilinear upsampling.
0.0 0.1 0.2 0.3 0.4
mIoU% 49.77 49.84 49.94 50.27 49.98
pixAcc% 78.69 78.77 78.75 79.01 78.89
Table 2: Ablation study on loss weight using EncNet-ResNet50+LaU (Pascal Context val 60 classes).
Ablation Study for LaU

To evaluate LaU, the key parameter is the upsampling factor. As listed in Table 1, upsampling works better than other settings with negligible additional time and space complexity. With proposed LaU, the results consistently outperform standard bilinear upsampling and the best setting exceeds baseline by +1.2 mIoU with only 0.16M extra parameters. Since loss weight of the original auxiliary loss in EncNet is 0.2, in this section, we intuitively set to 0.3 to enlarge the gradient of location-aware loss.

Head Upsampling OS mIoU%
EncNet [50] PixelShuffle 8 43.4
PixelShuffle 8 38.3
PixelShuffle 8 27.9
Bilinear 8 49.1
LaU 8 50.3
EncNet [50] PixelShuffle 32 37.8
Bilinear 32 45.7
LaU 32 46.7
ASPP [7] PixelShuffle 8 43.3
Bilinear 8 48.4
LaU 8 49.8
(a) Applying to encoder-decoder methods with ResNet-50. “OS” refers to the output stride. We first apply an upsampling with LaU/PixelShuffle, then cover the remaining upsampling factor with bilinear upsampling. “” means the cascade of two , i.e., .
Method Backbone mIoU% MS
CCL [11] ResNet-101 50.7
DUpsampling [40] Xception-65 51.4
DUpsampling [40] Xception-71 52.5
DANet [15] ResNet-101 52.6
HRNet [39] HRNetV2-W48 53.1
JPU [46] ResNet-101 53.1
BFP [10] ResNet-101 52.7
EncNet [50] ResNet-101 51.7
EncNet + LaU ResNet-101 53.9
EncNet + LaU ResNet-101 52.7
EncNet + LaU ResNet-101 53.9
EncNet + LaU ResNet-101 52.8
(b) Results with state-of-the-art methods. “off” and “reg” are offset-guided loss and offset regression loss, respectively.
”MS” means multi-scale testing.
Table 3: Semantic segmentation results on Pascal Context val set. mIoU on 60 classes w/ background.
Ablation Study for Location-aware Loss

We experiment with loss weight between 0 and 0.4 and show the results in Table 2. Note that LaU without additional supervision () still produces performance gain and yields the best performance. Both location-aware upsampling and location-aware loss contribute to higher performance than bilinear baseline (similar results are reported in Table 9). Due to the limited computing resources, we directly apply the best settings to regression loss experiments and emprically set to 0.1 without grid search.

4.2 Applying to Label Prediction

In this section, we apply LaU to the top of the decoders (as shown in Figure 5), which corresponds to the prediction of label maps.

4.2.1 Quantitative Analysis

PASCAL Context

To estimate the generalization ability of LaU, we conduct experiments on two popular methods Encoding [50] and ASPP [7] with different output strides. Table 3(a) shows that our methods consistently outperform the original bilinear methods. Following the setting in super-resolution tasks [27, 20, 38], one may decompose a single upsampling into three cascaded upsamplings. To simplify the experimental settings, we leave this scheme as future work.

Compared with PixelShuffle [38] (Table 3(a)) and Deformable [9] bilinear upsampling (Table 9), LaU notably outperforms those methods with/without location-aware loss. Note that PixelShuffle and Deformable cause damages to the performance; We show the visualization results and give a brief discussion in the appendix section 2.

Table 3(b) further shows that single-scale LaUs surpass recent upsampling method DUpsampling [40] with strong Xception-71 backbone, e.g., ImageNet Top-1 accuracy of Xception-65 is higher than ResNet-101. Our multi-scale result achieves mIoU over EncNet, while LaU is still compatible with attention mechanisms and stronger encoder-decoder architectures to further enhance the leading performances.

Method Upsampling OS mIoU%
EncNet [50] Deformable+Bilinear 32 44.7
LaU 32 46.2
LaU 32 46.7
Deformable+Bilinear 8 48.0
LaU 8 49.8
LaU 8 50.3
ASPP [7] Deformable+Bilinear 8 47.9
LaU 8 49.4
LaU 8 49.8
Table 4: Comparison with Deformable Convolution Netwrok [9] on Pascal Context val set. We add an “LaU/Deformable+Bilinear” module to the top of EncNet/ASPP. “LaU” refers to LaU without location-aware loss. “OS” is the output stride.

As shown in Table 5, our methods achieve similar results on ADE20K dataset. With ResNet-50 backbone, LaUs outperform EncNet baseline by and mIoU. Regression loss seems to perform worse than location-guided loss. We attribute this to the hand-crafted design of optimal candidate coordinates and the selection of . The current search space contains only four points, which can be suboptimal. We will include a larger search space in our future work.

Although our methods achieve slightly worse performance than EncNet on val set, LaUs notably surpass EncNet and all entries in COCO-Place challenge 2017 on the final test score (see results in Table 6).

4.2.2 Qualitative Analysis

We show the PASCAL-Context visual results in Figure 6. LaU labels out the missing parts in EncNet and corrects the misclassifications. More examples in ADE20K and the visual comparison between offset-guided loss and regression loss are shown in the appendix (Figure 11). To further illustrate the effect of location-aware mechanism, we compare the segmentation results before and after LaU in Figure 7. LaU first produces coarse segmentation of objects with rough boundary (in Figure 7), then refines the boundary using offsets (in Figure 7). The whole segmentation process of LaU is from coarse to fine.

Method Backbone pixAcc% mIoU%
EncNet [50] ResNet-50 79.73 41.11
EncNet+LaU ResNet-50 80.53 42.78
EncNet+LaU ResNet-50 80.28 42.41
HRNet [39] HRNetV2-W48 43.20
PSPNet [51] ResNet-152 81.38 43.51
EncNet [50] ResNet-101 81.69 44.65
PSPNet [51] ResNet-269 81.69 44.94
EncNet+LaU ResNet-101 81.21 44.55
EncNet+LaU ResNet-101 81.49 45.02
Table 5: ADE20K val set results with top-performing models (using mutli-scale testing) .
Team Model Final Score mIoU%
PSPNet [51] ResNet-269 0.5538
WinterISComing 0.5544
EncNet [50] ResNet-101 0.5567
JPU [46] ResNet-101 0.5584 38.39
EncNet+LaU ResNet-101 0.5641 39.04
EncNet+LaU ResNet-101 0.5632 39.14
Table 6: Results on ADE20K test set. Best entries in COCO-Place challenge 2017 are listed.

4.3 Applying to Decoder

Recent state-of-the-arts [7, 8] utilize upsampling modules in decoders to combine the high-resolution features with low-resolution ones. To show the general applicability of proposed method, we replace the bilinear upsampling in DeepLabv3 [8] with LaU. The performance of ResNet-50+LaU ( mIoU in Table 7) has approached the original ResNet-101 backbone results. Even with strong ResNet-101 baseline, LaU can still boost the performance by one point.

\thesubsubfigure Image
\thesubsubfigure GT
\thesubsubfigure Before LaU
\thesubsubfigure After LaU
Figure 7: Illustrations of the effect of predicted offsets generated by LaU. We visualize the segmentation results of EncNet+LaU before and after LaU.
Method Upsampling Backbone mIoU%
DeepLabv3 Bilinear ResNet-50 76.52
LaU ResNet-50 78.55
Bilinear ResNet-101 78.85
LaU ResNet-101 80.16
Table 7: mIoU over Pascal VOC 2012 val set using DeepLabv3 [8].

5 Conclusion

In this paper, we provide a new perspective for optimizing the semantic segmentation problem. We decompose the original pixel-level classification problem into offsets prediction and classification, which introduces the idea of location prediction into semantic segmentation. Based on this fresh viewpoint, we propose the location-aware upsampling and location-aware losses. Our models achieve promising performance compared with various baseline methods. The effectiveness of learning offsets is also verified through qualitative and quantitative results.


  • [1] V. Badrinarayanan, A. Kendall, and R. Cipolla (2017) SegNet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39 (12), pp. 2481–2495. External Links: Link, Document Cited by: §1.
  • [2] G. Bertasius, L. Torresani, S. X. Yu, and J. Shi (2017) Convolutional random walk networks for semantic image segmentation. See DBLP:conf/cvpr/2017, pp. 6137–6145. External Links: Link, Document Cited by: §1, §2.
  • [3] S. Chandra and I. Kokkinos (2016) Fast, exact and multi-scale inference for semantic image segmentation with deep gaussian crfs. See DBLP:conf/eccv/2016-7, pp. 402–418. External Links: Link, Document Cited by: §2.
  • [4] S. Chandra, N. Usunier, and I. Kokkinos (2017) Dense and low-rank gaussian crfs using deep embeddings. See DBLP:conf/iccv/2017, pp. 5113–5122. External Links: Link, Document Cited by: §2.
  • [5] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2015) Semantic image segmentation with deep convolutional nets and fully connected crfs. See DBLP:conf/iclr/2015, External Links: Link Cited by: §2.
  • [6] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2018) DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 40 (4), pp. 834–848. External Links: Link, Document Cited by: §1, §2.
  • [7] L. Chen, G. Papandreou, F. Schroff, and H. Adam (2017) Rethinking atrous convolution for semantic image segmentation. CoRR abs/1706.05587. External Links: Link, 1706.05587 Cited by: Table 8, §B.1, §1, §3, §4.1, §4.2.1, §4.3, 3(a), Table 4.
  • [8] L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. See DBLP:conf/eccv/2018-7, pp. 833–851. External Links: Link, Document Cited by: §1, §4.1, §4.1, §4.3, Table 7.
  • [9] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei (2017) Deformable convolutional networks. See DBLP:conf/iccv/2017, pp. 764–773. External Links: Link, Document Cited by: §B.2, Figure 2, §2, §4.2.1, Table 4.
  • [10] H. Ding, X. Jiang, A. Q. Liu, N. M. Thalmann, and G. Wang (2019-10) Boundary-aware feature propagation for scene segmentation. In The IEEE International Conference on Computer Vision (ICCV), Cited by: 3(b).
  • [11] H. Ding, X. Jiang, B. Shuai, A. Q. Liu, and G. Wang (2018) Context contrasted feature and gated multi-scale aggregation for scene segmentation. See DBLP:conf/cvpr/2018, pp. 2393–2402. External Links: Link, Document Cited by: 3(b).
  • [12] C. Dong, C. C. Loy, K. He, and X. Tang (2016) Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 38 (2), pp. 295–307. External Links: Link, Document Cited by: §1.
  • [13] C. Dong, C. C. Loy, and X. Tang (2016) Accelerating the super-resolution convolutional neural network. See DBLP:conf/eccv/2016-2, pp. 391–407. External Links: Link, Document Cited by: §2.
  • [14] M. Everingham, L. J. V. Gool, C. K. I. Williams, J. M. Winn, and A. Zisserman (2010) The pascal visual object classes (VOC) challenge. International Journal of Computer Vision 88 (2), pp. 303–338. External Links: Link, Document Cited by: §4.
  • [15] J. Fu, J. Liu, H. Tian, Z. Fang, and H. Lu (2019) Dual attention network for scene segmentation. In 2019 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-21, 2019, pp. 1–8. Cited by: §4.1, §4.1, 3(b).
  • [16] H. Gao, H. Yuan, Z. Wang, and S. Ji (2019) Pixel transposed convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence (), pp. 1–1. External Links: Document, ISSN 0162-8828 Cited by: §2, §2.
  • [17] R. B. Girshick (2015) Fast R-CNN. See DBLP:conf/iccv/2015, pp. 1440–1448. External Links: Link, Document Cited by: §3.1.2.
  • [18] B. Hariharan, P. Arbelaez, L. D. Bourdev, S. Maji, and J. Malik (2011) Semantic contours from inverse detectors. See DBLP:conf/iccv/2011, pp. 991–998. External Links: Link, Document Cited by: §4.1.
  • [19] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. See DBLP:conf/cvpr/2016, pp. 770–778. External Links: Link, Document Cited by: §1, §4.1.
  • [20] X. He, Z. Mo, P. Wang, Y. Liu, M. Yang, and J. Cheng (2019) ODE-inspired network design for single image super-resolution. See DBLP:conf/cvpr/2019, pp. 1732–1741. External Links: Link Cited by: §B.1, §2, §4.2.1.
  • [21] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu (2015) Spatial transformer networks. See DBLP:conf/nips/2015, pp. 2017–2025. External Links: Link Cited by: §2.
  • [22] P. Jiang, F. Gu, Y. Wang, C. Tu, and B. Chen (2018) DifNet: semantic segmentation by diffusion networks. See DBLP:conf/nips/2018, pp. 1637–1646. External Links: Link Cited by: §2.
  • [23] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) ImageNet classification with deep convolutional neural networks. See DBLP:conf/nips/2012, pp. 1106–1114. External Links: Link Cited by: §1.
  • [24] V. S. Lempitsky, A. Vedaldi, and A. Zisserman (2011) Pylon model for semantic segmentation. See DBLP:conf/nips/2011, pp. 1485–1493. External Links: Link Cited by: §1.
  • [25] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee (2017-07) Enhanced deep residual networks for single image super-resolution. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Cited by: §B.1.
  • [26] G. Lin, A. Milan, C. Shen, and I. D. Reid (2017) RefineNet: multi-path refinement networks for high-resolution semantic segmentation. See DBLP:conf/cvpr/2017, pp. 5168–5177. External Links: Link, Document Cited by: §4.1.
  • [27] G. Lin, C. Shen, A. van den Hengel, and I. D. Reid (2016) Efficient piecewise training of deep structured models for semantic segmentation. See DBLP:conf/cvpr/2016, pp. 3194–3203. External Links: Link, Document Cited by: §1, §2, §4.2.1.
  • [28] Z. Liu, X. Li, P. Luo, C. C. Loy, and X. Tang (2015) Semantic image segmentation via deep parsing network. See DBLP:conf/iccv/2015, pp. 1377–1385. External Links: Link, Document Cited by: §2.
  • [29] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. See DBLP:conf/cvpr/2015, pp. 3431–3440. External Links: Link, Document Cited by: §1, §2.
  • [30] A. Makhzani and B. J. Frey (2015) Winner-take-all autoencoders. See DBLP:conf/nips/2015, pp. 2791–2799. External Links: Link Cited by: §2.
  • [31] R. Mottaghi, X. Chen, X. Liu, N. Cho, S. Lee, S. Fidler, R. Urtasun, and A. L. Yuille (2014) The role of context for object detection and semantic segmentation in the wild. See DBLP:conf/cvpr/2014, pp. 891–898. External Links: Link, Document Cited by: §4.1, §4.
  • [32] H. Noh, S. Hong, and B. Han (2015) Learning deconvolution network for semantic segmentation. See DBLP:conf/iccv/2015, pp. 1520–1528. External Links: Link, Document Cited by: §2.
  • [33] A. Odena, V. Dumoulin, and C. Olah (2016) Deconvolution and checkerboard artifacts. Distill. External Links: Link, Document Cited by: §2.
  • [34] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, Cited by: Appendix B.
  • [35] A. Radford, L. Metz, and S. Chintala (2016) Unsupervised representation learning with deep convolutional generative adversarial networks. See DBLP:conf/iclr/2016, External Links: Link Cited by: §2.
  • [36] S. Ren, K. He, R. B. Girshick, and J. Sun (2015) Faster R-CNN: towards real-time object detection with region proposal networks. See DBLP:conf/nips/2015, pp. 91–99. External Links: Link Cited by: §1, §3.1.2, §3.3.
  • [37] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. See DBLP:conf/miccai/2015-3, pp. 234–241. External Links: Link, Document Cited by: §1, §2.
  • [38] W. Shi, J. Caballero, F. Huszar, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang (2016) Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. See DBLP:conf/cvpr/2016, pp. 1874–1883. External Links: Link, Document Cited by: §B.1, Figure 1, §2, §2, §4.2.1, §4.2.1.
  • [39] K. Sun, Y. Zhao, B. Jiang, T. Cheng, B. Xiao, D. Liu, Y. Mu, X. Wang, W. Liu, and J. Wang (2019) High-resolution representations for labeling pixels and regions. CoRR abs/1904.04514. External Links: Link, 1904.04514 Cited by: 3(b), Table 5.
  • [40] Z. Tian, T. He, C. Shen, and Y. Yan (2019) Decoders matter for semantic segmentation: data-dependent decoding enables flexible feature aggregation. In 2019 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-21, 2019, pp. 1–8. Cited by: Figure 1, §2, §4.1, §4.1, §4.2.1, 3(b).
  • [41] A. Toshev and C. Szegedy (2014) DeepPose: human pose estimation via deep neural networks. See DBLP:conf/cvpr/2014, pp. 1653–1660. External Links: Link, Document Cited by: §1.
  • [42] A. van den Oord, N. Kalchbrenner, L. Espeholt, K. Kavukcuoglu, O. Vinyals, and A. Graves (2016) Conditional image generation with pixelcnn decoders. See DBLP:conf/nips/2016, pp. 4790–4798. External Links: Link Cited by: §2.
  • [43] A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu (2016) Pixel recurrent neural networks. See DBLP:conf/icml/2016, pp. 1747–1756. External Links: Link Cited by: §2.
  • [44] A. Vedaldi and K. Lenc (2015) MatConvNet: convolutional neural networks for MATLAB. See DBLP:conf/mm/2015, pp. 689–692. External Links: Link, Document Cited by: §2.
  • [45] R. Vemulapalli, O. Tuzel, M. Liu, and R. Chellappa (2016) Gaussian conditional random field network for semantic segmentation. See DBLP:conf/cvpr/2016, pp. 3224–3233. External Links: Link, Document Cited by: §1, §2.
  • [46] H. Wu, J. Zhang, K. Huang, K. Liang, and Y. Yu (2019) FastFCN: rethinking dilated convolution in the backbone for semantic segmentation. In 2019 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-21, 2019, pp. 1–8. Cited by: §4.1, §4.1, 3(b), Table 6.
  • [47] F. Yu and V. Koltun (2016) Multi-scale context aggregation by dilated convolutions. See DBLP:conf/iclr/2016, External Links: Link Cited by: §1.
  • [48] M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus (2010) Deconvolutional networks. See DBLP:conf/cvpr/2010, pp. 2528–2535. External Links: Link, Document Cited by: §2.
  • [49] M. D. Zeiler, G. W. Taylor, and R. Fergus (2011) Adaptive deconvolutional networks for mid and high level feature learning. See DBLP:conf/iccv/2011, pp. 2018–2025. External Links: Link, Document Cited by: §2.
  • [50] H. Zhang, K. J. Dana, J. Shi, Z. Zhang, X. Wang, A. Tyagi, and A. Agrawal (2018) Context encoding for semantic segmentation. See DBLP:conf/cvpr/2018, pp. 7151–7160. External Links: Link, Document Cited by: Table 8, Figure 9, §B.1, Appendix B, §1, §3, Figure 6, §4.1, §4.1, §4.1, §4.1, §4.2.1, Table 1, 3(a), 3(b), Table 4, Table 5, Table 6.
  • [51] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. See DBLP:conf/cvpr/2017, pp. 6230–6239. External Links: Link, Document Cited by: §3, Table 5, Table 6.
  • [52] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. See DBLP:conf/cvpr/2017, pp. 6230–6239. External Links: Link, Document Cited by: §1.
  • [53] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. S. Torr (2015) Conditional random fields as recurrent neural networks. See DBLP:conf/iccv/2015, pp. 1529–1537. External Links: Link, Document Cited by: §2.
  • [54] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba (2017) Scene parsing through ADE20K dataset. See DBLP:conf/cvpr/2017, pp. 5122–5130. External Links: Link, Document Cited by: §4.1, §4.
  • [55] X. Zhu, H. Hu, S. Lin, and J. Dai (2019) Deformable convnets V2: more deformable, better results. See DBLP:conf/cvpr/2019, pp. 9308–9316. External Links: Link Cited by: Appendix B, §2.

Appendix A Run-time Speed

In this section, we employ the frame per second (FPS) as the evaluation metirc to show the time efficiency about the proposed method. Compared with the traditional bilinear upsampling, LaU consistently improves the performance at the cost of negligible extra time complexity. We report the FPS averaged over 100 iterations. As shown in Table 8, LaU is competitive with the bilinear upsamling network at actual run-time speed.

Head OS Backbone Upsampling FPS
EncNet [50] 8 ResNet-50 Bilinear 18.60
8 ResNet-50 LaU 18.39
32 ResNet-50 Bilinear 70.80
32 ResNet-50 LaU 69.84
8 ResNet-101 Bilinear 11.66
8 ResNet-101 LaU 11.49
32 ResNet-101 Bilinear 47.37
32 ResNet-101 LaU 46.46
ASPP [7] 8 ResNet-50 Bilinear 15.93
8 ResNet-50 LaU 15.72
8 ResNet-101 Bilinear 10.48
8 ResNet-101 LaU 10.32
Table 8: Actual inference time on a GTX 1080Ti GPU with input size. “OS” refers to the output stride.

Appendix B Implementation Details

Our experiments base on the PyTorch framework [34] and pre-trained models provided by EncNet [50]. We use the official open source code [55] to conduct the deformable convolution experiments. In this section, we employs ResNet-50 as the backbone without multi-scale testing. The results are reported on PASCAL Context val set. To make a fair comparison, we follow the same setting in our main paper.

\thesubsubfigure Image
\thesubsubfigure ASPP Head
\thesubsubfigure EncNet Head
Figure 8: PixelShuffle segmentation results. Figure (b) and (c) show the checkerboard artifacts. Although the last convolution layer (i.e., ) may smooth the outputs, the results still consist of strange checkerboard pattern of artifacts.

b.1 PixelShuffle

We assume that the class number is 59. Following the popular setting in super-resolutions [25, 38, 20], the architecture of PixelShuffle-based upsamplings are as follow :

where refers to the convolution layer with 236 filters.

The intuitive idea is that, in the worst case, pixelshuffle should degrade into nearest neighbor upsampling (same feature map and weight for each shuffle group). However, the experiment results in Table 3a demostrates that PixelShuffle damages the performance. In this subsection, we show the segmentation results to reveal checkboard problems in PixelShuffle. As illustrated in Figure 8, both EncNet [50] head and ASPP [7] head suffer from the checkboard artifacts.

\thesubsubfigure Image
\thesubsubfigure GT
\thesubsubfigure Deformable
Figure 9: Visualization results using deformable bilinear upsampling with EncNet-ResNet50 [50].

b.2 Deformable Convolution

We replace the traditional bilinear upsamling in encoder-decoder networks with “Deformable Convolution [9] + Blinear” module. Since “Blinear + Deformable Convolution” will hughly increase the extra computing cost, we only conduct experiments under the first setting.

Although deformable convolution has been proved to improve the segmentation task when properly used in feature extraction, we show that the “deformable upsampling” is probably not appropriate for the final prediction. As illustrated in Figure 9, deformable bilinear upsampling tends to focus on the main body of the objects yet miss the details, which consists with the design of deformable convolution networks.

Appendix C Visualization Results

In this section, we show the visual results from ADE20K dataset. For the qualitative analysis, we compare the outputs generated by LaU and bilinear upsampling, shown in Figure 10. Since offset-guided loss and regression loss achieve similar mIoU and pixel accuracy, we visualize the segmentation results on val set in Figure 11 and make a qualitative evaluation. Regression loss performs slightly better than offset-guided loss, which is consistent with the mIoU results.

\thesubsubfigure Image
\thesubsubfigure GT
\thesubsubfigure EncNet
\thesubsubfigure LaU
Figure 10: Visual improvements on ADE20K with ResNet101 backbone. “EncNet” is the standard bilinear upsampling. “LaU” refers to LaU.
\thesubsubfigure Image
\thesubsubfigure GT
\thesubsubfigure offset-
guided loss
\thesubsubfigure regression
Figure 11: Visual comparison on ADE20K val set.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description