Location-aware Upsampling for Semantic Segmentation
Many successful learning targets such as dice loss and cross-entropy loss have enabled unprecedented breakthroughs in segmentation tasks. Beyond semantic supervision, this paper aims to introduce location prediction into semantic segmentation from a new viewpoint: let pixels determine their own coordinates. Based on this idea, we present a Location-aware Upsampling (LaU) that adaptively refines the interpolating coordinates with trainable offsets. Then, location-aware losses are established by encouraging pixels to move towards well-classified locations. An LaU is offset prediction coupled with interpolation, which is trained end-to-end to generate confidence score at each position from coarse to fine. Guided by location-aware losses, the new module can replace its plain counterpart (e.g., bilinear upsampling) in a plug-and-play manner to further boost the leading encoder-decoder approaches. Extensive experiments validate the consistent improvement over the state-of-the-art methods on benchmark datasets.
Recent advances in deep learning have empowered a wide range of applications [23, 36, 19, 41, 12], including semantic segmentation [52, 47, 29, 6]. The pioneering Fully Convolutional Network (FCN)  with deconvolutional operations  dramatically surpasses classical methods relying on hand-crafted features. Following FCNs, a series of CNN-based methods further enhance the state-of-the-art by introducing dilated (atrous) convolutions [47, 7], shortcut connections  and CRFs [45, 2, 27]. Latest semantic segmentation methods [50, 8] exploit fully trainable encoder-decoder architectures with simple bilinear upsamplings to extract high-resolution predictions.
Although recent works have exhibited state-of-the-art performance, we suggest that the location information hidden in the label map has been overlooked. To introduce location prediction into segmentation models, we revisit the idea of semantic segmentation: the task of assigning each pixel of a photograph to a semantic class label , then decompose this target into two relatively simple sub-tasks: pixel-level localization and classification. Instead of following previous methods to simply predict class labels, we propose to predict locations simultaneously, more precisely, to learn offsets with additional supervision.
This fresh viewpoint leads to the Location-aware Upsampling (LaU) directly, which upsamples classification results while accounting for the effect of offsets. Besides that, LaU tends to meet the key challenges in semantic segmentation: First, CNN-based segmentation methods extract high-level semantic features at the cost of losing detailed pixel-level spatial information. Second, the feature distribution of objects is imbalanced, where the interior points enjoy the rich knowledge after high-ratio upsampling yet pixels around edges suffer from indiscriminative features due to bilinear interpolation. A natural solution is to borrow the information from interior points to guide the categorization near the boundary, which exactly corresponds to our two sub-tasks: learning offsets and utilizing the information at the new position to make a prediction.
Our observation is that feature maps used by pixel-level classification, e.g., the second last layer in decoders, can also be used for predicting locations. On top of these features, we construct LaUs by adding an offset prediction branch to produce offsets at each position in the high-resolution feature map (details in section 3.3). To facilitate the offset prediction of interior points and improves the convergence, we first define the candidate locations then use predicted offsets to regress fine-grained coordinates, i.e., .
The proposed offset prediction makes it possible to further introduce location supervision implicitly/explicitly into the training process. In this paper, we present two location-aware losses forcing the offsets to predict fine-grained coordinates with the correct class. Besides, our location-aware losses avoid human labeling and produce “label” online. In the experiments, we empirically show that location-aware mechanism outperforms strong baseline methods, and it is feasible to apply the rich knowledge in the design of location regression to the location-aware upsampling. We believe dense predictions can widely benefit from the location-aware module, including but not limited to the semantic segmentation.
2 Related Work
Feature map upsampling has been widely used in deep models, such as encoder-decoder architectures [32, 37], generative models [35, 30], super-resolution tasks [13, 38, 20] and semantic segmentation [29, 6]. Deconvolution operations [48, 49], also known as transposed convolutions , first bridge the gap between low-resolution feature maps and desired full-resolution predictions. Despite its successes, recent works [33, 16] show that deconvolutions can easily suffer from the checkboard problem when the kernel size is not divisible by the stride.
Advances like PixelShuffle  and DUpsampling  turn to hand-crafted pixel rearrangement to reconstruct high-resolution features. However, the empirical design results in the inconsistent spatial relationship among adjacent pixels on the high-resolution feature map (As illustrated in Figure 1, pixels connected by blue lines are generated by different filters, yet the input of each filter corresponds to the same receptive field. Both filter and receptive filed are different for pixels connected by red lines.). Currently, PixelDCL  generates intermediate feature maps sequentially so that later stages depend on previously generated ones, which is similar to pixel-recurrent neural networks (PixelRNNs)  and PixelCNNs . Yet pixel-wise prediction is time-consuming and there is no explicit time sequence relation between adjacent pixels, i.e., previously generated pixels can still rely on later produced ones which leads to a chicken-and-egg problem.
Conditional Random Field (CRF) is another group of methods, that is primarily used in  as a disjoint post-processing. Following works [27, 45, 28] further integrate CRF into networks to model the semantic relationships between objects. Due to the increase of computing complexity caused by integration, [3, 4] propose a Gaussian conditional random field to avoid the iterative optimization  while suffering from heavy gradient computation. Similar to CRFs, [2, 22] focus on refining the output label map by random walks. However, the dense matrix inversion term and the two branch model with cascaded random walks cannot be easily merged into leading encoder-decoder architectures.
Spatial Transformer Networks (STN)  is the first work to introduce spatial transformation capacity into neural networks, which enables global parametric transformation like 2D affine transformation. Since the pointwise location sampling in STN is weakly coupled with the input feature map at each position (only through global transformation parameters produced by the preceding feature maps), this greatly limits its capacity in pixel-wise predictions such as semantic segmentation. The module we present in this paper can be seen as a new transformation in a local and dense manner. The upsampled feature map can still enjoy the detailed information of low-resolution inputs by using pixel-wise offsets.
Deformable convolution networks (DCNN) [9, 55] achieve strong performance on object detection and semantic segmentation with trainable offsets. However, it is commonly used in feature extraction and we empirically prove that replacing the last bilinear upsampling with “DCNN+Bilinear” will cause damage to the performance (details in Table 9 and appendix section B.2 Figure 9). As illustrated in Figure 2, DCNN aims at augmenting the spatial sampling grid of filters, yet our method directly refines the sampling coordinates with location-guided losses which contribute to higher performance.
Data-independent upsampling such as bilinear interpolation is one of the basic techniques in computer vision and image processing. Despite its high efficiency, this module can be oversimple (illustrated in Figure 3) and fail to capture the semantic information. In this section, we alleviate the existing problems in bilinear upsampling by introducing the location-aware mechanism.
Learning data-dependent offsets is the key step in LaU. This idea is based on the fact that adjacent boundary points on the high-resolution label map can degrade into the same pixel in the low-resolution feature map, since upsampling is commonly used in popular methods [7, 51, 50]. Instead of sampling from the local area, we propose to use the predicted offsets, generating the output map from the location-refined coordinates. We describe the formulation of a location-aware upsampling in the following sections.
3.1 Spatial Upsampling
To estimate the values in the output feature map , a sampler applys the sampling kernel to a particular grid in the input feature map . Each coordinate defines the spatial location in and determines the kernel function at a particular pixel in channel . Formally,
where , are the integer coordinates in the output feature map, and are the parameters of a generic sampling function which can be preset or data-dependent 111 can be in , and in LaU.. Note that the existing upsamplings such as bilinear and PixelShuffle apply the same to every channel of the input (i.e., data-independent), while LaU enables free changes of at each position. That is, the sampling kernel in LaU changes with the input data and coordinates.
In theory, is a (sub-)differentiable function of the coordinates and , along with the parameters , which can be in any form. For example, corresponds to the bilinear interpolation. Here, we show that PixelShuffle is another data-independent case, which may lead to sub-optimal results.
PixelShuffle can also be reformulated in the form of Eq.(1). Specifically, in PixelShuffle upsampling is
where is the Kronecker delta function. Note that changes along with channel , yet still fixed with respect to input data. From the view of , this sampling kernel equates to the data-independent methods like bilinear interpolation. Furthermore, the output of Kronecker delta is either or , which partly pinpoints the root of missing spatial relationship. That is, generated by PixelShuffle can be considered as a bijection. There is no direct relationship between adjacent elements in its codomain (output feature map), unless the source elements in its domain (intput feature map) are strongly correlated. For example, neighboring pixels connected by red lines in Figure 1 belong to different receptive fields, and are generated by different filters; They are likely to look different (i.e., uncorrelated) unless the filter weights are similar/same. Therefore, the network has to consider the high-resolution spatial relationships when producing low-resolution intermediate feature maps.
3.1.2 Location-aware Upsampling
Mathematically, bilinear upsampling is a non-injective surjective function (not a bijection), denoted as . Neighboring points in sampled from the same grid in are strongly correlated, since they are generated from the same elements. Compared with PixelShuffle, this scheme avoids checkerboard artifacts. As illustrated in Figure 2, location-aware upsampling is a simple yet effective improvement to the bilinear interpolation with the following sampling kernel
where is the upsampling ratio. Inspired by the anchor-based methods in object detection [17, 36], we introduce as a candidate location then regress the detailed location with predicted offsets. To encourage the network to jump out of the local grid, we add no constraint on .
To learn the offsets in the upsampling module, we utlize the gradients with respect to and . This allows the backpropagation of the loss function through the (sub-)differentiable location-aware upsampling module. For upsampling kernel (2), the partial derivatives are
With the gradients to offsets, we can further define the offset prediction branch to allow the end-to-end training.
3.2 Location-aware Loss
To fully exploit the effect of predicted offsets , we introduce (semi-)supervised location-aware losses into the training process. Since offsets could point to any input location, as long as it contains the correct category, we first consider the offset-guided loss without hard constraints. Inspired by the bounding box regression, we further propose the offset regression loss to guide the offset prediction directly.
3.2.1 Offset-guided Loss
In common training process of semantic segmentation, the network is optimized by per-pixel cross-entropy loss . Instead of learning from isolated per-pixel supervision, we consider the loss pair generated by bilinear upsmapling and LaU (shown in Figure 4a). Formally, we introduce the auxiliary loss (with no gradient) and the target loss ,
where is the input image, is the model parameters and refers to the predicted offsets. The per-pixel offset-guided loss
with loss weight forces the network to “take a step”, since the trivial solution (i.e., ) will be punished by the extra cost . Besides, demostrates that the moved pixel should have a smaller cross-entropy loss (i.e., a higher confidence score) than the original point. Therefore, (3) encourages the pixel to move towards a better position, which contains potentially correct class.
3.2.2 Regression Loss
Since there is no groud-truth label of the offsets, we could intuitively search for the closest well classified point among the neighbouring pixels to generate the offsets. However, the brute search is time-consuming for online label generation. Hence, we emprically constrain the search space to the intergral coordinate points (shown in Figure 4b).
Following the preivous subsection, we further introduce the auxiliary losses , , and , e.g., with the sampling kernels
where is the “indicator function”, that is and . Then, we define the loss set
and the coordinate set
Combining and , we obtain the optimal candidate corrdinates
where the summation stops at . Given (illustrated in Figure 4c), we formulate the per-pixel offset regression loss as
where controls the strength of the regression loss. In our experiments, we set to 0.1 without fine-tuning. Note that the auxiliary losses will be excluded from the backward propagation (with no gradient) and omitted at inference time.
3.3 Location-aware Network
The combination of predicted offsets and interpolation operation forms a location-aware upsampling (Figure 5). Note that the offset prediction branch can take any form in practice, such as a fully-connected network like RPN in Faster-RCNN . In this paper, we use “Conv+LReLU+Conv+PixelShuffle” to illustrate the basic framework of location-aware mechanism in semantic segmentation.
The first convolution layer reduces the channel number of the input feature map from to . Then, the last convolution layer produces an offset tensor where and . can be the number of class to allow the per-element shift. However, the extra computing cost will be unacceptable if is too large. When we set to 1, the output feature map shares offset across different channels. Compared with , this setting saves over G FLOPs on ADE20K dataset.
Encoder-decoder architecture with LaU tends to first predict the general location and rough boundary, then recontruct the detailed information using predicted offsets. As shown in Figure 7 and 7, this two-step scheme decomposes the original problem into two sub-tasks: pixel-level localization and classification, which correspond to the two branches in location-aware network.
We evaluate the proposed method on the PASCAL VOC 2012 semantic segmentation benchmark , PASCAL Context  and ADE20K  datasets. The standard evaluation metrics are pixel accuracy (pixAcc) and mean Intersection of Union (mIoU). In this section, we first introduce datasets used in this paper as well as the implementation details. We then apply LaU to the top of networks and the upsampler in decoders to show the effectiveness of the location-aware mechanism. Moreover, the visualization results demonstrate the superiority of optimizing the proposed location-aware segmentation (shown in Figure 6).
4.1 Experimental Settings
PASCAL Context dataset  provides pixel-wise semantic annotations for whole scenes, which contains 4998 images for training and 5105 images for testing. Following [26, 15, 46, 50], we use the most frequent 59 object categories plus background (60 classes in total) as the groundtruth labels. PASCAL VOC 2012 consists of 1464 (train), 1449 (val) and 1456 (test) pixel-level annotated images, which involves 20 foreground object classes and one background class. Following previous works [8, 18], we augment the dataset by the extra annotations, resulting in 10582 (trainaug) training images. The mIoU is averaged across 21 classes. ADE20K  scene parsing benchmark consists of 150 categories. There are 20K images for training, 2K for validation and 3K for testing. The results on test set are provided by the ADE20K online evaluation server.
For the experiments on PASCAL Context, we adopt the same settings as described in [50, 46]. The learning rate starts at 0.001, which gradually decrease to 0 using the ”poly” strategy (power=0.9). The networks are trained for 80 epochs with SGD. Concretely, the momentum is set to 0.9 and weight decay is 0.0001. To encourage the LaU to ”look wider”, we set the weight decay of the convolutional layer in LaU to 0. For data augmentation, we randomly scale (from 0.5 to 2.0) and horizontally flip the input images, then apply a random crop to the images. All the experiments are conducted on 4 GPUs with batch size 16. Since ImageNet pre-trained backbones can significantly contribute to higher performances, we follow the practice of [50, 46, 15, 40] to use the standard ResNet-50 and ResNet-101 .
For training on PASCAL VOC, we follow the setting in [40, 8]. The initial learning rate is set to 0.007 with ”poly” policy and weight decay is 0.0001. The total iteration is for ResNet-50 experiments with batch size 48, yet another iterations are required for training ResNet-101 with initial learning rate being 0.001 . The scaling with flipping data augmentation and LaU settings remain the same as PASCAL Context.
For the experiments on ADE20K, as described in EncNet , we train our model on the training set for 180 epochs with learning rate 0.004 and evaluate it on the validation set using multi-scale testing. We then fine-tune our models on the train-val set for another 30 epochs with learning rate and submit results to the test server. We use a base size with crop for training. The backbone networks are the same as previous experiments.
Ablation Study for LaU
To evaluate LaU, the key parameter is the upsampling factor. As listed in Table 1, upsampling works better than other settings with negligible additional time and space complexity. With proposed LaU, the results consistently outperform standard bilinear upsampling and the best setting exceeds baseline by +1.2 mIoU with only 0.16M extra parameters. Since loss weight of the original auxiliary loss in EncNet is 0.2, in this section, we intuitively set to 0.3 to enlarge the gradient of location-aware loss.
”MS” means multi-scale testing.
Ablation Study for Location-aware Loss
We experiment with loss weight between 0 and 0.4 and show the results in Table 2. Note that LaU without additional supervision () still produces performance gain and yields the best performance. Both location-aware upsampling and location-aware loss contribute to higher performance than bilinear baseline (similar results are reported in Table 9). Due to the limited computing resources, we directly apply the best settings to regression loss experiments and emprically set to 0.1 without grid search.
4.2 Applying to Label Prediction
In this section, we apply LaU to the top of the decoders (as shown in Figure 5), which corresponds to the prediction of label maps.
4.2.1 Quantitative Analysis
To estimate the generalization ability of LaU, we conduct experiments on two popular methods Encoding  and ASPP  with different output strides. Table 3(a) shows that our methods consistently outperform the original bilinear methods. Following the setting in super-resolution tasks [27, 20, 38], one may decompose a single upsampling into three cascaded upsamplings. To simplify the experimental settings, we leave this scheme as future work.
Compared with PixelShuffle  (Table 3(a)) and Deformable  bilinear upsampling (Table 9), LaU notably outperforms those methods with/without location-aware loss. Note that PixelShuffle and Deformable cause damages to the performance; We show the visualization results and give a brief discussion in the appendix section 2.
Table 3(b) further shows that single-scale LaUs surpass recent upsampling method DUpsampling  with strong Xception-71 backbone, e.g., ImageNet Top-1 accuracy of Xception-65 is higher than ResNet-101. Our multi-scale result achieves mIoU over EncNet, while LaU is still compatible with attention mechanisms and stronger encoder-decoder architectures to further enhance the leading performances.
As shown in Table 5, our methods achieve similar results on ADE20K dataset. With ResNet-50 backbone, LaUs outperform EncNet baseline by and mIoU. Regression loss seems to perform worse than location-guided loss. We attribute this to the hand-crafted design of optimal candidate coordinates and the selection of . The current search space contains only four points, which can be suboptimal. We will include a larger search space in our future work.
Although our methods achieve slightly worse performance than EncNet on val set, LaUs notably surpass EncNet and all entries in COCO-Place challenge 2017 on the final test score (see results in Table 6).
4.2.2 Qualitative Analysis
We show the PASCAL-Context visual results in Figure 6. LaU labels out the missing parts in EncNet and corrects the misclassifications. More examples in ADE20K and the visual comparison between offset-guided loss and regression loss are shown in the appendix (Figure 11). To further illustrate the effect of location-aware mechanism, we compare the segmentation results before and after LaU in Figure 7. LaU first produces coarse segmentation of objects with rough boundary (in Figure 7), then refines the boundary using offsets (in Figure 7). The whole segmentation process of LaU is from coarse to fine.
4.3 Applying to Decoder
Recent state-of-the-arts [7, 8] utilize upsampling modules in decoders to combine the high-resolution features with low-resolution ones. To show the general applicability of proposed method, we replace the bilinear upsampling in DeepLabv3  with LaU. The performance of ResNet-50+LaU ( mIoU in Table 7) has approached the original ResNet-101 backbone results. Even with strong ResNet-101 baseline, LaU can still boost the performance by one point.
In this paper, we provide a new perspective for optimizing the semantic segmentation problem. We decompose the original pixel-level classification problem into offsets prediction and classification, which introduces the idea of location prediction into semantic segmentation. Based on this fresh viewpoint, we propose the location-aware upsampling and location-aware losses. Our models achieve promising performance compared with various baseline methods. The effectiveness of learning offsets is also verified through qualitative and quantitative results.
-  (2017) SegNet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39 (12), pp. 2481–2495. External Links: Cited by: §1.
-  (2017) Convolutional random walk networks for semantic image segmentation. See DBLP:conf/cvpr/2017, pp. 6137–6145. External Links: Cited by: §1, §2.
-  (2016) Fast, exact and multi-scale inference for semantic image segmentation with deep gaussian crfs. See DBLP:conf/eccv/2016-7, pp. 402–418. External Links: Cited by: §2.
-  (2017) Dense and low-rank gaussian crfs using deep embeddings. See DBLP:conf/iccv/2017, pp. 5113–5122. External Links: Cited by: §2.
-  (2015) Semantic image segmentation with deep convolutional nets and fully connected crfs. See DBLP:conf/iclr/2015, External Links: Cited by: §2.
-  (2018) DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 40 (4), pp. 834–848. External Links: Cited by: §1, §2.
-  (2017) Rethinking atrous convolution for semantic image segmentation. CoRR abs/1706.05587. External Links: Cited by: Table 8, §B.1, §1, §3, §4.1, §4.2.1, §4.3, 3(a), Table 4.
-  (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. See DBLP:conf/eccv/2018-7, pp. 833–851. External Links: Cited by: §1, §4.1, §4.1, §4.3, Table 7.
-  (2017) Deformable convolutional networks. See DBLP:conf/iccv/2017, pp. 764–773. External Links: Cited by: §B.2, Figure 2, §2, §4.2.1, Table 4.
-  (2019-10) Boundary-aware feature propagation for scene segmentation. In The IEEE International Conference on Computer Vision (ICCV), Cited by: 3(b).
-  (2018) Context contrasted feature and gated multi-scale aggregation for scene segmentation. See DBLP:conf/cvpr/2018, pp. 2393–2402. External Links: Cited by: 3(b).
-  (2016) Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 38 (2), pp. 295–307. External Links: Cited by: §1.
-  (2016) Accelerating the super-resolution convolutional neural network. See DBLP:conf/eccv/2016-2, pp. 391–407. External Links: Cited by: §2.
-  (2010) The pascal visual object classes (VOC) challenge. International Journal of Computer Vision 88 (2), pp. 303–338. External Links: Cited by: §4.
-  (2019) Dual attention network for scene segmentation. In 2019 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-21, 2019, pp. 1–8. Cited by: §4.1, §4.1, 3(b).
-  (2019) Pixel transposed convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence (), pp. 1–1. External Links: Cited by: §2, §2.
-  (2015) Fast R-CNN. See DBLP:conf/iccv/2015, pp. 1440–1448. External Links: Cited by: §3.1.2.
-  (2011) Semantic contours from inverse detectors. See DBLP:conf/iccv/2011, pp. 991–998. External Links: Cited by: §4.1.
-  (2016) Deep residual learning for image recognition. See DBLP:conf/cvpr/2016, pp. 770–778. External Links: Cited by: §1, §4.1.
-  (2019) ODE-inspired network design for single image super-resolution. See DBLP:conf/cvpr/2019, pp. 1732–1741. External Links: Cited by: §B.1, §2, §4.2.1.
-  (2015) Spatial transformer networks. See DBLP:conf/nips/2015, pp. 2017–2025. External Links: Cited by: §2.
-  (2018) DifNet: semantic segmentation by diffusion networks. See DBLP:conf/nips/2018, pp. 1637–1646. External Links: Cited by: §2.
-  (2012) ImageNet classification with deep convolutional neural networks. See DBLP:conf/nips/2012, pp. 1106–1114. External Links: Cited by: §1.
-  (2011) Pylon model for semantic segmentation. See DBLP:conf/nips/2011, pp. 1485–1493. External Links: Cited by: §1.
-  (2017-07) Enhanced deep residual networks for single image super-resolution. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Cited by: §B.1.
-  (2017) RefineNet: multi-path refinement networks for high-resolution semantic segmentation. See DBLP:conf/cvpr/2017, pp. 5168–5177. External Links: Cited by: §4.1.
-  (2016) Efficient piecewise training of deep structured models for semantic segmentation. See DBLP:conf/cvpr/2016, pp. 3194–3203. External Links: Cited by: §1, §2, §4.2.1.
-  (2015) Semantic image segmentation via deep parsing network. See DBLP:conf/iccv/2015, pp. 1377–1385. External Links: Cited by: §2.
-  (2015) Fully convolutional networks for semantic segmentation. See DBLP:conf/cvpr/2015, pp. 3431–3440. External Links: Cited by: §1, §2.
-  (2015) Winner-take-all autoencoders. See DBLP:conf/nips/2015, pp. 2791–2799. External Links: Cited by: §2.
-  (2014) The role of context for object detection and semantic segmentation in the wild. See DBLP:conf/cvpr/2014, pp. 891–898. External Links: Cited by: §4.1, §4.
-  (2015) Learning deconvolution network for semantic segmentation. See DBLP:conf/iccv/2015, pp. 1520–1528. External Links: Cited by: §2.
-  (2016) Deconvolution and checkerboard artifacts. Distill. External Links: Cited by: §2.
-  (2017) Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, Cited by: Appendix B.
-  (2016) Unsupervised representation learning with deep convolutional generative adversarial networks. See DBLP:conf/iclr/2016, External Links: Cited by: §2.
-  (2015) Faster R-CNN: towards real-time object detection with region proposal networks. See DBLP:conf/nips/2015, pp. 91–99. External Links: Cited by: §1, §3.1.2, §3.3.
-  (2015) U-net: convolutional networks for biomedical image segmentation. See DBLP:conf/miccai/2015-3, pp. 234–241. External Links: Cited by: §1, §2.
-  (2016) Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. See DBLP:conf/cvpr/2016, pp. 1874–1883. External Links: Cited by: §B.1, Figure 1, §2, §2, §4.2.1, §4.2.1.
-  (2019) High-resolution representations for labeling pixels and regions. CoRR abs/1904.04514. External Links: Cited by: 3(b), Table 5.
-  (2019) Decoders matter for semantic segmentation: data-dependent decoding enables flexible feature aggregation. In 2019 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-21, 2019, pp. 1–8. Cited by: Figure 1, §2, §4.1, §4.1, §4.2.1, 3(b).
-  (2014) DeepPose: human pose estimation via deep neural networks. See DBLP:conf/cvpr/2014, pp. 1653–1660. External Links: Cited by: §1.
-  (2016) Conditional image generation with pixelcnn decoders. See DBLP:conf/nips/2016, pp. 4790–4798. External Links: Cited by: §2.
-  (2016) Pixel recurrent neural networks. See DBLP:conf/icml/2016, pp. 1747–1756. External Links: Cited by: §2.
-  (2015) MatConvNet: convolutional neural networks for MATLAB. See DBLP:conf/mm/2015, pp. 689–692. External Links: Cited by: §2.
-  (2016) Gaussian conditional random field network for semantic segmentation. See DBLP:conf/cvpr/2016, pp. 3224–3233. External Links: Cited by: §1, §2.
-  (2019) FastFCN: rethinking dilated convolution in the backbone for semantic segmentation. In 2019 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-21, 2019, pp. 1–8. Cited by: §4.1, §4.1, 3(b), Table 6.
-  (2016) Multi-scale context aggregation by dilated convolutions. See DBLP:conf/iclr/2016, External Links: Cited by: §1.
-  (2010) Deconvolutional networks. See DBLP:conf/cvpr/2010, pp. 2528–2535. External Links: Cited by: §2.
-  (2011) Adaptive deconvolutional networks for mid and high level feature learning. See DBLP:conf/iccv/2011, pp. 2018–2025. External Links: Cited by: §2.
-  (2018) Context encoding for semantic segmentation. See DBLP:conf/cvpr/2018, pp. 7151–7160. External Links: Cited by: Table 8, Figure 9, §B.1, Appendix B, §1, §3, Figure 6, §4.1, §4.1, §4.1, §4.1, §4.2.1, Table 1, 3(a), 3(b), Table 4, Table 5, Table 6.
-  (2017) Pyramid scene parsing network. See DBLP:conf/cvpr/2017, pp. 6230–6239. External Links: Cited by: §3, Table 5, Table 6.
-  (2017) Pyramid scene parsing network. See DBLP:conf/cvpr/2017, pp. 6230–6239. External Links: Cited by: §1.
-  (2015) Conditional random fields as recurrent neural networks. See DBLP:conf/iccv/2015, pp. 1529–1537. External Links: Cited by: §2.
-  (2017) Scene parsing through ADE20K dataset. See DBLP:conf/cvpr/2017, pp. 5122–5130. External Links: Cited by: §4.1, §4.
-  (2019) Deformable convnets V2: more deformable, better results. See DBLP:conf/cvpr/2019, pp. 9308–9316. External Links: Cited by: Appendix B, §2.
Appendix A Run-time Speed
In this section, we employ the frame per second (FPS) as the evaluation metirc to show the time efficiency about the proposed method. Compared with the traditional bilinear upsampling, LaU consistently improves the performance at the cost of negligible extra time complexity. We report the FPS averaged over 100 iterations. As shown in Table 8, LaU is competitive with the bilinear upsamling network at actual run-time speed.
Appendix B Implementation Details
Our experiments base on the PyTorch framework  and pre-trained models provided by EncNet . We use the official open source code  to conduct the deformable convolution experiments. In this section, we employs ResNet-50 as the backbone without multi-scale testing. The results are reported on PASCAL Context val set. To make a fair comparison, we follow the same setting in our main paper.
where refers to the convolution layer with 236 filters.
The intuitive idea is that, in the worst case, pixelshuffle should degrade into nearest neighbor upsampling (same feature map and weight for each shuffle group). However, the experiment results in Table 3a demostrates that PixelShuffle damages the performance. In this subsection, we show the segmentation results to reveal checkboard problems in PixelShuffle. As illustrated in Figure 8, both EncNet  head and ASPP  head suffer from the checkboard artifacts.
b.2 Deformable Convolution
We replace the traditional bilinear upsamling in encoder-decoder networks with “Deformable Convolution  + Blinear” module. Since “Blinear + Deformable Convolution” will hughly increase the extra computing cost, we only conduct experiments under the first setting.
Although deformable convolution has been proved to improve the segmentation task when properly used in feature extraction, we show that the “deformable upsampling” is probably not appropriate for the final prediction. As illustrated in Figure 9, deformable bilinear upsampling tends to focus on the main body of the objects yet miss the details, which consists with the design of deformable convolution networks.
Appendix C Visualization Results
In this section, we show the visual results from ADE20K dataset. For the qualitative analysis, we compare the outputs generated by LaU and bilinear upsampling, shown in Figure 10. Since offset-guided loss and regression loss achieve similar mIoU and pixel accuracy, we visualize the segmentation results on val set in Figure 11 and make a qualitative evaluation. Regression loss performs slightly better than offset-guided loss, which is consistent with the mIoU results.