Dual Graph Convolutional Network for Semantic Segmentation

Dual Graph Convolutional Network for Semantic Segmentation

Abstract

Exploiting long-range contextual information is key for pixel-wise prediction tasks such as semantic segmentation. In contrast to previous work that uses multi-scale feature fusion or dilated convolutions, we propose a novel graph-convolutional network (GCN) to address this problem. Our Dual Graph Convolutional Network (DGCNet) models the global context of the input feature by modelling two orthogonal graphs in a single framework. The first component models spatial relationships between pixels in the image, whilst the second models interdependencies along the channel dimensions of the network’s feature map. This is done efficiently by projecting the feature into a new, lower-dimensional space where all pairwise interactions can be modelled, before reprojecting into the original space. Our simple method provides substantial benefits over a strong baseline and achieves state-of-the-art results on both Cityscapes (82.0% mean IoU) and Pascal Context (53.7% mean IoU) datasets.

\addauthor

Li Zhang*lz@robots.ox.ac.uk1 \addauthorXiangtai Li*lxtpku@pku.edu.cn2 \addauthorAnurag Arnabaarnab@robots.ox.ac.uk1 \addauthorKuiyuan Yangkuiyuanyang@deepmotion.ai3 \addauthorYunhai Tongyhtong@pku.edu.cn2 \addauthorPhilip H.S. Torrphst@robots.ox.ac.uk1 \addinstitution Department of Engineering Science,
Torr Vision Group,
University of Oxford \addinstitution Key Laboratory of Machine Perception,
School of EECS,
Peking University \addinstitution DeepMotion AI Research Dual Graph Convolutional Network

1 Introduction

Semantic segmentation is a fundamental problem in computer vision, and aims to assign an object class label to each pixel in an image. It has numerous applications including autonomous driving, augmented- and virtual reality and medical diagnosis.

An inherent challenge in semantic segmentation is that pixels are difficult to classify when considered in isolation, as local image evidence is ambiguous and noisy. Therefore, segmentation systems must be able to effectively capture contextual information in order to reason about occlusions, small objects and model object co-occurrences in a scene.

Current state-of-the-art methods are all based on deep learning using fully convolutional networks (FCNs) [Long et al.(2015)Long, Shelhamer, and Darrell]. However, the receptive field of an FCN grows slowly (only linearly) with increasing depth in the network, and its limited receptive field is not able to capture longer-range relationships between pixels in an image. Dilated convolutions [Chen et al.(2015)Chen, Papandreou, Kokkinos, Murphy, and Yuille, Yu and Koltun(2016)] have been proposed to remedy this. However, the resulting feature representation is dominated by large objects in the image, and consequently, performance on small objects is poor. Another direction has been to fuse multiscale features within the network [Zhao et al.(2017)Zhao, Shi, Qi, Wang, and Jia, Liu et al.(2015)Liu, Rabinovich, and Berg] or to use LSTMs to propagate information spatially [Byeon et al.(2015)Byeon, Breuel, Raue, and Liwicki, Shuai et al.(2018)Shuai, Zuo, Wang, and Wang]. Recently, several methods based on self-attention [Wang et al.(2018b)Wang, Girshick, Gupta, and He, Fu et al.(2018)Fu, Liu, Tian, Fang, and Lu, Yuan and Wang(2018), Li et al.(2019)Li, Zhang, You, Yang, Yang, and Tong] have also been used to learn an affinity map at each spatial position that propagates information to its neighbours. However, the memory requirements of the large affinity matrix renders these methods unsuitable for high resolution imagery (such as the Cityscapes dataset [Cordts et al.(2016)Cordts, Omran, Ramos, Rehfeld, Enzweiler, Benenson, Franke, Roth, and Schiele]).

Figure 1: Our proposed DGCNet exploits contextual information across the whole image by proposing a graph convolutional network to efficiently propagate information along both the spatial and channel dimensions of a convolutional feature map.

In this paper, we use graph-convolutional networks (GCNs) [Kipf and Welling(2017)] to effectively and efficiently model contextual information for semantic segmentation. GCNs have recently been applied to scene understanding tasks [Li and Gupta(2018), Chen et al.(2018b)Chen, Rohrbach, Yan, Yan, Feng, and Kalantidis, Liang et al.(2018)Liang, Hu, Zhang, Lin, and Xing, Zhang et al.(2019)Zhang, Xu, Arnab, and Torr], as they are able to globally propagate information through the whole image in a manner that is conditional on the input. This provides greater representational power than methods based on Conditional Random Fields [Zheng et al.(2015)Zheng, Jayasumana, Romera-Paredes, Vineet, Su, Du, Huang, and Torr, Chen et al.(2015)Chen, Papandreou, Kokkinos, Murphy, and Yuille, Arnab et al.(2018)Arnab, Zheng, Jayasumana, Romera-Paredes, Larsson, Kirillov, Savchynskyy, Rother, Kahl, and Torr] which have historically been employed for semantic segmentation [He et al.(2004)He, Zemel, and Carreira-Perpiñán, Shotton et al.(2006)Shotton, Winn, Rother, and Criminisi].

As shown in Fig. 1, our proposed method consists of two primary components: the coordinate space GCN explicitly models the spatial relationships between pixels in the image, enabling our network to produce coherent predictions that consider all objects in the image, whilst the feature space GCN models interdependencies along the channel dimensions of the network’s feature map. Assuming that filters in later layers of a CNN are responsive to object parts and other high-level features [Zeiler and Fergus(2014)], the feature space GCN captures correlations between more abstract features in the image like object parts. After reasoning, these two complementary relation-aware feature are distributed back to the original coordinate space and added to the original feature.

Using our proposed approach, we obtain state-of-the-art results on the Cityscapes [Cordts et al.(2016)Cordts, Omran, Ramos, Rehfeld, Enzweiler, Benenson, Franke, Roth, and Schiele] and Pascal Context [Mottaghi et al.(2014)Mottaghi, Chen, Liu, Cho, Lee, Fidler, Urtasun, and Yuille] datasets. Furthermore, to encourage reproducibility, we have publicly released our training code and models.

2 Related Work

Following the success of deep neural networks for image classification [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton, Simonyan and Zisserman(2015), He et al.(2016)He, Zhang, Ren, and Sun], recent works in semantic segmentation all leverage fully-convolutional networks (FCNs) [Long et al.(2015)Long, Shelhamer, and Darrell]. A limitation of standard FCNs is their small receptive field which prevents them from taking all the contextual information in the scene into account. The DeepLab series of papers [Chen et al.(2015)Chen, Papandreou, Kokkinos, Murphy, and Yuille, Chen et al.(2018)Chen, Papandreou, Kokkinos, Murphy, and Yuille, Chen et al.(2017)Chen, Papandreou, Schroff, and Adam] proposed atrous or dilated convolutions and atrous spatial pyramid pooling (ASPP) to increase the effective receptive field. DenseASPP improved on [Chen et al.(2018)Chen, Papandreou, Kokkinos, Murphy, and Yuille] by densely connecting convolutional layers with different dilation rates. PSPNet [Zhao et al.(2017)Zhao, Shi, Qi, Wang, and Jia], on the other hand, used a pyramid pooling module to fuse convolutional features from multiple scales. Similarly, encoder-decoder network structures [Noh et al.(2015)Noh, Hong, and Han, Ronneberger et al.(2015)Ronneberger, Fischer, and Brox, Lin et al.(2017b)Lin, Dollár, Girshick, He, Hariharan, and Belongie] combine features from early- and late-stages in the network to fuse mid-level and high-level semantic features. Deeplab-v3+ [Chen et al.(2018a)Chen, Zhu, Papandreou, Schroff, and Adam] also followed this approach by fusing lower-level features into its decoder. Di et al\bmvaOneDot[Lin et al.(2018)Lin, Ji, Lischinski, Cohen-Or, and Huang] also recursively, locally fused feature maps of every two levels in a feature pyramid into one.

Another approach has been proposed to more explicitly account for the relations between all pixels in the image. DAG-RNN [Shuai et al.(2018)Shuai, Zuo, Wang, and Wang] models a directed acyclic graph with a recurrent network that propagates information. PSANet [Zhao et al.(2018a)Zhao, Zhang, Liu, Shi, Change Loy, Lin, and Jia] captures pixel-to-pixel relations using an attention module that takes the relative location of each pixel into account. On the other hand, EncNet [Zhang et al.(2018)Zhang, Dana, Shi, Zhang, Wang, Tyagi, and Agrawal] and DFN [Yu et al.(2018b)Yu, Wang, Peng, Gao, Yu, and Sang] use attention along the channel dimension of the convolutional feature map to account for global context such as the co-occurrences of different classes in the scene.

Following on from these approaches, graph neural networks [Kipf and Welling(2017)] have also been used to model long-range context in the scene. Non-local networks [Wang et al.(2018b)Wang, Girshick, Gupta, and He] applied this to video understanding and object detection by learning an affinity map between all pixels in the image or video frames. This allowed the network to effectively increase its receptive field to the whole image. The non-local operator has been applied to segmentation by OCNet [Yuan and Wang(2018)] and DANet [Fu et al.(2018)Fu, Liu, Tian, Fang, and Lu] recently. However, these methods have a (sometimes prohibitively) high memory cost as the affinity matrix grows quadratically with the number of pixels in the image. To bypass this problem, several works [Chen et al.(2018b)Chen, Rohrbach, Yan, Yan, Feng, and Kalantidis, Liang et al.(2018)Liang, Hu, Zhang, Lin, and Xing, Li and Gupta(2018)] have modelled the dependencies between regions of the image rather than individual pixels. This is done by aggregating features from the original “co-ordinate space” to a lower-dimensional intermediate representation, performing graph convolutions in this space, and then reprojecting back onto the original co-ordinate space.

However, differently from these recent GCN methods, we propose the Dual Graph Convolutional Network (DGCNet) to model the global context of the input feature by considering two orthogonal graphs in a single general framework. Specifically, with different mapping strategies, we first project the feature into a new coordinate space and a non-coordinate (feature) space where global relational reasoning can be computed efficiently. After reasoning, two complementary relation-aware features are distributed back to the original coordinate space and added to the original feature. The refined feature thus contains rich global contextual information and can be further provided to the following layers to learn better task-specific representations.

3 Methodology

In this section, we first revisit the graph convolutional network in Sec. 3.1 and then introduce the formulation of our proposed DGCNet in Sec. 3.2 and  3.3. Finally, we detail the resulting network architecture in Sec. 3.4.

3.1 Preliminaries

Revisiting the graph convolution. Assume an input tensor , where is the feature dimension and is the number of locations defined on regular grid coordinates . In standard convolution, information is only exchanged among positions in a small neighborhood defined by the filter size (e.g\bmvaOneDottypically ). In order to create a large receptive field and capture long range dependencies, one needs to stack numerous layers after each other, as done in common architectures [Simonyan and Zisserman(2015), He et al.(2016)He, Zhang, Ren, and Sun]. Graph convolution [Kipf and Welling(2017)], is a highly efficient, effective and differentiable module that generalises the neighborhood definition used in standard convolution, and allows long-range information exchange in a single layer. This is done by defining edges among nodes in a graph . Formally, following [Kipf and Welling(2017)], graph convolution is defined as

(1)

where is the non-linear activation function, is the adjacency matrix characterising the neighbourhood relations of the graph and is the weight matrix. Clearly, the graph definition and structure play a key role in determining the information propagation. Our proposed framework is motivated by building orthogonal graph spaces via different graph projection strategies to learn a better task-specific representation. As summarised in Fig. 2, we now describe how we propagate information in the coordinate space in Sec. 3.2, and in feature space in Sec. 3.3.

Figure 2: Illustration of our proposed DGCNet. Our method consists of two branches, which each consist of a graph convolutional network (GCN) to model contextual information in the spatial- and channel-dimensions in a convolutional feature map, .

3.2 Graph convolution in coordinate space

Coordinate space projection. We first project the input feature into a new coordinate space . In general, we adopt a spatial downsampling operation to transform the input feature to a new feature in the new coordinate space , where the denotes the downsample rate,

(2)

We consider two different operations for : (1). Parameter-free operation. We take the downsampling operation to be average pooling which requires no additional learnable parameters. (2). Parameterised operation. For efficiency, a downsampling rate of is achieved by chaining depth-wise convolution layers, each with a stride of 2 and kernel size of .

Coordinate graph convolution. After projecting the features into the new coordinate space , we can build a lightweight fully-connected graph with adjacency matrix for diffusing information across nodes. Note that the nodes of the graph aggregate information from a “cluster” of pixels, and the edges measure the similarity between these clusters.

The global relational reasoning is performed on the downsampled feature to model the interaction between the features of the corresponding nodes. In particular, we adopt three learnable linear transformations (, , ) on the feature to produce the message,

(3)

where gives the adjacency matrix in Eq. 1, is the dot-product operation.

Reprojection. After the reasoning, we map the new features back into the original coordinate space () to be compatible with a regular CNN. Opposite to the downsample operation used in graph projection, we simply perform upsampling as the reprojection operation. In particular, nearest neighbour interpolation, , is adopted to resize to the original spatial input size (). Hence, the output map is computed as , where is a convolution that transforms into the channel dimension .

Discussion. Our coordinate GCN is built in a coarser spatial grid and its size is determined by the downsampling rate (we usually set , the effect of changing the downsample rate is analysed in Section 4.2.1). Compared to the Non-local operator [Wang et al.(2018b)Wang, Girshick, Gupta, and He] that needs to build a large fully-connected graph with adjacency matrix , our method is significantly more efficient. Moreover, by re-ordering Eq. 3 to (following the associative rule), we can obtain large savings in terms of memory and computation (from to ).

3.3 Graph convolution in feature space

Given that the coordinate space GCN explicitly models the spatial relationships between pixels in the image, we now consider projecting the input feature into the feature space . The coordinate space GCN enables our network to produce coherent predictions that consider all objects in the image, whilst the feature space GCN models interdependencies along the channel dimensions of the network’s feature map. Assuming that filters in later layers of a FCN are responsive to object parts and other high-level features [Zeiler and Fergus(2014)], then the feature space GCN captures correlations between more abstract features in the image like object parts. Feature space projection. In practice, we first reduce the dimension of the input feature with function and formulate the projection function as a linear combination of input such that the new features can aggregate information from multiple regions.

Formally, the input feature is projected to a new feature in the feature space via the projection function . Thus we have

(4)

where both functions of and are implemented with convolutional layer. This results in a new feature , which consists of nodes, each of dimension .

Feature graph convolution. After projection, we can build a fully-connected graph with adjacency matrix in the feature space , where each node contains the feature descriptor. Following Eq. 1, we have

(5)

where denotes the layer-specific trainable edge weights. We consider Laplacian smoothing [Li et al.(2018)Li, Han, and Wu, Chen et al.(2018b)Chen, Rohrbach, Yan, Yan, Feng, and Kalantidis] by updating the adjacency matrix to to propagate the node features over the graph. The identity matrix serves as a residual sum connection in our implementation that alleviates optimisation difficulties. Both adjacency matrix and are randomly initialised and optimised by gradient descent during training in an end-to-end fashion.

Reprojection. As in Sec. 3.2, after the reasoning, we map the new features back into the original coordinate space with output to be compatible with regular convolutional neural networks,

(6)

This is done by first reusing the projection matrix and then performing a linear projection (e.g\bmvaOneDot convolution layer) to transform into the original coordinate space. As a result, we have the feature with feature dimension of at each grid coordinate.

3.4 DGCNet

The final refined feature is computed as , where “+” denotes point-wise summation. To this end, we can easily incorporate our proposed module into existing backbone CNN architectures (e.g\bmvaOneDot, ResNet-101). Figure 2 shows the schematic illustration of our proposed DGCNet .

Implementation of DGCNet. We insert our proposed module between two convolution layers (both layers output channels) appended at the end of a Fully Convolutional Network (FCN) for the task of semantic segmentation. Specifically, we use an ImageNet pretrained ResNet-101 as backbone network, removing the last pooling and FC layer. Our proposed module is then random initialised. Dilated convolution and multi-grid strategies [Fu et al.(2018)Fu, Liu, Tian, Fang, and Lu] are adopted in last two stages of the backbone. We simply set and in our implementation. We add a synchronised batch normalisation (BN) layer and ReLU non-linearity after each convolution layer in our proposed module, except for the convolution layers in the coordinate space GCN (i.e. there are no BN and ReLU operations in the coordinate space GCN defined in Sec. 3.2).

4 Experiments

To evaluate our proposed method, we carry out comprehensive experiments on the Cityscapes [Cordts et al.(2016)Cordts, Omran, Ramos, Rehfeld, Enzweiler, Benenson, Franke, Roth, and Schiele] and PASCAL Context [Mottaghi et al.(2014)Mottaghi, Chen, Liu, Cho, Lee, Fidler, Urtasun, and Yuille] datasets, where we achieve state-of-art performance. We describe our experimental setup in Sec. 4.1, before presenting experiments on the Cityscapes dataset (on which we also perform ablation studies) in Sec. 4.2 and finally Pascal Context in Sec. 4.3.

4.1 Experimental setup

Cityscapes: This dataset [Cordts et al.(2016)Cordts, Omran, Ramos, Rehfeld, Enzweiler, Benenson, Franke, Roth, and Schiele] densely annotates 19 object categories in urban scenes captured from cameras mounted on a car. It contains 5000 finely annotated images, split into 2975, 500 and 1525 for training, validation and testing respectively. The images are all captured at a high resolution of .

PASCAL Context: This dataset [Mottaghi et al.(2014)Mottaghi, Chen, Liu, Cho, Lee, Fidler, Urtasun, and Yuille] provides detailed semantic labels for whole scenes (both “thing” and “stuff” classes [Forsyth et al.(1996)Forsyth, Malik, Fleck, Greenspan, Leung, Belongie, Carson, and Bregler]), and contains 4998 images for training and 5105 images for validation. Following previous works which we compare to, we evaluate on the most frequent 59 classes with along with one background category (60 classes) [Mottaghi et al.(2014)Mottaghi, Chen, Liu, Cho, Lee, Fidler, Urtasun, and Yuille].

Implementation details: We implement our method using Pytorch. Following [Zhao et al.(2017)Zhao, Shi, Qi, Wang, and Jia], we use momentum and adopt a polynomial learning rate decay schedule where the initial learning rate is multiplied by . The initial learning rate is set to 0.01 for Cityscapes and 0.001 for Pascal Context. Momentum and weight decay coefficients are set to 0.9 and 0.0001 respectively. For data augmentation, we apply random cropping (crop size with 769) and random left-right flipping during training for Cityscapes. For the Pascal Context dataset, our crop size is 480. We also use synchronised batch normalisation for better estimation of the batch statistics.

Backbone Coord. GCN Feature GCN mIoU (%) Dilated FCN ResNet-101 75.2 GCN ResNet-101 78.8 GCN ResNet-101 79.3 DGCNet ResNet-101 80.5 (a) Comparison of different graph modules. OHEM Multi-grid MS mIoU (%) DGCNet 79.5 DGCNet 79.8 DGCNet 80.5 DGCNet 81.8 (b) Additional training and inference strategies.
Table 1: Ablation studies on (a) the proposed components of our network and (b) additional training and inference strategies. All methods use the ResNet-101 backbone, and are evaluated using the mean IoU on the Cityscapes validation set. Refer to Sec. 4.2.1 for additional details.
GFLOPs #Params mIoU DA module [Fu et al.(2018)Fu, Liu, Tian, Fang, and Lu] 24.87 1 496 224 79.8 DGC module (Ours) 14.15 1 240 704 80.5 (a) Comparison on computation cost. Downsample rate d=4 d=8 d=16 Avg. pooling 80.2 80.5 80.5 Stride conv. 80.0 80.5 80.5 (b) Ablation on graph projection strategies.
Table 2: Ablation studies on (a) computational cost (input size [] and (b) graph projection strategy. All methods are evaluated using the mean IoU at a single scale, using the ResNet-101 backbone on the Cityscapes validation set.

4.2 Experiments on Cityscapes

4.2.1 Ablation Studies

Effect of proposed modules: As shown in Table 1a, our proposed GCN modules substantially improve performance. Compared to our baseline FCN with dilated convolutions (with the ResNet-101 backbone), appending the Channel-GCN module obtains a mean IoU of 79.3%, which is an improvement of 4.1%. Similarly, the Spatial-GCN module on its own improves the baseline by 3.6%. The best results are obtained by combining the two modules together, resulting in a mean IoU of 80.5%. The effect of our modules are visualised in Fig. 6. The contextual information captured by our graph module improves the consistency of our results leading to fewer artifacts in the prediction.

Effect of additional training and inference strategies: It is common to use additional “tricks” to improve results for semantic segmentation benchmarks [Chen et al.(2017)Chen, Papandreou, Schroff, and Adam, Fu et al.(2018)Fu, Liu, Tian, Fang, and Lu, Yuan and Wang(2018), Zhao et al.(2017)Zhao, Shi, Qi, Wang, and Jia, Shrivastava et al.(2016)Shrivastava, Gupta, and Girshick]. Table 1b shows that our proposed GCN module is complementary to them, and incorporating these strategies improves our results too.

Specifically, we consider 1) Online Hard Example Mining (OHEM) [Shrivastava et al.(2016)Shrivastava, Gupta, and Girshick, Pohlen et al.(2017)Pohlen, Hermans, Mathias, and Leibe, Li et al.(2017)Li, Arnab, and Torr, Yuan and Wang(2018)] where the loss is only computed on the pixels with the highest loss in the image. Following [Yuan and Wang(2018)], we used in a cropped training image 2) Multi-Grid [Chen et al.(2017)Chen, Papandreou, Schroff, and Adam, Fu et al.(2018)Fu, Liu, Tian, Fang, and Lu, Zhu et al.(2019)Zhu, Sapra, Reda, Shih, Newsam, Tao, and Catanzar] employs a hierarchy of convolutional filters of different dilation rates (4, 8 and 16) in the last ResNet block. 3) Multi-Scale (MS) ensembling is commonly used at inference time to improve results [Chen et al.(2015)Chen, Papandreou, Kokkinos, Murphy, and Yuille, Chen et al.(2018a)Chen, Zhu, Papandreou, Schroff, and Adam, Zhao et al.(2017)Zhao, Shi, Qi, Wang, and Jia, Yuan and Wang(2018), Fu et al.(2018)Fu, Liu, Tian, Fang, and Lu, Zhao et al.(2018a)Zhao, Zhang, Liu, Shi, Change Loy, Lin, and Jia, Dai et al.(2015)Dai, He, and Sun] We average the segmentation probability maps from 6 image scales{0.75, 1 1.25, 1.5, 1.75, 2} during inference.

As shown in Table. 1b each of these strategies provides consistent improvements to our overall results. Using these strategies, we compare to the state-of-art in the following subsection.

Figure 3: Cityscapes results compared with Dilated FCN ResNet101 baseline [Yu and Koltun(2016)]. Red boxes show our method can handle inconsistent results within the same object. Best view in color. More results in Section 6.
Method Backbone mIoU (%)
PSPNet [Zhao et al.(2017)Zhao, Shi, Qi, Wang, and Jia] ResNet-101 78.4
PSANet [Zhao et al.(2018a)Zhao, Zhang, Liu, Shi, Change Loy, Lin, and Jia] ResNet-101 78.6
OCNet [Yuan and Wang(2018)] ResNet-101 80.1
DGCNet (Ours) ResNet-101 80.9
SAC [Zhang et al.(2017)Zhang, Tang, Zhang, Li, and Yan] ResNet-101 78.1
AAF [Ke et al.(2018)Ke, Hwang, Liu, and Yu] ResNet-101 79.1
BiSeNet [Yu et al.(2018a)Yu, Wang, Peng, Gao, Yu, and Sang] ResNet-101 78.9
PSANet [Zhao et al.(2018a)Zhao, Zhang, Liu, Shi, Change Loy, Lin, and Jia] ResNet-101 80.1
DFN [Yu et al.(2018b)Yu, Wang, Peng, Gao, Yu, and Sang] ResNet-101 79.3
DepthSeg [Kong and Fowlkes(2018)] ResNet-101 78.2
DenseASPP [Yang et al.(2018)Yang, Yu, Zhang, Li, and Yang] ResNet-101 80.6
GloRe [Chen et al.(2018b)Chen, Rohrbach, Yan, Yan, Feng, and Kalantidis] ResNet-101 80.9
DANet [Fu et al.(2018)Fu, Liu, Tian, Fang, and Lu] ResNet-101 81.5
OCNet [Yuan and Wang(2018)] ResNet-101 81.7
DGCNet (Ours) ResNet-50 80.8
DGCNet (Ours) ResNet-101 82.0

: trained only on train-fine set.

: trained on train-fine and val-fine sets.

Table 4: Comparison to other methods on Pascal Context [Mottaghi et al.(2014)Mottaghi, Chen, Liu, Cho, Lee, Fidler, Urtasun, and Yuille] dataset.
Method Backbone mIoU (%)
FCN8-s [Long et al.(2015)Long, Shelhamer, and Darrell] VGG-16 37.8
HO CRF [Arnab et al.(2016)Arnab, Jayasumana, Zheng, and Torr] VGG-16 41.3
Piecewise [Lin et al.(2016)Lin, Shen, van den Hengel, and Reid] VGG-16 43.3
DeepLab-v2 (COCO) [Chen et al.(2018)Chen, Papandreou, Kokkinos, Murphy, and Yuille] ResNet-101 45.7
RefineNet [Lin et al.(2017a)Lin, Milan, Shen, and Reid] ResNet-101 47.3
PSPNet [Zhao et al.(2017)Zhao, Shi, Qi, Wang, and Jia] ResNet-101 47.8
Ding et al. [Ding et al.(2018)Ding, Jiang, Shuai, Qun Liu, and Wang] ResNet-101 51.6
EncNet [Zhang et al.(2018)Zhang, Dana, Shi, Zhang, Wang, Tyagi, and Agrawal] (SS) ResNet-50 49.0
EncNet [Zhang et al.(2018)Zhang, Dana, Shi, Zhang, Wang, Tyagi, and Agrawal] (MS) ResNet-101 51.7
SGR [Liang et al.(2018)Liang, Hu, Zhang, Lin, and Xing] ResNet-101 52.5
DANet [Fu et al.(2018)Fu, Liu, Tian, Fang, and Lu] ResNet-50 50.1
DANet [Fu et al.(2018)Fu, Liu, Tian, Fang, and Lu] ResNet-101 52.6
Dilated FCN baseline ResNet-50 44.3
DGCNet (SS) ResNet-50 50.1
DGCNet (SS) ResNet-101 53.0
DGCNet (MS) ResNet-101 53.7

SS: Single scale. MS: Multi scale

Table 3: State-of-the-art comparison on the Cityscapes test set.
Figure 4: Comparison of our results on Pascal Context to the state-of-art EncNet [Zhang et al.(2018)Zhang, Dana, Shi, Zhang, Wang, Tyagi, and Agrawal] method. Note that our results are more consistent and have fewer artifacts. Best view in color. More results in Section 6.

Computational cost analysis Table 2a shows that our proposed method costs significantly fewer floating point operations than the related work of DANet [Fu et al.(2018)Fu, Liu, Tian, Fang, and Lu], but still achieves higher performance.

Effect of mapping strategies: As mentioned in Sec. 3.2, different mapping strategies are possible to build the coordinate space graph. We present the effect of two strategies – average pooling and strided convolution – for different downsampling ratios, , in Tab. 2b. It is interesting to observe that the model with average pooling achieves similar performance as strided convolution, even slightly outperforming it for . One reason may be that the model with the parameter-free mapping strategy can overfit less than the one using the strided convolution operation. Note that both strategies are robust to the choice of , and similar performance is obtained for , and . The final model that we use for comparing to the state-of-the-art uses strided convolutions with .

4.2.2 Comparisons with state-of-the-art

Table 5 compares our approach with existing methods on the Cityscapes test set. Following common practice towards obtaining the highest performance, we use the inference strategies described in the previous section. For fair comparison, Tab. 5 shows methods that are only trained using the fine annotations from Cityscapes, and evaluated on the evaluation server. We achieve a mean IoU of 80.7% when only using the training set, thus outperforming PSANet [Zhao et al.(2018a)Zhao, Zhang, Liu, Shi, Change Loy, Lin, and Jia] by 2.1% and OC-Net [Yuan and Wang(2018)] by 0.8%. Training with both train-fine and val-fine sets achieves an IoU of 82.0%. In both scenarios, we obtain state-of-the-art results. Detailed per-class results are provided in Section 6, which shows that our method achieves the highest IoU in 16 out of the 19 classes.

4.3 Experiments on Pascal Context

Table 4 shows our results on Pascal Context. We follow prior work [Lin et al.(2017a)Lin, Milan, Shen, and Reid, Zhang et al.(2018)Zhang, Dana, Shi, Zhang, Wang, Tyagi, and Agrawal, Fu et al.(2018)Fu, Liu, Tian, Fang, and Lu] to use the semantic labels of the most frequent 59 object categories plus background (therefore, there are 60 classes in total). The Dilated-FCN baseline achieves a mean IoU of 44.3% with the ResNet-50 backbone. Our proposed DGCNet significantly improves this baseline, achieving an IoU of 50.1% with the same ResNet-50 backbone under single scale evaluation (SS), which outperforms previous work using the same backbone (49.0) [Zhang et al.(2018)Zhang, Dana, Shi, Zhang, Wang, Tyagi, and Agrawal]. With the ResNet-101 backbone, DGCNet achieves an IoU of 53.0%. Moreover, our performance further improves to 53.7% when multiscale inference (MS) is adopted, surpassing the previous state-of-the-art [Fu et al.(2018)Fu, Liu, Tian, Fang, and Lu] by a large margin.

5 Conclusion

We proposed a graph-convolutional module to model the contextual relationships in an image, which is critical for dense prediction tasks such as semantic segmentation. Our method consists of two branches, one to capture context along the spatial dimensions, and another along the channel dimensions, in a convolutional feature map. Our proposed approach provides significant improvements over a strong baseline, and achieves state-of-art results on the Cityscapes and Pascal Context datasets. Future work is to address other dense prediction tasks such as instance segmentation and depth estimation.

Acknowledgments

This work was supported by EPSRC Programme Grant Seebibyte EP/M013774/1, ERC grant ERC-2012-AdG 321162-HELIOS and EPSRC/MURI grant EP/N019474/1. We gratefully acknowledge the use of the University of Oxford Advanced Research Computing (ARC) facility in carrying out this work. We would also like to acknowledge the Royal Academy of Engineering, FiveAI and the support of DeepMotion AI Research for providing the computing resources in carrying out this research.

6 Appendix

Methods

Mean IoU

road

sidewalk

building

wall

fence

pole

traffic light

traffic sign

vegetation

terrain

sky

person

rider

car

truck

bus

train

motorcycle

bicycle

DeepLab-v2 [Chen et al.(2018)Chen, Papandreou, Kokkinos, Murphy, and Yuille] 70.4 97.9 81.3 90.3 48.8 47.4 49.6 57.9 67.3 91.9 69.4 94.2 79.8 59.8 93.7 56.5 67.5 57.5 57.7 68.8
RefineNet [Lin et al.(2017a)Lin, Milan, Shen, and Reid] 73.6 98.2 83.3 91.3 47.8 50.4 56.1 66.9 71.3 92.3 70.3 94.8 80.9 63.3 94.5 64.6 76.1 64.3 62.2 70
GCN [Peng et al.(2017)Peng, Zhang, Yu, Luo, and Sun] 76.9 - - - - - - - - - - - - - - - - - - -
DUC [Wang et al.(2018a)Wang, Chen, Yuan, Liu, Huang, Hou, and Cottrell] 77.6 98.5 85.5 92.8 58.6 55.5 65 73.5 77.9 93.3 72 95.2 84.8 68.5 95.4 70.9 78.8 68.7 65.9 73.8
ResNet-38 [Wu et al.(2019)Wu, Shen, and Van Den Hengel] 78.4 98.5 85.7 93.1 55.5 59.1 67.1 74.8 78.7 93.7 72.6 95.5 86.6 69.2 95.7 64.5 78.8 74.1 69 76.7
PSPNet [Zhao et al.(2017)Zhao, Shi, Qi, Wang, and Jia] 78.4 - - - - - - - - - - - - - - - - - - -
BiSeNet [Yu et al.(2018c)Yu, Wang, Peng, Gao, Yu, and Sang] 78.9 - - - - - - - - - - - - - - - - - - -
PSANet [Zhao et al.(2018b)Zhao, Zhang, Liu, Shi, Loy, Lin, and Jia] 80.1 - - - - - - - - - - - - - - - - - - -
DenseASPP [Yang et al.(2018)Yang, Yu, Zhang, Li, and Yang] 80.6 98.7 87.1 93.4 60.7 62.7 65.6 74.6 78.5 93.6 72.5 95.4 86.2 71.9 96.0 78.0 90.3 80.7 69.7 76.8
GloRe [Chen et al.(2018b)Chen, Rohrbach, Yan, Yan, Feng, and Kalantidis] 80.9 - - - - - - - - - - - - - - - - - - -
DANet [Fu et al.(2018)Fu, Liu, Tian, Fang, and Lu] 81.5 98.6 86.1 93.5 56.1 63.3 69.7 77.3 81.3 93.9 72.9 95.7 87.3 72.9 96.2 76.8 89.4 86.5 72.2 78.2
Ours 82.0 98.7 87.4 93.9 62.4 63.4 70.8 78.7 81.3 94.0 73.3 95.8 87.8 73.7 96.4 76.0 91.6 81.6 71.5 78.2
Table 5: Per-class results on Cityscapes testing set. Our methods outperforms existing approaches and achieves 82.0 % in Mean IoU and achieves the highest IoU in 16 out of the 19 classes.
Figure 5: Cityscapes results compared with Dilated FCN ResNet101 baseline [Yu and Koltun(2016)]. Best view in color.
Figure 6: Comparison of our results on Pascal Context to the state-of-art EncNet [Zhang et al.(2018)Zhang, Dana, Shi, Zhang, Wang, Tyagi, and Agrawal] method. Note how our results are more consistent and have fewer artifacts. Best view in color.

References

  • [Arnab et al.(2016)Arnab, Jayasumana, Zheng, and Torr] Anurag Arnab, Sadeep Jayasumana, Shuai Zheng, and Philip HS Torr. Higher order conditional random fields in deep neural networks. In European Conference on Computer Vision, 2016.
  • [Arnab et al.(2018)Arnab, Zheng, Jayasumana, Romera-Paredes, Larsson, Kirillov, Savchynskyy, Rother, Kahl, and Torr] Anurag Arnab, Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Måns Larsson, Alexander Kirillov, Bogdan Savchynskyy, Carsten Rother, Fredrik Kahl, and Philip HS Torr. Conditional random fields meet deep neural networks for semantic segmentation: Combining probabilistic graphical models with deep learning for structured prediction. IEEE Signal Processing Magazine, 2018.
  • [Byeon et al.(2015)Byeon, Breuel, Raue, and Liwicki] Wonmin Byeon, Thomas M. Breuel, Federico Raue, and Marcus Liwicki. Scene labeling with lstm recurrent neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, 2015.
  • [Chen et al.(2015)Chen, Papandreou, Kokkinos, Murphy, and Yuille] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected CRFs. International Conference on Learning Representations, 2015.
  • [Chen et al.(2017)Chen, Papandreou, Schroff, and Adam] Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
  • [Chen et al.(2018)Chen, Papandreou, Kokkinos, Murphy, and Yuille] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
  • [Chen et al.(2018a)Chen, Zhu, Papandreou, Schroff, and Adam] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In European Conference on Computer Vision, 2018a.
  • [Chen et al.(2018b)Chen, Rohrbach, Yan, Yan, Feng, and Kalantidis] Yunpeng Chen, Marcus Rohrbach, Zhicheng Yan, Shuicheng Yan, Jiashi Feng, and Yannis Kalantidis. Graph-based global reasoning networks. arXiv preprint arXiv:1811.12814, 2018b.
  • [Cordts et al.(2016)Cordts, Omran, Ramos, Rehfeld, Enzweiler, Benenson, Franke, Roth, and Schiele] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  • [Dai et al.(2015)Dai, He, and Sun] Jifeng Dai, Kaiming He, and Jian Sun. Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pages 1635–1643, 2015.
  • [Ding et al.(2018)Ding, Jiang, Shuai, Qun Liu, and Wang] Henghui Ding, Xudong Jiang, Bing Shuai, Ai Qun Liu, and Gang Wang. Context contrasted feature and gated multi-scale aggregation for scene segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • [Everingham et al.(2010)Everingham, Van Gool, Williams, Winn, and Zisserman] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 2010.
  • [Forsyth et al.(1996)Forsyth, Malik, Fleck, Greenspan, Leung, Belongie, Carson, and Bregler] David A Forsyth, Jitendra Malik, Margaret M Fleck, Hayit Greenspan, Thomas Leung, Serge Belongie, Chad Carson, and Chris Bregler. Finding pictures of objects in large collections of images. In International workshop on object representation in computer vision, pages 335–360. Springer, 1996.
  • [Fu et al.(2018)Fu, Liu, Tian, Fang, and Lu] Jun Fu, Jing Liu, Haijie Tian, Zhiwei Fang, and Hanqing Lu. Dual attention network for scene segmentation. arXiv preprint arXiv:1809.02983, 2018.
  • [He et al.(2016)He, Zhang, Ren, and Sun] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  • [He et al.(2004)He, Zemel, and Carreira-Perpiñán] Xuming He, Richard S Zemel, and Miguel Á Carreira-Perpiñán. Multiscale conditional random fields for image labeling. In IEEE Conference on Computer Vision and Pattern Recognition, 2004.
  • [Ke et al.(2018)Ke, Hwang, Liu, and Yu] Tsung-Wei Ke, Jyh-Jing Hwang, Ziwei Liu, and Stella X. Yu. Adaptive affinity fields for semantic segmentation. In European Conference on Computer Vision, 2018.
  • [Kipf and Welling(2017)] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, 2017.
  • [Kong and Fowlkes(2018)] Shu Kong and Charless C. Fowlkes. Recurrent scene parsing with perspective understanding in the loop. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, 2012.
  • [Li et al.(2018)Li, Han, and Wu] Qimai Li, Zhichao Han, and Xiao-Ming Wu. Deeper insights into graph convolutional networks for semi-supervised learning. In AAAI Conference on Artificial Intelligence, 2018.
  • [Li et al.(2017)Li, Arnab, and Torr] Qizhu Li, Anurag Arnab, and Philip HS Torr. Holistic, instance-level human parsing. In British Machine Vision Conference, 2017.
  • [Li et al.(2019)Li, Zhang, You, Yang, Yang, and Tong] Xiangtai Li, Li Zhang, Ansheng You, Maoke Yang, Kuiyuan Yang, and Yunhai Tong. Global aggregation then local distribution in fully convolutional networks. In British Machine Vision Conference, 2019.
  • [Li and Gupta(2018)] Yin Li and Abhinav Gupta. Beyond grids: Learning graph representations for visual recognition. In Advances in Neural Information Processing Systems, 2018.
  • [Liang et al.(2018)Liang, Hu, Zhang, Lin, and Xing] Xiaodan Liang, Zhiting Hu, Hao Zhang, Liang Lin, and Eric P Xing. Symbolic graph reasoning meets convolutions. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, 2018.
  • [Lin et al.(2018)Lin, Ji, Lischinski, Cohen-Or, and Huang] Di Lin, Yuanfeng Ji, Dani Lischinski, Daniel Cohen-Or, and Hui Huang. Multi-scale context intertwining for semantic segmentation. In European Conference on Computer Vision, 2018.
  • [Lin et al.(2016)Lin, Shen, van den Hengel, and Reid] Guosheng Lin, Chunhua Shen, Anton van den Hengel, and Ian D. Reid. Efficient piecewise training of deep structured models for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  • [Lin et al.(2017a)Lin, Milan, Shen, and Reid] Guosheng Lin, Anton Milan, Chunhua Shen, and Ian D. Reid. Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, 2017a.
  • [Lin et al.(2017b)Lin, Dollár, Girshick, He, Hariharan, and Belongie] Tsung-Yi Lin, Piotr Dollár, Ross B. Girshick, Kaiming He, Bharath Hariharan, and Serge J. Belongie. Feature pyramid networks for object detection. In IEEE Conference on Computer Vision and Pattern Recognition, 2017b.
  • [Liu et al.(2015)Liu, Rabinovich, and Berg] Wei Liu, Andrew Rabinovich, and Alexander C Berg. Parsenet: Looking wider to see better. arXiv preprint arXiv:1506.04579, 2015.
  • [Long et al.(2015)Long, Shelhamer, and Darrell] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, 2015.
  • [Mottaghi et al.(2014)Mottaghi, Chen, Liu, Cho, Lee, Fidler, Urtasun, and Yuille] Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan Yuille. The role of context for object detection and semantic segmentation in the wild. In IEEE Conference on Computer Vision and Pattern Recognition, 2014.
  • [Noh et al.(2015)Noh, Hong, and Han] Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. Learning deconvolution network for semantic segmentation. In IEEE International Conference on Computer Vision, 2015.
  • [Peng et al.(2017)Peng, Zhang, Yu, Luo, and Sun] Chao Peng, Xiangyu Zhang, Gang Yu, Guiming Luo, and Jian Sun. Large kernel matters–improve semantic segmentation by global convolutional network. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [Pohlen et al.(2017)Pohlen, Hermans, Mathias, and Leibe] Tobias Pohlen, Alexander Hermans, Markus Mathias, and Bastian Leibe. Full-resolution residual networks for semantic segmentation in street scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4151–4160, 2017.
  • [Ronneberger et al.(2015)Ronneberger, Fischer, and Brox] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. International Conference on Medical Image Computing and Computer Assisted Intervention, 2015.
  • [Shotton et al.(2006)Shotton, Winn, Rother, and Criminisi] Jamie Shotton, John Winn, Carsten Rother, and Antonio Criminisi. Textonboost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation. In European Conference on Computer Vision, 2006.
  • [Shrivastava et al.(2016)Shrivastava, Gupta, and Girshick] Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. Training region-based object detectors with online hard example mining. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  • [Shuai et al.(2018)Shuai, Zuo, Wang, and Wang] Bing Shuai, Zhen Zuo, Bing Wang, and Gang Wang. Scene segmentation with dag-recurrent neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
  • [Simonyan and Zisserman(2015)] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representations, 2015.
  • [Wang et al.(2018a)Wang, Chen, Yuan, Liu, Huang, Hou, and Cottrell] Panqu Wang, Pengfei Chen, Ye Yuan, Ding Liu, Zehua Huang, Xiaodi Hou, and Garrison Cottrell. Understanding convolution for semantic segmentation. In IEEE Winter Conference on Applications of Computer Vision, 2018a.
  • [Wang et al.(2018b)Wang, Girshick, Gupta, and He] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, 2018b.
  • [Wu et al.(2019)Wu, Shen, and Van Den Hengel] Zifeng Wu, Chunhua Shen, and Anton Van Den Hengel. Wider or deeper: Revisiting the resnet model for visual recognition. Pattern Recognition, 2019.
  • [Yang et al.(2018)Yang, Yu, Zhang, Li, and Yang] Maoke Yang, Kun Yu, Chi Zhang, Zhiwei Li, and Kuiyuan Yang. Denseaspp for semantic segmentation in street scenes. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • [Yu et al.(2018a)Yu, Wang, Peng, Gao, Yu, and Sang] Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, and Nong Sang. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In European Conference on Computer Vision, 2018a.
  • [Yu et al.(2018b)Yu, Wang, Peng, Gao, Yu, and Sang] Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, and Nong Sang. Learning a discriminative feature network for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, 2018b.
  • [Yu et al.(2018c)Yu, Wang, Peng, Gao, Yu, and Sang] Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, and Nong Sang. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In European Conference on Computer Vision, 2018c.
  • [Yu and Koltun(2016)] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. International Conference on Learning Representations, 2016.
  • [Yuan and Wang(2018)] Yuhui Yuan and Jingdong Wang. Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916, 2018.
  • [Zeiler and Fergus(2014)] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European Conference on Computer Vision, 2014.
  • [Zhang et al.(2018)Zhang, Dana, Shi, Zhang, Wang, Tyagi, and Agrawal] Hang Zhang, Kristin Dana, Jianping Shi, Zhongyue Zhang, Xiaogang Wang, Ambrish Tyagi, and Amit Agrawal. Context encoding for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • [Zhang et al.(2019)Zhang, Xu, Arnab, and Torr] Li Zhang, Dan Xu, Anurag Arnab, and Philip HS Torr. Dynamic graph message passing network. arXiv preprint arXiv:1908.06959, 2019.
  • [Zhang et al.(2017)Zhang, Tang, Zhang, Li, and Yan] Rui Zhang, Sheng Tang, Yongdong Zhang, Jintao Li, and Shuicheng Yan. Scale-adaptive convolutions for scene parsing. In IEEE International Conference on Computer Vision, 2017.
  • [Zhao et al.(2017)Zhao, Shi, Qi, Wang, and Jia] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [Zhao et al.(2018a)Zhao, Zhang, Liu, Shi, Change Loy, Lin, and Jia] Hengshuang Zhao, Yi Zhang, Shu Liu, Jianping Shi, Chen Change Loy, Dahua Lin, and Jiaya Jia. Psanet: Point-wise spatial attention network for scene parsing. In European Conference on Computer Vision, 2018a.
  • [Zhao et al.(2018b)Zhao, Zhang, Liu, Shi, Loy, Lin, and Jia] Hengshuang Zhao, Yi Zhang, Shu Liu, Jianping Shi, Chen Change Loy, Dahua Lin, and Jiaya Jia. PSANet: Point-wise spatial attention network for scene parsing. In European Conference on Computer Vision, 2018b.
  • [Zheng et al.(2015)Zheng, Jayasumana, Romera-Paredes, Vineet, Su, Du, Huang, and Torr] Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang, and Philip H. S. Torr. Conditional random fields as recurrent neural networks. In IEEE International Conference on Computer Vision, 2015.
  • [Zhou et al.(2016)Zhou, Zhao, Puig, Fidler, Barriuso, and Torralba] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Semantic understanding of scenes through the ADE20K dataset. arXiv preprint arXiv:1608.05442, 2016.
  • [Zhu et al.(2019)Zhu, Sapra, Reda, Shih, Newsam, Tao, and Catanzar] Yi Zhu, Karan Sapra, Fitsum A. Reda, Kevin J. Shih, Shawn Newsam, Andrew Tao, and Bryan Catanzar. Improving Semantic Segmentation via Video Propagation and Label Relaxation. In IEEE Conference on Computer Vision and Pattern Recognition, 2019.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
390086
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description