ContextNet: Exploring Context and Detail for Semantic Segmentation in Real-time

ContextNet: Exploring Context and Detail
for Semantic Segmentation in Real-time

Abstract

Modern deep learning architectures produce highly accurate results on many challenging semantic segmentation datasets. State-of-the-art methods are, however, not directly transferable to real-time applications or embedded devices, since naïve adaptation of such systems to reduce computational cost (speed, memory and energy) causes a significant drop in accuracy. We propose ContextNet, a new deep neural network architecture which builds on factorized convolution, network compression and pyramid representations to produce competitive semantic segmentation in real-time with low memory requirements. ContextNet combines a deep branch at low resolution that captures global context information efficiently with a shallow branch that focuses on high-resolution segmentation details. We analyse our network in a thorough ablation study and present results on the Cityscapes dataset, achieving 66.1% accuracy at 18.2 frames per second at full () resolution.

\addauthor

Rudra P K Poudelrudra.poudel@gmail.com1 \addauthorUjwal Bondeujwal.bonde@gmail.com1 \addauthorStephan Liwickistephan.liwicki@gmail.com1 \addauthorChristopher Zachchristopher.m.zach@gmail.com1 \addinstitution Toshiba Research Europe
Cambridge, UK ContextNet: Exploring Context and Detail

1 Introduction

Semantic segmentation provides detailed pixel-level classification of images, which is particularly suited for autonomous vehicles and driver assistance, as these applications often require accurate road boundaries and obstacle detection  [Cordts et al.(2016)Cordts, Omran, Ramos, Rehfeld, Enzweiler, Benenson, Franke, Roth, and Schiele, Menze and Geiger(2015), Brostow et al.(2009)Brostow, Fauqueur, and Cipolla]. Modern systems produce highly accurate segmentation results, but often at the cost of computational efficiency. In this paper, we propose ContextNet to address competitive semantic segmentation for autonomous driving tasks, requiring real-time processing and memory efficiency.

Deep neural networks (DNNs) are becoming the preferred approach for semantic image segmentation in recent years. High performance segmentation methods adopt state-of-the-art classification architecture using fully convolutional network (FCN) [Shelhamer et al.(2016)Shelhamer, Long, and Darrell] or encoder-decoder techniques [Badrinarayanan et al.(2017)Badrinarayanan, Kendall, and Cipolla]. In particular, DeepLab [Chen et al.(2016)Chen, Papandreou, Kokkinos, Murphy, and Yuille] employs an increased number of layers to extract complex and abstract features, leading to increased accuracy. PSPNet [Zhao et al.(2017b)Zhao, Shi, Qi, Wang, and Jia] combines multiple levels of information through context aggregation from multiple resolutions, and benchmarks as one of the most accurate DNNs. The accuracy of these architectures comes at high computational price. Semantic segmentation of a single image requires more than a second, even on modern high-end GPUs (e.g\bmvaOneDotNvidia Titan X) and hinders their deployment for driverless cars.

Convolution Factorization: Ordinarily convolutions map cross-channel and spatial correlations simultaneously. In contrast, convolution factorization employs multiple sub-operations to reduce computation cost and memory. Examples include Inception [Szegedy et al.(2016)Szegedy, Vanhoucke, Ioffe, Shlens, and Wojna], Xception [Chollet(2016)] and MobileNet [Howard et al.(2017)Howard, Zhu, Chen, Kalenichenko, Wang, Weyand, Andreetto, and Adam, Sandler et al.(2018)Sandler, Howard, Zhu, Zhmoginov, and Chen]. In particular, MobileNet [Howard et al.(2017)Howard, Zhu, Chen, Kalenichenko, Wang, Weyand, Andreetto, and Adam] decomposes a standard convolution into a depth-wise convolution (also known as spatial or channel-wise convolution) and a point-wise convolution. MobileNetV2 [Sandler et al.(2018)Sandler, Howard, Zhu, Zhmoginov, and Chen] further improves this framework by introducing a bottleneck block and residual connections [He et al.(2015)He, Zhang, Ren, and Sun].

Network Compression: Network compression is orthogonal to convolution factorization. Network hashing or pruning is applied to reduce the size of a pre-trained network, resulting in faster test-time execution and a smaller parameter set and memory footprint [Han et al.(2016)Han, Mao, and Dally, Li et al.(2017)Li, Kadav, Durdanovic, Samet, and Graf].

Network Quantization: The runtime of a network can be further reduced using quantization techniques [Hubara et al.(2016)Hubara, Courbariaux, Soudry, El-Yaniv, and Bengio, Rastegari et al.(2016)Rastegari, Ordonez, Redmon, and Farhadi, Wu et al.(2018)Wu, Li, Chen, and Shi]. These techniques reduce the size/memory requirements of a network by encoding parameters in low-bit representations. Moreover, runtime is further improved if binary weights and activation functions are employed since efficient XNOR and bit-count operations replace costly floating point operations of standard DNNs.

Despite these known techniques, notably, only ENet [Paszke et al.(2016)Paszke, Chaurasia, Kim, and Culurciello] implements a network that runs in real-time on the full resolution images of the popular Cityscapes dataset [Cordts et al.(2016)Cordts, Omran, Ramos, Rehfeld, Enzweiler, Benenson, Franke, Roth, and Schiele], but with significantly reduced accuracy in comparison to the state-of-the-art.

1.1 Contributions

Figure 1: ContextNet combines a deep network at small resolution with a shallow network at full resolution to achieve accurate and real-time semantic segmentation.

In this paper we introduce ContextNet, a competitive network for semantic segmentation running in real-time with low memory footprint (Figure 1). ContextNet builds on the following two observations of previous work:

  1. An increased number of layers helps to learn more complex and abstract features, leading to increased accuracy but also increased running time [He et al.(2015)He, Zhang, Ren, and Sun, Chen et al.(2016)Chen, Papandreou, Kokkinos, Murphy, and Yuille, Zhao et al.(2017b)Zhao, Shi, Qi, Wang, and Jia].

  2. Aggregation of context information from multiple resolutions is beneficial, since it combines multiple levels of information for improved performance [Shelhamer et al.(2016)Shelhamer, Long, and Darrell, Zhao et al.(2017b)Zhao, Shi, Qi, Wang, and Jia].

Consequently, we propose a network architecture with branches at two resolutions. In order to obtain real-time performance, the network at low resolution is deep, while the network at high resolution is rather shallow. As such, our network perceives context at low resolution and refines it for high resolution results. The architecture is designed to ensure the low resolution branch is mainly responsible for providing the correct label, whereas the high-resolution branch refines the segmentation boundaries.

Our design is related to pyramid representations [Burt and Adelson(1987)] which have been employed in many recent DNNs. RefineNet [Lin et al.(2017)Lin, Milan, Shen, and Reid] uses multiple paths over which information from different resolutions is carefully combined to obtain high-resolution segmentation results. Ghiasi and Fowlkes [Ghiasi and Fowlkes(2016)] use a class specific multi-level Laplacian pyramid reconstruction technique for the segmentation task. However, neither achieve real-time performance.

ICNet [Zhao et al.(2017a)Zhao, Qi, Shen, Shi, and Jia] employs a subset of ResNet [He et al.(2015)He, Zhang, Ren, and Sun] at three resolution levels (full, half and one fourth of the original input resolution) which are later combined to provide semantic segmentation results. Here we emphasize, while our implementation of ICNet confirms accuracy in [Zhao et al.(2017a)Zhao, Qi, Shen, Shi, and Jia], the reported runtime is achieved only at half resolution images of Cityscapes [Cordts et al.(2016)Cordts, Omran, Ramos, Rehfeld, Enzweiler, Benenson, Franke, Roth, and Schiele]. In contrast, we experimentally show that it is possible to capture global context (semantically rich features) with only a single deeper network on smaller input size and local context (detailed spatial features) with a shallow network on full resolution.

We employ efficient bottleneck residual blocks [Sandler et al.(2018)Sandler, Howard, Zhu, Zhmoginov, and Chen] and network pruning [Han et al.(2016)Han, Mao, and Dally, Li et al.(2017)Li, Kadav, Durdanovic, Samet, and Graf] to present a novel DNN architecture for real-time semantic segmentation with full floating point operations. Network quantization is not explored and is left for future work. In the following sections, we provide design choices of our proposed ContextNet and show our detailed ablation analysis and experiments on Cityscapes [Cordts et al.(2016)Cordts, Omran, Ramos, Rehfeld, Enzweiler, Benenson, Franke, Roth, and Schiele].

2 Proposed Context Network (ContextNet)

The proposed ContextNet is visualized in Figure 1. ContextNet produces cost efficient accurate segmentation at low resolution, which is then combined with a sub-network at high resolution to provide detailed segmentation results.

2.1 Motivation

Combining different levels of context information is advantageous for the semantic segmentation task [Ronneberger et al.(2015)Ronneberger, Fischer, and Brox, Shelhamer et al.(2016)Shelhamer, Long, and Darrell, Chen et al.(2016)Chen, Papandreou, Kokkinos, Murphy, and Yuille]. PSPNet [Zhao et al.(2017b)Zhao, Shi, Qi, Wang, and Jia] employs an explicit pyramid pooling module to improve performance by capturing global and local context at different input sizes.

Another noticeable trend is that state-of-the-art DNNs have grown deeper (e.g\bmvaOneDot[He et al.(2015)He, Zhang, Ren, and Sun, Zhao et al.(2017b)Zhao, Shi, Qi, Wang, and Jia, Chen et al.(2016)Chen, Papandreou, Kokkinos, Murphy, and Yuille]), since this can capture more complex and abstract features, and increase the receptive field. Unfortunately, higher number of layers ultimately increase runtime and memory requirements.

ContextNet combines both, deep networks and multi-resolution architectures. In order to achieve fast runtime we restrict our multi-scale input to two branches, where global information at low resolution is refined by a shallow network at high resolution to produce the final segmentation results in real-time.

2.2 Network Design

We now describe the main building blocks, the overall architecture and discuss our design.

2.2.1 Depth-wise Convolution to Improve Run-time

Input Operator Output
Conv2d
DWConv
Conv2d
Table 1: Bottleneck residual block transferring the input from to channels with height, width, expansion factor , convolution type kernel-size/stride and non-linear function .

Depth-wise separable convolutions factorize standard convolution (Conv2d) into a depth-wise convolution (DWConv), also known as spatial or channel-wise convolution, followed by a point-wise convolution layer [Howard et al.(2017)Howard, Zhu, Chen, Kalenichenko, Wang, Weyand, Andreetto, and Adam]. Cross-channel and spatial correlation is therefore computed independently, which drastically reduces the number of parameters, resulting in fewer floating point operations and fast execution time.

ContextNet utilizes DWConv, as we design its two main building blocks accordingly. The sub-network with down-sampled input uses bottleneck residual blocks with DWConv [Sandler et al.(2018)Sandler, Howard, Zhu, Zhmoginov, and Chen] (Table 1). In the sub-network at full resolution depth-wise separable convolutions are directly employed. We omit the nonlinearity between depth-wise and point-wise convolutions in our full resolution branch, since it had limited impact on accuracy in our initial experiments.

2.2.2 Capturing Global and Local Context

ContextNet has two branches, one for full resolution () and one for lower resolution (e.g\bmvaOneDot), with input image height and width (Figure 1). Each branch has different responsibilities; the latter captures the global context of the image, and the former provides the detail information for the higher resolution segmentation. In particular, our design choices are motivated as follows:

  1. For fast feature extractions, semantically rich features are extracted from the lowest possible resolution only.

  2. Features for local context are extracted separately from full resolution input by a very shallow branch, and are then combined with low-resolution results.

Hence, significantly faster computation of image segmentations is possible in ContextNet.

Input Operator Expansion Factor Output Channels Repeats Stride
Conv2d - 32 1 2
bottleneck 1 32 1 1
bottleneck 6 32 1 1
bottleneck 6 48 3 2
bottleneck 6 64 3 2
bottleneck 6 96 2 1
bottleneck 6 128 2 1
Conv2d - 128 1 1
Table 2: Branch-4 for compressed input. Repeated block use stride 1 after first block/layer.

Capturing context: The detail structure of lower resolution branch is shown in Table 2. This sub-network consists of two convolution layers and 12 bottleneck residual blocks. Similar to MobileNetV2 [Sandler et al.(2018)Sandler, Howard, Zhu, Zhmoginov, and Chen], we employ residual connections for bottleneck residual blocks when input and output are of the same size. While the low resolution branches of ICNet [Zhao et al.(2017a)Zhao, Qi, Shen, Shi, and Jia] requires 50 costly layers of ResNet [He et al.(2015)He, Zhang, Ren, and Sun], in ContextNet a total of only 38 highly efficient layers are used to describe global context.

Spatial detail: The sub-network of the full resolution branch is kept as shallow as possible and only consists of four layers. Its objective is to refine the results with local context. In particular, the number of feature maps are 32, 64, 128 and 128 respectively. The first layer uses standard convolution and all others use depth-wise separable convolutions, and the stride is 2 for all but the last layer, where it is 1.

Branch-1 Branch-4
                 - Upsample 4
- DWConv (dilation 4)
Conv2d Conv2d
add,
Table 3: Features fusion unit of ContextNet.

We use fusion unit shown in Table 3 to merge the features from both branches. Since runtime is of concern, we use feature addition instead of concatenation. Finally, we use a simple convolution layer for the final soft-max based classification results.

2.2.3 Design Choices

We have conducted several initial experiments before deciding on the final model of ContextNet using train and validation sets of Cityscapes dataset. We have found that the use of a pyramid pooling module [Zhao et al.(2017b)Zhao, Shi, Qi, Wang, and Jia] after the low-resolution branch increases accuracy. Also, learning global context using down-sampled input is more efficient than learning with asymmetric convolution (for examples, , , where ) on higher resolution inputs. Class weight balancing technique did not help, when we increased batch-size to 16.

Empirically, we found an auxiliary loss for the low-resolution branch is beneficial. We think, the auxiliary loss ensures that meaningful features for semantic segmentation are extracted by the branch for global context, and are learned independently from the other branch. Following [Chen et al.(2016)Chen, Papandreou, Kokkinos, Murphy, and Yuille, Zhao et al.(2017b)Zhao, Shi, Qi, Wang, and Jia], a cross-entropy loss is employed as auxiliary and final loss of ContextNet.

3 Experiments

In our evaluation, we present a detailed ablation study of ContextNet on the validation set of the Cityscapes dataset [Cordts et al.(2016)Cordts, Omran, Ramos, Rehfeld, Enzweiler, Benenson, Franke, Roth, and Schiele] , and report its performance on the Cityscapes benchmark.

3.1 Implementation Details

All our experiments are performed on a workstation with Nvidia Titan X (Maxwell, 3072 CUDA cores), CUDA 8.0 and cuDNN V6. We use ReLU6 as nonlinearity function due to its robustness when used with low-precision computations [Howard et al.(2017)Howard, Zhu, Chen, Kalenichenko, Wang, Weyand, Andreetto, and Adam]. During training, batch normalization is used at all layers and dropout is used before soft-max layers only. During inference, parameters of batch normalization are merged with the weights and bias of parent layers. In the depth-wise convolution layers, we found that regularization is not necessary, which is consistent with the findings in [Howard et al.(2017)Howard, Zhu, Chen, Kalenichenko, Wang, Weyand, Andreetto, and Adam]. Since labelled training data is limited, we apply standard data augmentation techniques in all experiments: random scale 0.5 to 2, horizontal flip, varied hue, saturation, brightness and contrast.

The models of ContextNet are trained with TensorFlow [Abadi and et. al.(2015)] using RMSprop [Tieleman and Hinton(2012)] with weight decay 0.9, momentum 0.9 and epsilon 1. Additionally, we apply a poly learning rate [Chen et al.(2016)Chen, Papandreou, Kokkinos, Murphy, and Yuille] with base rate 0.045 and power 0.98. The maximum number of epochs is set to 1000, as no pre-training is used.

Results are reported as mean intersection-over-union (mIoU) [Cordts et al.(2016)Cordts, Omran, Ramos, Rehfeld, Enzweiler, Benenson, Franke, Roth, and Schiele] and runtime considers single threaded CPU with sequential CPU to GPU memory transfer, kernel execution, and GPU to CPU memory exchange.

3.2 Cityscapes Dataset

Cityscapes is a large-scale dataset for semantic segmentation that contains a diverse set of images in street scenes from 50 different cities in Germany [Cordts et al.(2016)Cordts, Omran, Ramos, Rehfeld, Enzweiler, Benenson, Franke, Roth, and Schiele]. In total, it consists of 25,000 annotated images of which 5,000 have labels at high pixel accuracy and 20,000 are weakly annotated. In our experiments we only use the 5,000 images with high label quality: a training set of 2,975 images, validation set of 500 images and 1,525 test images which can be evaluated on the Cityscapes evaluation server. No pre-training is employed.

3.2.1 Ablation Study

In our ablation study weights are learned solely from the Cityscapes training set, and we report the performance on the validation set. In particular, we present effects on different resolution factors of the low resolution branch, introducing multiple branches, and modifications on the number of parameters. Finally, we analyse the two branches in detail.

cn18 cn14 cn12 cn124 cn14-500 cn14-160
Accuracy (mIoU in %) 60.1 65.9 68.7 67.3 62.1 57.7
Frames per Second 21.0 18.3 11.4 7.6 20.6 22.0
Number of Parameters (in millions) 0.85 0.85 0.85 0.93 0.49 0.16
Table 4: ContextNet (cn14) compared to its version with half resolution (cn12) and eighth resolution (cn18) at the low resolution branch, and with multiple levels at quarter, half and full resolution (cn124) on Cityscapes validation set. Implementations with smaller memory footprint are also shown (cn14-500 and cn14-160).

Input resolution: The input image resolution is the most critical factor for the computation time. Our low resolution branch takes images of a quarter size at for Cityscapes images (denoted cn14). Alternatively, half (cn12) or one eighth (cn18) resolution could be used. Table 4 shows how the different options affect the results. Overall, larger resolution in the deeper context branch produce better segmentation results. However, these improvements come at the cost of running time.

Table 5 lists the IoU in more detail. As expected, accuracy of small-size classes (i.e\bmvaOneDotfence, pole, traffic light and traffic sign), classes with fine detail (i.e\bmvaOneDotperson, rider, motorcycle and bicycle) and rare classes (i.e\bmvaOneDotbus, train and truck) benefit from increased resolutions at the global context branch. Other classes are often of larger size, and can therefore be captured at lower resolution. We conclude that cn14 is fairly balanced at 18.3 frames per seconds (fps) and 65.9% mIoU.

road

sidewalk

building

wall

fence

pole

traffic light

traffic sign

vegetation

terrain

sky

person

rider

car

truck

bus

train

motorcycle

bicycle

cn18 96.2 72.4 87.4 45.9 44.1 32.1 38.4 54.3 87.9 54.1 91.0 62.3 36.1 88.3 57.6 59.1 34.2 41.2 58.4
cn14 96.8 76.6 88.6 46.4 49.7 38.0 45.3 60.5 89.0 59.3 91.4 67.5 41.7 90.0 63.5 71.7 57.1 41.5 64.6
cn12 97.2 78.9 89.2 47.2 54.4 39.5 55.3 63.8 89.4 59.8 91.5 70.2 47.5 91.1 70.2 76.2 63.7 51.36 67.8
cn124 97.4 79.6 89.5 44.1 49.8 45.5 50.6 64.6 90.2 59.4 93.4 70.9 43.1 91.8 65.2 71.9 64.5 41.95 66.1
Table 5: Detailed IoU of ContextNet (cn14) compared to version with half (cn12) and eighth (cn18) resolution, and its multi-level version (cn124). Small-size classes (green), classes with fine detail (blue), and classes with very few samples (red) benefit from high resolution.

Multiple Branches: We designed ContextNet under consideration of runtime using only two branches. However, branches at multiple resolutions could be employed. In addition to previous results, Table 4 and Table 5 also include a version of ContextNet with two shallow branches at half and full resolution (cn124). In comparison to cn14, cn124 improves accuracy by 1.4% which confirms earlier results on multi-scale feature fusion [Lin et al.(2017)Lin, Milan, Shen, and Reid, Ghiasi and Fowlkes(2016), Zhao et al.(2017a)Zhao, Qi, Shen, Shi, and Jia]. Runtime however is reduced more than twice from 18.3 fps to 7.6 fps. Furthermore we note, cn12 which has a deep sub-network at half resolution outperforms cn124 in terms of accuracy and speed. These results show that number of layers have positive effect on accuracy and negative effect on run-time. We therefore conclude that a two branch architecture is beneficial to our implementation of ContextNet to run in real-time.

Number of Parameters: Apart from runtime, memory footprint is an important consideration for implementations on embedded devices. Table 4 includes two versions of ContextNet with drastically reduced number of parameters, denoted as cn14-500 and cn14-160. Surprisingly, ContextNet with only 159,387 and 490,459 parameters achieved 57.7% and 62.1% mIoU on Cityscapes validation set. These results show that ContextNet design is flexible enough to adapt the computation time and memory requirements of the given system.

ContextNet vs Ensemble Nets: Global context with one fourth resolution and detail branches trained independently obtained 63.3% and 25.1% mIoU respectively. As expected, we found that context branch is not performing well on small-size classes and receptive filed of the detail branch is not large enough for reasonable performance. Ensemble of both branches obtained 60.3% mIoU, which is 6.6% less than cn14 and 3% less than context branch alone. This provides further evidence that ContextNet architecture does better for multi-scale features fusion.

Context Branch Analysis: We have zeroed the output of detail branch and global context branch to observe their individual contributions. The results are shown in Figure 2. In first column, we can see that global context branch detected the larger objects, for example cloud and trees, correctly but misses around the boundaries and thin regions. In contrast, detail branch detected the objects boundaries correctly but miss-classified the centre region of the objects. However, the detail branch has detected the centre regions of the trees correctly, we think this is due to the discriminative nature of the tree texture. Finally, we can see that ContextNet correctly combined both information to produce improved results in last row. Similarly, in the middle column we can see that segmentation of the persons are refined by ContextNet over the global context branch output. In last column, even though detail branch detects poles and traffic signs, ContextNet fails to effectively combine those with the global context branch. Overall we observe that context branch and detail branch learn complementary informations.

Figure 2: Visual comparison on Cityscape validation set [Cordts et al.(2016)Cordts, Omran, Ramos, Rehfeld, Enzweiler, Benenson, Franke, Roth, and Schiele]. First row: input RGB images; second row: ground truths; third row: context branch outputs; fourth row: detail branch outputs; and last row: ContextNet outputs using both context and detail branches.

3.2.2 Cityscape Benchmark Results

We evaluate ContextNet on the withheld test-set of Cityscapes [Cordts et al.(2016)Cordts, Omran, Ramos, Rehfeld, Enzweiler, Benenson, Franke, Roth, and Schiele]. Table 6 shows the results in comparison to current state-of-the-art real-time segmentation networks (SegNet [Badrinarayanan et al.(2017)Badrinarayanan, Kendall, and Cipolla], ENet [Paszke et al.(2016)Paszke, Chaurasia, Kim, and Culurciello], ICNet [Zhao et al.(2017a)Zhao, Qi, Shen, Shi, and Jia] and ERFNet [Romera et al.(2018)Romera, Álvarez, Bergasa, and Arroyo]), and offline methods (PSPNet [Zhao et al.(2017b)Zhao, Shi, Qi, Wang, and Jia] and DeepLab-v2 [Chen et al.(2016)Chen, Papandreou, Kokkinos, Murphy, and Yuille]). Table 7 compares the runtime at full, half and quarter resolution on images of Cityscapes. ContextNet achieves 64.2% before, and 66.1% mIoU after pruning, and runs at 18.3 fps in a single CPU thread of TensorFlow [Abadi and et. al.(2015)]. We will explain the pruning method below. ENet [Paszke et al.(2016)Paszke, Chaurasia, Kim, and Culurciello] test at similar speed but at only 58.3% accuracy. ICNet [Zhao et al.(2017a)Zhao, Qi, Shen, Shi, and Jia] and ERFNet [Romera et al.(2018)Romera, Álvarez, Bergasa, and Arroyo] achieve 69.5% and 68.0% mIoU respectively, but are considerably slower compared to ContextNet.111Although our implementation of ICNet [Zhao et al.(2017a)Zhao, Qi, Shen, Shi, and Jia] achieves similar accuracy, we do not achieve the timing mentioned by the authors. This might be caused by differences in software environment and the employed testing protocols.

Class mIoU (in %) Category mIoU (in %) Parameters (in millions)
DeepLab-v2 [Chen et al.(2016)Chen, Papandreou, Kokkinos, Murphy, and Yuille]* 70.4 86.4 44.__
PSPNet [Zhao et al.(2017b)Zhao, Shi, Qi, Wang, and Jia]* 78.4 90.6 65.7_
SegNet [Badrinarayanan et al.(2017)Badrinarayanan, Kendall, and Cipolla] 56.1 79.8 29.46
ENet [Paszke et al.(2016)Paszke, Chaurasia, Kim, and Culurciello] 58.3 80.4 00.37
ICNet [Zhao et al.(2017a)Zhao, Qi, Shen, Shi, and Jia]* 69.5 - 06.68
ERFNet [Romera et al.(2018)Romera, Álvarez, Bergasa, and Arroyo] 68.0 86.5 02.1_
ContexNet (Ours) 66.1 82.7 00.85
Table 6: Cityscape benchmark results for the proposed ContextNet and similar networks. DeepLab-v2 [Chen et al.(2016)Chen, Papandreou, Kokkinos, Murphy, and Yuille] and PSPNet [Zhao et al.(2017b)Zhao, Shi, Qi, Wang, and Jia] are considered offline approaches. Runtime of other methods is shown in Table 7. (Methods with ‘*’ are pre-trained on ImageNet [Russakovsky et al.(2015)Russakovsky, Deng, Su, Krause, Satheesh, Ma, Huang, Karpathy, Khosla, Bernstein, Berg, and Fei-Fei].)
SegNet [Badrinarayanan et al.(2017)Badrinarayanan, Kendall, and Cipolla]* 1.6 - -
ENet [Paszke et al.(2016)Paszke, Chaurasia, Kim, and Culurciello]* 20.4 76.9 142.9
ICNet [Zhao et al.(2017a)Zhao, Qi, Shen, Shi, and Jia] 14.2 46.3 83.2
ERFNet [Romera et al.(2018)Romera, Álvarez, Bergasa, and Arroyo]* 11.2 41.7 125.0
ContexNet (Ours) 18.3 65.5 139.2
ContexNet (Ours) 23.2 84.9 150.1
Table 7: Runtime on Nvidia Titan X (Maxwell, 3,072 CUDA cores) with TensorFlow [Abadi and et. al.(2015)], including sequential CPU/GPU memory transfer and kernel execution. (Results with ‘*’ are taken from existing literature – it is not known if memory transfer is considered. Our measure with ‘’ denotes kernel execution time alone.)

We emphasize, our runtime evaluation includes the complete CPU and GPU pipeline including memory transfers. If parallel memory transfer and kernel execution are employed, our run time improves to 23.2 fps. Finally, in Table 7 we observe that ContextNet scales well for smaller input resolution sizes, and therefore can be tuned for the task at hand and the available resources. The results of ContextNet are displayed in Figure 2 for qualitative analysis. ContextNet is able to segment even small objects at far distances adequately.

Network Pruning: Network pruning is usually employed to reduce network parameters [Li et al.(2017)Li, Kadav, Durdanovic, Samet, and Graf, Zhao et al.(2017a)Zhao, Qi, Shen, Shi, and Jia]. Instead, we use pruning of feature maps to increase the accuracy of our model.

First we train our network with doubled number of feature maps to improve results. We then decrease parameters progressively by pruning to 1.5, 1.25 and 1 times the original size. Following [Li et al.(2017)Li, Kadav, Durdanovic, Samet, and Graf], we have pruned filters with lowest sum. Through pruning we improve mIoU from 64.2% to 66.1% on the Cityscape test set.

4 Conclusions and Future Work

In this work we proposed a real-time semantic segmentation model called ContextNet, which extensively leverages depth-wise convolution and bottleneck residual blocks for memory and run-time efficiency.

Our ablation study shows that ContextNet effectively combines global and local contexts to achieve competitive results and outperforms other state-of-the-art real-time methods on high/full resolution images. We also empirically show that model pruning (in order to reach a given targets in terms of network size and real-time performance) leads to improvement in segmentation accuracy. Demonstrating that the ContexNet architecture is beneficial for other tasks relevant for autonomous systems such as single-image depth prediction is part of future work.

References

  • [Abadi and et. al.(2015)] M. Abadi and et. al. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL https://www.tensorflow.org/.
  • [Badrinarayanan et al.(2017)Badrinarayanan, Kendall, and Cipolla] V. Badrinarayanan, A. Kendall, and R. Cipolla. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. TPAMI, 2017.
  • [Brostow et al.(2009)Brostow, Fauqueur, and Cipolla] G. J. Brostow, J. Fauqueur, and R. Cipolla. Semantic object classes in video: A high-definition ground truth database. Pattern Recogn. Lett., 30(2), 2009.
  • [Burt and Adelson(1987)] P. J. Burt and E. H. Adelson. The Laplacian Pyramid as a Compact Image Code. In Readings in Computer Vision, pages 671–679. 1987.
  • [Chen et al.(2016)Chen, Papandreou, Kokkinos, Murphy, and Yuille] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. arXiv:1606.00915 [cs], 2016.
  • [Chollet(2016)] F. Chollet. Xception: Deep Learning with Depthwise Separable Convolutions. arXiv:1610.02357 [cs], 2016.
  • [Cordts et al.(2016)Cordts, Omran, Ramos, Rehfeld, Enzweiler, Benenson, Franke, Roth, and Schiele] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016.
  • [Ghiasi and Fowlkes(2016)] G. Ghiasi and C. C. Fowlkes. Laplacian Pyramid Reconstruction and Refinement for Semantic Segmentation. arXiv:1605.02264 [cs], May 2016.
  • [Han et al.(2016)Han, Mao, and Dally] S. Han, H. Mao, and W. J. Dally. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. In ICLR, 2016.
  • [He et al.(2015)He, Zhang, Ren, and Sun] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. arXiv:1512.03385 [cs], 2015.
  • [Howard et al.(2017)Howard, Zhu, Chen, Kalenichenko, Wang, Weyand, Andreetto, and Adam] A. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv:1704.04861 [cs], 2017.
  • [Hubara et al.(2016)Hubara, Courbariaux, Soudry, El-Yaniv, and Bengio] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. Binarized Neural Networks. In NIPS. 2016.
  • [Li et al.(2017)Li, Kadav, Durdanovic, Samet, and Graf] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf. Pruning Filters for Efficient ConvNets. In ICLR, 2017.
  • [Lin et al.(2017)Lin, Milan, Shen, and Reid] G. Lin, A. Milan, C. Shen, and I. Reid. RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation. In CVPR, 2017.
  • [Menze and Geiger(2015)] M. Menze and A. Geiger. Object scene flow for autonomous vehicles. In Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • [Paszke et al.(2016)Paszke, Chaurasia, Kim, and Culurciello] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello. ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation. arXiv:1606.02147 [cs], 2016.
  • [Rastegari et al.(2016)Rastegari, Ordonez, Redmon, and Farhadi] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks. In ECCV, 2016.
  • [Romera et al.(2018)Romera, Álvarez, Bergasa, and Arroyo] E. Romera, J. M. Álvarez, L. M. Bergasa, and R. Arroyo. ERFNet: Efficient Residual Factorized ConvNet for Real-Time Semantic Segmentation. IEEE Transactions on Intelligent Transportation Systems, 2018.
  • [Ronneberger et al.(2015)Ronneberger, Fischer, and Brox] O. Ronneberger, P. Fischer, and T. Brox. U-Net: Convolutional Networks for Biomedical Image Segmentation. In MICCAI, 2015.
  • [Russakovsky et al.(2015)Russakovsky, Deng, Su, Krause, Satheesh, Ma, Huang, Karpathy, Khosla, Bernstein, Berg, and Fei-Fei] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 2015.
  • [Sandler et al.(2018)Sandler, Howard, Zhu, Zhmoginov, and Chen] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen. Inverted Residuals and Linear Bottlenecks: Mobile Networks for Classification, Detection and Segmentation. arXiv:1801.04381 [cs], 2018.
  • [Shelhamer et al.(2016)Shelhamer, Long, and Darrell] E. Shelhamer, J. Long, and T. Darrell. Fully convolutional networks for semantic segmentation. PAMI, 2016.
  • [Szegedy et al.(2016)Szegedy, Vanhoucke, Ioffe, Shlens, and Wojna] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the Inception Architecture for Computer Vision. In CVPR, 2016.
  • [Tieleman and Hinton(2012)] T. Tieleman and G. Hinton. Lecture 6.5-rmsprop, coursera: Neural networks for machine learning. University of Toronto, Technical Report, 2012.
  • [Wu et al.(2018)Wu, Li, Chen, and Shi] S. Wu, G. Li, F. Chen, and L. Shi. Training and Inference with Integers in Deep Neural Networks. In ICLR, 2018.
  • [Zhao et al.(2017a)Zhao, Qi, Shen, Shi, and Jia] H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia. ICNet for Real-Time Semantic Segmentation on High-Resolution Images. arXiv:1704.08545 [cs], 2017a.
  • [Zhao et al.(2017b)Zhao, Shi, Qi, Wang, and Jia] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid Scene Parsing Network. In CVPR, 2017b.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
192138
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description