Combining the Best of Convolutional Layers and Recurrent Layers: A Hybrid Network for Semantic Segmentation
State-of-the-art results of semantic segmentation are established by Fully Convolutional neural Networks (FCNs). FCNs rely on cascaded convolutional and pooling layers to gradually enlarge the receptive fields of neurons, resulting in an indirect way of modeling the distant contextual dependence. In this work, we advocate the use of spatially recurrent layers (i.e. ReNet layers) which directly capture global contexts and lead to improved feature representations. We demonstrate the effectiveness of ReNet layers by building a Naive deep ReNet (N-ReNet), which achieves competitive performance on Stanford Background dataset. Furthermore, we integrate ReNet layers with FCNs, and develop a novel Hybrid deep ReNet (H-ReNet). It enjoys a few remarkable properties, including full-image receptive fields, end-to-end training, and efficient network execution. On the PASCAL VOC 2012 benchmark, the H-ReNet improves the results of state-of-the-art approaches Piecewise , CRFasRNN  and DeepParsing  by , and , respectively, and achieves the highest IoUs for 13 out of the 20 object classes111A part of this work was done when Zhicheng Yan interned at Google Brain..
Keywords:semantic segmentation, CNN, RNN, CRF
Convolutional Neural Networks (CNNs) have achieved notable successes in a variety of visual recognition tasks, such as image classification [4, 5] and object detection . By replacing fully-connected layers with convolutional layers, classification CNNs can be effortlessly transformed into Fully Convolutional Networks (FCNs) , which take an input image of arbitrary size and predict a semantic label map. Though FCNs have achieved state-of-the-art results in semantic segmentation tasks [2, 3], it suffers a few limitations in modeling distant contextual regions. Specifically, the receptive field of a neuron in the convolutional layer of FCNs usually corresponds to a local area of the input image. However, in semantic segmentation, contextual evidences from distant areas of the image are usually crucial for reasoning and prediction. For example, when labeling the middle area of an image, seeing the pattern of the sea on top of the image and the pattern of a hill at the bottom increases our confidence of making the correct prediction “beach”, yet the limited size of the local receptive fields inhibits the FCN to capture such spatially long-range dependence across different local areas. Although one can artificially adjust the size of the receptive field to cover the entire image, this implicit modeling usually leads to an ineffective way of encoding long-range context. How the long-range dependence will propagate across filters and whether they will complement each other to help the reasoning still remain unclear. These limitations drive us to seek for a more explicit modeling of global context and long-range dependence.
Recurrent Neural Networks (RNNs) have demonstrated strong capabilities of modeling long-term contextual dependence in speech recognition and language understanding [8, 9]. In this paper, we introduce the spatially recurrent layer (i.e. ReNet layer) to address the aforementioned limitations. In a ReNet layer, RNNs sweep vertically and horizontally across the image. With gating and memory units to adaptively forget, memorize and expose the memory contents at each running step, the RNN directly propagates spatially long-range information throughout its hidden units, and generates image representations that better capture global context. We stack ReNet layers to build a naive deep spatially recurrent network (N-ReNet), and achieve comparable performance to state-of-the-art approaches. We visualize the intermediate feature maps produced by N-ReNet, and observe a form of hierarchical feature representations, similar to those generated by deep CNNs. To boost the performance, we further integrate ReNet layers with existing FCNs to form a deep hybrid network (H-ReNet), of which the convolutional layers and pooling layers extract local features, and the recurrent layers perform spatial long-range information propagation. The H-ReNet exhibits a few favorable properties. First, by employing recurrent layers to facilitate the long-range dependence propagation, H-ReNet directly supports full-image receptive fields. Second, incorporating ReNet layers to capture global context improves the learned feature representations, with compelling performance on both region recognition and boundary localization. Third, H-ReNet is end-to-end trainable with efficient forward and backward executions. Its computations in the recurrent layers can be easily parallelized, thus enables H-ReNet to fully exploit the computational power of modern GPUs (compared to the graphical models used in FCN-based models). Overall, recurrent layers substantially improve the results at a negligible increase of computational costs.
To conclude, the contributions of our work are two-fold. First, we introduce spatially recurrent layers (i.e. ReNet layers) for semantic segmentation, and show that by simply stacking ReNet layers, the resulting N-ReNet achieves competitive performance on Stanford Background dataset. Second, we construct a hybrid network (i.e. H-ReNet) by appending recurrent layers on top of FCNs. We extensively evaluate the H-ReNet on benchmark PASCAL VOC 2012 with both internal ablation studies and external comparisons. We show that H-ReNet yields improved feature representations over FCNs and achieves state-of-the-art results.
2 Related Work
Nonparametric Methods. Nonparametric methods have achieved remarkable performance in semantic segmentation [10, 11, 12, 13, 14]. The core idea is retrieving similar patches from a database of fully annotated images, and transferring the labels from the annotated images to the query image. Specifically, the query image is matched against the annotated database using both holistic image representations as well as superpixels. Probabilistic graphical models (e.g. MRF, CRF) are then introduced to model the semantic context and obtain a spatially coherent semantic label map [15, 16, 17, 18]. Nonparametric methods divide the segmentation task into individual steps. Each step requires a careful design, and the entire process is not amenable to joint optimization.
Parametric Methods. Parametric methods have been dominated by FCN-based models, which can be classified into two lines. In the first line, the FCN takes input of bounding boxes which encompass image regions with high objectness , and outputs a segmentation mask for each bounding box. The final segmentation map is obtained by merging individual ones for the bounding boxes [20, 21, 22, 23, 24, 25]. By contrast, in the second line, the whole image is directly fed into the segmentation net, and a complete segmentation mask is generated at once . Due to the pooling layers of CNNs, the output mask is usually not sufficiently sharp, and region boundaries are not clearly localized. An additional graphical model layer (e.g. MRFs and CRFs) is thus introduced to capture pixel interactions and respect region boundaries. The graphical model can either be applied as a separate post-processing step  or be plugged into a deep neural net with joint optimization [2, 28], both at a high cost of extra computations. Besides FCN-based methods, Mostajabi et al.  propose to label the superpixels using zoom-out features, which include pixel-level, region-level and global features extracted from a deep neural network.
RNNs for Visual Recognition. Exploiting recurrent neural networks for visual recognition is an active field of research. Byeon et al. propose a cascaded structure consisting of alternating 2D Long Short Term Memory (LSTM) and convolutional layers, and report comparable results to state-of-the-art on both Stanford Background and SiftFlow datasets . Bell et al. develop the IRNN layer for object detection to generate features that are not limited to the bounding box of an object proposal . The recently proposed ReNet architecture is a scalable alternative to CNNs for image recognition . We build our models upon ReNet layers, to capture the global contexts as well as enjoy its property of efficient parallelization. In contrast to the IRNN layer where a naive ReLU RNN is implemented, we employ sophisticated LSTM with various gating units to adaptively forget, memorize and expose the memory contents. Empirically, we observe better performance by the ReNet LSTM layer for our task.
3 Spatially Recurrent Layer Group
We first introduce the spatially recurrent layer, which is a novel component of our work. Following the naming convention in , we refer to a spatially recurrent layer as a ReNet layer. Specifically, a ReNet layer receives either an input image or an input feature map of size , and divides it into a grid of patches with patch size where and . It has two 1D RNNs with independent weights sweeping across the grids vertically (or horizontally) in opposite directions, one in forward and the other in backward. We choose the LSTM unit described in  as our basic RNN implementation because of its outstanding property of overcoming the issue of vanishing gradient, but note that other RNN variants, such as Recurrent Gated Units , might be also suitable. An 1D LSTM unit includes the forget gate , the input gate , the cell input , the cell memory , the output gate and the hidden state , which enable the LSTM unit to adaptively forget, memorize and expose its memory content at each running step .
Formally, the ReNet layer takes a 2D map as input, sweeps across the grids in opposite directions, and updates the cell memory and the hidden state of its LSTM units at pixel positions as
where we use the the superscripts and to denote forward and backward directions. As the scanning of two LSTMs is independent of each other, we can easily parallelize their computations for additional speedup. By concatenating hidden states and , each of which has hidden units, we can obtain a composite feature map of size , with a receptive field comprised of all the patches within the same column. By stacking two ReNet layers with orthogonal sweeping directions (e.g. horizontal and vertical), we can obtain an output feature map fully covering the input image, as shown in Figure 1. We refer to two ReNet layers with orthogonal sweeping directions as a recurrent layer group.
We construct a Naive deep ReNet (N-ReNet) by stacking multiple layer groups on top of each other. The first group takes raw pixels as input and the output of the last group is passed through a softmax layer to produce dense predictions. To align the channel number of output feature maps with the number of semantic labels, we append an auxiliary convolutional layer on top of the last ReNet layer, with the number of convolutional kernels identical to the number of labels.
Similarly, we can further integrate the ReNet layers with FCNs by appending a recurrent layer group at the end of a pretrained FCN, such as VGG-16 net  and GoogleNet  pretrained on ImageNet for image classification. It enables us to exploit both the advantages of FCNs in capturing local context and the capability of ReNet layers in modeling distant and global context. We refer to this hybrid network as an H-ReNet. Both N-ReNet and H-ReNet will be evaluated in our experiments.
We first train a N-ReNet from scratch and evaluate it on the Stanford Background dataset. We compare the N-ReNet to other competing methods that also do not use pretrained model. Our focus in this part is not establishing new state-of-the-art results. Instead, we aim to demonstrate the effectiveness of the proposed recurrent layer group. After confirming the capability of the recurrent layer group, we then conduct experiments for the H-ReNet, where ReNet layers are appended at the end of a pretrained FCN. We evaluate the H-ReNet on PASCAL VOC 2012 (VOC12) dataset.
We implement our models based on Caffe . All experiments are conducted on a single NVIDIA K40c GPU. In particular, the ReNet LSTM layer has been efficiently implemented on GPU. The vertical and horizontal sweepings in a recurrent layer group are parallelized to improve efficiency.
Training. Both N-ReNet and H-ReNet require a minimal input image size of . Reflection padding is applied if an input image is smaller than . At the training stage, randomly cropped patches of size with random horizontal flipping are fed into the model in a mini-batch size of 10. We set to be and on two datasets Stanford Background and VOC12, respectively.
Since ReNet layers are differentiable, both N-ReNet and H-ReNet are end-to-end trainable by stochastic gradient descent. Specifically, we employ the pixelwise multinomial cross entropy  with equal weights for all semantic labels as the loss function, and train the model parameters by standard backpropagation.
Testing. At the testing stage, the network can take an image at its original size, as both convolutional layers and ReNet LSTM layers can handle inputs of variable size. It produces dense predictions at the original resolution of the test image.
4.1 Evaluations of N-ReNet
We evaluate the N-ReNet on Stanford Background dataset, which contains 715 images of outdoor scenes with 8 labels. We randomly and evenly divide the images into 5 sets, and report 5-fold cross validation results.
We first configure a N-ReNet consisting of 3 ReNet layer groups with increasing numbers of neurons, as shown in Figure 2. The first layer group sweeps over patches while all other groups scan patches (i.e. pixels), thereby the spatial resolution is reduced by a factor of 4. An auxiliary convolutional layer with 8 kernels is placed on top of ReNet layers for the purpose of aligning the number of feature maps with the number of semantic labels. Finally, an upsampling layer is appended to restore the label map to the original spatial resolution via bilinear interpolation. It is noticeable that neurons in the last convolutional layer has a receptive field covering the entire input image.
In the rest of this section, we first show that the N-ReNet learns hierarchical feature representations similar to those learned by deep CNNs. Then we demonstrate that the N-ReNet achieves comparable accuracy on the Stanford Background dataset with outstanding running efficiency.
4.1.1 Hierarchical Feature Representations
Deep CNNs tend to learn hierarchical feature representations . It is intriguing to figure out whether the N-ReNet has similar capabilities. In Figure 2, we visualize the first 4 feature maps from both an early ReNet layer, renet1(V), and a later ReNet layer, renet3(H). In its early layer, the N-ReNet extracts low-level features that still preserve fine-scale image details (e.g. windows and doors of the buildings). As such details become less important for semantic region inference, the N-ReNet learns to gradually smooth out them and transform them into high-level discriminative features in its deep layer, resulting in a form of hierarchical feature representations.
4.1.2 Results on Stanford Background
Our N-ReNet, which is comprised of three recurrent layer groups, achieves competitive results on Stanford Background, as shown in Table 1. We compare the N-ReNet to other methods from three classes.
First, the N-ReNet improves the pixel accuracy and class accuracy of all nonparametric methods [13, 11, 14] by at least and while executes at least faster with an efficient GPU implementation. Second, compared to those built on top of CNNs,
the N-ReNet outperforms the methods in [38, 26, 39]. Recursive context propagation  achieves higher labeling accuracy than ours. They employ multiple CNNs with tied weights to extract multi-scale local features, while the N-ReNet only uses single-scale features. It is noticeable that they rely on superpixels to reduce the computational cost during context propagation, while the superpixel segmentation turns out to be its computational bottleneck, accounting for second out of a total of second. The zoom-out approach in  achieves the highest pixel accuracy by combining both hand-crafted and CNN-based superpixel features.
Although the computational cost is not reported in , this method is likely to be computationally expensive as it involves superpixel segmentation, heavy feature extraction (e.g. pixel-level, region-level and global features) and a multi-layer perceptron. Overall, CNN-based methods run at least slower than our N-ReNet.
Third, compared with another network built on the basis of 2D LSTMs , N-ReNet improves the pixel and class accuracy by and , respectively. The 2D LSTM in  follows the sequential scan-line order and cannot be easily accelerated on GPU. On the contrary, with parallel 1D LSTM computations, the N-ReNet runs more than faster. Overall, our evaluations confirm the effectiveness of the ReNet layer as a novel alternative of FCNs for semantic segmentation.
|Nonparametric parsing ||74.1, 62.2||20 (CPU)|
|Superparsing ||77.5, N/A||4.0 (CPU)|
|Nonparametric parsing II ||75.3, 66.5||16.6 (CPU)|
|Single-scale ConvNet ||66.0,56.5||0.35 (CPU)|
|Multi-scale ConvNet ||78.8,72.4||0.6 (CPU)|
|Recurrent CNN ||76.2,67.2||1.1 (CPU)|
|Multi-CNN + rCPN Fast ||80.9,78.8||0.37(GPU)|
|Augmented CNN||76.4, 68.5||N/A|
|Deep 2D LSTM||78.6,68.8||1.3 (CPU)|
4.2 Evaluations of H-ReNet
Given the evidence that the ReNet layer alone is effective for inferring semantic regions, we are interested in answering a more ambitious question. Can we improve the feature representations of FCNs by adding a ReNet layer group? To answer, we build an H-ReNet consisting of convolutional, pooling and ReNet layers, and evaluate it on the VOC12. We use the augmented dataset from  with , and images in the training set, validation set and test set, respectively. We evaluate on the validation set when only the training set is used for training. We also report results on the test set when both the training set and the validation set are used for training. We follow the comp6 evaluation protocol and employ mean Intersection over Union (IoU) as our evaluation metric .
We also consider the setting that uses extra annotations from Microsoft COCO dataset . This dataset contains images in semantic labels, among which images have overlapped labels with VOC12 and more than pixels annotated by these labels in the image.
In the following we introduce the model structures we use for the experiments, including a baseline FCN architecture adapted from pretrained classification CNNs, and our H-ReNet architecture with multiple variants.
4.2.1 Baseline FCN Architecture
We follow the practice in  and adapt the pretrained VGG-16 network into a baseline FCN with the architecture shown in Table 2. Specifically, we reuse the first 10 layers of VGG-16, which include 5 groups of convolutional layers (conv1-conv5) and 5 max-pooling layers (pool1-pool5). Each max-pooling layer reduces the size of the feature map by a half. Overall, the size of the feature map is reduced by a factor of 32. Although reducing the resolution of feature maps may not compromise the accuracy in image classification tasks, high-resolution feature maps are important in semantic segmentation to reconstruct accurate region boundaries in the final semantic map . Therefore, we reduce the strides of the layers pool4 and pool5 from 2 to 1, so that the overall downsampling factor decreases from 32 to 8. This technique is known as hole algorithm in  and dilated convolution in .
|layer ID||baseline FCN conv1-conv7||renet1||conv8||upsample|
We further append 3 convolutional layers (conv6, conv7 and conv8) on top of the last max-pooling layer pool5. They play similar roles as the three fully-connected layers in VGG-16 net. To increase the size of receptive fields, we dilate the convolutional kernels in layer conv6 to have a stride of 12 , which leads to a receptive field of with padding. Finally, the spatial resolution is restored by upsampling the feature maps from layer conv8 using bilinear interpolation. Note our baseline FCN is akin to the DeepLab-LargeFOV in , but achieves higher accuracy ( v.s. ).
4.2.2 H-ReNet Architecture
The architecture of the H-ReNet is derived by inserting a recurrent layer group into the baseline FCN. An example of an H-ReNet is shown in Table 3. We insert one ReNet layer group renet1, which includes two ReNet layers with orthogonal scanning directions, between layers conv7 and conv8. Compared to the baseline FCN where the input to conv8 is the output of another convluational layer conv7, in the H-ReNet, the input to conv8 is from a ReNet layer group. This provides an ideal testbed for us to verify the effectiveness of the ReNet layer group.
4.2.3 Multi-layer Feature Combination
Combining features from multiple layers yields more discriminative features for semantic segmentation and object detection [22, 30]. We therefore concatenate feature maps from pool4, pool5 and conv7, as shown in Figure 3. As the magnitude of the feature values from different layers may vary substantially, we further normalize the feature maps before the concatenation in case the final combination is dominated by one or two feature maps. We experiment with both L2 normalization and batch normalization . Assume that the size of the feature map is where and denote minibatch size, channel number, height and width, respectively. For L2 normalization, we normalize each map of size to have unit L2 norm and then uniformly scale it by a constant , chosen by a grid search from on a held-out set. For batch normalization, we collect a number of samples and normalize them to have zero mean and unit variance.
In the rest part of this section, we first introduce the experiment settings we use to evaluate the H-ReNet. Then, we conduct ablation studies on the VOC12 dataset by comparing the baseline FCN, the H-ReNet and its variants. Finally, we compare the H-ReNet to other start-of-the-art methods, and demonstrate its superiority.
4.2.4 Experimental Settings
For both the baseline FCN and the H-ReNet, the parameters from layer conv1 to layer conv5 are initialized from VGG-16 pretrained model. Parameters in other convolutional layers are randomly initialized by sampling from a Gaussian distribution (, ). The parameters of ReNet layers are randomly initialized by sampling from a uniform distribution over . We empirically observe that a full-pass backpropagation is unnecessary during training. Skipping layers conv1 and conv2 during backpropagation has negligible impact on the final results but can speeds up the network training. For the baseline FCN, we find training iterations are sufficient. For the H-ReNet, we train for iterations when the multi-layer feature combination is disabled, and iterations when enabled. The initial learning rates for both models are set to and decreased by a factor of 10 in the middle of the training.
With extra training data from Microsoft COCO are added, we first pretrain the H-ReNet for iterations, using a combination of the training set and the validation set from both VOC12 and Microsoft COCO. The initial learning rate is decreased by a factor of 10 for every iterations. As the annotation masks from COCO dataset are coarser than those from VOC12, finetuning on VOC12 data for another iterations is necessary. Evaluation is performed on the test set.
4.2.5 Results on PASCAL VOC 2012
We compare H-ReNet and its variants to the baseline FCN on the validation set of VOC12 in Table 4. The baseline FCN achieves a mean IoU . By inserting one recurrent layer group between conv7 and conv8, with the rest of the structure unchanged, we obtain a substantial improvement of on the mean IoU ( v.s. ). The improvement is completely attributed to the recurrent layer group, which uses two ReNet layers with orthogonal sweeping directions to capture distant context across different local areas. By adding batch normalization , we further improve the mean IoU by ( v.s. ). This demonstrates that the ReNet layer can also benefit from normalization techniques. If we further concatenate feature maps from pool4, pool5 and conv7 before feeding into renet1, we further improve the mean IoU by ( Vs ), which demonstrates that ReNet layers can also be complemented by multi-scale feature combination . We also experiment with L2 normalization for feature concatenation, but observe no improvement.
4.2.6 Improving region recognition and boundary localization
To see how the H-ReNet improves the region recognition and bound localization, we perform qualitative comparisons in Figure 4. In the top-left example, the H-ReNet produces a more coherent prediction than the baseline FCN, and in the top-right example the H-ReNet successfully recognize the chair while the baseline FCN fails. In both examples, global context is critical for the model to correctly recognize and separate out the object. The baseline FCN has to cascade convolutional and pooling layers to progressively increase the size of its receptive fields, while how well the global context could be encoded by this implicit way of modeling is still an open question. By contrast, the H-ReNet encodes global context by employing ReNet layers to explicitly propagate information across the entire image. In both models, the spatial resolution of feature maps is greatly reduced. However, compared to the baseline FCN, the H-ReNet can better localize and restore the boundaries, as evidenced in the bottom two examples.
4.2.7 LSTM v.s. IRNN
The RNN unit in a ReNet layer can also be instantiated by the IRNN layer . Compared to LSTM units, IRNN units lack gating functions and memory units to overcome the vanishing gradient issue, and it heavily relies on identity initialization to support dependence propagation. We implement an IRNN-based ReNet layer with same number of parameters and compare it to our original H-ReNet. It achieves a mean IoU , which is lower than the results of H-ReNet with 1D LSTM units.
4.2.8 Multi-Layer Feature Combination
Before concatenating features from different layers, we need to first figure out a set of layers with high-quality and complementary features. To investigate the “quality” of the features from each individual layer, we build an H-ReNet, where the recurrent layer group are fed feature maps solely from that layer under investigation. For example, to investigate the feature map from pool4, we connect pool4, renet1 and conv8, and discard the intermediate layers (conv5 to conv7), to obtain an H-ReNet. We choose a candidate set of layers for investigation. As shown in the left side of Table 5, the feature maps from deeper layers give better results than those from shallower layers, as deeper layers generally extract high-level features more helpful for reasoning. We are also interested in whether features from different layers are complementary to each other. As shown in the right side of Table 5, as we gradually add more feature maps from lower layers pool5 and pool4, the accuracy is improved by ( v.s. ). However, the improvement diminishes as more feature maps from other lower layers are concatenated.
4.2.9 Training Image Size
We experiment with different sizes of training images where , , . For the baseline FCN, the mean IoUs are , and , respectively. We conclude that FCNs are not sensitive to the input size. A plausible explanation is that the effective size of receptive fields in the last convolutional layer of FCNs is no larger than the smallest option , thus enlarging the training cropping may not be helpful for the FCN to learn better feature representations. On the other side, the H-ReNet benefits from large training croppings, and achieves mean IoUs of , and , respectively. When a larger training image is fed into the H-ReNet, the ReNet layer has to scan more image patches during every single pass, and the contextual information is also propagated over a longer distance, resulting in better feature representations.
4.2.10 Incorporating CRFs
While the H-ReNet can improve boundary localization over FCNs, it still produces blurry boundaries due to the reduced resolution of feature maps. To refine the boundaries, we adopt the post-processing step DenseCRF in , i.e. we add a fully-connected CRF layer which could be solved by the mean-field algorithm. The CRF consists of both unary and pairwise potentials, where the unary potential is defined as negative log-probability predicted by the network and the pairwise potential is a composition of a spatial kernel and a bilateral kernel. We implement the DenseCRF on GPU, and our experiments show that it can improve the results of both the baseline FCN and the H-ReNet by and , respectively (Table 4). The DenseCRF models the interactions between every pair of pixels by taking input of low-level features (e.g. pixel position, intensity). On the contrary, ReNet layers use high-level features from convolutional layers to model the contextual dependence between different local areas of the image. Empirically, we observe post-processing with CRFs complements ReNet layers for dependence propagation.
4.2.11 Comparisons with Other Approaches
In Table 6, we compare the H-ReNet with other methods on the validation set of VOC12. DeepLab-LargeFOV, with the same base FCN as ours, runs slightly faster but underperforms the H-ReNet by in IoU. DeepParsing  network underperforms the H-ReNet by in IoU and running slower. The Piecewise network  slightly outperforms the H-ReNet by , with a cost of more prediction time. By comparing the H-ReNet to our baseline FCN, we find that adding one recurrent layer group increases the prediction time merely by 0.03 second (0.24 v.s. 0.21). Incorporating DenseCRF into the H-ReNet further improves IoU by with execution time. Therefore, DenseCRF post processing represents a trade-off between segmentation accuracy and execution efficiency.
In Table 7, we compare the H-ReNet with other approaches on test set of VOC12. The FCN-8s net  exploits multi-scale features from various layers by combining predictions from different skip connections. In contrast, we concatenate features from different layers and use them to generate a single prediction. We improves FCN-8s by ( Vs. ). The Zoomed-out approach  extracts both hand-crafted and CNN-based features in multiple levels. Those features are concatenated to feed into a MLP network to produce superpixel-wise predictions. The H-ReNet does not rely on superpixel prior, but still outperforms Zoomed-out by ( v.s. ). The CRFasRNN  approach appends recurrent layers with mean-field solvers after the FCN-8s net, and performs end-to-end training for all layers. Both Piecewise  and DeepParsing  incorporate more sophisticated graphical models into an FCN, and perform end-to-end training. The H-ReNet alone outperforms all the aforementioned approaches except DeepParsing. With the DenseCRF integration, it slightly improves the result of DeepParsing by , and establishes new state-of-the-art results. The H-ReNet achieves the highest IoUs on 13 out of 20 object classes. For the bike class, DeepParsing attains substantially better results ( v.s. ) than ours by correctly predicting the interior regions of bike wheels as background while the H-ReNet fails to do that. These improvements are attributed to the modeling of high-order label relations by DeepParsing. Note that our H-ReNet can also incorporate such sophisticated graphical models introduced in DeepParsing and CRFasRNN, and jointly optimize the model parameters with preceding layers. We expect this can further improve the results of the H-ReNet, and leave the integration of trainable CRF-based post processing with the H-ReNet as a future work.
|DeepLab-CRF-LargeFOV||2.13(GPU+CPU)222The FCN and mean field inference are implemented on GPU and CPU, respectively.||67.6|
|DeepParsing||0.30(GPU)333In , the authors only report the time cost of the last 4 layers in their 15-layer net, which is 0.075 seconds. By building up a net, which has the exactly same 11 layers as in the DPN net, we estimate the time cost of first 11 layers to be 0.22 seconds.||67.8|
|baseline FCN||0.21 (GPU)||63.4|
|H-ReNet||0.24(GPU) 444Time cost is measured on an image of size .||71.1|
|H-ReNet + DenseCRF (2 iterations)||0.46(GPU)||72.6|
When extra training data from MS COCO are used, DeepParsing marginally outperforms the H-ReNet ( v.s. ). A plausible explanation is that the mixture of label contexts and high-order relations modeled by the DeepParsing are better captured when a sufficiently large amount of training data are available.
Qualitative comparisons. In Figure 5, we qualitatively compare the H-ReNet to two leading approaches, CRFasRNN  and DeepParsing , both with graphical models incorporated in an end-to-end manner. In the first example, a part of the airplane is visually similar to the background wall at the bottom. DeepParsing mistakenly labels the background wall as the airplane while CRFasRNN method incorrectly classifies the airplane part as the background. In contrast, the H-ReNet can recognize a more complete body of the airplane due to the explicit modeling of contextual dependence. In the second example, the part of bird torso in the lower left corner can only be recognized by the H-ReNet. In the third and fourth examples, more complete foreground regions are labeled by the H-ReNet. In general, by seeing boarder areas, the H-ReNet is more capable of resolving ambiguous regions.
|(a) Per-class results on VOC12 test set|
(b) Per-class results on VOC12 test set with the extra MS COCO training data
In this work, we propose the use of the spatially recurrent layer (ReNet layer) for semantic segmentation, and demonstrate its effectiveness by constructing a naive deep ReNet (N-ReNet) and evaluting it on the Stanford Background dataset. Furthermore, we integrate ReNet layers with FCNs and develop a hybrid network (H-ReNet), and achieve state-of-the-art results on the PASCAL VOC 2012 dataset.
-  Lin, G., Shen, C., Reid, I., et al.: Efficient piecewise training of deep structured models for semantic segmentation. arXiv preprint arXiv:1504.01013 (2015)
-  Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang, C., Torr, P.: Conditional random fields as recurrent neural networks. In: ICCV. (2015)
-  Liu, Z., Li, X., Luo, P., Loy, C.C., Tang, X.: Semantic image segmentation via deep parsing network. In: ICCV. (2015)
-  He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385 (2015)
-  Yan, Z., Zhang, H., Piramuthu, R., Jagadeesh, V., DeCoste, D., Di, W., Yu, Y.: Hd-cnn: Hierarchical deep convolutional neural networks for large scale visual recognition. In: ICCV. (2015)
-  Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems. (2015) 91–99
-  Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR. (2015)
-  Graves, A.: Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850 (2013)
-  Graves, A., Jaitly, N.: Towards end-to-end speech recognition with recurrent neural networks. In: Proceedings of the 31st International Conference on Machine Learning (ICML-14). (2014) 1764–1772
-  Tighe, J., Lazebnik, S.: Finding things: Image parsing with regions and per-exemplar detectors. In: CVPR. (2013) 3001–3008
-  Tighe, J., Lazebnik, S.: Superparsing: Scalable nonparametric image parsing with superpixels. In: ECCV. (2010)
-  Liu, C., Yuen, J., Torralba, A.: Nonparametric scene parsing via label transfer. TPAMI (2011)
-  Singh, G., Kosecka, J.: Nonparametric scene parsing with adaptive feature relevance and semantic context. In: CVPR. (2013)
-  Eigen, D., Fergus, R.: Nonparametric image parsing using adaptive neighbor sets. In: CVPR. (2012)
-  Russell, C., Kohli, P., Torr, P.: Associative hierarchical crfs for object class image segmentation. In: ICCV. (2009)
-  Kohli, P., Torr, P.H.: Robust higher order potentials for enforcing label consistency. IJCV (2009)
-  Li, Y., Tarlow, D., Zemel, R.: Exploring compositional high order pattern potentials for structured output learning. In: CVPR. (2013)
-  Ramalingam, S., Kohli, P., Alahari, K., Torr, P.H.: Exact inference in multi-label crfs with higher order cliques. In: CVPR. (2008)
-  Zitnick, C.L., Dollár, P.: Edge boxes: Locating object proposals from edges. In: Computer Vision–ECCV 2014. Springer (2014) 391–405
-  Dai, J., He, K., Sun, J.: Convolutional feature masking for joint object and stuff segmentation. In: CVPR. (2015)
-  Dai, J., He, K., Sun, J.: Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In: ICCV. (2015)
-  Hariharan, B., Arbeláez, P., Girshick, R., Malik, J.: Hypercolumns for object segmentation and fine-grained localization. In: CVPR. (2015)
-  Hariharan, B., Arbeláez, P., Girshick, R., Malik, J.: Simultaneous detection and segmentation. In: ECCV. Springer (2014) 297–312
-  Noh, H., Hong, S., Han, B.: Learning deconvolution network for semantic segmentation. In: ICCV. (2015)
-  Girshick, R., Donahue, J., Darrell, T., Malik, J.: Region-based convolutional networks for accurate object detection and segmentation. TPAMI (2015)
-  Pinheiro, P.H., Collobert, R.: Recurrent convolutional neural networks for scene parsing. JMLR (2013)
-  Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected crfs. ICLR (2015)
-  Schwing, A.G., Urtasun, R.: Fully connected deep structured networks. arXiv preprint arXiv:1503.02351 (2015)
-  Mostajabi, M., Yadollahpour, P., Shakhnarovich, G.: Feedforward semantic segmentation with zoom-out features. In: CVPR. (2015)
-  Bell, S., Zitnick, C.L., Bala, K., Girshick, R.: Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. arXiv preprint arXiv:1512.04143 (2015)
-  Visin, F., Kastner, K., Cho, K., Matteucci, M., Courville, A., Bengio, Y.: Renet: A recurrent neural network based alternative to convolutional networks. arXiv preprint arXiv:1505.00393 (2015)
-  Zaremba, W., Sutskever, I., Vinyals, O.: Recurrent neural network regularization. arXiv preprint arXiv:1409.2329 (2014)
-  Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Gated feedback recurrent neural networks. arXiv preprint arXiv:1502.02367 (2015)
-  Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR (2014)
-  Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. arXiv preprint arXiv:1409.4842 (2014)
-  Jia, Y.: Caffe: An open source convolutional architecture for fast feature embedding (2013)
-  Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: ECCV. (2014)
-  Farabet, C., Couprie, C., Najman, L., LeCun, Y.: Scene parsing with multiscale feature learning, purity trees, and optimal covers. In: ICML. (2012)
-  Kekeç, T., Emonet, R., Fromont, E., Trémeau, A., Wolf, C.: Contextually constrained deep networks for scene labeling. In: BMVC. (2014)
-  Sharma, A., Tuzel, O., Liu, M.: Recursive context propagation network for semantic scene labeling. In: NIPS. (2014)
-  Byeon, W., Breuel, T.M., Raue, F., Liwicki, M.: Scene labeling with lstm recurrent neural networks. In: CVPR. (2015)
-  Arbelaez, P., Maire, M., Fowlkes, C., Malik, J.: Contour detection and hierarchical image segmentation. TPAMI (2011)
-  Hariharan, B., Arbeláez, P., Bourdev, L., Maji, S., Malik, J.: Semantic contours from inverse detectors. In: ICCV. (2011)
-  Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. IJCV (2010)
-  Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: ECCV. (2014)
-  Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015)
-  Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)
-  Papandreou, G., Chen, L.C., Murphy, K., Yuille, A.L.: Weakly-and semi-supervised learning of a dcnn for semantic image segmentation. arXiv preprint arXiv:1502.02734 (2015)