Improving Semantic Segmentation via Video Propagation and Label Relaxation

Improving Semantic Segmentation via Video Propagation and Label Relaxation

Yi Zhu  Karan Sapra  Fitsum A. Reda  Kevin J. Shih  Shawn Newsam
 Andrew Tao  Bryan Catanzaro
University of California at Merced  Nvidia Corporation
yzhu25,snewsam@ucmerced.edu  ksapra,freda,kshih,atao,bcatanzaro@nvidia.com
Abstract

Semantic segmentation requires large amounts of pixel-wise annotations to learn accurate models. In this paper, we present a video prediction-based methodology to scale up training sets by synthesizing new training samples in order to improve the accuracy of semantic segmentation networks. We exploit video prediction models’ ability to predict future frames in order to also predict future labels. A joint propagation strategy is also proposed to alleviate mis-alignments in synthesized samples. We demonstrate that training segmentation models on datasets augmented by the synthesized samples leads to significant improvements in accuracy. Furthermore, we introduce a novel boundary label relaxation technique that makes training robust to annotation noise and propagation artifacts along object boundaries. Our proposed methods achieve state-of-the-art mIoUs of on Cityscapes and on CamVid. Our single model, without model ensembles, achieves mIoU on the KITTI semantic segmentation test set, which surpasses the winning entry of the ROB challenge 2018. Our code and videos can be found at https://nv-adlr.github.io/publication/2018-Segmentation. indicates equal contribution.

1 Introduction

Semantic segmentation is the task of dense per pixel predictions of semantic labels. Large improvements in model accuracy have been made in recent literature [44, 14, 10], in part due to the introduction of Convolutional Neural Networks (CNNs) for feature learning, the task’s utility for self-driving cars, and the availability of larger and richer training datasets (e.g., Cityscapes [15] and Mapillary Vista [32]). While these models rely on large amounts of training data to achieve their full potential, the dense nature of semantic segmentation entails a prohibitively expensive dataset annotation process. For instance, annotating all pixels in a Cityscapes image takes on average hours [15]. Annotation quality plays an important role for training better models. While coarsely annotating large contiguous regions can be performed quickly using annotation toolkits, finely labeling pixels along object boundaries is extremely challenging and often involves inherently ambiguous pixels.

Figure 1: Framework overview. We propose joint image-label propagation to scale up training sets for robust semantic segmentation. The green dashed box includes manually labelled samples, and the red dashed box includes our propagated samples. is the transformation function learned by the video prediction models to perform propagation. We also propose boundary label relaxation to mitigate label noise during training. Our framework can be used with most semantic segmentation and video prediction models.

Many alternatives have been proposed to augment training processes with additional data. For example, Cords et al[15] provided 20K coarsely annotated images to help train deep CNNs, an annotation cost effective alternative used by all top performers on the Cityscapes benchmark. Nevertheless, coarse labeling still takes, on average, minutes per image. An even cheaper way to obtain more labeled samples is to generate synthetic data [35, 36, 18, 47, 45]. However, model accuracy on the synthetic data often does not generalize to real data due to the domain gap between synthetic and real images. Luc et al. [28] use a state-of-the-art image segmentation method [42] as a teacher to generate extra annotations for unlabelled images. However, their performance is bounded by the teacher method. Another approach exploits the fact that many semantic segmentation datasets are based on continuous video frame sequences sparsely labeled at regular intervals. As such, several works [2, 9, 31, 16, 33] propose to use temporal consistency constraints, such as optical flow, to propagate ground truth labels from labeled to unlabeled frames. However, these methods all have different drawbacks which we will describe in Sec. 2.

In this work, we propose to utilize video prediction models to efficiently create more training samples (image-label pairs) as shown in Fig. 1. Given a sequence of video frames having labels for only a subset of the frames in the sequence, we exploit the prediction models’ ability to predict future frames in order to also predict future labels (new labels for unlabelled frames). Specifically, we propose leveraging such models in two ways. 1) Label Propagation (LP): We create new training samples by pairing a propagated label with the original future frame. 2) Joint image-label Propagation (JP): We create a new training sample by pairing a propagated label with the corresponding propagated image. In approach (2), it is of note that since both past labels and frames are jointly propagated using the same prediction model, the resulting image-label pair will have a higher degree of alignment. As we will show in later sections, we separately apply each approach for multiple future steps to scale up the training dataset.

While great progress has been made in video prediction, it is still prone to producing unnatural distortions along object boundaries. For synthesized training examples, this means that the propagated labels along object boundaries should be trusted less than those within an object’s interior. Here, we present a novel boundary label relaxation technique that can make training more robust to such errors. We demonstrate that by maximizing the likelihood of the union of neighboring class labels along the boundary, the trained models not only achieve better accuracy, but are also able to benefit from longer-range propagation.

As we will show in our experiments, training segmentation models on datasets augmented by our synthesized samples leads to improvements on several popular datasets. Furthermore, by performing training with our proposed boundary label relaxation technique, we achieve even higher accuracy and training robustness, producing state-of-the-art results on the Cityscapes, CamVid, and KITTI semantic segmentation benchmarks. Our contributions are summarized below:

  • We propose to utilize video prediction models to propagate labels to immediate neighbor frames.

  • We introduce joint image-label propagation to alleviate the mis-alignment problem.

  • We propose to relax one-hot label training by maximizing the likelihood of the union of class probabilities along boundary. This results in more accurate models and allows us to perform longer-range propagation.

  • We compare our video prediction-based approach to standard optical flow-based ones in terms of segmentation performance.

2 Related Work

Here, we discuss additional work related to ours, focusing mainly on the differences.

Label propagation There are two main approaches to propagating labels: patch matching [2, 9] and optical flow [31, 16, 33]. Patch matching-based methods, however, tend to be sensitive to patch size and threshold values, and, in some cases, they assume prior-knowledge of class statistics. Optical flow-based methods rely on very accurate optical flow estimation, which is difficult to achieve. Erroneous flow estimation can result in propagated labels that are misaligned with their corresponding frames.

Our work falls in this line of research but has two major differences. First, we use motion vectors learned from video prediction models to perform propagation. The learned motion vectors can handle occlusion while also being class agnostic. Unlike optical flow estimation, video prediction models are typically trained through self-supervision. The second major difference is that we conduct joint image-label propagation to greatly reduce the mis-alignments.

Boundary handling Some prior works [12, 29] explicitly incorporate edge cues as constraints to handle boundary pixels. Although the idea is straightforward, this approach has at least two drawbacks. One is the potential error propagation from edge estimation and the other is fitting extremely hard boundary cases may lead to over-fitting at the test stage. There is also literature focusing on structure modeling to obtain better boundary localization, such as affinity field [21], random walk [5], relaxation labelling [37], boundary neural fields [4], etc. However, none of these methods deals directly with boundary pixels but they instead attempt to model the interactions between segments along object boundaries. The work most similar to ours is [22] which proposes to incorporate uncertainty reasoning inside Bayesian frameworks. The authors enforce a Gaussian distribution over the logits to attenuate loss when uncertainty is large. Instead, we propose a modification to class label space that allows us to predict multiple classes at a boundary pixel. Experimental results demonstrate higher model accuracy and increased training robustness.

Figure 2: Motivation of joint image-label propagation. Row 1: original frames. Row 2: propagated labels. Row 3: propagated frames. The red and green boxes are two zoomed-in regions which demonstrate the mis-alignment problem. Note how the propagated frames align perfectly with propagated labels as compared to the original frames. The black areas in the labels represent a void class. (Image brightness has been adjusted for better visualization.)

3 Methodology

We present an approach for training data synthesis from sparsely annotated video frame sequences. Given an input video and semantic labels , where , we synthesize new training samples (image-label pairs) using video prediction models, where is the length of propagation applied to each input image-label pair . We will first describe how we use video prediction models for label synthesis.

3.1 Video Prediction

Video prediction is the task of generating future frames from a sequence of past frames. It can be modeled as the process of direct pixel synthesis or learning to transform past pixels. In this work, we use a simple and yet effective vector-based approach [34] that predicts a motion vector to translate each pixel to its future coordinate. The predicted future frame is given by,

(1)

where is a 3D CNN that predicts motion vectors conditioned on input frames and estimated optical flows between successive input frames and . is an operation that bilinearly samples from the most recent input using the predicted motion vectors .

Note that the motion vectors predicted by are not equivalent to optical flow vectors F. Optical flow vectors are undefined for pixels that are visible in the current frame but not visible in the previous frame. Thus, performing past frame sampling using optical flow vectors will duplicate foreground objects, create undefined holes or stretch image borders. The learned motion vectors, however, account for disocclusion and attempt to accurately predict future frames. We will demonstrate the advantage of learned motion vectors over optical flow in Sec. 4.

In this work, we propose to reuse the predicted motion vectors to also synthesize future labels . Specifically:

(2)

where a sampling operation is applied on a past label . in equation 2 is the same as in equation 1 and is pre-trained on the underlying video frame sequences for the task of accurately predicting future frames.

3.2 Joint Image-Label Propagation

Standard label propagation techniques create new training samples by pairing a propagated label with the original future frame as , with being the propagation length. For regions where the frame-to-frame correspondence estimation is not accurate, we will encounter mis-alignment between and . For example, as we see in Fig. 2, most regions in the propagated label (row 2) correlate well with the corresponding original video frames (row 1). However, certain regions, like the pole (red) and the leg of the pedestrian (green), do not align with the original frames due to erroneous estimated motion vectors.

To alleviate this mis-alignment issue, we propose a joint image-label propagation strategy; i.e., we jointly propagate both the video frame and the label. Specifically, we apply equation 2 to each input training sample for future steps to create new training samples by pairing a predicted frame with a predicted label as (). As we can see in Fig. 2, the propagated frames (row 3) correspond well to the propagated labels (row 2). The pole and the leg experience the same distortion. Since semantic segmentation is a dense per-pixel estimation problem, such good alignment is crucial for learning an accurate model.

Our joint propagation approach can be thought of as a special type of data augmentation because both the frame and label are synthesized by transforming a past frame and the corresponding label using the same learned transformation parameters . It is an approach similar to standard data augmentation techniques, such as random rotation, random scale or random flip. However, joint propagation uses a more fundamental transformation which was trained for the task of accurate future frame prediction.

In order to create more training samples, we also perform reversed frame prediction. We equivalently apply joint propagation to create additional new training samples as . In total, we can scale the training dataset by a factor of . In our study, we set to be or , where indicates a forward propagation, and a backward propagation.

We would like to point out that our proposed joint propagation has broader applications. It could also find application in datasets where both the raw frames and the corresponding labels are scarce. This is different from label propagation alone for synthesizing new training samples for typical video datasets, for instance Cityscapes [15], where raw video frames are abundant but only a subset of the frames have human annotated labels.

Figure 3: Motivation of boundary label relaxation. For the entropy image, the lighter pixel value, the larger the entropy. We find that object boundaries often have large entropy, due to ambiguous annotations or propagation distortions. The green boxes are zoomed-in figures showing such distortions.

3.3 Video Reconstruction

Since, in our problem, we know the actual future frames, we can instead perform not just video prediction but video reconstruction to synthesize new training examples. More specifically, we can condition the prediction models on both the past and future frames to more accurately reconstruct “future” frames. The motivation behind this reformulation is that because future frames are observed by video reconstruction models, they are, in general, expected to produce better transformation parameters than video prediction models which only observe only past frames.

Mathematically, a reconstructed future frame is given by,

(3)

In a similar way to equation 2, we also apply from equation 3 (which is learned for the task of accurate future frame reconstruction) to generate a future label .

Method mIoU ()
Baseline
+ Mapillary Pre-training
+ Class Uniform Sampling (Fine + Coarse)
Table 1: Effectiveness of Mapillary pre-training and class uniform sampling on both fine and coarse annotations.

3.4 Boundary Label Relaxation

Most of the hardest pixels to classify lie on the boundary between object classes [25]. Specifically, it is difficult to classify the center pixel of a receptive field when potentially half or more of the input context could be from a different class. This problem is further compounded by the fact that the annotations are nowhere near pixel-perfect along the edges.

We propose a modification to class label space, applied exclusively during training, that allows us to predict multiple classes at a boundary pixel. We define a boundary pixel as any pixel that has a differently labeled neighbor. Suppose we are classifying a pixel along the boundary of classes and for simplicity. Instead of maximizing the likelihood of the target label as provided by annotation, we propose to maximize the likelihood of . Because classes and are mutually exclusive, we aim to maximize the union of and :

(4)

where is the softmax probability of each class. Specifically, let be the set of classes within a 33 window of a pixel. We define our loss as:

(5)

Note that for , this loss reduces to the standard one-hot label cross-entropy loss.

One can see that the loss over the modified label space is minimized when without any constraints on the relative values of each class probability. We demonstrate that this relaxation not only makes our training robust to the aforementioned annotation errors, but also to distortions resulting from our joint propagation procedure. As can be seen in Fig. 3, the propagated label (three frames away from the ground truth) distorts along the moving car’s boundary and the pole. Further, we can see how much the model is struggling with these pixels by visualizing the model’s entropy over the class label . As the high entropy would suggest, the border pixel confusion contributes to a large amount of the training loss. In our experiments, we show that by relaxing the boundary labels, our training is more robust to accumulated propagation artifacts, allowing us to benefit from longer-range training data propagation.

4 Experiments

In this section, we evaluate our proposed method on three widely adopted semantic segmentation datasets, including Cityscapes [15], CamVid [7] and KITTI [1]. For all three datasets, we use the standard mean Intersection over Union (mIoU) metric to report segmentation accuracy.

0
VPred + LP
VPred + JP
VRec + JP
Table 2: Comparison between (1) label propagation (LP) and joint propagation (JP); (2) video prediction (VPred) and video reconstruction (VRec). Using the proposed video reconstruction and joint propagation techniques, we improve over the baseline by mIoU ().

4.1 Implementation Details

For the video prediction/reconstruction models, the training details are described in the supplementary materials. For semantic segmentation, we use an SGD optimizer and employ a polynomial learning rate policy [27, 13], where the initial learning rate is multiplied by . We set the initial learning rate to and power to . Momentum and weight decay are set to and respectively. We use synchronized batch normalization (batch statistics synchronized across each GPU) [44, 43] with a batch size of 16 distributed over 8 V100 GPUs. The number of training epochs is set to for Cityscapes, for Camvid and for KITTI. The crop size is for Cityscapes, for Camvid and for KITTI due to different image resolutions. For data augmentation, we randomly scale the input images (from 0.5 to 2.0), and apply horizontal flipping, Gaussian blur and color jittering during training. Our network architecture is based on DeepLabV3Plus [14] with equal to . For the network backbone, we use ResNeXt50 [39] for the ablation studies, and WideResNet38 [38] for the final test-submissions. In addition, we adopt the following two effective strategies.

Mapillary Pre-Training Instead of using ImageNet pre-trained weights for model initialization, we pre-train our model on Mapillary Vistas [32]. This dataset contains street-level scenes annotated for autonomous driving, which is close to Cityscapes. Furthermore, it has a larger training set (i.e., K images) and more classes (i.e., classes).

Class Uniform Sampling We introduce a data sampling strategy similar to [10]. The idea is to make sure that all classes are approximately uniformly chosen during training. We first record the centroid of areas containing the class of interest. During training, we take half of the samples from the standard randomly cropped images and the other half from the centroids to make sure the training crops for all classes are approximately uniform per epoch. In this case, we are actually oversampling the underrepresented categories. For Cityscapes, we also utilize coarse annotations based on class uniform sampling. We compute the class centroids for all 20K samples, but we can choose which data to use. For example, classes such as fence, rider, train are underrepresented. Hence, we only augment these classes by providing extra coarse samples to balance the training.

4.2 Cityscapes

Cityscapes is a challenging dataset containing high quality pixel-level annotations for images. The standard dataset split is , , and for the training, validation, and test sets respectively. There are also K coarsely annotated images. All images are of size 10242048. Cityscapes defines semantic labels containing both objects and stuff, and a void class for do-not-care regions. We perform several ablation studies below on the validation set to justify our framework design.

Stronger Baseline

First, we demonstrate the effectiveness of Mapillary pre-training and class uniform sampling. As shown in Table 1, Mapillary pre-training is highly beneficial and improves mIoU by over the baseline (). This makes sense because the Mapillary Vista dataset is close to Cityscape in terms of domain similarity, and thus provides better initialization than ImageNet. We also show that class uniform sampling is an effective data sampling strategy to handle class imbalance problems. It brings an additional improvement (). We use this recipe as our baseline.

Figure 4: Boundary label relaxation leads to higher mIoU at all propagation lengths. The longer propagation, the bigger the gap between the solid (with label relaxation) and dashed (without relaxation) lines. The black dashed line represents our baseline (). x-axis equal to 0 indicates no augmented samples are used. For each experiment, we perform three runs and report the mean and sample standard deviation as the error bar [8].

Label Propagation versus Joint Propagation

Next, we show the advantage of our proposed joint propagation over label propagation. For both settings, we use the motion vectors predicted by the video prediction model to perform propagation. The comparison results are shown in Table 2. Column in Table 2 indicates the baseline ground-truth-only training (no augmentation with synthesized data). Columns 1 to 5 indicate augmentation with sythesized data from timesteps , not including intermediate sythesized data from timesteps . For example, indicates we are using , and the ground truth samples, but not and . Note that we also tried the accumulated case, where and is included in the training set. However, we observed a slight performance drop. We suspect this is because the cumulative case significantly decreases the probability of sampling a hand-annotated training example within each epoch, ultimately placing too much weight on the synthesized ones and their imperfections. Comparisons between the non-accumulated and accumulated cases can be found in the supplementary materials.

As we can see in Table 2 (top two rows), joint propagation works better than label propagation at all propagation lengths. Both achieve highest mIoU for , which is basically using information from just the previous and next frames. Joint propagation improves by mIoU over the baseline (), while label propagation only improves by (). This clearly demonstrates the usefulness of joint propagation. We believe this is because label noise from mis-alignment is outweighed by additional dataset diversity obtained from the augmented training samples. Hence, we adopt joint propagation in subsequent experiments.

Method split road swalk build. wall fence pole tlight tsign veg. terrain sky person rider car truck bus train mcycle bicycle mIoU
Baseline val
+ VRec with JP val
+ Label Relaxation val
ResNet38 [38] test
PSPNet [44] test
InPlaceABN [10] test
DeepLabV3+ [14] test
DRN-CRL [46] test
Ours test
Table 3: Per-class mIoU results on Cityscapes. Top: our ablation improvements on the validation set. Bottom: comparison with top-performing models on the test set.

Video Prediction versus Video Reconstruction

Recall from Sec. 3.1 that we have two methods for learning the motion vectors to generate new training samples through propagation: video prediction and video reconstruction. We experiment with both models in Table 2.

As shown in Table 2 (bottom two rows), video reconstruction works better than video prediction at all propagation lengths, which agrees with our expectations. We also find that achieves the best result. Starting from , the model accuracy starts to drop. This indicates that the quality of the augmented samples becomes lower as we propagate further. Compared to the baseline, we obtain an absolute improvement of (). Hence, we use the motion vectors produced by the video reconstruction model in the following experiments.

(a) Top: MVs; Bottom: Flow
(b) Propagation length performance
Figure 5: Our learned motion vectors from video reconstruction are better than optical flow (FlowNet2). 4(a) Qualitative result. The learned motion vectors are better in terms of occlusion handling. 4(b) Quantitative result. The learned motion vectors are better at all propagation lengths in terms of mIoU.

Effectiveness of Boundary Label Relaxation

Theoretically, we can propagate the labels in an auto-regressive manner for as long as we want. The longer the propagation, the more diverse information we will get. However, due to abrupt scene changes and propagation artifacts, longer propagation will generate low quality labels as shown in Fig. 2. Here, we will demonstrate how the proposed boundary label relaxation technique can help to train a better model by utilizing longer propagated samples.

We use boundary label relaxation on datasets created by video prediction (red) and video reconstruction (blue) in Fig. 4. As we can see, adopting boundary label relaxation leads to higher mIoU at all propagation lengths for both models. Take the video reconstruction model for example. Without label relaxation (dashed lines), the best performance is achieved at . After incorporating relaxation (solid lines), the best performance is achieved at with an improvement of mIoU (). The gap between the solid and dashed lines becomes larger as we propagate longer. The same trend can be observed for the video prediction models. This demonstrates that our boundary label relaxation is effective at handling border artifacts. It helps our model obtain more diverse information from , and at the same time, reduces the impact of label noise brought by long propagation. Hence, we use boundary label relaxation for the rest of the experiments.

Note that even for no propagation (x-axis equal to 0) in Fig. 4, boundary label relaxation improves performance by a large margin (). This indicates that our boundary label relaxation is versatile. Its use is not limited to reducing distortion artifacts in label propagation, but it can also be used in normal image segmentation tasks to handle ambiguous boundary labels.

Learned Motion Vectors versus Optical Flow

Here, we perform a comparison between the learned motion vectors from the video reconstruction model and optical flow, to show why optical flow is not preferred. For optical flow, we use the state-of-the-art CNN flow estimator FlowNet2 [20] because it can generate sharp object boundaries and generalize well to both small and large motions.

First, we show a qualitative comparison between the learned motion vectors and the FlowNet2 optical flow. As we can see in Fig. 4(a), FlowNet2 suffers from serious doubling effects caused by occlusion. For example, the dragging car (left) and the doubling rider (right). In contrast, our learned motion vectors can handle occlusion quite well. The propagated labels have only minor artifacts along the object borders which can be remedied by boundary label relaxation. Next, we show quantitative comparison between learned motion vectors and FlowNet2. As we can see in Fig. 4(b), the learned motion vectors (blue) perform significantly better than FlowNet2 (red) at all propagation lengths. As we propagate longer, the gap between them becomes larger, which indicates the low quality of the FlowNet2 augmented samples. Note that when the propagation length is and , the performance of FlowNet2 is even lower than the baseline.

Figure 6: Visual comparisons on Cityscapes. The images are cropped for better visualization. We demonstrate our proposed techniques lead to more accurate segmentation than our baseline. Especially for thin and rare classes, like street light and bicycle (row 1), signs (row 2), person and poles (row 3). Our observation corresponds well to the class mIoU improvements in Table 3.
Figure 7: Visual examples on Cityscapes. From left to right: image, GT, prediction and their differences. We demonstrate that our model can handle situations with multiple cars (row 1), dense crowds (row 2) and thin objects (row 3). The bottom two rows show failure cases. We mis-classify a reflection in the mirror (row 4) and a model inside the building (row 5) as person (red boxes).
Figure 8: Visual comparison between our results and those of the winning entry [10] of ROB challenge 2018 on KITTI. From left to right: image, prediction from [10] and ours. Boxes indicate regions in which we perform better than [10]. Our model can predict semantic objects as a whole (bus), detect thin objects (poles and person) and distinguish confusing classes (sidewalk and road, building and sky).

Comparison to State-of-the-Art

As shown in Table 3 top, our proposed video reconstruction-based data synthesis together with joint propagation improves by mIoU over the baseline. Incorporating label relaxation brings another mIoU improvement. We observe that the largest improvements come from small/thin object classes, such as pole, street light/sign, person, rider and bicycle. This can be explained by the fact that our augmented samples result in more variation for these classes and helps with model generalization. We show several visual comparisons in Fig. 6.

For test submission, we train our model using the best recipe suggested above, and replace the network backbone with WideResNet38 [38]. We adopt a multi-scale strategy [44, 14] to perform inference on multi-scaled (0.5, 1.0 and 2.0), left-right flipped and overlapping-tiled images, and compute the final class probabilities after averaging logits per inference. More details can be found in the supplementary materials. As shown in Table 3 bottom, we achieve an mIoU of , outperforming all prior methods. We get the highest IoU on out of the classes except for wall and truck. In addition, we show several visual examples in Fig. 7. We demonstrate that our model can handle situations with multiple cars (row 1), dense crowds (row 2) and thin objects (row 3). We also show two interesting failure cases in Fig. 7. Our model mis-classifies a reflection in the mirror (row 4) and a model inside the building (row 5) as person (red boxes). However, in terms of appearance without reasoning about context, our predictions are correct. More visual examples can be found in the supplementary materials.

Method Pre-train Encoder mIoU ()
SegNet [3] ImageNet VGG16
RTA [19] ImageNet VGG16
Dilate8 [42] ImageNet Dilate
BiSeNet [41] ImageNet ResNet18
PSPNet [44] ImageNet ResNet50
DenseDecoder [6] ImageNet ResNeXt101
VideoGCRF [11] Cityscapes ResNet101
Ours (baseline) Cityscapes WideResNet38
Ours Cityscapes WideResNet38
Table 4: Results on the CamVid test set. Pre-train indicates the source dataset on which the model is trained.

4.3 CamVid

CamVid is one of the first datasets focusing on semantic segmentation for driving scenarios. It is composed of densely annotated images with size from five video sequences. We follow the standard protocol proposed in [3] to split the dataset into training, validation and test images. A total of classes are provided. However, most literature only focuses on due to the rare occurrence of the remaining classes. To create the augmented samples, we directly use the video reconstruction model trained on Cityscapes without fine tuning on CamVid. The training strategy is similar to Cityscapes. We compare our method to recent literature in Table 4. For fair comparison, we only report single-scale evaluation scores. As can be seen in Table 4, we achieve an mIoU of , outperforming all prior methods by a large margin. Furthermore, our multi-scale evaluation score is . Per-class breakdown can be seen in the supplementary materials.

One may argue that our encoder is more powerful than prior methods. To demonstrate the effectiveness of our proposed techniques, we perform training under the same settings without using the augmented samples and boundary label relaxation. The performance of this configuration on the test set is , a significant IoU drop of .

4.4 Kitti

The KITTI Vision Benchmark Suite [17] was introduced in 2012 but updated with semantic segmentation ground truth [1] in 2018. The data format and metrics conform with Cityscapes, but with a different image resolution of . The dataset consists of training and test images. Since the dataset is quite small, we perform -split cross validation fine-tuning on the training images. Eventually, we determine the best model in terms of mIoU on the whole training set because KITTI only allows one submission for each algorithm. For test images, we run multi-scale inference by averaging over scales (, and ). We compare our method to recent literature in Table 5. We achieve significantly better performance than prior methods on all four evaluation metrics. In terms of mIoU, we outperform previous state-of-the-art [10] by . Note that [10] is the winning entry to Robust Vision Challenge 2018, which is achieved by an ensemble of five models, we only use one. We show two visual comparisons between ours and [10] in Fig. 8.

Method IoU class iIoU class IoU category iIoU category
APMoEseg [23]
SegStereo [40]
AHiSS [30]
LDN2 [24]
MapillaryAI [10]
Ours
Table 5: Results on KITTI test set.

5 Conclusion

We propose an effective video prediction-based data synthesis method to scale up training sets for semantic segmentation. We also introduce a joint propagation strategy to alleviate mis-alignments in synthesized samples. Furthermore, we present a novel boundary relaxation technique to mitigate label noise. The label relaxation strategy can also be used for human annotated labels and not just synthesized labels. We achieve state-of-the-art mIoUs of on Cityscapes, on CamVid, and on KITTI. The superior performance demonstrates the effectiveness of our proposed methods.

We hope our approach inspires other ways to perform data augmentation, such as GANs [26], to enable cheap dataset collection and achieve improved accuracy in target tasks. For future work, we would like to explore soft label relaxation using the learned kernels in [34] for better uncertainty reasoning. Our state-of-the-art implementation, will be made publicly available to the research community.

Acknowledgements

We would like to thank Saad Godil, Matthieu Le, Ming-Yu Liu and Guilin Liu for suggestions and discussions.

References

  • [1] H. Alhaija, S. Mustikovela, L. Mescheder, A. Geiger, and C. Rother. Augmented Reality Meets Computer Vision: Efficient Data Generation for Urban Driving Scenes. International Journal of Computer Vision (IJCV), 2018.
  • [2] V. Badrinarayanan, F. Galasso, and R. Cipolla. Label Propagation in Video Sequences. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010.
  • [3] V. Badrinarayanan, A. Kendall, and R. Cipolla. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2017.
  • [4] G. Bertasius, J. Shi, and L. Torresani. Semantic Segmentation with Boundary Neural Fields. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [5] G. Bertasius, L. Torresani, S. X. Yu, and J. Shi. Convolutional Random Walk Networks for Semantic Image Segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [6] P. Bilinski and V. Prisacariu. Dense Decoder Shortcut Connections for Single-Pass Semantic Segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [7] G. J. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla. Segmentation and Recognition Using Structure from Motion Point Clouds. In European Conference on Computer Vision (ECCV), 2008.
  • [8] G. W. Brown. Standard Deviation, Standard Error: Which ‘Standard’ Should We Use? American Journal of Diseases of Children, 1982.
  • [9] I. Budvytis, P. Sauer, T. Roddick, K. Breen, and R. Cipolla. Large Scale Labelled Video Data Augmentation for Semantic Segmentation in Driving Scenarios. In International Conference on Computer Vision (ICCV) Workshop, 2017.
  • [10] S. R. Bulò, L. Porzi, and P. Kontschieder. In-Place Activated BatchNorm for Memory-Optimized Training of DNNs. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [11] S. Chandra, C. Couprie, and I. Kokkinos. Deep Spatio-Temporal Random Fields for Efficient Video Segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [12] L.-C. Chen, J. T. Barron, G. Papandreou, K. Murphy, and A. L. Yuille. Semantic Image Segmentation with Task-Specific Edge Detection Using CNNs and a Discriminatively Trained Domain Transform. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [13] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2018.
  • [14] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In European Conference on Computer Vision (ECCV), 2018.
  • [15] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The Cityscapes Dataset for Semantic Urban Scene Understanding. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [16] R. Gadde, V. Jampani, and P. V. Gehler. Semantic Video CNNs Through Representation Warping. In IEEE International Conference on Computer Vision (ICCV), 2017.
  • [17] A. Geiger, P. Lenz, and R. Urtasun. Are We Ready for Autonomous Driving? The KITTI Vision Benchmark Suite. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
  • [18] J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. Efros, and T. Darrell. CyCADA: Cycle-Consistent Adversarial Domain Adaptation. In International Conference on Machine Learning (ICML), 2018.
  • [19] P.-Y. Huang, W.-T. Hsu, C.-Y. Chiu, T.-F. Wu, and M. Sun. Efficient Uncertainty Estimation for Semantic Segmentation in Videos. In European Conference on Computer Vision (ECCV), 2018.
  • [20] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [21] T.-W. Ke, J.-J. Hwang, Z. Liu, and S. X. Yu. Adaptive Affinity Fields for Semantic Segmentation. In European Conference on Computer Vision (ECCV), 2018.
  • [22] A. Kendall and Y. Gal. What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? In Conference on Neural Information Processing Systems (NIPS), 2017.
  • [23] S. Kong and C. Fowlkes. Pixel-wise Attentional Gating for Parsimonious Pixel Labeling. In IEEE Winter Conference on Applications of Computer Vision (WACV), 2019.
  • [24] I. KreÅ¡o, J. Krapac, and S. Å egvić. Ladder-style DenseNets for Semantic Segmentation of Large Natural Images. In IEEE International Conference on Computer Vision (ICCV), 2017.
  • [25] X. Li, Z. Liu, P. Luo, C. C. Loy, and X. Tang. Not All Pixels Are Equal: Difficulty-aware Semantic Segmentation via Deep Layer Cascade. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [26] S. Liu, J. Zhang, Y. Chen, Y. Liu, Z. Qin, and T. Wan. Pixel Level Data Augmentation for Semantic Image Segmentation using Generative Adversarial Networks. arXiv preprint arXiv:1811.00174, 2018.
  • [27] W. Liu, A. Rabinovich, and A. C. Berg. ParseNet: Looking Wider to See Better. In International Conference on Learning Representations (ICLR), 2016.
  • [28] P. Luc, N. Neverova, C. Couprie, J. Verbeek, and Y. LeCun. Predicting Deeper into the Future of Semantic Segmentation. In International Conference on Computer Vision (ICCV), 2017.
  • [29] D. Marmanis, K. Schindler, J. D. Wegner, S. Galliani, M. Datcu, and U. Stilla. Classification With an Edge: Improving Semantic Image Segmentation with Boundary Detection. ISPRS Journal of Photogrammetry and Remote Sensing, 2018.
  • [30] P. Meletis and G. Dubbelman. Training of Convolutional Networks on Multiple Heterogeneous Datasets for Street Scene Semantic Segmentation. arXiv preprint arXiv:1803.05675, 2018.
  • [31] S. K. Mustikovela, M. Y. Yang, and C. Rother. Can Ground Truth Label Propagation from Video help Semantic Segmentation? In European Conference on Computer Vision (ECCV) Workshop, 2016.
  • [32] G. Neuhold, T. Ollmann, S. R. Bulò, and P. Kontschieder. The Mapillary Vistas Dataset for Semantic Understanding of Street Scenes. In International Conference on Computer Vision (ICCV), 2017.
  • [33] D. Nilsson and C. Sminchisescu. Semantic Video Segmentation by Gated Recurrent Flow Propagation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [34] F. A. Reda, G. Liu, K. J. Shih, R. Kirby, J. Barker, D. Tarjan, A. Tao, and B. Catanzaro. SDC-Net: Video Prediction using Spatially-Displaced Convolution. In European Conference on Computer Vision (ECCV), 2018.
  • [35] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez. The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [36] S. Sankaranarayanan, Y. B. abd Arpit Jain, S. N. Lim, and R. Chellappa. Learning from Synthetic Data: Addressing Domain Shift for Semantic Segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [37] R. Vieux, J. Benois-Pineau, J.-P. Domenger, and A. Braquelaire. Segmentation-based Multi-class Semantic Object Detection. Multimedia Tools and Applications, 2012.
  • [38] Z. Wu, C. Shen, and A. van den Hengel. Wider or Deeper: Revisiting the ResNet Model for Visual Recognition. arXiv:1611.10080, 2016.
  • [39] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated Residual Transformations for Deep Neural Networks. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [40] G. Yang, H. Zhao, J. Shi, Z. Deng, and J. Jia. SegStereo: Exploiting Semantic Information for Disparity Estimation. In European Conference on Computer Vision (ECCV), 2018.
  • [41] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang. BiSeNet: Bilateral Segmentation Network for Real-time Semantic Segmentation. In European Conference on Computer Vision (ECCV), 2018.
  • [42] F. Yu and V. Koltun. Multi-Scale Context Aggregation by Dilated Convolutions. In International Conference on Learning Representations (ICLR), 2016.
  • [43] H. Zhang, K. Dana, J. Shi, Z. Zhang, X. Wang, A. Tyagi, and A. Agrawal. Context Encoding for Semantic Segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [44] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid Scene Parsing Network. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [45] X. Zhu, H. Zhou, C. Yang, J. Shi, and D. Lin. Penalizing Top Performers: Conservative Loss for Semantic Segmentation Adaptation. In European Conference on Computer Vision (ECCV), 2018.
  • [46] Y. Zhuang, F. Yang, L. Tao, C. Ma, Z. Zhang, Y. Li, H. Jia, X. Xie, and W. Gao. Dense Relation Network: Learning Consistent and Context-Aware Representation For Semantic Image Segmentation. In IEEE International Conference on Image Processing (ICIP), 2018.
  • [47] A. Zlateski, R. Jaroensri, P. Sharma, and F. Durand. On the Importance of Label Quality for Semantic Segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.

Appendices

Appendix A Implementation Details of Our Video Prediction/Reconstruction Models

In this section, we first describe the network architecture of our video prediction model and then we illustrate the training details. The network architecture and training details of our video reconstruction model is similar, except the input is different.

Recalling equation (1) from the main submission, the future frame is given by,

where is a general CNN that predicts the motion vectors conditioned on the input frames and the estimated optical flow between successive input frames and . is an operation that bilinearly samples from the most recent input using the predicted motion vectors .

In our implementation, we use the vector-based architecture as described in [34]. is a fully convolutional U-net architecture, complete with an encoder and decoder and skip connections between encoder/decoder layers of the same output dimensions. Each of the encoder layers is composed of a convolution operation followed by a Leaky ReLU. The decoder layers are composed of a deconvolution operation followed by a Leaky ReLU. The output of the decoder is fed into one last convolutional layer to generate the motion vector predictions. The input to is and (8 channels), and the output is the predicted 2-channel motion vectors that can best warp to . For the video reconstruction model, we simply add and to the input, and change the number of channels in the first convolutional layer to instead of .

We train our video prediction model using frames extracted from short sequences in the Cityscapes dataset. We use the Adam optimizer with , , and a weight decay of . The frames are randomly cropped to with no extra data augmentation. We set the batch size to 128 over 8 V100 GPUs. The initial learning rate is set to and the number of epochs is . We refer interested readers to [34] for more details.

Appendix B Non-Accumulated and Accumulated Comparison

Recalling Sec. 4.1 from the main submission, we have two ways to augment the dataset. The first is the non-accumulated case, where we simply use synthesized data from timesteps , excluding intermediate synthesized data from timesteps . For the accumulated case, we include all the synthesized data from timesteps , which makes the augmented dataset times larger than the original training set.

We showed that we achieved the best performance at , so we use here. We compare three configurations:

  1. Baseline: using the ground truth dataset only.

  2. Non-accumulated case: using the union of the ground truth dataset and ;

  3. Accumulated case: using the union of the ground truth dataset, , and .

For these experiments, we use boundary label relaxation and joint propagation. We report segmentation accuracy on the Cityscapes validation set.

Method Baseline Non-accumulated Accumulated
mIoU ()
Table 6: Accumulated and non-accumulated comparison. The numbers in brackets are the sample standard deviations.

We have two observations from Table 6. First, using the augmented dataset always improves segmentation quality as quantified by mIoU. Second, the non-accumulated case performs better than the accumulated case. We suspect this is because the cumulative case significantly decreases the probability of sampling a hand-annotated training example within each epoch, ultimately placing too much weight on the synthesized ones and their imperfections.

Appendix C Cityscapes

c.1 More Training Details

We perform 3-split cross-validation to evaluate our algorithms, in terms of cities. The three validation splits are {cv0: munster, lindau, frankfurt}, {cv1: darmstadt, dusseldorf, erfurt} and {cv2: monchengladbach, strasbourg, stuttgart}. The rest cities will be in the training set, respectively. cv0 is the standard validation split. We found that models trained on cv2 split leads to higher performance on the test set, so we adopt cv2 split for our final test submission. Using our best model, we perform multiscale inference on the ‘stuttgart00’ sequence and generate a demo video. The video is composed of both video frames and predicted semantic labels, with a alpha blending.

c.2 Failure Cases

We show several more failure cases in Fig. 9. First, we show four challenging scenarios of class confusion. From rows (a) to (d), our model has difficulty in segmenting: (a) car and truck. (b) person and rider. (c) wall and fence (d) terrain and vegetation.

Furthermore, we show three cases where it could be challenging even for a human to label. In Fig. 9 (e), it is very hard to tell whether it is a bus or train when the object is far away. In Fig. 9 (f), it is also hard to predict whether it is a car or bus under such strong occlusion (more than of the object is occluded). In Fig. 9 (g), there is a bicycle hanging on the back of a car. The model needs to know whether the bicycle is part of the car or a painting on the car, or whether they are two separate objects, in order to make the correct decision.

Finally, we show two training samples where the annotation might be wrong. In Fig. 9 (h), the rider should be on a motorcycle, not a bicycle. In Fig. 9 (i), there should be a fence before the building. However, the whole region was labelled as building by a human annotator. In both cases, our model predicts the correct semantic labels.

c.3 More Synthesized Training Samples

We show synthesized training samples in the demo video to give readers a better understanding. Each is a -frame video clip, in which only the th frame is the ground truth. The neighboring frames are generated using the video reconstruction model. We also show the comparison to using the video prediction model and FlowNet2 [20]. In general, the video reconstruction model gives us the best propagated frames/labels in terms of visualization. It also works the best in our experiments in terms of segmentation accuracy. Since the Cityscapes dataset is recorded at 17Hz [15], the motion between frames is very large. Hence, propagation artifacts can be clearly observed, especially at the image borders.

Appendix D CamVid

d.1 Class Breakdown

We show the per-class mIoU results in Table 7. Our model has the highest mIoU on out of classes (all classes but tree, sky and sidewalk). This is expected because our synthesized training samples help more on classes with small/thin structures. Overall, our method significantly outperforms previous state-of-the-art by mIoU.

Method Build. Tree Sky Car Sign Road Pedes. Fence Pole Swalk Cyclist mIoU
RTA [19]
Dilate8 [42]
BiSeNet [41]
VideoGCRF [11]
Ours (SS)
Ours (MS)
Table 7: Per-class mIoU results on CamVid. Comparison with recent top-performing models on the test set. ‘SS’ indicates single-scale inference, ‘MS’ indicates multi-sclae inference. Our model achieves the highest mIoU on out of classes (all classes but tree, sky and sidewalk). This is expected because our synthesized training samples help more on classes with small/thin structures.

d.2 More Synthesized Training Samples

For CamVid, we show two demo videos of synthesized training samples. One is on the validation sequence ‘006E15’, which is manually annotated every other frame. The other is on the training sequence ‘0001TP’, which has manually annotated labels for every 30th frame. For ‘006E15’, we do one step of forward propagation to generate a label for the unlabeled intermediate frame. For ‘0001TP’, we do steps of forward propagation and steps of backward propagation to label the unlabeled frames in between. For both videos, the synthesized samples are generated using the video reconstruction model trained on Cityscapes, without fine-tuning on CamVid. This demonstrates the great generalization ability of our video reconstruction model.

Figure 9: Failure cases (in yellow boxes). From left to right: image, ground truth, prediction and their difference. Green boxes are zoomed in regions for better visualization. Row (a) to (d) show class confusion problems. Our model has difficulty in segmenting: (a) car and truck. (b) person and rider. (c) wall and fence (d) terrain and vegetation. Row (e) to (f) show challenging cases when the object is far away, strongly occluded, or overlaps other objects. The last two rows show two training samples with wrong annotations: (h) mislabeled motorcycle to bicycle and (i) mislabeled fence to building.

Appendix E Demo Video

We present all the video clips mentioned above at https://nv-adlr.github.io/publication/2018-Segmentation.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minumum 40 characters
Add comment
Cancel
Loading ...
322865
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description