Coherent Loss: A Generic Framework for Stable Video Segmentation

Coherent Loss: A Generic Framework for Stable Video Segmentation


Video segmentation approaches are of great importance for numerous vision tasks especially in video manipulation for entertainment. Due to the challenges associated with acquiring high-quality per-frame segmentation annotations and large video datasets with different environments at scale, learning approaches shows overall higher accuracy on test dataset but lack strict temporal constraints to self-correct jittering artifacts in most practical applications. We investigate how this jittering artifact degrades the visual quality of video segmentation results and proposed a metric of temporal stability to numerically evaluate it. In particular, we propose a Coherent Loss with a generic framework to enhance the performance of a neural network against jittering artifacts, which combines with high accuracy and high consistency. Equipped with our method, existing video object/semantic segmentation approaches achieve a significant improvement in term of more satisfactory visual quality on video human dataset, which we provide for further research in this field, and also on DAVIS and Cityscape.


1 Dalian University of Technology, 2 Department of Computer Vision Technology (VIS), Baidu Inc., fuyi02, tanxiao01, liyingying05, wenshilei,,


As a dense per pixel-level prediction task, video segmentation aims to label all pixels in a video sequence for semantic segmentation or differentiating individual instance and assigning consistent object IDs to each instance over the sequence for instance segmentation, which remains a foundational building block in many autonomous driving and interactive entertainment products. For example, in many interactive applications, i.e., background transition, clothes changing and auto clipping, a fast and accurate semantic and/or instance segmentation method is required for targeting the object or human of interest. Unfortunately, most off-the-shelf methods failed to provide satisfactory video segmentation results, i.e, the results may vary a lot even the scenario undergoes only a tiny change as shown in Figure 1. This phenomenon could lead to bad alignment and temporal jittering of object boundaries in videos, which significantly undermines visual quality and hence degrades the interactive experience. In addition, per-pixel accuracy in video segmentation methods like mIoU is used to measure the static segmentation performance of the overall test samples, but jittering artifact of scattered frames will be propagated to the whole video and generate perceptually jarring results, which cannot reflect real sensory experience through mIoU.

frame t frame t+1 frame t+2 frame t+3
Figure 1: Segmentation results from the official PSPNet Song et al. (2018). We give a close-up view for the details in the second row. Note that the inconsistencies appear on the more discriminative results such as the billboard.

Recent segmentation methods succeed to detect a better object identity with more clear edges, but fail to alleviate the jittering artifact and misaligned distortions along object boundaries when processing images sequentially. Besides, although temporal coherency has been valued in many methods Voigtlaender et al. (2019)Hu et al. (2020), they still face the following unavoidable risks. Camera motion inevitably leads to motion blurring and confuses the prediction of algorithms, which causes the jittering artifacts. Even worse is ”ground truth” boundaries labeled by annotators are in fact not that accurate enough Hoebel et al. (2019), it is known that manual annotations often involve inherently uncertain boundary areas of single image, not to mention annotating perfect alignment across video sequences. As a result, jittering is observed when we apply a predictor using per-frame annotation independently, and the segmented mask can not adhere well to a complete define object of anatomically in the sequence, even if the masks are annotations. Besides, metric like mIou that comparing the accuracy of results against the per-frame annotations cannot reflect the temporal coherency when the predictor’s bias is small but variance is large especially along the boundary. Therefore, we hope to evaluate the jittering artifact as an important indicator, and fundamentally alleviate the jitter of actual results, which is more sensitive for video manipulation applications.

Note that perfect coherency can be achieved trivially at the expense of accuracy: all pixels in the frames are assigned the same label, however, a combination of high accuracy and high coherency is not easy to achieve. Instead of purely relying on mIoU for evaluation, we first emphasis the temporal coherency and introduce a simple metric: stability rate, to measure the overall prediction consistency in a sequence. We introduce a Coherent Loss with a generic framework performing as a neural network training strategy smartly exploiting unlabeled videos for enhancing the temporal coherency. Owing it is orthogonal to the network architecture, it can hence be used to cooperate with state-of-the-art segmentation methods to alleviate the jittering artifacts without adding extra prediction overhead.

Unfortunately, most of the current datasets for segmentation, such as DAVIS-16 Perazzi et al. (2016) and Cityscape Cordts et al. (2016), are small and have a few labeled videos. Using such a dataset for video segmentation may result in poor performance and be not suitable for evaluating stability. Besides, the human is a necessary category of segmentation and is easily accessed due to the strong demand in the practical products, thus we collect a large dataset for video human segmentation, which contains about 100,000 labeled images and 130 unlabeled videos for future research. As a result, our contributions are as following:

  • We expose a new dimension of evaluating video segmentation methods in term of temporal coherency which plays an important role in video segmentation tasks, and introduce a new numeric metric, stability rate, to explicitly measure the coherency of video segmentation results.

  • We introduce a Coherent Loss with a generic framework, which performs as a training strategy smartly exploiting unlabeled videos to enhance temporal coherency in video segmentation. Owing it is orthogonal to the network architecture, it can hence be used to cooperate with many state-of-the-art segmentation methods based on neural network without adding extra prediction overhead.

  • We collect a large video human segmentation dataset for research, and we show the improvement of visual stability on this dataset and also achieve promotion on the public DAVIS and Cityscape.

Related Work

Segmentation Methods

Many video object segmentation methods adjust the neural network for better adaption to the propagation of temporal information. Huang et al. Huang et al. (2020) propose a new temporal aggregation network and a novel dynamic time-evolving template matching mechanism to achieve significantly improved performance. David et al. Nilsson and Sminchisescu (2018) propose a Spatio-Temporal Transformer GRU module that can temporally propagate labeling information by FlowNet Ilg et al. (2017), adaptively gated based on its locally estimated uncertainty. Wang et al. Wang et al. (2019) fuse the features and masks of the previous frame with the current frame and supervise it with current annotation to enhance temporal consistency indirectly. Though they achieve high performance of mIoU, sequentially running these approaches on each frame of a video usually leads to jittering and unaligned distortions along object boundaries. Our method is quite different from these work as we strictly constraint the coherency between the tracked pixels by proposed loss function rather than relying on defective labels for per-frame supervision. In addition, the problem of video jitter as mentioned above cannot be avoided by per-frame supervision (because manual annotation of non-jitter is difficult to obtain), we use unlabeled videos for unsupervised learning to explore temporal stability of the sequential results directly by minimizing pixel-level unmatched errors between these results.

Figure 2: Framework overview. The per-pixel supervision utilizes labeled samples for better segmentation while the proposed coherent supervision utilizes unlabeled videos for self-correcting unstable details.

Methods about Coherency

It is difficult to annotate consistently along the object boundary uncertain areas across frames, thus some methods suggest completely different ideas. Kundu et al. Kundu et al. (2016) employ a regularization to optimize the mapping of pixels to a Euclidean feature space so as to minimize distances between corresponding points. It significantly enhances the temporal coherency after post-processing but is very time-consuming. In Dong et al. (2018), supervision-by-registration is proposed to improve jitter of key points in facial landmark detectors on both images and video, which has a great reference to us. Besides, the metric mIoU, which evaluates the quality of segmentation against the annotation, are also doubted if it can measure the coherency of video segmentation. Hoebel et al. Hoebel et al. (2019) assess some measures of uncertainty for segmentation in medical imaging and the correlation with segmentation quality, while Hendrycks Hendrycks and Dietterich (2019) discusses the robustness of the model under different noise disturbances and provides the metrics on synthetic data sets. Although the metric consistency in Kundu et al. (2016) and Nilsson and Sminchisescu (2018) is measured if all pixels along the track are assigned the same label, it contains greater error when the positive or negative regions are much larger. We propose a new metric to explicitly measure the coherency of video segmentation, which is based on the consistent predictions of the corresponding pixels and can selectively reflect the coherency of global or local segmentation in each video, and we also present a novel Coherent Loss to enhance the temporal coherency.



Widely used segmentation loss functions involve a Cross Entropy for per-pixel prediction, which measures some similarity between each prediction and the corresponding ground truth, and may not good at maintaining temporal coherency, we hence introduce our Coherent Loss for alleviating the visual jittering artifact for video segmentation, and the overview of our framework is shown in Figure 2. We introduce our Coherent Loss for alleviating the visual jittering artifact for video segmentation, and the overview of our framework is shown in Figure 2. Given a trained network for providing satisfactory video segmentation results, we propose an online matching strategy named coherent supervision to enhance the coherency on the basis of per-pixel supervision. It is ideally to train with fully labeled video sequences, however the issues are: (1) it is difficult to annotate perfect alignment across video frames and (2) annotating video frames needs huge cost especially for higher accuracy. Therefore, the proposed Coherent Loss exploits unlabeled videos to learn to predict consistently in corresponding regions between successive frames. Owing it is orthogonal to the network, this novel paradigm allows us to cooperate with many successful methods for better performance.

Per-Pixel Loss

Many segmentation networks take an image as input and output a predicted softmax map . Typically, they apply a Cross-Entropy loss on the predicted map with the ground truth , i.e., , where is the predicted categorical probability in while is its ground truth labels. Note that when training with different kinds of networks, we can still use the original loss functions and settings described in their methods.

Coherent Loss

Since temporal coherency requires the corresponding pixels in successive frames having a same label, optical flow is encouraged to find those corresponding pixels in each image pair. The Coherent Loss is proposed to perform online matching on the unlabeled videos, which directly uses loss function across these corresponding points to calculate mismatched prediction errors and back-propagate gradient for better alignment during the training. Specifically, Coherent Loss(shown in Figure 3) is required to address two main issues for better temporal coherent results. The first is misalignment around boundary region caused by training with inconsistent annotations Hoebel et al. (2019), thus Boundary Coherency is designed to improve the temporal alignment among the sensitive boundary areas between adjacent frames. The second is sudden mis-segmentation due to the undetectable appearance difference between different targets, changes in lightening or deformation caused by camera jitter, thus Global Coherency is designed to minimize the mis-segmentation in the whole sequences.

Finding Corresponding Pixels. Motivated by Dong et al. (2018) who designs a differentiable Lucas-Kanade operation to track facial landmarks, we design an iterative algorithm to calculate the optical flow of the whole image. We use as a set of locations between two frames to indicate all tracked points and to indicate those tracked points passed by dual matching of forward-backward check Kalal et al. (2010) for better accuracy. For simplification, we use to formulate a mapping of labels in or along the optical flow, i.e., , where are the offsets.

Figure 3: The overview of Coherent loss.

Boundary Coherency. As mentioned above, per-frame supervision with imprecise annotations along the boundary uncertain areas leads to incoherency of segmentation boundaries. We present Boundary Coherency for the case in which we treat all the corresponding pixels along the boundary areas in the to be robust and encouraged to have the same label, enabling the coherency of segmented boundary areas.

We first extract the single line of boundary mask calculated by the output from the network. The is considered as the center of the boundary uncertain area, we extend the boundary line to the area with a certain width, covering the potential areas of uncertainty. We set the width to several pixels in accordance with general conditions, the boundary uncertain area mask is defined as , which is a binary map. Then, all the positions in the uncertain areas, donates as , are regarded as the hypothetical ground truth to supervise the corresponding predictions in the , warped from the by . We use the standard Lovasz-Softmax loss() for the optimization of the intersection-over-union just in . Similarly, the backward part also performs the same processing, thus the Boundary Coherency loss () is defined as following:


and are the hypothetical ground truth labels from the and at position.

Global Coherency. As mentioned above, mis-segmentation due to sudden changes in external environment brings temporal jittering in the global scope. We handle it by introducing the learning of temporal information to keep the prediction consistency among from a global perspective. Therefore, the motivation of Global Coherency is that the predicted will perform as a reference for the to learn to get closer to this result.

As shown in the Figure 3, from to , we use the prepared optical flow to warp the into the next time, by computing . As Global Coherency serves as an unsupervised manner, the model training may suffer from the misalignment by directly using labels from the for supervision. But, those labels with higher confidence scores should be trusted more than those with lower confidence scores. Thus, we adopt a compromise plan that we find out all the points with higher top-1 scores in the than those aligned in the through an appropriate threshold , which denotes as .

To be more accurate, we use to denote these tracked points with higher confidence scores and Soft Cross-Entropy De Boer et al. (2005) loss to reduce the mismatched errors in these tracked pixels. Similarly, we also perform the Global Coherency from to , thus the loss is defined as following:


and denote the target softmax scores at position in and , and and denote the predicted scores in the warped and .

stable videos score mean score mIoU(syn) mIoU(test)
Method1 7 / 30 68 2.27 99.17 95.46 91.20 95.14
Method2 8 / 30 68 2.27 99.18 95.54 93.22 96.20
Method3 17/ 30 76 2.53 99.22 95.94 93.80 95.37
Method4 27/ 30 91 3.03 99.28 96.54 94.28 94.28
Method5 26/ 30 95 3.17 99.29 96.79 94.31 95.91
Table 1: User-study for real human videos. For simplification, Method1 and Method2 are the base model in the  Li et al. (2019) with ResNet18 and ResNet50 as backbone; Method3, Method4, Method5 are fine-tuned on the trained model of Method2, with Coherent Loss of different parameters using unlabeled videos.

Total Loss

The complete loss function contains two important components: per-pixel segmentation loss for spatial supervision, and Boundary Coherency and Global Coherency for coherent supervision:


We use a balanced combination of the segmentation loss and coherent loss controlled by the parameters and .


We propose a large video human segmentation dataset (VHS) consisting of labeled images and unlabeled videos. The labeled images are collected from the public, which contain about 5168 multi-person images from LIP Gong et al. (2017), chosen as the completely annotated human instances, 17706 images from ATR Liang et al. (2015), 34426 images from AISegment AISegment (2019), about 38781 images from AIC Wu et al. (2019) and 5200 images on Supervisely Person Dataset Hackernoon (2018). We relabeled all static images to binary mask, and the unlabeled videos are selfie videos that we have collected from the different people, lasting from 30 seconds to 1 minute. Details will be described in the Supplementary Material.


We propose a new metric for evaluating the temporal coherency of each method, and apply it to different models.


A more satisfactory video segmentation means that for a particular pixel (except for points exceeding the image scope) in the video, the model predicts a same and accuracy label across all frames. We propose a new metric: stability rate () tracking each particular pixel and measuring the coherency of its predictions during the whole sequences. However, it is too difficult to track the offset of a pixel through annotations, we have to find some finely labeled images to generate video sequences for evaluation using synthesis.

According to Hendrycks and Dietterich (2019), we choose 4 common perturbations: translation, rotation, scaling and occlusion, and 1 corruption: Gaussian noise with fixed parameters, to generate the synthetic test dataset. Each perturbation type is set to six levels of severity to simulate real-world perturbations. For each severity, we mark the labeled image as the first image and further randomly perform one of the perturbations to generate the second image, the following frames in the sequence is also a perturbation of the previous with minute Gaussian noise applied all along the line. We set the length of a sequence of 11 (including the first image) to make sure repeated application of a perturbation does not bring the image far out-of-distribution, so the total synthetic images are for each image.

Since the next frame is generated from previous frame, the matching pair or optical flow is known for all pixels. Besides, as the ground truth flow is given, we can calculate the stability in every specific region even the whole image (donated as ) by using part or all locations in , therefore, all un-occluded pixels can be used for evaluation. For this propose, the is a rate of pixels in specific region with consistent predictions among all un-occluded pixels. To be more realistic, we use the generic optical flow to limit the scope of the evaluation. In these tracked points in , the calculates the proportion of pixels with consistent predictions tracked from to in each two successive frames. For notation brevity, we assume there is only one video or sequence with N frames. The equation is defined as:


or denotes the predicted label from the or at position; N is the length of each video sequence in the synthetic test dataset.

To make sure the proposed is a valid metric for measuring the coherency and it is consistent with the human perception, we carry out the user-study on VHS, which is more common and suitable for observation and evaluation. We recruited 30 students to rank 5 methods based on results from 30 videos with different (works on the whole region) and (works on boundary area with 15-pixel width of ground truth edge). For each segmented video, we set four evaluation scores: 1 point for inaccurate and unstable segmented video, 2 point for accurate but unstable segmented video, 3 point for inaccurate but stable segmented video, and 4 point for accurate and stable segmented video. All ratings are subjective evaluations, and we recorded the number of videos considered to be stable, the score of all video and the average score.

We use , and mIoU(syn) on the synthetic test dataset, and mIoU(test) on the test set for reference. Table 1 shows the results, for the of 95.46 on Method1, there are only 7 videos are regarded as stable segmentation, and the average score is 2.27 which means that users think the method is accurate but quite unstable. With the model rises, more videos are regarded to be stably segmented. When the rises up to 96.79, mean score achieves the best 3.17 with about 87% videos are regarded as stable segmentation. However, one may concern that the accuracy of the method provides a stable foundation, we find even Method2 improves the mIoU(syn) by 2% compared with Method1, the mean score is still 2.27 that users don’t think there is a significant improvement. Therefore, we believe that the stability metric reveals the intuitive human perception, which verifies the rationality of the metric.





frame frame frame frame frame frame
Figure 4: Qualitative video results for comparison of the method SCHP with/without CL on VHS. We show significant improvement for stable results across adjacent frames.

Video Human Segmentation

Dataset. Video human segmentation dataset(VHS) provides 101,281 labeled images for human body and 130 unlabeled videos. We leverage 91,281 images and 100 videos for training, and 10,000 images and 30 videos for testing.

Experiment Settings. We exploit the SCHP implementation by Li et al. (2019) as our baseline and further perform our Coherent Loss on it. SCHP is designed to enhance the boundary accuracy for improving the accuracy of the whole mask in person-part parsing, which is a state-of-the-art method that ranks 1st in CVPR2019 LIP Challenge. We adjust this network by using ResNet50 as backbone and replacing the output layer for binary segmentation.

The training procedure is divided into two step. We first use the ImageNet pre-trained weights to initial the network and train the baseline on labeled images with balanced Cross-Entropy and Lovasz-Softmax loss. The input labeled images are resized into with data augmentation, and the batch size is 24 for 200 epochs per GPU on Tesla P40. We use SGD optimizer with an initial learning rate of 7e-3. Momentum and weight decay are set to 0.9 and 5e-4 respectively. We first make sure the segmentation loss has converged before activating the Coherent Loss. Second, we use unlabeled videos to fine-tune the network with the proposed loss function. We sample each video at equal intervals to generate about 30 image pairs of two consecutive frames per video, and all the images are resized into . The training batch size is set to 24 for total 10 epoches, and for each batch we use 12 images from the labeled images for full supervised and 6 random image pairs for coherent supervision. We use SGD optimizer with an initial learning rate of 1e-4. Momentum and weight decay are set to 0.9 and 5e-4 respectively. In the equation 1, the width of the boundary uncertain areas is set to 15 and the for global consistency is 5e-2 according to our research. The weight for boundary coherency loss (bc) in equation 3 is set to 1 and the for global coherency loss (gc) is set to 5e-5, enabling a balance between the supervised loss and Coherent Loss.

The metrics of the evaluation are the mIoU on the VHS test set, and calculated between global successive masks and calculated along the boundary areas (15-pixel width of ground truth edge) on the synthetic test set. The first frame for the synthesis is randomly collected from the videos for human segmentation on the DAVIS dataset for better annotations. All the synthetic sequences are 228 with 11 frames per sequence.

Method mIoU(test) mIoU(syn)
SCHP(ResNet18) 95.14 91.20 99.17 95.46
SCHP(ResNet18)+CL 95.06 91.86 99.17 96.37
SCHP(ResNet50) 96.20 93.22 99.18 95.54
SCHP(ResNet50)+bc() 94.28 94.28 99.28 96.54
SCHP(ResNet50)+gc() 95.37 93.80 99.22 95.59
SCHP(ResNet50)+CL 95.91 94.31 99.29 96.94
Table 2: Comparison of mIoU on VHS test set, mIoU and on synthetic test set.
Method mIoU(val) mIoU(val) mIoU(syn)
offline online offline offline offline
CRN 73.38 82.32 81.30 99.07 93.87
CRN+CL 75.01 83.03 82.03 99.21 94.52
SCHP 70.66 78.99 66.77 98.31 93.42
SCHP+CL 75.51 81.59 69.52 98.79 94.25
Table 3: Comparison of mIoU on DAVIS val set, and mIoU on the synthetic test set.







frame frame frame frame frame frame
Figure 5: Qualitative video results for comparison of the CRN with/without CL on DAVIS, the rest is the results on DAVIS. We show significant improvement for stable results across adjacent frames.

Results. As shown in Table 2, in terms of the baseline SCHP trained with different backbones, our Coherent Loss(+CL) show great improvements on , especially along the object boundary, where our CL improves the SCHP(ResNet50) by 1.4%. Both the Boundary Coherency(+bc) and Global Coherency(+gc) improve the baseline on and mIoU on synthetic set, which demonstrates that our Coherent Loss behaves effectively for enhancing the temporal coherency. More importantly, from Figure 8 we show that with the Coherent Loss, the misalignment along the boundary is significantly improved and the visual performance become more smooth and satisfactory, and mis-segmentation are suppressed in the end. Videos in the Supplementary Materials show more intuitive results.

Video Object Segmentation

Dataset. We also evaluate our method on the public dataset: DAVIS-2016, which contains 50 video sequences in total, with 30 in the train set and 20 in the val set, and provides binary segmentation ground truth masks for all 3455 frames.

Experiment Settings. We separately use the adjusted network SCHP and CRN Hu et al. (2018) as our baseline on which we further perform our Coherent Loss. CRN is a also standard network that takes the coarse segmentation as guidance to generate an accurate segmentation on DAVIS.

At the first step, for SCHP we use the similar training strategy on VHS for the baseline training on the training set, and for CRN we use official checkpoint model as the baseline. When performing the Coherent Loss, we first make sure the segmentation loss has converged before activating the coherent supervision. At the second step, we directly use the two consecutive frames from the training set as a image pair. The fine-tune strategy of SCHP is similar to VHS, and for CRN we follow the original settings described in Hu et al. (2018) but the ratio of labeled images and unlabeled pairs for each batch are balanced. We fine-tune the SCHP and CRN for 10 epoches combined with Coherent Loss. The width is set to 15 and the threshold is 5e-2 according to our research. The weight and the is set to 0.1 and 5e-6 for CRN, and 1 and 5e-5 for SCHP, enabling a balance between the supervised loss and unsupervised loss.

The metrics of the evaluation are the mIoU on the DAVIS-2016 val set, and on the synthetic test set, which is collected by using the first frame per video on DAVIS val set and is totally images.

Results. As shown in Table 5, in terms of , our Coherent Loss both improves the CRN on global and boundary areas by 0.2% and 0.7%, which significantly achieve a promotion to produce stabler segmentation results, shown in Figure 5. Although we focus on more coherent segmentation and better visual performance, the Coherent Loss, performing to exploits unlabeled videos to enhance temporal coherency, may also provide a valid temporal information for network to produce more accurate results. Both CRN and SCHP are improved by Coherent Loss on mIoU and with offline inference, thus this result demonstrates the flexibility and effectiveness that we can cooperate with many state-of-the-art segmentation methods based on neural network for better visual performance without adding extra prediction overhead. We will show more results and the videos in the Supplementary Materials.

Video Semantic Segmentation

We also evaluate our method on Cityscape, which is a challenging dataset containing high quality pixel-level annotations for 5000 images. The standard dataset split is 2975, 500, and 1525 for the training, validation, and test sets respectively. Since the dataset is quite different from video object segmentation, we perform our Coherent Loss on the standard PSPNet and fine-tune it for 20 epoches with similar settings in Song et al. (2018). For the 3 provided unlabeled videos with successive frames, we use 2 videos for training and 1 for testing. The width is set to 15 and the is 0.3, the weight and is set to 2e-2 and 0.3, enabling a balance between the supervised loss and unsupervised loss. Eventually, we determine the best model in terms of mIoU on the whole val set and calculate the on the synthetic set, in which the first frame is randomly chosen in the val set.

We compare our method to the official baseline PSPNet with ResNet50 as a backbone in Table 6, and we achieve a significant improvement both on accuracy and stability. and from the visual examples in Figure 6, we can see that although the baseline achieve high mIoU of the whole val set, many visual jitters are overlooked during the evaluation. Equipped with Coherent Loss, the baseline achieve a significant improvement for stabler results.



frame t frame t+1 frame t+2 frame t frame t+1 frame t+2
Figure 6: Visual examples cropped from the original frames on Cityscapes. Our method further improves the misalignment along the boundary and reduces mis-segmentations.


We carry out our method on three different video segmentation datasets to investigate its performance and the results show that our method significantly enhances the temporal coherency for all datasets, especially on the object boundary. Visual performance in Figure 8 demonstrates that our method performs more stabler with fewer mis-segmentation and unaligned boundaries. One may doubt that may be very high when the network performs poorly in the whole sequences, we think it makes more sense to discuss stability on the basis of a certain segmentation accuracy.

In addition, our Coherent Loss is designed for enhancing the ability of a given baseline for providing satisfactory video segmentation results in term of temporal coherency, however, the results in Table 5 demonstrates that our method can not only improve the , but also help to improve the segmentation accuracy in term of mIoU, which we believe is probably because more coherent results are more accurate especially when the object is notable in the video to segment. Equipped with our method, many state-of-the-art segmentation approaches can achieve a significant improvement on more satisfactory visual quality on VHS and DAVIS. However, our method may not bring a noticeable increase in accuracy on Cityscape, this is probably due to the fact that our Coherent Loss enforces the network to perform coherently for the whole image, thus those trembling small objects will be smoothed for stabler performance.

Method mIoU(val)
PSPNet 77.02 93.06 90.22
PSPNet+CL 77.88 93.08 91.63
Table 4: Comparison of mIoU on Cityscape val set, on the synthetic set.


Inspired by unstable phenomena in applications, we expose, temporal coherency, a new dimension of measuring the quality of video segmentation performance. For explicitly evaluating, we hence introduce a new numeric metric, stability rate, which is verified by user-study. Further more, we also propose Coherent Loss with a generic framework performing as a neural network training strategy to enhance the temporal coherency in video segmentation, which is capable of exploiting unlabeled videos to learn the boundary consistency and global consistency temporally. Owing it is orthogonal to the network architecture, it hence can be used to cooperate with different segmentation models. Comprehensive experiments are carried out on a variety of datasets, which verifies our method being an effective framework for providing better results in combination of segmentation accuracy and temporal coherency. A large labeled video human segmentation dataset is collected for future study in this field and will be made available to the public.







Image label Image label Image label
Figure 7: Images and labels on the video human segmentation.

frame frame frame frame frame frame
Figure 8: Videos on the video human segmentation.

The labeled images in the proposed video human segmentation dataset contain about 5 parts. (1) The images in ATR part are a total of 17,706 images of the full-body of a single person. (2) The images in AISegment part are a total of 34,426 images of the half-body of a single person. (3) The images in AIC part contain 3,8781 pose pictures of single to multiple people under different scenarios. (4) The images in LIP part are a total of 5,168 images of multi people. (5) The images in Supervisely Person Dataset part are 5,200 high quality media images under single scenes.

As there is no labels in AIC part, we invited 30 students to annotate the images and guaranteed that the error of the boundary does not exceed 3 pixels under the image size of 512. The AISegment part was originally rough matting annotations, and we processed all the labels larger than 0.1 * 255 in the matting annotations into the foreground and the others into the background, enabling a binary annotation. The images in LIP part were all multi people because the original labels marked the details of the neck, watch, etc. as the background in images of single person, thus we manually selected 5,168 pictures of the neck and other parts that were not marked as the background, and processed all labels of the body parts into the foreground. The processing for images in ATR was similar to LIP. The images in Supervisely Person Dataset(SPD) are instance-level human body annotations and we processed them into binary labels. These 5 image parts make up about 100,000 labeled images to ensure a rich segmentation scenes of different posture, background, half-body, full-body of single or multi people.

The videos were collected by a public team of more than 100 people, and all of then were taken by using phone camera and each must contained no less than 6 actions with different camera perspectives, in any life scene containing about 1 to 5 individuals. Thus, the video part contains 130 videos lasting from 30 seconds to 1 minute, and all the videos were sampled and got about 300 to 600 frames for each video.

Results on videos

Intuitive video performance can be found at˙VOS

Method mIoU(val) MAE(val) mIoU(val) MAE(val)
offline offline online online
CRN 73.38 2.11 82.32 0.803
CRN+CL 75.01 1.59 83.03 0.802
SCHP 70.66 2.89 78.99 1.97
SCHP+CL 75.51 2.33 81.59 1.77
offline offline online online
CRN 99.07 93.87 99.45 95.21
CRN+CL 99.21 94.52 99.27 95.35
SCHP 98.31 93.42 99.57 95.37
SCHP+CL 98.79 94.25 99.46 95.73
Table 5: More comparison results on DAVIS set.
Method mIoU(val) mAcc(val) aAcc(ss,val)
PSPNet 77.02 84.17 95.90 93.06 90.22
PSPNet+CL 77.88 85.43 96.04 93.08 91.63
Table 6: More comparison on Cityscape val set. mIoU/mAcc/aAcc stands for mean IoU, mean accuracy of each class and all pixel accuracy respectively. ss denotes single scale testing.


  1. Matting human datasets. Note: \url Cited by: Dataset.
  2. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223. Cited by: Introduction.
  3. A tutorial on the cross-entropy method. Annals of operations research 134 (1), pp. 19–67. Cited by: Coherent Loss.
  4. Supervision-by-registration: an unsupervised approach to improve the precision of facial landmark detectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 360–368. Cited by: Methods about Coherency, Coherent Loss.
  5. Look into person: self-supervised structure-sensitive learning and a new benchmark for human parsing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 932–940. Cited by: Dataset.
  6. Supervisely person dataset. Note: \url Cited by: Dataset.
  7. Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261. Cited by: Methods about Coherency, Metric.
  8. Give me (un) certainty–an exploration of parameters that affect segmentation uncertainty. arXiv preprint arXiv:1911.06357. Cited by: Introduction, Methods about Coherency, Coherent Loss.
  9. Temporally distributed networks for fast video semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8818–8827. Cited by: Introduction.
  10. Motion-guided cascaded refinement network for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1400–1409. Cited by: Video Object Segmentation, Video Object Segmentation.
  11. Fast video object segmentation with temporal aggregation network and dynamic template matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8879–8889. Cited by: Segmentation Methods.
  12. Flownet 2.0: evolution of optical flow estimation with deep networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2462–2470. Cited by: Segmentation Methods.
  13. Forward-backward error: automatic detection of tracking failures. In 2010 20th International Conference on Pattern Recognition, pp. 2756–2759. Cited by: Coherent Loss.
  14. Feature space optimization for semantic video segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3168–3175. Cited by: Methods about Coherency.
  15. Self-correction for human parsing. arXiv preprint arXiv:1910.09777. Cited by: Table 1, Video Human Segmentation.
  16. Human parsing with contextualized convolutional neural network. In Proceedings of the IEEE international conference on computer vision, pp. 1386–1394. Cited by: Dataset.
  17. Semantic video segmentation by gated recurrent flow propagation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6819–6828. Cited by: Segmentation Methods, Methods about Coherency.
  18. A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 724–732. Cited by: Introduction.
  19. Pyramid dilated deeper convlstm for video salient object detection. In Proceedings of the European Conference on Computer Vision, pp. 715–731. Cited by: Figure 1, Video Semantic Segmentation.
  20. Feelvos: fast end-to-end embedding learning for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9481–9490. Cited by: Introduction.
  21. Ranet: ranking attention network for fast video object segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3978–3987. Cited by: Segmentation Methods.
  22. Large-scale datasets for going deeper in image understanding. In 2019 IEEE International Conference on Multimedia and Expo, pp. 1480–1485. Cited by: Dataset.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description