Time-Dynamic Estimates of the Reliabilityof Deep Semantic Segmentation Networks

Time-Dynamic Estimates of the Reliability
of Deep Semantic Segmentation Networks

Kira Maag University of Wuppertal, School of Mathematics and Natural Sciences,  Germany,   email: {kmaag, rottmann, hgottsch}@uni-wuppertal.de     Matthias Rottmann1    Hanno Gottschalk1
1footnotemark: 1
Abstract

In the semantic segmentation of street scenes, the reliability of a prediction is of highest interest. The assessment of neural networks by means of uncertainties is a common ansatz to prevent safety issues. As in online applications like automated driving, a video stream of images is available, we present a time-dynamical approach to investigate uncertainties and assess the prediction quality of neural networks.To this end, we track segments over time and gather aggregated metrics per segment, e.g. mean dispersion metrics derived from the softmax output and segment sizes. Due to identifying segments over consecutive frames, we obtain time series of metrics from which we assess prediction quality. We do so by either classifying between intersection over union and (meta classification) or predicting the directly (meta regression). In our tests, we analyze the influence of the length of the time series on the predictive power of metrics and study different models for meta classification and regression. We use two publicly available DeepLabv3+ networks as well as two street scene datasets, i.e., VIPER as a synthetic one and KITTI based on real data. We achieve classification accuracies of up to and AUROC values of up to for the task of meta classification. For meta regression we obtain values of up to . We show that these results yield improvements compared to other approaches.

\floatsetup

[table]capposition=top

1 Introduction

Semantic segmentation, i.e., the pixel-wise classification image content, is an important tool for scene understanding. In recent years, neural networks have demonstrated outstanding performance for this task. In safety relevant applications like automated driving [Huang2018] and medical imaging [Wickstrom2018], the reliability of predictions and thus uncertainty quantification is of highest interest. While most works focus on uncertainty quantification for single frames, there is often video data available. In this work, we investigate uncertainties taking the temporal information into account. To this end, we track objects over time and construct metrics that express the model’s uncertainty.

Uncertainty measures.   A very important type of uncertainty is the model uncertainty resulting from the fact that the ideal parameters are unknown and have to be estimated from data. Bayesian models are one possibility to consider these uncertainties [Mackay1992]. Therefore, different frameworks based on variational approximations for Bayesian inference exist [Attias2000, Duvenaud2016]. Recently, Monte-Carlo (MC) Dropout [Gal2016] as approximation to Bayesian inference has aroused a lot of interest. In classification tasks, the uncertainty score can be directly determined on the network’s output [Gal2016]. Threshold values for the highest softmax probability or threshold values for the entropy of the classification distributions (softmax output) are common approaches for the detection of false predictions (false positive) of neural networks, see e.g. [Hendrycks2016, Liang2017]. Uncertainty metrics like classification entropy or the highest softmax probability are usually combined with model uncertainty (MC Dropout inference) or input uncertainty, cf. [Gal2016] and [Liang2017], respectively. Alternatively, gradient-based uncertainty metrics are proposed in [Oberdiek2018] and an alternative to Bayesian neural networks is introduced in [Lakshminarayanan2017] where the idea of ensemble learning is used to consider uncertainties. These uncertainty measures have proven to be practically efficient for detecting uncertainty and some of them have also been transferred to semantic segmentation tasks, such as MC Dropout, which also achieves performance improvements in terms of segmentation accuracy, see [Kendall2015]. The works presented in [Kampffmeyer2016] and [Wickstrom2018] also make use of MC Dropout to model the uncertainty and filter out predictions with low reliability. This line of research is further developed in [Huang2018] to detect spacial and temporal uncertainty in the semantic segmentation of videos. In semantic segmentation tasks the concept of meta classification and meta regression is introduced in [Rottmann2018]. Meta classification refers to the task of predicting whether a predicted segment intersects with the ground truth or not. Therefore, the intersection over union (, also known as Jaccard index [Jaccard1912]), a commonly used performance measure for semantic segmentation, is considered. The quantifies the degree of overlap of prediction and ground truth, it is equal to zero if and only if the predicted segment does not intersect with the ground truth. The meta-classification task corresponds to (meta-)classifying between and for every predicted segment. Meta regression is the task of predicting the (e.g. via linear regression) directly. The main aim of both tasks is to have a model that is able to reliably assess the quality of a semantic segmentation obtained from a neural network. The predicted therefore also serves as a performance estimate. As input both methods use segment-wise metrics extracted from the segmentation network’s softmax output. The same tasks are pursued in [DeVries2018, Huang2016] for images containing only a single object, instead of metrics they utilize additional CNNs. In [Schubert2019] the work of [Rottmann2018] is extended by adding resolution dependent uncertainty and further metrics. In [Erdem2004] performance measures for the segmentation of videos are introduced, these measures are also based on image statistics and can be calculated without ground truth.

Visual object tracking.   Object tracking is an essential task in video applications, such as automated driving, robot navigation and many others. The tasks of object tracking consist of detecting the objects and then tracking them in consecutive frames, eventually studying their behavior [Yilmaz2006]. In most works, the target object is represented as an axis-aligned bounding box [Wu2013] or rotated bounding box [Kristan2015]. Labeling objects with bounding boxes keeps annotation costs low and allows a fast and simple initialization of the target object. The approaches described in the following work with bounding boxes. A popular strategy for object tracking is the tracking-by-detection approach [Babenko2009]. A discriminative classifier is trained online while performing the tracking, to separate the object from the background only by means of the information where the object is located in the first frame. Another approach for tracking-by-detection uses adaptive correlation filters that model the targets appearance, the tracking is then performed via convolution with the filters [Bolme2010]. In [Danelljan2015] and [Valmadre2017], the trackers based on correlation filters are improved with spatial constraints and deep features, respectively. Another object tracking algorithm [Mu2016] combines Kalman filters and adaptive least squares to predict occluded objects where the detector shows deficits. In contrast to online learning, there are also tracking algorithms that learn the tracking task offline and perform tracking as inference, only. These methods differ greatly from the tracking-by-detection procedure. The idea behind these approaches [Held2016, Bertinetto2016] is to train offline a similarity function on pairs of video frames instead of training a discriminative classifier online. In [Bertinetto2016] a fully-convolutional siamese network is used and this approach is improved by making use of region proposals ([Li2018]), angle estimation and spatial masking ([He2018]) as well as memory networks ([Yang2018]). Another approach for object tracking with bounding boxes is presented in [Yao2019] where semantic information is used for tracking. Most algorithms and also the ones described here use bounding boxes, mostly for initializing and predicting the position of an object in the subsequent frames. In contrast, [Comaniciu2000] uses coarse binary masks of target objects instead of rectangles. There are other procedures that initialize and/or track an object without bounding boxes, since a rectangular box does not necessarily capture the shape of every object well. In [Jilani2019] a temporal quad-tree algorithm is applied, where the objects are divided into squares getting smaller and smaller. Other approaches use semantic image segmentation such as [Hariharakrishnan2005], where the initialization includes a segmentation for predicting object boundaries. Segmentation-based tracking algorithms are presented in [Aeschliman2010] and [Duffner2013] based on a pixel-level probability model and an adaptive model, respectively. In the latter case, co-training takes place between detector and segmentation. The approaches presented in [Belagiannis2012] and [Son2015] are also based on segmentation and use particle filters for the tracking process. There are also a superpixel-based approaches, see e.g. [Yeo2017], and a fully-convolutional siamese approach [Wang2018] that creates binary masks and starts from a bounding box initialization.

Our contribution.   In this work we elaborate on the meta classification and regression approach from [Rottmann2018] that provides a framework for post processing a semantic segmentation. This method generates uncertainty heat maps from the softmax output of the semantic segmentation network, such as pixel-wise entropy, probability margin or variation ratio.

Figure 1: Segmentation predicted by a neural network (top) and heat map (bottom).

In fig. 1 a visualization of the segment-wise variation ratio is given. In addition to these segment-wise metrics, further quantities derived from the predicted segments are used, for instance various measures corresponding to the segments geometry. This set of metrics, yielding a structured dataset where each row corresponds to a predicted segment, is presented to meta classifier/regressor to either classify between and or predict the directly. In contrast to [Rottmann2018] we use the additional metrics proposed in [Schubert2019]. In this paper, we extend the work presented in [Rottmann2018] by taking time-dynamics into account. A core assumption is that a semantic segmentation network and a video stream of input data are available. We present a light-weight approach that tracks semantic segments over time. The segments are matched according to their overlap in multiple frames and we improve these measures due to shifting segments according to expected location in the next frame. We gather time series of metrics that are presented as input to meta classifiers and regressors. For the latter we study different types of models and their dependence on the length of the time series.

In our tests, we employ two publicly available DeepLabv3+ networks [Chen2018] and we perform all tests on the VIsual PERception (VIPER) dataset [Richter2017] as well as on the KITTI dataset [Geiger2013]. For the synthetic VIPER dataset we train a DeepLabv3+ network and demonstrate that the additional information from our time-dynamical approach improves over its single frame counterpart w.r.t. meta classification and regression (meta tasks). Furthermore, the different methods for classification and regression improve the prediction accuracy of the . For the task of meta regression we obtain an value of up to and for the meta classification AUROC values of up to as well as classification accuracies of up to . For the VIPER dataset there are labeled ground truth images for each frame, while for the KITTI dataset only a few frames are labeled with ground truth. For the KITTI datset we use alternative sources of useful information besides the real ground truth to train the meta tasks and we employ both networks for investigations. For meta regression we achieve values of up to and for the meta classification AUROC values of up to as well as a classification accuracies of up to . We also show that these results yield significant improvements compared to the results obtained by the predecessor method introduced in [Rottmann2018].

Related work.   Most works [Wu2013, Kristan2015, Babenko2009, Bolme2010, Danelljan2015, Valmadre2017, Mu2016, Xiang2015, Held2016, Bertinetto2016, Li2018, He2018, Yang2018, Yao2019] in the field of object tracking make use of bounding boxes while our approach is based on semantic segmentation. There are some approaches that make use of segmentation masks. However, only a coarse binary mask is used in [Comaniciu2000] and in [Hariharakrishnan2005] the segmentation is only used for initialization. In [Aeschliman2010, Duffner2013] not only information of the semantic segmentations are included in the tracking algorithm, but the segmentation and the tracking are executed depending on each other. In our procedures, a segmentation is inferred first, tracking is performed afterwards. In addition to the different forms of object representations, there are various algorithms for object tracking. In the tracking-by-detection methods a classifier for the difference between object and background is trained and therefore only information about the location of the object in the first frame is given [Babenko2009, Bolme2010, Danelljan2015, Valmadre2017]. We do not train classifiers as this information is contained in the inferred segmentations. Another approach is to learn a similarity function offline [Held2016, Bertinetto2016, Li2018, He2018, Yang2018]. The works of [Aeschliman2010, Duffner2013, Belagiannis2012, Son2015, Wang2018] are based on segmentation and they use different tracking methods, like probability models, particle filters and fully-convolutional siamese network, respectively. Our algorithm is solely based on the degree of overlap of predicted segments.

With respect to uncertainty quantification, MC dropout – many forward passes under dropout at inference time – is widely used, c.f. [Gal2016, Kendall2015, Kampffmeyer2016, Wickstrom2018]. Whenever dropout is used in a segmentation network, the resulting heat-map can be equipped by our framework. However, in this work we do not include MC dropout. There are alternative measures of uncertainty like gradient based ones [Oberdiek2018] or measures based on spatial and temporal differences between the colors and movements of the objects [Erdem2004]. We construct metrics based on aggregated dispersion measures from the softmax output of a neural network at segment level. The works [DeVries2018, Huang2016] closest to ours are constructed to work with one object per image, instead on hand crafted metrics they are based on post-processing CNNs. We extend the work of [Rottmann2018] by a temporal component and further investigate methods for the meta classification and regression, e.g. gradient boosting and neural networks.

Outline.   The remainder of this work is organized as follows. In section 2 we introduce a tracking algorithm for semantic segmentation. This is followed by the construction of segment-wise metrics using uncertainty and geometry information in section 3. In section 4 we describe the meta regression and classification methods including the construction of their inputs consisting of time series of metrics. Finally, we present numerical results in section 5. We study the influence of time-dynamics on meta classification and regression as well as the incorporation of various classification and regression methods.

2 Tracking Segments over Time

In this section we introduce a light-weight tracking method for the case where a semantic segmentation is available in each frame of a video. Semantic image segmentation aims at segmenting objects in an image, to this end it can be defined as a pixel-wise classification of image content (cf. top panel of fig. 1). To obtain a semantic segmentation, the goal is to assign to each image pixel of an input image a label within a prescribed label space . Here, this task is performed by a neural networks that provides for each pixel a probability distribution over the class labels , given learned weights and an input image . The predicted class for each pixel is obtained by

(1)

Let denote the predicted segmentation and the set of predicted segments. The idea of our algorithm is to match segments of the same class according to their overlap in consecutive frames. We denote by the image sequence with a length of and corresponds to the image. Furthermore, we formulate the overlap of a segment with a segment through

(2)

To account for the motion of objects, we also register geometric centers of predicted segments. The geometric center of a segment in frame is defined as

(3)

where is given by its vertical and horizontal coordinates of pixel .

Our tracking algorithm is applied sequentially to each frame , , and we aim at tracking all segments present in at least one of the frames. To give the segments different priorities for matching, the segments of each frame are sorted by size and treated in descending order. As is the case when a segment in frame has been matched with a segment from previous frames, it is ignored in further steps and matched segments are assigned an id. Within the description of the matching procedure, we introduce parameters , , and , the respective numerical choices are given in section 5. More formally, our algorithm consists of the following five steps:

Step 1 (aggregation of segments).   The minimum distance between segment and all of the same class is calculated. If the distance is less than a constant , the segments are so close to each other that they are regarded as one segment and receive a common id.

Step 2 (shift).   If the algorithm was applied to at least two previous frames, the geometric centers and of segment are computed. The segment from frame is shifted by the vector and the overlap with each segment from frame is determined. If or , the segments and are matched and receive the same id. If there is no match found for segment during this procedure, the quantity

(4)

is calculated for each available and both segments are matched if . This allows for matching segments that are closer to than expected. If segment exists in frame , but not in , then step 2 is simplified: only the distance between the geometric center of and is computed and the segments are matched if the distance is smaller than .

Step 3 (overlap).   If , The overlap of the segments and of two consecutive frames is calculated. If or , the segments and are matched.

Step 4 (regression).   In order to account for flashing predicted segments, either due to false prediction or occlusion, we implement a linear regression and match segments that are more than one, but at most , frames apart in temporal direction. If the id of segment , , in frame has not yet been assigned and , i.e., three frames have already been processed, then the geometric centers of segment are computed in frames to (in case exists in all these frames). If at least two geometric centers are available, a linear regression is performed to predict the geometric center . If the distance between the predicted geometric center and the calculated geometric center of the segment is less than a constant value , and are matched. If no match was found for segment , segment is shifted by the vector , where denotes the frame where contains the maximum number of pixels. If or applies to the resulting overlap, and are matched.

Step 5 (new ids).   All segments that have not yet received an id are assigned with a new one.

3 Segment-wise Metrics and Time Series

In the previous chapter, we presented the semantic segmentation and the resulting probability distribution for an image , pixel and weights . The degree of randomness in is quantified by dispersion measures. In this work, we consider three pixel-wise dispersion measures: the entropy , the variation ratio and the probability margin . The measures are given by

(5)
(6)

and

(7)

and a visualization of segment-wise variation ratio is shown in fig. 1. Note that also other heat maps (like MC Dropout variance) can be processed. At segment level, we define for each the interior where a pixel is an element of if all eight neighbouring pixels are an element of , the boundary and the following metrics (see [Rottmann2018, Schubert2019]):

  • the segment sizes , ,

  • the mean dispersions , , defined as

    where

  • the relative segment sizes ,

  • the relative mean dispersions , where

  • the geometric center defined in equation (3)

  • the mean class probabilities for each class

Additionally we define the set of metrics by

The separate treatment of interior and boundary in all dispersion measures is motivated by typically large values of for . In addition, we find that poor or false predictions are often accompanied by fractal segment shapes (which have a relatively large amount of boundary pixels, measurable by and ) and/or high dispersions on the segment’s interior.

4 Prediction of IoU from Time Series

A measure to determine the prediction accuracy of the segmentation network with respect to the ground truth is the . Therefore we define the set of connected components in the ground truth , analogously to (the set of connected components in the predicted segmentation ). For let . For each the is given by

(8)

In our test we use a slight modification, i.e., the adjusted

(9)

with proposed in [Rottmann2018]. In this work, we make segment-wise predictions of the via different regression approaches and classify between and (meta classification) for every predicted segment. These prediction tasks are performed by means of the metrics introduced in section 3. Note that these metrics can be computed without the knowledge of the ground truth. Our aim is to analyze to which extent they are suitable for the meta tasks and what influence the temporal information has. For each segment in frame we have the metrics and further measures from the previous frames due to the segment matching. For meta regression and classification we make use of these metrics , , where describes the number of considered frames. For meta classification we define and we want to predict this value by three different methods. One method used is the least absolute shrinkage and selection operator (LASSO [Tibshirani1996]) method. The LASSO method makes use of -penalization and investigates the predictive power of different combinations of input variables. Here, we study the influence of the various metrics of previous frames. The two other methods we apply to meta classification are gradient boosting [Friedman2002] and a shallow neural network, which contains only one hidden layer with neurons. For the meta regression we follow a similar procedure and compare six methods for this task. The procedure includes simple linear regression, as well as linear regression with - and -penalization. Furthermore we use gradient boosting and shallow neural networks. One net with -penalization and another one with .

methods
LR linear regression
LR L1 linear / logistic regression with -penalization
LR L2 linear regression with -penalization
GB gradient boosting
NN L1 neural network with -penalization
NN L2 neural network with -penalization
Table 1: Overview of meta classification and regression methods. For classification we consider LR L1, GB, NN L2 and for regression all of them.

An overview of the different methods for regression and classification is given in table 1.

5 Numerical Results

Meta Classification
LR L1 GB NN L2
ACC AUROC ACC AUROC ACC AUROC
Meta Regression
LR LR L1 LR L2
GB NN L1 NN L2
Table 2: Results for meta classification and regression for the different methods. The super script denotes the number of frames where the best performance and in particular the given values are reached. The best classification and regression results are highlighted.
Figure 2: (a): Predicted vs.  for all non-empty segments. The dot size is proportional to the segment size. (b): Segment lifetime (time series length) vs. mean interior segment size, both on log scale.
Figure 3: Ground truth image (bottom left), prediction obtained by a neural network (bottom right), a visualization of the true segment-wise of prediction and ground truth (top left) and its prediction obtained from meta regression (top right). Green color corresponds to high values and red color to low ones. For the white regions there is no ground truth available, these regions are not included in the statistical evaluation.
Figure 4: A selection of results for meta classification AUROC and regression as functions of the number of frames and for different compositions of training data (cf. table 3). (a): meta classification via a neural network with -penalization, (b): meta classification via gradient boosting, (c): meta regression via gradient boosting.

In this section, we investigate the properties of the metrics defined in the previous sections, the influence of the length of the time series considered and of different meta classification and regression methods. We perform our tests on two different datasets for the semantic segmentation of street scenes where also videos are available, the synthetic VIPER dataset [Richter2017] obtained from the computer game GTA V and the KITTI dataset [Geiger2013] with real street scene images from Karlsruhe, Germany. In all our tests we consider two different DeepLabv3+ networks [Chen2018] for semantic segmentation for which we use a reference implementation in Tensorflow [Abadi2015]. The DeepLabv3+ implementation and weights are available for two network backbones. First, there is the Xception65 network, a modified version of Xception [Chollet2017] and it is a powerful structure for server-side deployment. On the other hand, is MobilenetV2 [Sandler2018] a fast structure designed for mobile devices. Primarily we use Xception65 for VIPER and MobilenetV2 for KITTI, for the latter we also use Xception65 as a reference network to generate pseudo ground truth for the meta classification and regression tasks. For tests with KITTI we used the publicly available weights for both networks.

For tracking segments with our procedure, we assign the parameters defined in section 2 with the following values: , , and . We study the predictive power of our metrics and segment-wise averaged class probabilities per segment and frame. From our tracking algorithm we get these metrics additionally from previous frames for every segment.

VIPER dataset.   The VIPER dataset consists of more than K high-resolution video frames and for all frames there is ground truth available for classes. We trained an Xception65 network starting from the weights for ImageNet [Russakovsky2015]. We choose an output stride of and the input image is evaluated within the framework only on its original scale (deeplab allows for evaluation on different scales and averaging the results). For a detailed explanation of the chosen parameters we refer to [Chen2018]. We retrain the Xception65 net on the VIPER dataset on training images and validation images. We only use images from the day category (i.e., bright images, no rain) for training and further processing. We achieve a mean of . If we take out the classes with a mean below , the total mean rises to . This case applies to the three classes mobile barrier, chair and van, classes that are also underrepresented in the dataset. For meta classification and regression we use only video sequences consisting of images in total. From these images we obtain roughly segments (not yet matched over time) of which have non-empty interior. The latter are used in all numerical tests. We investigate the influence of time-dynamics on meta classification and regression, i.e., we firstly only present the segment-wise metrics of a single frame into the meta classifier/regressor, secondly we extend the metrics to time series with a length of up to previous time steps , . In summary, we obtain different inputs for the meta classification and regression models. The presented results are averaged over runs obtained by random sampling of the train/validation/test splitting. In tables and figures, the corresponding standard deviations are given in brackets and by shades, respectively. Out of the segments with non-empty interior, have an . We start with the detection of the segments with , i.e., we perform meta classification to detect false positive segments. To this extent, we use segments that are not presented to the segmentation network during training and apply a train/validation/test splitting of 70%/10%/20%. To evaluate the performance of different models for meta classification we consider classification accuracy and AUROC values. The AUROC is obtained by varying the decision threshold in a binary classification problem, here for the decision between and . We achieve test AUROC values of up to and accuracies of up to . Table 2 shows the best results for three different meta classification methods, i.e., logistic regression, a neural network and gradient boosting, cf. table 1. The super script denotes the number of frames where the best performance and in particular the given values are reached. On the one hand, we observe that the best results are achieved when considering more than one frame. On the other hand, significant differences between the methods for meta classification can be observed, gradient boosting shows the best performance with respect to classification accuracy and AUROC.

In the next step, we predict values via meta regression to get an uncertainty measure. For this task we indicate resulting standard deviations and values. We achieve values of up to . This value is obtained by gradient boosting incorporating previous frames. For this particular study, the relationship between the calculated and predicted is shown in fig. 2 (a), an illustration of the resulting uncertainty measure is given in fig. 3. We also provide video sequences that visualize the prediction and the segment tracking, see https://youtu.be/TQaV5ONCV-Y. Result for meta regression are also summarized in table 2, the findings are in analogy to those for meta classification. Gradient boosting performs best, and more frames yield better results than a single one. Figure 2 (b) depicts the time series length vs. the mean interior segment size. On average, a predicted segment exists for frames, however when we consider only segments that contain at least 1,000 interior pixels, the average life time increases to frames.

KITTI dataset.   For the KITTI dataset, we use both DeepLabv3+ networks (pre-trained on the Cityscapes dataset [Cordts2016], available on GitHub). As parameters for the Xception65 network we choose an output stride of , a decoder output stride of and an evaluation of the input on scales of , and (averaging the results). For the MobilenetV2+ we use an output stride of and the input image is evaluated within the framework only on its original scale. We use both nets to generate the output probabilities on the KITTI dataset. In our tests we use street scene videos consisting of images with a resolution of . Of these images, only are labelled. An evaluation of meta regression and classification requires a train/validation/test splitting. Therefore, the small number of labeled images seems almost insufficient. Hence, we acquire alternative sources of useful information besides the (real) ground truth. First, we utilize the Xception65 net with high predictive performance, its predicted segmentations we term pseudo ground truth. We generate pseudo ground truth for all images where no ground truth is available. The mean performance of the Xception65 net for the 142 labelled images is roughly 65% (and for the MobilenetV2+ the mean is about ). In addition, to augment the structured dataset of metrics, we apply a variant of SMOTE for continuous target variables for data augmentation (see [Chawla2002, Torgo2013]).

splitting types of data / annotation no. of segments
R real 3,400
RA real and augmented 27,000
train RAP real, augmented and pseudo 27,000
RP real and pseudo 27,000
P pseudo 27,000
val real 500
test real 1,000
Table 3: Train/val/test splitting, different compositions of training data and their approximate number of segments.

An overview of the different compositions of training data and the train/val/test splitting are given in table 3. The train/val/test splitting of the data with ground truth available is the same as for the VIPER dataset, i.e., 70%/10%/20%. The shorthand “augmented” refers to data obtained from smote, “pseudo” refers to pseudo ground truth obtained from the Xception65 net and “real” refers to ground truth obtained from a human annotator. These additions are only used during training. We utilize the Xception65 network exclusively for the generation of pseudo ground truth, all tests are performed using the MobilenetV2. The KITTI dataset consists of classes ( classes less than VIPER), thus we have metrics in total.

Meta Classification
LR L1 GB NN L2
ACC AUROC ACC AUROC ACC AUROC
R
RA
RAP
RP
P
Meta Regression
LR LR L1 LR L2
R
RA
RAP
RP
P
GB NN L1 NN L2
R
RA
RAP
RP
P
Table 4: Results for meta classification and regression for different compositions of training data and methods. The super script denotes the number of frames where the best performance and thus the given value is reached. The best results for each data composition are highlighted.

From the chosen images, we obtain segments of which have non-empty interior. Of these segments, have an . A selection of results for meta classification AUROC and regression as functions of the number of frames, i.e., the maximum time series length, is given in fig. 4. The meta classification results for neural networks presented in subfigure (a) indeed show, that an increasing length of time series has a positive effect on meta classification. On the other hand, the results in subfigure (b) show that gradient boosting does not benefit as much from time series. In both cases augmentation and pseudo ground truth do not improve the models’ performance on the test set and although the neural network benefits a lot from time series, its best performance is still about below that of gradient boosting. With respect to the influence of time series length, the results for meta regression with gradient boosting in subfigure (c) are qualitatively similar to those in subfigure (b). However, we observe in this case that the incorporation of pseudo ground truth slightly increases the performance. Noteworthily, gradient boosting trained with real ground truth and gradient boosting trained only with pseudo ground truth perform almost equally well. This shows that meta regression can be learned when there’s not ground truth but a strong reference model available. Note that this (except for the data augmentation part) is in accordance to our findings for the VIPER dataset. Results for a wider range of tests (including those previously discussed) are summarized in table 4. Again we provide video sequences that visualize the prediction and the segment tracking, see https://youtu.be/YcQ-i9cHjLk. For meta classification, we achieve accuracies of up to and AUROC values of up to , for meta regression we achieve values of up to . As the labeled 142 images only yield 4,877 segments, we observe overfitting in our tests for all models when increasing the length of the time series. This might serve as an explanation that in some cases, time series do not increase performance. In particular, we observe overfitting in our tests when using gradient boosting, this holds for both datasets, KITTI as well as VIPER. It is indeed well-known that gradient boosting requires plenty of data.

6 Conclusion and Outlook

In this work we extended the approach presented in [Rottmann2018] by incorporating time series as input for meta classification and regression. To this end, we introduced a light-weight tracking algorithm for semantic segmentation. From matched segments we generated time series of metrics and use these as inputs for the meta tasks. In our tests we studied the influence of the time series length on different models for the meta tasks, i.e., gradient boosting, neural networks and linear ones. Our results show significant improvements in comparison to those presented in [Rottmann2018]. More precisely, in contrast to the single frame approach using only linear models, we increase the accuracy by pp and the AUROC by pp. The value for meta regression is increased by pp. As a further improvement, we plan to develop additional time-dynamical metrics, as the presented metrics are still single-frame based. In addition, we plan to further investigate and improve the tracking algorithm by using autoregressive time series and comparing it with approaches based on bounding boxes. Another interesting direction could be to jointly performing segmentation and tracking. The source code of our method is publicly available at https://github.com/kmaag/Time-Dynamic-Prediction-Reliability.

References

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
398254
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description