Uncertainty Measures and Prediction Quality Rating for the Semantic Segmentation of Nested Multi Resolution Street Scene Images
Abstract
In the semantic segmentation of street scenes the reliability of the prediction and therefore uncertainty measures are of highest interest. We present a method that generates for each input image a hierarchy of nested crops around the image center and presents these, all rescaled to the same size, to a neural network for semantic segmentation. The resulting softmax outputs are then post processed such that we can investigate mean and variance over all image crops as well as mean and variance of uncertainty heat maps obtained from pixelwise uncertainty measures, like the entropy, applied to each crop’s softmax output. In our tests, we use the publicly available DeepLabv3+ MobilenetV2 network (trained on the Cityscapes dataset) and demonstrate that the incorporation of crops improves the quality of the prediction and that we obtain more reliable uncertainty measures. These are then aggregated over predicted segments for either classifying between and (meta classification) or predicting the via linear regression (meta regression). The latter yields reliable performance estimates for segmentation networks, in particular useful in the absence of ground truth. For the task of meta classification we obtain a classification accuracy of and an AUROC of . For meta regression we obtain an value of . These results yield significant improvements compared to other approaches.
1 Introduction
In recent years, deep learning has outperformed other classes of predictive models in many applications. In some of these, e.g. autonomous driving or diagnostic medicine, the reliability of a prediction is of highest interest. In classification tasks, thresholding on the highest softmax probability or thresholding on the entropy of the classification distributions (softmax output) are commonly used approaches to detect false predictions of neural networks, see e.g. [DBLP:journals/corr/HendrycksG16c, DBLP:journals/corr/LiangLS17]. Metrics like classification entropy or the highest softmax probability are also combined with model uncertainty (MonteCarlo (MC) dropout inference) or input uncertainty, cf. [Gal:2016:DBA:3045390.3045502] and [DBLP:journals/corr/LiangLS17], respectively. See [oberdiek2018] for further uncertainty metrics. These approaches have proven to be practically efficient for detecting uncertainty and some of them have also been transferred to semantic segmentation tasks. The work presented in [DBLP:journals/corr/KendallBC15] makes use of MC dropout to model the uncertainty of segmentation networks and also shows performance improvements in terms of segmentation accuracy. This approach was used in other works to model the uncertainty and filter out predictions with low reliability, cf. e.g. [Kampffmeyer2016SemanticSO, DBLP:journals/corr/abs180710584]. In [huang2018efficient] this line of research was further developed to detect spacial and temporal uncertainty in the semantic segmentation of videos. In [rottmann18] the concept of meta classification in semantic segmentation, the task of predicting whether a predicted segment intersects with the ground truth or not, was introduced. This can be formulated as the task of classifying between and for every predicted segment (the is also known as Jaccard index [Jaccard12similarityCoefficient]). Furthermore a framework for the prediction of the via linear regression (meta regression) was proposed. The prediction of the can be seen as a performance estimate which, after training a model, can be computed in the absence of ground truth. Both predictors use segmentwise metrics extracted from the segmentation network’s softmax output as its input. A visualization of a segmentwise rating is given in fig. 1. Apart from the discussed uncertainty related methods, there are also works based on input image statistics. For instance, in [rejectFP] a method for the rejection of false positive predictions is introduced. Performance measures for the segmentation of videos, also based on image statistics and boundary shapes, is introduced in [perf_measure_video].
In this work we elaborate on the uncertainty based approach from [rottmann18] which is a method that consists of three simple steps. First, the segmentation network’s softmax output is used to generate uncertainty heat maps, e.g. the pixelwise entropy (cf. fig. 3). In the second step, these uncertainty heat maps are then aggregated over the predicted segments and combined with other quantities derived from the predicted segments, e.g. the number of pixels per segment. From this we obtain segmentwise metrics. In the third step, these metrics are inputs for either a meta classification (between and ) or a meta regression for predicting the . In this paper, we perform the same prediction tasks, however we improve the method in all of its three steps.
In many scenarios, the camera system in use provides images with very highresolution which are coarsened before presenting them to the segmentation network. Thus we loose information, especially for objects further away from the camera. Therefore we propose a method that constructs a hierarchy of nested image crops where all images have a common center point, see fig. 2. All crops are then resized to the input size expected by the segmentation network such that we obtain an equally sized batch of input images. This can be processed by the neural network in a data parallel batch mode. Most neural network libraries, like e.g. Tensorflow [tensorflow2015whitepaper], are well vectorized over the input batch. Thus the increase in execution time should be below linear. The outputs of the segmentation network are then scaled back to its original size. In addition, we add kernel functions to let the crops smoothly fade into the combination of all larger crops, that have been merged with their predecessors recursively in the same way. We do this in order to avoid boundary effects. From this procedure we obtain a batch of probability distributions that are inputs to uncertainty measures, e.g. the entropy, probability margin and variation ratio. These are applied pixelwise and yield heat maps for each probability distribution. A mean and a variance over all image crop heat maps give us additional uncertainty information compared to the uncertainty information used in [rottmann18].
Furthermore we elaborate on the approach from [rottmann18] by introducing additional metrics that are derived from each segment’s uncertainty and geometry information. In summary we end up with metrics (plus predicted class probabilities averaged over the predicted segments) in contrast to the metrics (plus class probabilities) introduced in [rottmann18]. In addition to that, we study the incorporation of neural networks in meta classification and regression.
In our tests, we employ the publicly available DeepLabv3+ MobilenetV2 network [deeplab, mobilenet] that was trained on the Cityscapes dataset [cityscapes]. We perform all tests on the Cityscapes validation set. We demonstrate that the mean probability distribution over all crops provides improved values and that the additional uncertainty heat maps, respectively their mean and variance, yield improved uncertainty information which results in better inputs for meta classification and regression. For the task of meta classification we obtain a classification accuracy of and an AUROC of . For meta regression we obtain an value of . We also show that these results yield significant improvements compared to baseline approaches and the results obtained by the predecessor method introduced in [rottmann18].
The remainder of this work is structured as follows: In section 2 we introduce the construction of the nested image crops, the aggregation of their softmax outputs and the resulting uncertainty heat maps. This is followed by the construction of segmentwise metrics using uncertainty and geometry information in section 3. Afterwards we present numerical results. First, we study the segmentation performance for different numbers of image crops. Then, we study how useful our segmentwise metrics are for meta classification and regression. This also includes a variable/metric selection study. Afterwards, we compare the meta classification and regression performance of our approach with baseline approaches and previous ones. Lastly, we study the incorporation of neural networks in meta classification and regression.
2 Nested Image Crops and Uncertainty Measures
Let denote an RGB input image. For a chosen crop distance of we define a restriction operator that removes the top and bottom rows as well as the left and right most pixels from , i.e.,
(1) 
In order to rescale a cropped image to a desired resolution, we define an interpolation operator which performs a bilinear interpolation for such that
and  (2) 
A segmentation network with a softmax output layer can be seen as a statistical model that provides for each pixel of the image a probability distribution on the class labels .
(3) 
for . Note that, due to eq. 2, i.e., all inputs being equally shaped, the ’s can be computed in batches which allows for efficient parallelization. In order to combine the probabilities to a common probability distribution we reshape them to their original size via
(4) 
We could now stack , , in a pyramid fashion, sum them up and normalize the results such that we get a new probability distribution. However, this distribution would suffer from artifacts on the boundary of each . To avoid this, we proceed as follows: Let define a zero padding operator such that and is centered in while all other entries are zero. In order to construct a smooth mean probability distribution, we introduce a kernel function that is zero where is zero and equal to one where the next nested crop is not equal to zero. Inbetween these two regions, interpolates linearly. We can now recursively define our set of probability distributions, that we will use for further investigation, by
(5) 
for . Each of the probability distributions can be viewed as a smooth merge of the current crop and the combination of all previously merged crops, due to their recursive definition being merged smoothly as well.
In the following we generate uncertainty heat maps for each by defining pixelwise dispersion measures. Let
(6) 
denote the predicted class, for each pixel we define the entropy (also known as Shannon information [shannon1948]) , the probability margin and the variation ratio by
(7)  
(8)  
(9) 
For each of these uncertainty measures we define a mean and a variance over the number of crops
and  (10) 
Furthermore we also consider a symmetrized version of the KullbackLeibler divergence of the mean probabilities and the original probabilities without incorporation of additional crops, i.e.,
(11) 
A visualization of and is given in fig. 3. The heat maps and are subject to segmentwise investigation.
3 Metrics Aggregated over Segments
For a given image we define the set of connected components (segments) in the predicted segmentation by . Analogously we denote by the set of connected components in the ground truth . For each , we define the following quantities:

the interior where a pixel is an element of if all eight neighbouring pixels are an element of

the boundary

the intersection over union : let be the set of all that have nontrivial intersection with and whose class label equals the predicted class for , then

adjusted : let , as in [rottmann18] we use in our tests

the pixel sizes , ,

the mean dispersion , , defined as
where

the relative sizes ,

the relative mean dispersions ,

the geometric center where and are the vertical and horizontal coordinates of the pixel in , respectively

the mean class probabilities for each class

sets of metrics
for and as well as
all pixels  center section  

number of crops  1  2  4  8  16  1  2  4  8  16 
0: road  95.94%  96.00%  96.04%  96.10%  96.23%  95.00%  95.05%  95.13%  95.25%  95.52% 
1: sidewalk  71.83%  72.08%  72.31%  72.63%  73.26%  62.58%  62.88%  63.27%  63.91%  65.30% 
2: building  84.83%  85.01%  85.15%  85.32%  85.58%  76.79%  77.07%  77.33%  77.70%  78.43% 
3: wall  34.41%  34.48%  34.40%  34.22%  33.92%  32.55%  32.97%  32.98%  33.12%  32.97% 
4: fence  49.23%  49.92%  49.96%  50.33%  50.49%  41.07%  41.48%  41.47%  42.24%  42.90% 
5: pole  28.97%  29.45%  29.89%  30.55%  31.70%  22.06%  22.50%  22.90%  23.72%  25.59% 
6: traffic light  41.70%  42.35%  42.72%  43.28%  44.23%  26.40%  27.56%  28.10%  29.00%  30.85% 
7: traffic sign  50.59%  50.94%  51.45%  52.08%  53.27%  39.54%  40.08%  40.95%  41.88%  44.03% 
8: vegetation  84.43%  84.58%  84.72%  84.90%  85.23%  77.39%  77.65%  77.92%  78.31%  79.07% 
9: terrain  52.88%  53.25%  53.43%  53.44%  53.69%  43.88%  44.46%  45.08%  45.49%  46.25% 
10: sky  82.82%  82.91%  82.98%  83.16%  83.40%  64.91%  65.07%  65.25%  65.83%  67.20% 
11: person  63.40%  63.85%  64.21%  64.93%  66.11%  63.25%  63.74%  64.20%  65.06%  66.69% 
12: rider  43.63%  43.90%  44.08%  44.50%  45.41%  42.53%  42.85%  43.15%  44.01%  45.41% 
13: car  85.06%  85.20%  85.40%  85.69%  86.23%  79.38%  79.58%  79.87%  80.37%  81.37% 
14: truck  66.64%  66.49%  66.41%  65.82%  64.16%  66.97%  67.54%  67.44%  67.56%  67.00% 
15: bus  70.47%  70.56%  70.56%  70.38%  70.22%  70.95%  71.17%  71.46%  71.60%  71.85% 
16: train  58.44%  59.63%  59.92%  58.87%  57.63%  58.44%  59.46%  60.51%  60.00%  61.15% 
17: motorcycle  48.16%  48.37%  48.63%  49.32%  50.21%  45.21%  45.49%  46.43%  47.28%  48.57% 
18: bicycle  61.74%  62.09%  62.44%  63.01%  63.94%  55.22%  55.73%  56.28%  57.09%  58.65% 
61.85%  62.16%  62.35%  62.55%  62.89%  56.01%  56.44%  56.83%  57.34%  58.36% 
Typically, is large for . This motivates the separate treatment of interior and boundary in all dispersion measures. Furthermore we observe that bad or wrong predictions often come with fractal segment shapes (which have a relatively large amount of boundary pixels, measurable by and ) and/or high dispersions on the segment’s interior. With the exception of and , all scalar quantities defined above can be computed without the knowledge of the ground truth. Our aim is to analyze to which extent they are suited for the tasks of meta classification and meta regression for the .
4 Numerical Experiments: Street Scenes
Meta Classification  

entropy  probability margin  class probabilities  
ACC  
AUROC  
Meta Regression  
Meta Classification  
variation ratio  segment sizes  all metrics  
with variances  without  
ACC  
AUROC  
Meta Regression  
number of metrics  1  2  3  4  5  6  7  8  9  10  11  12  61 

classification accuracy (in %)  0.7725  0.7801  0.7845  0.7884  0.7889  0.7901  0.7918  0.7928  0.7933  0.7938  0.7941  0.7944  0.7958 
added metric  all  
regression (in %)  0.7195  0.7501  0.7776  0.7929  0.8000  0.8023  0.8059  0.8086  0.8101  0.8107  0.8112  0.8117  0.8171 
added metric  all 
In this section we investigate the properties of the nested crops and the metrics defined in the previous sections for the example of a semantic segmentation of street scenes. To this end, we consider the DeepLabv3+ network [deeplab] with MobilenetV2 [mobilenet] encoder for which we use a reference implementation in Tensorflow [tensorflow2015whitepaper] as well as weights pretrained on the Cityscapes dataset [cityscapes] (available on GitHub). As parameters for the DeepLabv3+ framework we use an output stride of , the input image is evaluated within the framework only on its original scale. These parameters result in a mean of on the Cityscapes validation set, here mean refers to mean over all classes. We refer to [deeplab] for a detailed explanation of the chosen parameters.
For our tests we produced crops, i.e., we have nested images for each original image. The Cityscapes validation dataset contains images with a resolution of pixels. Each crop is obtained from the previous one by removing the leftmost and the rightmost columns as well as the top and the bottom rows. In all tests we only consider segments with nonempty interior. For the combined prediction using all 16 crops, MobilenetV2 predicts segments of which have nonempty interior. From those segments with nonempty interior, have . This gives a meta classification accuracy baseline of if we predict that each segment has . Note that, when only using the prediction of the original image, we obtain components, with nonempty interior of which have (resulting in meta classification baseline accuracy). Thus, meta classification results for different numbers of crops are not straight forward comparable. Hence, we focus on results for crops in the following studies.
All results, if not stated otherwise, were computed from repeated runs where training and validation sets (both of the same size) were resampled. We give mean results as well as corresponding standard deviations in brackets.
Performance depending on the number of crops.
Table 1 contains the values for the classical over the whole Cityscapes validation dataset for the different classes as a function of the number of crops (1,2,4,8,16), for the entire image ( pixels) as well as for the center pixels. In both cases the increases continuously when adding further crops. For the whole image the increases from to (i.e., by percentage points (pp)) and for the center section from to (by pp). This demonstrates that our crop based method indeed has the desired effect on smaller objects further away from ego car. For classes of particular interest, like person, rider and traffic sign, we observe improvements in the center section of (for rider) to pp (for traffic sign). We make these observations even though the original image is presented to the segmentation network at full resolution and the zoomed crops do not contain any additional information. In summary these results already justify the deployment of our approach which can be nicely parallelized over the data batch. In addition we obtain further uncertainty information which we investigate in the subsequent paragraphs.
Meta Classification  
all metrics  metrics from [rottmann18]  entropy baseline  
train  val  train  val  train  val  
ACC  
AUROC  
Meta Regression  
Correlation of segmentwise metrics with the .
Figure 5 contains the Pearson correlation coefficients for all segmentwise metrics for all 16 available image crops constructed in section 3. We observe strong correlations for the measures and where and for the relative size measures and . All other size measures as well as for also show increased correlation coefficients. The variances and the KullbackLeibler measures seem to play a minor role, however they might contribute additional information for a model that predicts the .
Metric selection for meta classification and meta regression.
In table 5 we compare different subsets of metrics. For the tasks of meta classification, we do so in terms of meta classification accuracy ( vs. ) and in terms of the area under curve corresponding to the receiver operator characteristic curve (AUROC, see [DBLP:conf/icml/DavisG06]). The receiver operator characteristic curve is obtained by varying the decision threshold of the classification output for deciding whether or . For the task of meta regression we state resulting standard deviations of the linear regression fit’s residual as well as values. We observe that the probability margin heat map yields the most predictive set of metrics, closely followed by the variation ratio. Altogether all heat maps yield fairly similar results and also the segment sizes yield a strong predictive set. The mean class probabilities by itself are not predictive enough, at least for linear and logistic regression models as being used here. In all cases we observe a significant performance increase when incorporating the variance based heat maps, also the geometric center yields valuable extra information. When using all metrics together, another significant increase in all performance measures can be observed. Noteworthily, we obtain AUROC values of up to for meta classification and values of up to for meta regression which demonstrates the predictive power of our metrics. When omitting the variance based metrics, the performance can not be maintained entirely, i.e., we observe a slight decrease of to pp in all accuracy measures. A visual demonstration of the meta regression performance can be found in fig. 6.
In order to further analyze the different subsets of metrics, we perform a greedy heuristic. We start with an empty set of metrics and add iteratively a single metric that improves meta prediction performance maximally. We perform this test twice, once for meta classification accuracy and once for meta regression . Figure 5 depicts both performance measures as functions of the number of metrics. In both cases the curves stagnate quite quickly, indicating that a small set of metrics might be sufficient for a good model. This is confirmed by the results stated in table 6. For the meta regression four of the first six metrics are variants of the probability margin. Combined with the geometric center and the relative segment size , this set obtains an of . Adding the rest of the metrics to this set only results in an increase of pp to the final of . For the meta classification we start with at classification accuracy which is only pp below the accuracy for all metrics. Six out of the first ten added metrics are class probabilities and already after the seventh metric we obtain a classification accuracy of . In both cases, for meta classification and regression, a small subset of metrics can be determined such that the corresponding performance is close to the performance for the full set of metrics. Also in both cases the variation ratio heat map is not required.
Comparison with baseline approaches and others.
In table 7 we compare our results for all metrics with the set metrics introduced in [rottmann18] (cf. fig. 5) and an entropy baseline where only a single entropy metric is employed. We do so as the entropy is a very commonly used uncertainty measure. In terms of AUROC we obtain an improvement of pp and in terms of of pp. When comparing the full set of metrics with the entropy baseline we obtain very pronounced gaps, pp in AUROC and pp in . In all three cases training and validation accuracies are tight, i.e., we do not observe any overfitting issues.
Classification  

neural networks  linear models  
train  val  val  
ACC  
AUROC  
Regression  
Meta classification and regression with neural networks.
We repeat the tests from table 7 for all metrics, however this time we use neural networks for meta classification and regression. Our neural networks are equipped with two hidden layers containing neurons each and we employ regularization with , results are stated in table 8. The difference between training and validation accuracies indicates that the neural network is slightly overfitting. When deploying neural networks instead of linear models, the validation accuracy increases by pp and the validation AUROC by pp. For the meta regression, the standard deviation is reduced by and the value is increased significantly by pp. Note that, the results for may lack interpretability when using a neural network, just as the whole model trades transparency for performance.
5 Conclusion and Outlook
In this paper we extend the approach presented in [rottmann18]. Firstly, we introduce an approach that generates a batch of nested image crops that are presented to the segmentation network and yield a batch of probability distributions. The aggregated probabilities show improved values, especially with respect to the far range section in the center of the input image. Secondly, we add segmentwise metrics constructed from variation ratio, KullbackLeibler divergence, geometric center and crop variance based metrics. Thirdly, for the meta classification and meta regression, we replace the linear model with neural networks. All three aspects contribute to a significant improvement over the approach presented in [rottmann18]. More precisely, we obtain an increase in meta classification accuracy of pp and an increase of AUROC of pp. The for meta regression is increased by pp. Currently we are working on timedynamic meta classification and regression approaches which make predictions from time series of metrics. As we only presented an approach for false positive detection we also plan to combine this with approaches for false negative detection, see e.g. [chan19_1]. Combining these approaches might eventually result in improved segmentation performance, at least with respect to certain classes. The source code of our method is publicly available at https://github.com/mrottmann/MetaSeg/tree/nested_metaseg.
Acknowledgements.
We thank Hanno Gottschalk, Peter Schlicht and Fabian Hüger for discussion and useful advice.