Towards Dependability Metrics for Neural Networks

Towards Dependability Metrics for Neural Networks

Chih-Hong Cheng1, Georg Nührenberg1, Chung-Hao Huang1 and Hirotoshi Yasuoka2 1fortiss - Research Institute of the Free State of Bavaria
Email: {cheng,nuehrenberg,huang}

Neural networks and other data engineered models are instrumental in developing automated driving components such as perception or intention prediction. The safety-critical aspect of such a domain makes dependability of neural networks a central concern for long living systems. Hence, it is of great importance to support the development team in evaluating important dependability attributes of the machine learning artifacts during their development process. So far, there is no systematic framework available in which a neural network can be evaluated against these important attributes. In this paper, we address this challenge by proposing eight metrics that characterize the robustness, interpretability, completeness, and correctness of machine learning artifacts, enabling the development team to efficiently identify dependability issues.

I Introduction

For state-of-the-art autonomous driving systems, the use of neural networks has been dominant in designing vision-based perception components such as object detection. These components, together with the underlying engineering process, should be carefully evaluated such that the resulting system meets the desired safety goals.

Overall, we believe that the dependability of an engineered neural network can be reflected by the evaluating the following criteria, hereinafter referenced to as the “RICC criteria”, which extract the underlying methodological principles in creating safety standards such as ISO 26262:

  • Robustness of a neural network against various effects such as distortion or adversarial perturbation (which is closely related to security).

  • Interpretability in terms of understanding what has a neural network really learned.

  • Completeness in terms of ensuring that the data used in training has covered all important scenarios, if possible.

  • Correctness in terms of a neural network able to perform the perception task without errors.

To the best of our knowledge, there is currently no systematic framework available in which the dependability aspects of a neural network can be evaluated in a consistent and repeatable manner. This can lead to unreliable systems that may cause life threatening situations.

In this paper, we address this omission of systematic analysis of neural networks by presenting a set of metrics that helps the development team to identify issues in the engineering process. The metrics are either applied on the data or on the created neural network, as a basic validation mechanism and as a mean to enable discussions among the team.

In the remainder of this paper, we outline the metrics and their underlying rationale, supplemented by possible engineering methods to improve the computed metrics. Fig. 1 summarizes the the name of each metric and its required inputs. We have implemented a tool to compute some of the proposed metrics, but a complete validation of the metrics using a large set of benchmark data set and production-ready neural networks is left as future work.

Fig. 1: Summary of computed metrics and their required inputs

Ii Quality Metrics

Ii-a Scenario coverage metric

Similar to the class imbalance problem [1] when training classifiers in machine learning, one needs to account for the presence of all relevant scenarios in training datasets for neural networks for autonomous driving. A scenario over a list of of operating conditions (e.g., weather and road condition) is given by a valuation of each condition. E.g., let represent the weather condition, represent the road surfacing, and represent the incoming road orientation. Then and constitute two possible scenarios.

Since for realistic specifications of operating conditions, checking the coverage of all scenarios is infeasible due to combinatorial explosion, our proposed scenario coverage metric is based on the concept of -projection and is tightly connected to the existing work of combinatorial testing, covering arrays and their quantitative extensions [2, 3, 4].

Fig. 2: Computing scenario coverage metric via -projection table


Computing the scenario coverage metric requires that the dataset is semantically labeled according to the specified operating conditions, such that for each data point it can be determined whether it belongs to a certain scenario.


The metric starts by preparing a table recording all possible pairs of operating conditions, followed by iterating each data point to update the table with occupancy. Lastly, compute the ratio between occupied cells and the total number of cells. Eq. 1 summarizes the formula, and an illustration can be found in Fig. 2, where a dataset of two data points achieves .


Provided that for each , the size of is bounded by constant (i.e., the categorization is finite and discrete), then the denominator can at most be , i.e., the number of data points required for full coverage is polynomially bounded.

Relations to Ricc & improving

The metric reflects completeness and correctness attributes of RICC. To improve the metric, one needs to discover new scenarios. For the example in Fig. 2, an image satisfying the scenario can efficiently increase the metric from to .

Ii-B Neuron -activation metric

The previously described input space partitioning may also be observed by the activation of neurons. By considering ReLU activation as an indicator of successfully detecting a feature, for close-to-output layers where high-level features are captured, the combination of neuron activation in the same layer also forms scenarios (which are independent from the specified operating conditions). Again, we encounter combinatorial explosion, e.g., for a layer of neurons, there is a total of scenarios to be covered. Therefore, similar to the 2-projection in the scenario coverage metric, this metric only monitors whether the input set has enabled all activation patterns for every neuron pair or triple in the same layer.


The user specifies an integer constant  and a specific layer to be analyzed. Assume that the layer has neurons.


The metric starts by preparing a table recording all possible -tuples of on-off activation for neurons in the layer being analyzed (similar to Fig. 2 with each now having only and status), followed by iterating each data point to update the table with occupancy. The denominator is given by the number of cells, which has value .


Note that when , our defined neuron -activation metric subsumes commonly seen neuron coverage acting over a single layer [5, 6], where one analyzes the on-off cases for each individual neuron.

Relations to Ricc & improving

The metric reflects the completeness and correctness attribute of RICC. To improve the metric, one needs to provide inputs that allows enabling different neuron activation patterns.

Ii-C Neuron activation pattern metric

Encountering the combinatorial explosion, while -activation metric captures the completeness, our designed neuron activation pattern metric is used to understand the distribution of activation. For inputs within the same scenario, intuitively the activation pattern should be similar, implying that the number of activated neurons should be similar.


The user provides an input set In, where all images belong to the same scenario, and specifies a layer of the neural network (with neurons) to be analyzed. Furthermore, the user chooses the number of groups , for a partition of In into  groups , where for group , , the number of activated neurons in the specified layer is within the range for each input in this group.


Let be the largest set among . Then the metric is evaluated by considering all inputs whose activation pattern, aggregated using the number of neurons being activated, significantly deviates from the majority.


Relations to Ricc & improving

This metric reflects the robustness and completeness attribute of RICC, as well as interpretability. To improve the metric, one requires careful examination over the reason of diversity in the activation pattern under the same scenario.

Ii-D Adversarial confidence loss metric

Vulnerability w.r.t. adversarial inputs [7] is an important quality attribute of neural networks, which are used for image processing and designed to be used in safety-critical systems. As providing a formally provable guarantee against all possible adversarial inputs is hard, our proposed adversarial confidence loss metric is useful in providing engineers an estimate of how robust a neural network is.


Computing requires that there exists a list of input transformers where for each , given a parameter specifying the allowed perturbation, one derives a new input by transforming input in. Each is one of the known image perturbation techniques ranging from simple rotation, distortion, to advanced techniques such as FGSM [8] or deepfool [9].


Given a test set In, a predefined perturbation bound , and the list of input transformers, let , where , be the output of the neural network being analyzed, with larger value being better111Here the formulation also assumes that there exists a single output for the neural network, but the formulation can be easily extended to incorporate multi-output scenarios.. The following equation computes the adversarial perturbation loss metric.


Intuitively, analyzes the change of output value for input in due to a perturbation , and selects one which leads to largest performance drop among all perturbation techniques, i.e., it makes the computed value of most negative. A real example is shown in Fig. 3, where the FGSM attack yields the largest classification performance drop among three perturbation techniques, which changes the probability of car from to . Thus, the largest negative value of the probability difference for this image is . Lastly, average the computed value over all inputs being analyzed.

Fig. 3: A vehicle image and three perturbed images. The largest classification performance drop is achieved by the FGSM technique.

Relations to Ricc & improving

The metric has a clear impact on robustness and correctness. To improve the metric, one needs to introduce perturbed images into the training set, or apply alternative training techniques with provable bounds [10].

Ii-E Scenario based performance degradation metric

Here we omit details, but for commonly seen performance metrics such as validation accuracy or even quantitative statistic measures such as MTBF, one may perform detailed analysis by either considering each scenario, or by discounting the value due to missing input scenarios (the discount factor can be taken from the computed scenario coverage metric).

Ii-F Interpretation precision metric

The interpretation precision metric is intended to judge if a neural network for image classification or object detection makes its decision on the correct part of the image. E.g., the metric can reveal that a certain class of objects is mostly identified by its surroundings, maybe because it only exists in similar surroundings in the training and validation data. In this case, engineers should test whether this class of object can also be detected in different contexts.


For computing this metric for image classification (or object detection), we need a validation set that has image segmentation ground truth in addition to the ground truth classes (and bounding boxes), e.g., as in VOC2012 data set [11].


Here we describe how the metric can be computed for a single detected object, where one can extend the computation to a set of images by posing average or min/max operators. A real example demonstrating the exact computation is shown in Fig. 5.

  1. Run the neural network on the image to classify an object with probability (and obtain a bounding box in the case of object detection).

  2. Compute an occlusion sensitivity heatmap , where each pixel of the heatmap maps to a position of the occlusion on the image [12]. The value of is given by the probability of the original class for the occluded image. For object detection we take the maximum probability of the correct class over all detected boxes that have a significant Jaccard similarity with the ground truth bounding box.

  3. For given probability threshold that defines the set of hot pixels as and the set of pixels that partly occlude the segmentation ground truth, denoted by , the metric is computed as follows:

Fig. 4: An illustrative example of a heatmap for a pedestrian. There are a total of nine hot pixels in orange, i.e., , five hot pixels belong to the group of occluding pixels, i.e., and the total number of occluding pixels is 30, i.e., .

An illustrative example of computing can be found in Fig. 4, where for the human figure only five out of nine hot pixels intersect the region of the human body. Thus . The set of thirty pixels constituting the human forms .

Relations to Ricc & improving

The interpretation precision metric contributes to the interpretability and correctness of the RICC criteria. It may reveal that a neural network uses a lot of context to detect some objects, e.g., regions surrounding the object or background of the image. In this case, adding images where these objects appear in different surroundings can improve the metric.

(a) Result of object detection
(b) Heatmap for red car (bottom left)
(c) for
(d) Heatmap for the right person
(e) for
Fig. 5: Computing for red car and the right person in front of the red car. The metric shows that the red car is mostly identified by the correct areas. On the other hand, for the person there are a lot of hot pixels in incorrect regions.

Ii-G Occlusion sensitivity covering metric

This metric measures the fraction of the object that is sensitive to occlusion. Generally speaking, it is undesirable to have a significant probability drop if only a small part of the object is occluded.

Furthermore, care should be taken about the location of the occlusion sensitive area. If a certain part of an object class is occlusion sensitive in many cases (e.g., the head of a dog) it should be tested if the object can still be detected when this part is occluded (e.g., head of a dog is behind a sign post). is computed in a similar way and based on the same inputs as :

  1. Perform steps 1) and 2) and determine and as for .

  2. Derive .

If the value is high it indicates that many positions of small occlusions can lead to a detection error. A low value indicates that there is a greater chance of still detecting the object when it is partly occluded. An illustrative example of computing can be found in Fig. 4, where for the human figure the heatmap only contains five hot pixels intersecting the human body (the head). As there are 30 pixels intersecting the region of the human, we have .

Relations to Ricc & improving

Occlusion sensitivity coverage covers the robustness and interpretability of RICC. If the metric values are too high for certain kinds of objects, an approach to improve it is to augment the training set with more images where these objects are only partly visible.

Ii-H Weighted accuracy/confusion metric

In object classification, not all errors have the same severity, e.g., confusing a pedestrian for a tree is more critical than in the opposite way. Apart from pure accuracy measures, one may employ fine-grained analysis such as specifying penalty terms as weights to capture different classification misses.

As such a technique is standard in performance evaluation of machine learning algorithms, the specialty will be how the weights of confusion are determined. Table I provides a summary over penalties to be applied in traffic scenarios, by reflecting the safety aspect. Misclassifying a pedestrian (or bicycle) to be background image (i.e., no object exists) should be set with highest penalty, as pedestrians are unprotected and it may easily lead to life threatening situations.

A is classified to B B (pedestrian) B (vehicle) B (background)
A (pedestrian) n.a. (correct)
A (vehicle) n.a. (correct)
A (background) n.a. (correct)
TABLE I: Qualitative severity of safety to be reflected as weights

Relations to Ricc & improving

The metric is a fine-grained indicator on correctness. To improve the metric, either one trains the network with more examples, or one modifies the loss function such that it is aligned with the weighted confusion, e.g., it sets higher penalty term when mis-classifying a “pedestrian” to “background”.


  • [1] N. Japkowicz and S. Stephen, “The class imbalance problem: A systematic study,” Intelligent data analysis, vol. 6, no. 5, pp. 429–449, 2002.
  • [2] J. Lawrence, R. N. Kacker, Y. Lei, D. R. Kuhn, and M. Forbes, “A survey of binary covering arrays,” the electronic journal of combinatorics, vol. 18, no. 1, p. 84, 2011.
  • [3] C. Nie and H. Leung, “A survey of combinatorial testing,” ACM Computing Surveys (CSUR), vol. 43, no. 2, p. 11, 2011.
  • [4] C.-H. Cheng, C.-H. Huang, and H. Yasuoka, “Quantitative projection coverage for testing ml-enabled autonomous systems,” arXiv preprint arXiv:1805.04333, 2018.
  • [5] “Functional safety beyond iso26262 for neural networks in highly automated driving,”
    , accessed: 2018-06-01.
  • [6] K. Pei, Y. Cao, J. Yang, and S. Jana, “Deepxplore: Automated whitebox testing of deep learning systems,” in SOSP.   ACM, 2017, pp. 1–18.
  • [7] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” arXiv preprint arXiv:1312.6199, 2013.
  • [8] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” arXiv preprint arXiv:1412.6572, 2014.
  • [9] S. M. Moosavi Dezfooli, A. Fawzi, and P. Frossard, “Deepfool: a simple and accurate method to fool deep neural networks,” in CVPR, no. EPFL-CONF-218057, 2016.
  • [10] J. Z. Kolter and E. Wong, “Provable defenses against adversarial examples via the convex outer adversarial polytope,” arXiv preprint arXiv:1711.00851, 2017.
  • [11] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results,”
  • [12] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in ECCV.   Springer, 2014, pp. 818–833.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description