Towards Dependability Metrics for Neural Networks
Abstract
Neural networks and other data engineered models are instrumental in developing automated driving components such as perception or intention prediction. The safetycritical aspect of such a domain makes dependability of neural networks a central concern for long living systems. Hence, it is of great importance to support the development team in evaluating important dependability attributes of the machine learning artifacts during their development process. So far, there is no systematic framework available in which a neural network can be evaluated against these important attributes. In this paper, we address this challenge by proposing eight metrics that characterize the robustness, interpretability, completeness, and correctness of machine learning artifacts, enabling the development team to efficiently identify dependability issues.
I Introduction
For stateoftheart autonomous driving systems, the use of neural networks has been dominant in designing visionbased perception components such as object detection. These components, together with the underlying engineering process, should be carefully evaluated such that the resulting system meets the desired safety goals.
Overall, we believe that the dependability of an engineered neural network can be reflected by the evaluating the following criteria, hereinafter referenced to as the âRICC criteriaâ, which extract the underlying methodological principles in creating safety standards such as ISO 26262:

Robustness of a neural network against various effects such as distortion or adversarial perturbation (which is closely related to security).

Interpretability in terms of understanding what has a neural network really learned.

Completeness in terms of ensuring that the data used in training has covered all important scenarios, if possible.

Correctness in terms of a neural network able to perform the perception task without errors.
To the best of our knowledge, there is currently no systematic framework available in which the dependability aspects of a neural network can be evaluated in a consistent and repeatable manner. This can lead to unreliable systems that may cause life threatening situations.
In this paper, we address this omission of systematic analysis of neural networks by presenting a set of metrics that helps the development team to identify issues in the engineering process. The metrics are either applied on the data or on the created neural network, as a basic validation mechanism and as a mean to enable discussions among the team.
In the remainder of this paper, we outline the metrics and their underlying rationale, supplemented by possible engineering methods to improve the computed metrics. Fig. 1 summarizes the the name of each metric and its required inputs. We have implemented a tool to compute some of the proposed metrics, but a complete validation of the metrics using a large set of benchmark data set and productionready neural networks is left as future work.
Ii Quality Metrics
Iia Scenario coverage metric
Similar to the class imbalance problem [1] when training classifiers in machine learning, one needs to account for the presence of all relevant scenarios in training datasets for neural networks for autonomous driving. A scenario over a list of of operating conditions (e.g., weather and road condition) is given by a valuation of each condition. E.g., let represent the weather condition, represent the road surfacing, and represent the incoming road orientation. Then and constitute two possible scenarios.
Since for realistic specifications of operating conditions, checking the coverage of all scenarios is infeasible due to combinatorial explosion, our proposed scenario coverage metric is based on the concept of projection and is tightly connected to the existing work of combinatorial testing, covering arrays and their quantitative extensions [2, 3, 4].
Assumption
Computing the scenario coverage metric requires that the dataset is semantically labeled according to the specified operating conditions, such that for each data point it can be determined whether it belongs to a certain scenario.
Computing
The metric starts by preparing a table recording all possible pairs of operating conditions, followed by iterating each data point to update the table with occupancy. Lastly, compute the ratio between occupied cells and the total number of cells. Eq. 1 summarizes the formula, and an illustration can be found in Fig. 2, where a dataset of two data points achieves .
(1) 
Provided that for each , the size of is bounded by constant (i.e., the categorization is finite and discrete), then the denominator can at most be , i.e., the number of data points required for full coverage is polynomially bounded.
Relations to Ricc & improving
The metric reflects completeness and correctness attributes of RICC. To improve the metric, one needs to discover new scenarios. For the example in Fig. 2, an image satisfying the scenario can efficiently increase the metric from to .
IiB Neuron activation metric
The previously described input space partitioning may also be observed by the activation of neurons. By considering ReLU activation as an indicator of successfully detecting a feature, for closetooutput layers where highlevel features are captured, the combination of neuron activation in the same layer also forms scenarios (which are independent from the specified operating conditions). Again, we encounter combinatorial explosion, e.g., for a layer of neurons, there is a total of scenarios to be covered. Therefore, similar to the 2projection in the scenario coverage metric, this metric only monitors whether the input set has enabled all activation patterns for every neuron pair or triple in the same layer.
Assumption
The user specifies an integer constant and a specific layer to be analyzed. Assume that the layer has neurons.
Computing
The metric starts by preparing a table recording all possible tuples of onoff activation for neurons in the layer being analyzed (similar to Fig. 2 with each now having only and status), followed by iterating each data point to update the table with occupancy. The denominator is given by the number of cells, which has value .
(2) 
Relations to Ricc & improving
The metric reflects the completeness and correctness attribute of RICC. To improve the metric, one needs to provide inputs that allows enabling different neuron activation patterns.
IiC Neuron activation pattern metric
Encountering the combinatorial explosion, while activation metric captures the completeness, our designed neuron activation pattern metric is used to understand the distribution of activation. For inputs within the same scenario, intuitively the activation pattern should be similar, implying that the number of activated neurons should be similar.
Assumption
The user provides an input set In, where all images belong to the same scenario, and specifies a layer of the neural network (with neurons) to be analyzed. Furthermore, the user chooses the number of groups , for a partition of In into groups , where for group , , the number of activated neurons in the specified layer is within the range for each input in this group.
Computing
Let be the largest set among . Then the metric is evaluated by considering all inputs whose activation pattern, aggregated using the number of neurons being activated, significantly deviates from the majority.
(3) 
Relations to Ricc & improving
This metric reflects the robustness and completeness attribute of RICC, as well as interpretability. To improve the metric, one requires careful examination over the reason of diversity in the activation pattern under the same scenario.
IiD Adversarial confidence loss metric
Vulnerability w.r.t. adversarial inputs [7] is an important quality attribute of neural networks, which are used for image processing and designed to be used in safetycritical systems. As providing a formally provable guarantee against all possible adversarial inputs is hard, our proposed adversarial confidence loss metric is useful in providing engineers an estimate of how robust a neural network is.
Assumption
Computing requires that there exists a list of input transformers where for each , given a parameter specifying the allowed perturbation, one derives a new input by transforming input in. Each is one of the known image perturbation techniques ranging from simple rotation, distortion, to advanced techniques such as FGSM [8] or deepfool [9].
Computing
Given a test set In, a predefined perturbation bound , and the list of input transformers, let , where , be the output of the neural network being analyzed, with larger value being better^{1}^{1}1Here the formulation also assumes that there exists a single output for the neural network, but the formulation can be easily extended to incorporate multioutput scenarios.. The following equation computes the adversarial perturbation loss metric.
(4) 
Intuitively, analyzes the change of output value for input in due to a perturbation , and selects one which leads to largest performance drop among all perturbation techniques, i.e., it makes the computed value of most negative. A real example is shown in Fig. 3, where the FGSM attack yields the largest classification performance drop among three perturbation techniques, which changes the probability of car from to . Thus, the largest negative value of the probability difference for this image is . Lastly, average the computed value over all inputs being analyzed.
Relations to Ricc & improving
The metric has a clear impact on robustness and correctness. To improve the metric, one needs to introduce perturbed images into the training set, or apply alternative training techniques with provable bounds [10].
IiE Scenario based performance degradation metric
Here we omit details, but for commonly seen performance metrics such as validation accuracy or even quantitative statistic measures such as MTBF, one may perform detailed analysis by either considering each scenario, or by discounting the value due to missing input scenarios (the discount factor can be taken from the computed scenario coverage metric).
IiF Interpretation precision metric
The interpretation precision metric is intended to judge if a neural network for image classification or object detection makes its decision on the correct part of the image. E.g., the metric can reveal that a certain class of objects is mostly identified by its surroundings, maybe because it only exists in similar surroundings in the training and validation data. In this case, engineers should test whether this class of object can also be detected in different contexts.
Assumption
For computing this metric for image classification (or object detection), we need a validation set that has image segmentation ground truth in addition to the ground truth classes (and bounding boxes), e.g., as in VOC2012 data set [11].
Computing
Here we describe how the metric can be computed for a single detected object, where one can extend the computation to a set of images by posing average or min/max operators. A real example demonstrating the exact computation is shown in Fig. 5.

Run the neural network on the image to classify an object with probability (and obtain a bounding box in the case of object detection).

Compute an occlusion sensitivity heatmap , where each pixel of the heatmap maps to a position of the occlusion on the image [12]. The value of is given by the probability of the original class for the occluded image. For object detection we take the maximum probability of the correct class over all detected boxes that have a significant Jaccard similarity with the ground truth bounding box.

For given probability threshold that defines the set of hot pixels as and the set of pixels that partly occlude the segmentation ground truth, denoted by , the metric is computed as follows:
(5) 
An illustrative example of computing can be found in Fig. 4, where for the human figure only five out of nine hot pixels intersect the region of the human body. Thus . The set of thirty pixels constituting the human forms .
Relations to Ricc & improving
The interpretation precision metric contributes to the interpretability and correctness of the RICC criteria. It may reveal that a neural network uses a lot of context to detect some objects, e.g., regions surrounding the object or background of the image. In this case, adding images where these objects appear in different surroundings can improve the metric.
IiG Occlusion sensitivity covering metric
This metric measures the fraction of the object that is sensitive to occlusion. Generally speaking, it is undesirable to have a significant probability drop if only a small part of the object is occluded.
Furthermore, care should be taken about the location of the occlusion sensitive area. If a certain part of an object class is occlusion sensitive in many cases (e.g., the head of a dog) it should be tested if the object can still be detected when this part is occluded (e.g., head of a dog is behind a sign post). is computed in a similar way and based on the same inputs as :

Perform steps 1) and 2) and determine and as for .

Derive .
If the value is high it indicates that many positions of small occlusions can lead to a detection error. A low value indicates that there is a greater chance of still detecting the object when it is partly occluded. An illustrative example of computing can be found in Fig. 4, where for the human figure the heatmap only contains five hot pixels intersecting the human body (the head). As there are 30 pixels intersecting the region of the human, we have .
Relations to Ricc & improving
Occlusion sensitivity coverage covers the robustness and interpretability of RICC. If the metric values are too high for certain kinds of objects, an approach to improve it is to augment the training set with more images where these objects are only partly visible.
IiH Weighted accuracy/confusion metric
In object classification, not all errors have the same severity, e.g., confusing a pedestrian for a tree is more critical than in the opposite way. Apart from pure accuracy measures, one may employ finegrained analysis such as specifying penalty terms as weights to capture different classification misses.
As such a technique is standard in performance evaluation of machine learning algorithms, the specialty will be how the weights of confusion are determined. Table I provides a summary over penalties to be applied in traffic scenarios, by reflecting the safety aspect. Misclassifying a pedestrian (or bicycle) to be background image (i.e., no object exists) should be set with highest penalty, as pedestrians are unprotected and it may easily lead to life threatening situations.
A is classified to B  B (pedestrian)  B (vehicle)  B (background) 

A (pedestrian)  n.a. (correct)  
A (vehicle)  n.a. (correct)  
A (background)  n.a. (correct) 
Relations to Ricc & improving
The metric is a finegrained indicator on correctness. To improve the metric, either one trains the network with more examples, or one modifies the loss function such that it is aligned with the weighted confusion, e.g., it sets higher penalty term when misclassifying a “pedestrian” to “background”.
References
 [1] N. Japkowicz and S. Stephen, “The class imbalance problem: A systematic study,” Intelligent data analysis, vol. 6, no. 5, pp. 429–449, 2002.
 [2] J. Lawrence, R. N. Kacker, Y. Lei, D. R. Kuhn, and M. Forbes, “A survey of binary covering arrays,” the electronic journal of combinatorics, vol. 18, no. 1, p. 84, 2011.
 [3] C. Nie and H. Leung, “A survey of combinatorial testing,” ACM Computing Surveys (CSUR), vol. 43, no. 2, p. 11, 2011.
 [4] C.H. Cheng, C.H. Huang, and H. Yasuoka, “Quantitative projection coverage for testing mlenabled autonomous systems,” arXiv preprint arXiv:1805.04333, 2018.

[5]
“Functional safety beyond iso26262 for neural networks in highly automated
driving,”
http://autonomousdriving.org/wpcontent/uploads/2018/04/Functional_Safety_beyond_ISO26262_for_Neural
_Networks__Exida__Florian_ADM5.pdf, accessed: 20180601.  [6] K. Pei, Y. Cao, J. Yang, and S. Jana, “Deepxplore: Automated whitebox testing of deep learning systems,” in SOSP. ACM, 2017, pp. 1–18.
 [7] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” arXiv preprint arXiv:1312.6199, 2013.
 [8] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” arXiv preprint arXiv:1412.6572, 2014.
 [9] S. M. Moosavi Dezfooli, A. Fawzi, and P. Frossard, “Deepfool: a simple and accurate method to fool deep neural networks,” in CVPR, no. EPFLCONF218057, 2016.
 [10] J. Z. Kolter and E. Wong, “Provable defenses against adversarial examples via the convex outer adversarial polytope,” arXiv preprint arXiv:1711.00851, 2017.
 [11] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results,” http://www.pascalnetwork.org/challenges/VOC/voc2012/workshop/index.html.
 [12] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in ECCV. Springer, 2014, pp. 818–833.