HarDNN: Feature Map Vulnerability Evaluation in Convolutional Neural Networks
CNNs have seen a recent surge in usage across many application domains ranging from High Performance Computing (HPC) to safety-critical systems such as autonomous vehicles and medical devices. We have also seen a rise in the use of efficient platforms that accelerate CNN executions such as GPUs and domain-specific accelerators such as the one deployed in Tesla’s Full Self-Driving (FSD) System Sze et al. (2017); NVIDIA (); Sean Hollister (2019). As CNNs continue to permeate the fabric of everyday life with increasing utilization in safety-critical applications, it is important that they are resilient to transient hardware errors (also known as soft errors).
Studies have shown that hardware errors could have severe unintended consequences unless the system is designed to detect these errors Yoshida (2013); Safety Research and Strategies, Inc. (2013). For example, following a series of unintended acceleration events by Toyota vehicles, a taskforce following up on a NASA investigation showed that, “as little as a single bit flip … could make a car run out of control.” To mitigate such scenarios, hardware in safety-critical systems must fulfill high integrity requirements, such as those outlined in the ISO-26262 standard International Organization for Standardization (2011).
While processors deployed in safety-critical systems will employ ECC/parity to protect large storage structures (storing weights and intermediate data), the level of protection they offer will likely be not sufficient to meet the stringent requirements set by standards such as ISO-26262. Conventional reliability solutions, such as full duplication through hardware or software, suffer from high overheads in cost, area, power, and/or performance Iturbe et al. (2016); Bartlett & Spainhower (2004); Shye et al. (2009), yet are still commonly used in practice to ensure high resilience. For example, despite the limited power and area constraints of real-time systems, Tesla’s FSD system deploys two fully redundant FSD chips along with accompanying redundant control logic, power, and peripheral packaging on the board for reliability.
With the goal of developing a reliability solution that is much lower cost than full duplication, we seek to understand the underlying vulnerability characteristics of CNNs. Instead of simply approaching a CNN as a single, large computational block, we explore its vulnerability at finer granularities (i.e., neurons, feature maps, and layers). We hypothesize that not all sub-components of a CNN contribute equally to the overall network vulnerability, and develop methodologies to quantify vulnerability at a finer granularity. Our results show that errors in some feature maps or layers are more likely to corrupt the output of a CNN. Furthermore, we recognize that feature maps are robust to translation effects in the input, maintaining higher-level information required by the CNN for inference, while a technique that operates at neuron-level will not have this benefit. Based on this advantage and the fact that we can compose the vulnerability estimates at layer or network level using fmap-level analysis, we focus on feature map-level granularity in this work.
One technique commonly used to quantify an application’s vulnerability to transient errors is error injection experiments. An exhaustive study, where one error simulation is performed for a possible hardware error in an application state to quantify the effect on the output, is often intractable. Instead, resilience analyses typically employ a statistical technique to limit the number of error simulation runs, while preserving the quality of results. We leverage a similar approach to quantify the vulnerability of a CNN’s component when subjected to various transient error models by analyzing the likelihood of a Top-1 misclassification for (classification) models.
In an error injection run, the output of a CNN can be corrupted but the classification might still be correct. To capture the severity of output corruption in each injection run, we propose using an alternate metric that uses the average change in cross entropy loss (loss). We find that capturing the fine-grain severity metric (instead of binary classification result) can produce vulnerability estimates that are comparable whether the classification changes, but several times faster (e.g., 10 for ResNet50). For an additionally faster method, we explore six heuristics that do not perform error injections to estimate vulnerability. We leverage activation values and gradient information during inference and back propagation. These heuristics provide an additional tradeoff between accuracy and speed for vulnerability assessment.
We also study the tradeoff between resilience improvement and increase in computation for added resilience offered by protecting highly vulnerable feature maps. Results show that a fraction of feature maps typically account for a disproportionately large percentage of output corruptions (on average, 30% of feature maps account for 76% of output corruptions for the studied CNNs). Since each feature map is a convolution of the input based on an given filter, a low-cost mitigation technique can be selective filter (or feature map) duplication.
In summary, this paper presents HarDNN, a highly tunable technique to identify and harden the most vulnerable components of a CNN for hardware-error-resilient inferences. The following are the main contributions.
We compare various granularities for protection to avoid full CNN duplication. We identify that feature maps (fmaps) provide a ”sweet spot” for their robustness to translation effects of inputs, and their composability for high-level (e.g., layer-level) protection.
We study the sensitivity of error models on the contribution of a fmap towards the total vulnerability, which we call relative vulnerability. Results show that the relative vulnerabilities of fmaps do not change much with the studied error models.
We introduce loss as an accurate metric to measure vulnerability. loss captures fine-grain perturbations in inference output and converges to the relative vulnerability estimate per fmap with far fewer injections compared to the baseline classification-based criterion.
We evaluate multiple non-injection based heuristics for vulnerability estimation, and compare their accuracy and speed to loss.
We study the tradeoff between resilience improvement and increase in computation for the resilience offered by protecting highly vulnerable fmaps. Results show that HarDNN improves resilience of SqueezeNet, for example, by 10 with just 30% additional computations.
There are two main approaches for resiliency analysis: statistical error injection and analytical error propagation modeling. Error injection emulates a hardware error by perturbing internal program state, and then executing the program to completion to evaluate the effect of the error Lu et al. (2015); Hari et al. (2017); Chang et al. (2018); Venkatagiri et al. (2016); Mahmoud et al. (2019); Fang et al. (2016); Schirmeier et al. (2015); Kaliorakis et al. (2017); Hari et al. (2012); Li et al. (2007); Sridharan & Kaeli (2009). Because a program can consist of billions of operations and there are a plurality of errors possible for each operation, an error injection campaign can take a large amount of time and resources to completely characterize the resilience of an application.
Analytical error models attempt to reduce the resource intensity of error injection by estimating the vulnerability of different operations through higher-level models that take into account architecture or domain knowledge Feng et al. (2010); Laguna et al. (2016); Li et al. (2018). In this paper, we investigate both of these resiliency analysis approaches for CNNs through injection-based and non-injection-based feature map vulnerability assessment schemes.
2.2 CNN Background
The most fundamental computational unit in a CNN is a neuron (or activation value). A neuron is the result of a dot product between a filter and an equal sized portion of the input. An output feature map (or fmap for short) is a plane of many neurons, and is obtained by convolving a filter over an input fmap. Mathematically, a convolution is comprised of many dot products, where each dot product is composed of many multiply-and-accumulate (MAC) operations.
A CNN is hierarchically composed of many convolutional layers, which are themselves formed of many filters. During an inference, filters are convolved with input fmaps to produce output fmaps, where the number of output fmaps correspond 1-to-1 with the number of filters in the layer. A CNN consists of a series of convolutional layers followed by some fully-connected layers. The final layer in the network is typically a softmax layer, which provides a probability distribution for each possible classification the network is trained to predict. The class with the highest probability (the Top-1 class) is the chosen prediction of the network during an inference.
3 HarDNN: Design Overview
HarDNN is a software-directed resiliency analysis technique which identifies and selectively hardens vulnerable computations of CNN inferences. HarDNN takes as input a pre-trained model, estimates the relative vulnerability at a target granularity, and hardens the network before deployment using a chosen selective protection method (e.g., low-level duplication) by protecting the most vulnerable components. Effectively, HarDNN transforms the original CNN model into a transient hardware-error-resilient model without any loss of classification accuracy. Figure 1 depicts the high-level overview of HarDNN.
In order to effectively analyze and protect a CNN from errors, there are three fundamental questions that need to be addressed: (1) At what granularity (i.e., neurons, feature maps, layers) should the hardening focus on? (2) Which subset of the target granularity need duplication? (3) How should the selective protection be implemented? The rest of this section addresses these questions.
3.1 Target Granularity
As described in Section 2.2, a CNN is composed in a hierarchical manner with neurons, feature maps, and layers building up to a full network. While full network duplication provides high resilience, it can incur 2 runtime or power overheads and lowers the throughput offered by the system by half. Such high overheads are often prohibitive for safety-critical systems that demand high compute throughput and resilience. Moreover, full duplication might provide unnecessary over-protection, as it is does not consider the contribution of different subcomponents on the total system reliability. Targeting finer granularities for duplication allows for effectively allocating resources for reliability, rather than indiscriminate redundancy in time or space. However, it is critical that the target granularity also effectively measures vulnerability. For example, although neuron-level resiliency analysis may provide the most fine-grained control for selective hardening, this level of granularity suffers from a fundamental issue that the neurons are not immune to translational effects of the input image. Changes to the image orientation or scaled images can affect the vulnerability contribution of a neuron.
Fmaps, on the other hand, do not suffer from this issue. As long as the CNN is trained to correctly infer images with such variations in the input (as is typical in training highly accurate networks), the same fmap is expected to be important for similar images. Hence, HarDNN focuses on quantifying vulnerability and hardening at the fmap level. A benefit of targeting the fmap-level is that the results can be composed to perform layer- and network-level vulnerability analysis.
3.2 Vulnerability Estimation
Vulnerability of a fmap is defined as the probability of the model’s output corruption given a transient hardware error during an inference. We refer to this quantity as . We can estimate in two steps using Equation 1. The first step estimates the probability of an error manifestation at the fmap level, given a transient hardware error and the second step estimates the probability of error propagation of the fmap corruption to the output of the CNN. We refer to these two quantities as origination probability or and propagation probability or , respectively.
depends on the implementation of the convolution and architecture on which it is being run. Assuming that the major storage structures are protected in the target hardware platform, most of the errors originate from the unprotected computations. Given that MAC operations are used to perform a convolution and produce an fmap, we assume that the origination probability is directly proportional to the number of MACs in a convolution, without loss in generality. In this work, we compute as a fraction of the number of MACs in a convolution to the number of MACs in the entire CNN. Refining this quantity based on the hardware platform and implementation optimizations is part of our future work.
depends on how the low-level hardware error manifests at the fmap-level and how this manifestation propagates through the network to the output. Considering a computation-based error model, we assume that an error in a MAC will corrupt a single neuron’s output in a fmap. We estimate the probability of single neuron corruptions propagating to the CNN’s output using two classes of techniques – injection- and non-injection-based. The first class aims to obtain high accuracy vulnerability estimates. The second class aims to estimate the vulnerability fast with zero error injections. We describe the techniques in detail in Section 4. We also study the sensitivity of using different neuron-level error manifestation models on , as described in Section 5.1.
HarDNN aims to estimate the relative vulnerability of each of the fmaps in a CNN (i.e., the contribution of a fmap towards the total CNN vulnerability) to address which fmaps are the most vulnerable and require protection. The quotients of Equations 1 and 2 can be used to measure this quantity for an fmap, as shown in Equation 3.
3.3 Selective Protection
Once the relative vulnerabilities of feature maps are gathered, HarDNN employs selective duplication to harden the computations of the most vulnerable feature maps. HarDNN addresses the how of selective protection by assuming that the filters which correspond to the highly vulnerable fmaps can be duplicated. Filter duplication results in two copies of the same logical fmap, where any mismatches between the two copies are used to detect errors during inference and trigger a higher-level system response. The duplicated fmaps need to be dropped before the inference proceeds to the next layer. The comparison of the two duplicate feature maps can be performed lazily to remove it from the critical path, allowing subsequent layer execution to proceed before the output is verified. HarDNN’s highly tunable software-directed selective protection approach allows the designer to control the resiliency versus overhead trade-off based on the varying resiliency requirements of the system.
4 HarDNN Vulnerability Estimation Techniques
Accurately measuring PropP exhaustively would require observing every possible error in every MAC unit of an fmap, and aggregating the observed error propogation outcome for each error per famp. This is infeasible, due to the intractably large number of possible error sites. Instead, we use statistical error injections to accurately obtain the vulnerability of fmaps (Section 4.1). Although tractable, this may still be slow since statistical significance might require a large number of error injections. Section 4.1.2 introduces loss as an injection-based metric which can relatively quickly converges to estimate vulnerability. We then study how non-injection based heuristics can approximate injection-based vulnerability in terms of accuracy, but with much less runtime (Section 4.2). Table 1 summarizes all HarDNN techniques explored for estimating fmap vulnerability.
4.1 Injection-Based Vulnerability Quantification
The first two metrics used for assessing vulnerability of an fmap are based on injecting an error into a CNN during inference, and comparing the outcome of the injection execution with the error-free execution outcome. The first metric is a binary observation of whether the injection resulted in an output misclassification (compared to the golden, error free inference); this is referred to as a mismatch. The second metric uses the average delta cross-entropy loss (loss for shorthand) to measure the vulnerability of an error injection.
We quantify PropP of a specific fmap as the fraction of injections that result in a classification mismatch over all error injections performed on the fmap. An injection run is categorized as a mismatch if the network misclassifies the input image, as compared to the reference label of the image provided by the dataset. This is analogous to the Top-1 accuracy metric typically used in machine learning to assess a neural network’s accuracy. Error injections that do not alter the Top-1 classification are considered masked, as their manifestation does not change the expected classification of the network.
We deem the mismatch metric to be the most accurate from a resiliency standpoint for computing a fmaps’s vulnerability, since the Top-1 category is typically used to interpret the classification of an input image in an application. Thus, mismatch forms our “oracle” metric, which we use to assess the accuracy of all other metrics’ relative fmap vulnerability.
Loss: Average Delta Cross Entropy Loss
Although measuring relative vulnerability using mismatches is the gold standard for image classification networks, one primary drawback is that mismatches are relatively rare. Thus, it may take too many injections to obtain fine-grained differences between fmap vulnerabilities. A primary insight in this work is to replace the binary view of error propagation (represented by mismatch) with a continuous view. To that end, we motivate using the average delta cross entropy loss for vulnerability estimation.
Cross entropy loss is typically used to train DNNs during backpropogation to improve the prediction accuracy of the network. More generally, it is used in information theory to measure the entropy between two distributions, the true distribution and the estimated distribution, by penalizing low confidence in predictions as well as wrong predictions. Adapting cross-entropy loss to reliability, we calculate the absolute difference between the cross entropy loss observed during an error-free inference, and the cross entropy loss observed during an error-injected inference. This can be expressed as:
where is the cross-entropy loss for an inference, ( is an error-injected inference, and represents the golden loss for an error-free inference) across total error injections. We use the absolute difference of the loss values to capture the magnitude of the relative loss observed due to an error injection as a measure of the vulnerability. The larger the loss, the more vulnerable the fmap is estimated to be.
|Mismatch||Yes||Top-1 Misclassification due to error in fmap|
|Loss||Yes||Average delta cross entropy loss of fmap|
|MaxNeuron||No||Max neuron value observe for fmap|
|FmapRange||No||Range of neuron values observed for fmap|
|L2||No||Average L2-norm value of fmap|
|Gradient||No||Average magnitude of gradients for fmap|
|Gain||No||Analytical model of Top-1 class change|
|for variation in fmap Sakr & Shanbhag (2018)|
|Mod Gain||No||Alternative formulation of Gain analytical model|
4.2 Non-Injection Based Heuristics
Non-error injection based heuristics present alternative methods to estimate vulnerability which can be gathered relatively quickly compared to error injections. These methods rely on information from a set of error-free inferences to estimate the vulnerability of an fmap. Our non-injection based heuristics fall under two general categories: (1) obtaining fmap-level information using observations from the forward pass during an inference, and (2) performing an additional backward pass (a back-propagation) to provide additional information via differentiation for vulnerability estimation. We study a total of 6 non-injection based heuristics.
Max Neuron Value: This simple forward-pass technique assigns an fmap the value of the maximal observed neuron value across many sample inputs. Thus, effectively, it assumes that errors in feature maps where the activation values can be high are more likely to affect the outcome.
Feature Map Range: This technique assigns each fmap the value computed by finding the difference between the largest and smallest activation value across many sample inputs. This ranking scheme takes into consideration that networks will typically be quantized before deployment Sakr et al. (2018, 2017); Sakr & Shanbhag (2018), constraining their dynamic range and, in effect, also reducing the possible observable corruptions in the neurons in the feature maps. Thus, it models the maximal range of error values which may be observed during inference.
Average L2 The L2-norm calculates the distance of the vector coordinate from the origin of the vector space. Specifically, it is calculated as the square root of the sum of the squared vector values. We compute the L2-norm of an fmap (vector) averaged across multiple input samples to assign this value to the fmap as an estimate of relative vulnerability.
Gradient One of the key components of CNN training is the gradient descent algorithm used to update a network’s weights. During training, a CNN performs back-propagation to adjust weights in order to minimize a loss function (a typical loss function is cross-entropy loss, discussed above). This is done by obtaining the gradient value at each weight, and adjusting the weights incrementally during each training epoch by using the gradient value. We use a similar technique but adapted to neurons (rather than weights). As neurons are differentiable, they too have gradient values which can be used to predict vulnerability. For this technique, we perform a backward pass which only computes the gradients for each neuron and does not modify the network parameters, unlike the backward pass used in the training phase. We compute the gradients with respect to the cross-entropy loss at the output. We use the absolute value of neuron gradients obtained, and average them per fmap across many samples from the dataset.
|Neural||Dataset||Convolutional||Total||Total||Average||Floating Point||INT8 Quantized|
|Network||Name||Layers||Feature Maps||Neurons||Neurons/Fmap||Top-1 Accuracy||Top-1 Accuracy|
Gain Recent work by Sakr et al. Sakr et al. (2017); Sakr & Shanbhag (2018) proposed an analytical model which bounds mismatch probability in the context of network quantization. The analysis is based on estimating how much a noise source, at a set of neurons for instance, affects the accuracy of a network. If a set of neurons is corrupted element-wise by a noise source of variance then:
where is the variance of the noise source, is the label predicted by the noise-free network, are that soft outputs, is the number of classes.
The set of neurons can be defined in a flexible manner and could denote anything from all of the neurons in the network, or the set of neurons in a given layer, and most relevantly, the set of neurons within an fmap. Thus, we define the noise gain of an fmap as
where, from (5), we have
The expectation in (6) can be obtained by taking averages over the training set or even a subset of it with statistically significant size as discussed in Sakr et al. (2017). Furthermore, computing the noise gain in (6) requires derivatives of outputs with respect to neurons, which are readily available thanks to the automatic differentiation packages utilized by deep learning frameworks implementing the back-propagation algorithm. In addition, (6) assumes a pre-trained network with frozen parameters and need only be computed once for all fmaps.
Thus, a natural mechanism for fmap vulnerability estimation is simply to measure their noise gains. Indeed, if an fmap has a large noise gain , then (7) shows that corrupting with noise will have a large impact on . On the other hand, if is small, then corrupting with noise will have a small impact on . In our results, this ranking procedure is referred to as Gain.
Modified Gain The analytical results of Gain are derived assuming corruption of neurons by independent noise sources. In our work, we also consider the case where a neuron is replaced by a random scalar belonging to the fmap’s dynamic range as discussed later in Section 5.1. Such setup violates the independence assumption between neuron and noise. However, we may still leverage the above theory. Indeed, it can be shown that (7) still applies in the context of neuron replacement provided the definition of the noise gain is updated as follows:
This analysis yields a homologous technique to Gain which we refer to as Mod-Gain in our evaluation.
5 Evaluation Methodology
We evaluate HarDNN on 11 CNNs across three datasets. Table 2 lists each CNN, along with the number of layers, fmaps, and neurons, and accuracy on the respective dataset. We use the PyTorch v1.1 framework Paszke et al. (2017) for evaluation, and obtained pretrained models for CNNs trained on ImageNet Russakovsky et al. (2015) from the PyTorch TorchVision repository PyTorch (2019), and CNNs trained on CIFAR10/100 from github Yang (2017). All experiments were run on an Amazon EC2 p3.2xlarge instance Miller et al. (2010), which has an Intel Xeon E5-2686 v4 server processor, 64GB of memory, and an NVIDIA V100 GPU with 16GB of memory NVIDIA (2018).
|Mismatch||samples * forward * fmaps * inj/fmap|
|Loss||samples * forward * fmaps * inj/fmap|
|MaxValue||samples * forward|
|FmapRange||samples * forward|
|Average L2||samples * forward|
|Gradient||samples * (forward + backward)|
|Gain||samples * (forward + (backward * (classes - 1)))|
|Mod-Gain||samples * (forward + (backward * (classes - 1)))|
|samples:||Total number of images used in the ranking set|
|forward:||Average runtime of a single inference for a CNN|
|backward:||Average runtime of a single back-propogation for a CNN|
|fmaps:||Total number of feature maps in a network|
|inj/fmap:||The number of error injection experiments per feature map|
|classes:||The total number of output classes|
5.1 Error Models
We model neuron-level manifestations of transient hardware errors that occur during an inference. As described in Section 3.2, we assume that a particle strike to a flip-flop used during a MAC operation during a convolution will corrupt a single neuron’s value. Such low-level errors can manifest as single or multiple bit-flips. Additionally, highly optimized inference systems typically employ quantization prior to deploying CNNs. Such models run significantly faster with hardware support for reduced-precision operations, which is prevalent in GPUs and CPUs. These benefits often come with a small but acceptable loss in classification accuracy (reflected in Column 8 of Table 2). Based on these consideration, we evaluate using the following three error models.
FP-Rand represents a random (possibly multi-bit) error during the computation of a neuron, where the computations are performed on floating point values. We model this by choosing a random neuron in an fmap, and substituting the original value with a random value between [-max, max], where max is the maximum observed neuron value in the fmap across the training set. FP-Rand limits the effect of an error by bounding it between a range. Previous work found that inference is highly sensitive to errors in the sign and exponent bits and a simple output fmap-level range detector can mitigate many of the most severe corruptions Li et al. (2017).
FxP-Rand considers 8-bit integer (INT8) fixed-point quantization, which quantizes the CNN based on the range of neuron values observed during training. Error injections are performed by choosing a random neuron from an fmap, and substituting the original value with a randomly selected INT8 value. Thus, FxP-Rand is similar to FP-Rand, but for quantized models.
FxP-Flip models the effect of a single bit-flip on a fixed-point quantized neuron, simulating the effect of a particle strike on a flip-flop storing the neuron’s output.
5.2 Evaluating Vulnerability Estimation Techniques
For vulnerability analysis, we exclude images that are incorrectly classified by the error-free network since our focus is on analyzing the resilience of the network during correct execution. After removing incorrectly classified images from the dataset, we randomize and partition the correctly classified images into two non-overlapping subsets: an estimation set (ES) and a test set (TS), using an 80-20 split. We use the TS to perform error injections and generate baseline vulnerability estimations using the mismatch- and loss-based metrics. The ES is used on all metrics (including mismatch and loss), to compare the fmap vulnerability estimates to the TS.
To quantitatively compare two metrics, we sort the fmaps from most to least vulnerable by the respective metric. We compute the cumulative vulnerability of the fmaps in this sorted order. This gives us a cumulative vulnerability vs. fmap curve for a given metric, from which we can compute Manhattan distance between the two curves for each fmap point. We use the average Manhattan distance as a measure of the difference between the metrics, where a distance close to zero indicates very high correlation.
5.3 Error Coverage vs. Computational Overhead
For a given set of feature maps, F, that are duplicated, we define coverage as the cumulative relative vulnerability of those fmaps. From a developerâs point of view, depending on the metric used, this coverage is the cumulative vulnerability estimate given by that metric on the test set. We refer to this as the predicted coverage. The actual coverage on the evaluation set, however, could be different. We experimentally measure this coverage by determining the coverage of F on the test set using the loss metric. We validate the predicted coverage against the actual coverage to see how similar they are.
To target a specific coverage value, we use a greedy algorithm. We sort all fmaps from higher to lower vulnerability (based on the metric being considered) and choose the first several fmaps whose relative vulnerability adds up to the targeted coverage. To assess the relative overhead tradeoff, we measure the total number of MAC operations in those selected fmaps as a fraction of the total MAC operations in all fmaps. We use MACs as a reasonable proxy to the actual overhead, while providing some abstraction for the actual hardware used.
5.4 Heuristic Analysis
One of the primary motives of using heuristics is to estimate vulnerability much faster compared to error injections. We provide analytical models to compare the different expected runtimes of each vulnerability estimation technique in Table 3. The analytical models predict the runtime based on the the number of samples explored, runtime of a forward/backward pass, and the number of inj/fmap (for the injection-based techniques). The forward-pass-based heuristics (MaxValue, FmapRange, AverageL2) are expected to perform the fastest, followed by the backprop-based heuristics (gradient, gain, mod-gain), and the slowest are injection-based techniques. In practice, most of these schemes can be parallelized via batching multiple images together on a single device and across multiple devices, which can provide additional speedups (not included in the model).
To assess the accuracy of the heuristics, i.e., how well they estimate relative vulnerabilities of fmaps, we compare the cumulative vulnerability estimates of each heuristic on the ES, and measure the Manhattan distance to an error-injection estimation provided by the TS. Section 5.3 provides additional detail on quantitatively comparing the accuracy of a heuristic.
6.1 Error Models Comparison
We begin our evaluation by comparing the relative vulnerabilities of fmaps across different error models. Figure 2 shows the cumulative relative vulnerability of the fmaps in AlexNet-ImageNet, where the x-axis is sorted in descending order of relative vulnerability as measured by the oracle mismatch metric using 12,288 injections per fmap (shorthand: inj/fmap). We use the same ordering on the x-axis (based on FxP-Flip vulnerability ordering) for comparison, and normalize the relative vulnerabilities of fmaps to its respective error model.
Comparing relative vulnerabilities of fmaps show that, regardless of error model, an fmap’s contribution towards the total network vulnerability is nearly the same, with the average Manhattan distance between the different errors models for AlexNet at . Further, we find that this occurs despite the fact that the absolute total vulnerability, , differs for different models. FP-Rand and FxP-Rand show similar , since both error models have the same dynamic range and multi-bit perturbation probabilities. FxP-Flip shows lower , attributed to the less egregious error model of a single-bit perturbation. We find that the relative vulnerabilities of fmaps are nearly same for different networks for FP-Rand and FxP-Flip. Table 4 shows the distances between the results for the two error models for different CNNs are small, despite FxP-Flip displaying fewer errors overall compared to FP-Rand. Given that the relative vulnerability contribution of fmaps is similar across error models, we focus the remainder analysis of this paper on FxP-Flip.
6.2 Mismatch vs. Loss
We next evaluate the efficacy of using loss-based vulnerability estimation in place of the mismatch metric to measure relative vulnerability. Using AlexNet as a case study, we perform a large injection campaign sweeping the number of injections per fmap from 64 to 12288, and measure relative fmap vulnerability using both mismatch and loss. We compare the relative vulnerability at each point to the Oracle vulnerability defined as mismatch at 12,288 inj/fmap, and compute the Manhattan distance between the cumulative relative vulnerabilities. Figure 2(a) shows the result for AlexNet across different datasets, and Figure 2(b) shows the result for all ImageNet networks with loss compared to Mismatches up to 2048 inj/fmap.
Although the mismatch metric may be considered the gold standard for CNN reliability analysis, we find that loss and mismatch both converge as the total number of injections per fmap increase. Larger models, such as AlexNet trained for ImageNet rather than CIFAR, take longer to converge for mismatch and start much further away from the final result (Manhattan Distance of 0.12). The reason for this is that the binary nature of the mismatch-based metric (i.e., it must observe a Top-1 misclassification to differentiate between fmaps) means that more injections are required as a function of the larger space of errors. Thus, the more total neurons in the CNN, the larger the space of possible errors, which in turn translates to requiring more error injection experiments if measured by mismatch.
However, loss does not suffer from this phenomenon, as it incorporates small changes in the output resulting from an error even if the Top-1 class does not change. In other words, loss can extract information from Masked errors, and use that to quickly converge to the fmaps final vulnerability estimate. We find this trend across all networks studied for ImageNet in Figure 2(b), where loss asymptotically approaches its final value quickly (note the log scale on x-axis). For example, for ResNet50, loss using 256 inj/fmap, the relative vulnerability is a distance of less than .002 from the relative vulnerability of loss using 2048 inj/fmap, indicating a potential runtime improvement of the injection campaign by 10 with negligible loss of quality.
As the distance computed in Figure 2(b) is with respect to mismatch at 2048, we find that networks with more neurons/fmap such as VGG (see column 6 of Table 2) display a larger distance between loss and mismatch. This can be attributed to the baseline mismatch metric not yet converging at 2048 inj/fmap. From Figure 2(a), we can expect that as the mismatch metric approaches many more inj/fmap, the two will close the gap. Thus, loss is in fact a better fine-grained proxy at estimating fmap vulnerability, since it requires fewer inj/fmaps to converge while still maintaining high accuracy compared to mismatch at higher inj/fmaps.
*Implementation required batch size = 1 for Gain/Mod-Gain
6.3 Error Coverage vs. Computational Overhead
HarDNN provides the developer with a technique to estimate vulnerability, and tune for error coverage versus computational overhead. Since the developer may not always have the true relative vulnerability of each fmap, s/he requires an accurate tool to make an informed decision regarding the coverage vs. overhead tradeoff. Figure 3(a) depicts this trade-off, where the x-axis shows the relative vulnerability estimates using loss, and the corresponding additional percentage of MAC units required to obtain the corresponding coverage. The coverage here is measured using the PropP value of loss from the ES for vulnerability estimation, as a prediction of actual coverage which is provided by the mismatch metric from the TS.
Based on this view, we find that the computational overhead is always sub-linear to coverage, indicating that selective protection is in fact advantageous to full duplication, and can even provide large benefits. For example, covering 90% of errors in SqueezeNet requires less than 30% additional MACs. Even for networks which don’t display such a large advantage as SqueezeNet, all networks do exhibit the sublinear behavior of coverage to overhead. The reasoning behind this is that the most vulnerably fmaps have a high PropP and/or a high OriginP. For networks which have less uniform fmap size distributions, such as SqueezeNet, MobileNet, and ShuffleNet, we find a knee at the end of the curve which captures large features (OriginP) with low PropP. Other networks which do not have such a large discrepancy between fmaps across layers show relative vulnerability with a less pronounced trade-off between coverage and overhead.
We compare the predicted coverage by loss to the actual coverage in Figure 3(b) (as described in Section 5.3), and we find that, not only is loss relatively quick at vulnerability estimation (from Section 6.2), it is also representative of the actual vulnerability as measured by mismatches in the TS. Thus, the prediction provided to the developer is very accurate, with loss providing an excellent proxy for error coverage. This goes back to the fact that loss values actually capture the sensitivity of the network, and the mismatch-based metric is a specific by-product of how sensitive an fmap is to errors.
6.4 Heuristic Analysis
Last but not least, we evaluate the accuracy and runtime of all heuristics, measured against the vulnerability estimate of loss as a baseline since loss experimentally proves to be a superior metric to mismatches. Figure 5 shows the heuristic results for different CNNs, where the x-axis is ordered based on the relative vulnerability estimate for each respective heuristic using the ES, and the baseline for loss is provided by the TS. The relative vulnerability is obtained for each heuristic by extracting the PropP of each Fmap from the mismatch-based TS, and the cumulative sum is shown on the y-axis. Table 5 shows the measured runtimes for all the techniques on our evaluation infrastructure.
We find that Loss is the highest performing metric (average distance of .004 from baseline) followed by mismatch (average distance = .006), both from the ES. This validates that the error injection techniques have high accuracy relative to the TS. The non-injection based techniques however vary widely, but we find that the techniques which leverage backprop are generally more accurate, with Gradient performing the best overall (average distance = .067). This result follows from the fact that the magnitude of the gradient helps inform the overall sensitivity of the fmap to a perturbation; a gradient of zero means the fmap is very stable, and won’t change too much, translating to a smaller effect on the output. Larger gradients on the other hand may play a stronger role in the final classification for an image, and as such the backprop-based metrics leverage this information.
One additional insight we find is that on average, a small number of fmaps (less than a third) account for a large percentage of the networks relative vulnerability (average of 76% cumulative vulnerability). However, this does not directly map to the overhead associated with the highly vulnerable feature maps (as it doesn’t take into account the size of the fmap, just the number of fmaps) but indicates that without incorporating OrigP, the relative vulnerability of fmaps may be biased.
For runtime analysis of the heuristics, we leveraged image batching when available, and still found that the runtime trends between the analytical models (Section 5.4) and the measure runtimes to be similar. However, gain and mod-gain underperformed due to an implementation limitation, where the backprop algorithm did not support for batching with different differentiation values as required by the gain formulation. Overall, loss provides the highest accuracy and be sped up with fewer inj/fmap, while gradient provides an acceptable trade-off between runtime and accuracy, which can be used for quick profiling by the developer during vulnerability estimation.
7 Related Work
Pruning Based Techniques: CNN model pruning techniques aim to remove redundant and less-useful parameters from a model to improve execution efficiency Cun et al. (1990). These techniques often reduce accuracy by a few small factors. HarDNN focuses on identifying vulnerable feature maps, which it then proceeds to duplicate to improve reliability, with no effect on classification accuracy. There are many similarities between pruning and hardening. (1) Pruning is typically a two-phased process. The first phase identifies a filter to remove, and the second phase (called a fine-tuning phase) removes the filter and retrains the network. (2) Recent work found that pruning full filters (rather than individual weights in a filter) can have minimal effects on accuracy, while improving the pruning speed Li et al. (2016). This is analogous to our fmap target granularity and protection strategy. (3) Pruning techniques rank filters using heuristics to identify candidates to prune Li et al. (2016); Molchanov et al. (2016). We also explore similar heuristics to estimate fmaps based on vulnerability. The objective of a pruning technique is to zero-out a filter, removing it from the model. In contrast, for the resiliency analysis, we assume various error models which change a single neuron.
DNN Reliability: Recent work has explored DNN-specific reliability due to rise of DNN usage in safetly-critical applications. Previous methods targeted neuron-level Schorn et al. (2018) vulnerability but more recently have also gravitated toward fmap-level analysis Schorn et al. (2019). HarDNN differs from Schorn et al. in that their focus is on redistributing error across a CNN, whereas HarDNN aims to provide selective protection of the vulnerable fmaps. BinFI Chen et al. (2019) proposes an orthogonal binary search technique to reduce the error injection space for ML reliability, which can be generally used to speed up error injection campaigns. We introduce loss as a different metric for measuring vulnerability, which, as shown, can also speed up injection campaigns.
This paper presents HarDNN, a software-directed technique to identify vulnerable computations in CNNs and selectively protect them. HarDNN operates at the feature map level granularity, and introduces loss as an accurate error-injection based metric for vulnerability estimation, and explores different heuristics for fast vulnerability assessment. Additionally, we analyze the tradeoff between error coverage and computation overhead for selective protection. Results show that the relative vulnerability of an fmap is similar across 3 error models studied, and that the improvement in resilience for the added computation is super linear with HarDNN. For example, HarDNN can improve SqueezeNet’s resilience by 10 with just 30% computational overhead. For future work we plan to extend HarDNN to include other applications of neural networks.
This material is based upon work supported in part by the Applications Driving Architectures (ADA) Research Center, a JUMP Center co-sponsored by SRC and DARPA. A portion of this work was performed while Abdulrahman Mahmoud interned at NVIDIA.
- Bartlett, W. and Spainhower, L. Commercial Fault Tolerance: A Tale of Two Systems. IEEE Transactions on Dependable and Secure Computing (TDSC), pp. 87–96, January 2004.
- Chang, C.-K., Lym, S., Kelly, N., Sullivan, M. B., and Erez, M. Hamartia: A fast and accurate error injection framework. In Proceedings of the International Conference on Dependable Systems and Networks Workshops (DSN-W), pp. 101–108. IEEE, 2018.
- Chen, Z., Li, G., Pattabiraman, K., and DeBardelenben, N. Binfi: An efficient fault injector for safety-critical machine learning systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’19, 2019.
- Cun, Y. L., Denker, J. S., and Solla, S. A. Advances in neural information processing systems 2. chapter Optimal Brain Damage, pp. 598–605. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1990. ISBN 1-55860-100-7. URL http://dl.acm.org/citation.cfm?id=109230.109298.
- Fang, B., Lu, Q., Pattabiraman, K., Ripeanu, M., and Gurumurthi, S. ePVF: An Enhanced Program Vulnerability Factor Methodology for Cross-Layer Resilience Analysis. In 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 168–179, June 2016.
- Feng, S., Gupta, S., Ansari, A., and Mahlke, S. Shoestring: Probabilistic soft error reliability on the cheap. In Proceedings of the the International Symposium on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 385–396, 2010.
- Hari, S. K. S., Adve, S. V., Naeimi, H., and Ramachandran, P. Relyzer: Exploiting Application-Level Fault Equivalence to Analyze Application Resiliency to Transient Faults. In Proc. of International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2012.
- Hari, S. K. S., Tsai, T., Stephenson, M., Keckler, S. W., and Emer, J. Sassifi: An architecture-level fault injection tool for gpu application resilience evaluation. In 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 249–258. IEEE, 2017.
- International Organization for Standardization. Road vehicles – Functional safety. Website, 2011. https://www.iso.org/standard/43464.html.
- Iturbe, X., Venu, B., Ozer, E., and Das, S. A Triple Core Lock-Step (TCLS) ARM® Cortex®-R5 Processor for Safety-Critical and Ultra-Reliable Applications. In 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshop (DSN-W), pp. 246–249, June 2016. doi: 10.1109/DSN-W.2016.57.
- Kaliorakis, M., Gizopoulos, D., Canal, R., and Gonzalez, A. MeRLiN: Exploiting Dynamic Instruction Behavior for Fast and Accurate Microarchitecture Level Reliability Assessment. In Proc. of International Symposium on Computer Architecture (ISCA), 2017.
- Laguna, I., Schulz, M., Richards, D. F., Calhoun, J., and Olson, L. Ipas: Intelligent protection against silent output corruption in scientific applications. In Proceedings of the International Symposium on Code Generation and Optimization (CGO), pp. 227–238. IEEE, 2016.
- Li, G., Hari, S. K. S., Sullivan, M., Tsai, T., Pattabiraman, K., Emer, J., and Keckler, S. W. Understanding error propagation in deep learning neural network (dnn) accelerators and applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’17, pp. 8:1–8:12, New York, NY, USA, 2017. ACM. ISBN 978-1-4503-5114-0. doi: 10.1145/3126908.3126964. URL http://doi.acm.org/10.1145/3126908.3126964.
- Li, G., Pattabiraman, K., Hari, S. K. S., Sullivan, M., and Tsai, T. Modeling soft-error propagation in programs. In Proceedings of the International Conference on Dependable Systems and Networks (DSN), 2018.
- Li, H., Kadav, A., Durdanovic, I., Samet, H., and Graf, H. P. Pruning filters for efficient convnets. CoRR, abs/1608.08710, 2016. URL http://arxiv.org/abs/1608.08710.
- Li, X., Adve, S., Bose, P., and Rivers, J. Online Estimation of Architectural Vulnerability Factor for Soft Errors. In Proc. of International Symposium on Computer Architecture (ISCA), pp. 341–352, 2007.
- Lu, Q., Farahani, M., Wei, J., Thomas, A., and Pattabiraman, K. Llfi: An intermediate code-level fault injection tool for hardware faults. In 2015 IEEE International Conference on Software Quality, Reliability and Security, pp. 11–16. IEEE, 2015.
- Mahmoud, A., Venkatagiri, R., Ahmed, K., Misailovic, S., Marinov, D., Fletcher, C. W., and Adve, S. V. Minotaur: Adapting software testing techniques for hardware errors. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’19, pp. 1087–1103, New York, NY, USA, 2019. ACM. ISBN 978-1-4503-6240-5. doi: 10.1145/3297858.3304050. URL http://doi.acm.org/10.1145/3297858.3304050.
- Miller, F. P., Vandome, A. F., and McBrewster, J. Amazon Web Services. Alpha Press, 2010. ISBN 6131788367, 9786131788369.
- Molchanov, P., Tyree, S., Karras, T., Aila, T., and Kautz, J. Pruning convolutional neural networks for resource efficient transfer learning. CoRR, abs/1611.06440, 2016. URL http://arxiv.org/abs/1611.06440.
- NVIDIA. Self-Driving Car Hardware — NVIDIA DRIVE. https://www.nvidia.com/en-us/self-driving-cars/drive-platform/hardware/.
- NVIDIA. Nvidia tesla v100 gpu accelerator. Website, 2018. URL https://images.nvidia.com/content/technologies/volta/pdf/tesla-volta-v100-datasheet-letter-fnl-web.pdf.
- Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. Automatic differentiation in pytorch. In NIPS-W, 2017.
- PyTorch. Pytorch classification models. Website, 2019. URL https://pytorch.org/docs/stable/torchvision/models.html.
- Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y.
- Safety Research and Strategies, Inc. Toyota unintended acceleration and the big bowl of ’spaghetti’ code. Website, 2013. URL http://www.safetyresearch.net/blog/articles/toyota-unintended-acceleration-and-big-bowl-%E2%80%9Cspaghetti%E2%80%9D-code.
- Sakr, C. and Shanbhag, N. R. An analytical method to determine minimum per-layer precision of deep neural networks. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1090–1094, 2018.
- Sakr, C., Kim, Y., and Shanbhag, N. Analytical guarantees on numerical precision of deep neural networks. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 3007–3016, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. URL http://proceedings.mlr.press/v70/sakr17a.html.
- Sakr, C., Choi, J., Wang, Z., Gopalakrishnan, K., and Shanbhag, N. True gradient-based training of deep binary activated neural networks via continuous binarization. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2346–2350, 04 2018. doi: 10.1109/ICASSP.2018.8461456.
- Schirmeier, H., Hoffmann, M., Dietrich, C., Lenz, M., Lohmann, D., and Spinczyk, O. FAIL*: An Open and Versatile Fault-Injection Framework for the Assessment of Software-Implemented Hardware Fault Tolerance. In European Dependable Computing Conference (EDCC), pp. 245–255, 2015.
- Schorn, C., Guntoro, A., and Ascheid, G. Accurate neuron resilience prediction for a flexible reliability management in neural network accelerators. In 2018 Design, Automation Test in Europe Conference Exhibition (DATE), pp. 979–984, March 2018. doi: 10.23919/DATE.2018.8342151.
- Schorn, C., Guntoro, A., and Ascheid, G. An efficient bit-flip resilience optimization method for deep neural networks. In Design, Automation & Test in Europe Conference & Exhibition, DATE 2019, Florence, Italy, March 25-29, 2019, pp. 1507–1512, 2019. doi: 10.23919/DATE.2019.8714885. URL https://doi.org/10.23919/DATE.2019.8714885.
- Sean Hollister. Tesla’s new self-driving chip is here, and this is your best look yet. 2019. URL https://www.theverge.com/2019/4/22/18511594/tesla-new-self-driving-chip-is-here-and-this-is-your-best-look-yet.
- Shye, A., Blomstedt, J., Moseley, T., Reddi, V. J., and Connors, D. A. PLR: A Software Approach to Transient Fault Tolerance for Multicore Architectures. IEEE Transactions on Dependable and Secure Computing (TDSC), 6(2):135–148, April 2009. ISSN 1545-5971.
- Sridharan, V. and Kaeli, D. R. Eliminating Microarchitectural Dependency from Architectural Vulnerability. In Proc. of International Symposium on High Performance Computer Architecture (HPCA), 2009.
- Sze, V., Chen, Y., Yang, T., and Emer, J. S. Efficient processing of deep neural networks: A tutorial and survey. Proceedings of the IEEE, 105(12):2295–2329, Dec 2017. ISSN 0018-9219. doi: 10.1109/JPROC.2017.2761740.
- Venkatagiri, R., Mahmoud, A., Hari, S. K. S., and Adve, S. V. Approxilyzer: Towards a Systematic Framework for Instruction-level Approximate Computing and its Application to Hardware Resiliency. In Proc. of International Symposium on Microarchitecture (MICRO), pp. 1–14, 2016.
- Yang, W. Pytorch-classification. Website, 2017. URL https://github.com/bearpaw/pytorch-classification.
- Yoshida, J. Toyota case: Single bit flip that killed. Website, 2013. URL https://www.eetimes.com/document.asp?doc_id=1319903#.