Selective Brain Damage: Measuring the Disparate Impact of Model Pruning
Neural network pruning techniques have demonstrated it is possible to remove the majority of weights in a network with surprisingly little degradation to test set accuracy. However, this measure of performance conceals significant differences in how different classes and images are impacted by pruning. We find that certain examples, which we term pruning identified exemplars (PIEs), and classes are systematically more impacted by the introduction of sparsity. Removing PIE images from the test-set greatly improves top-1 accuracy for both pruned and non-pruned models. These hard-to-generalize-to images tend to be mislabelled, of lower image quality, depict multiple objects or require fine-grained classification. These findings shed light on previously unknown trade-offs, and suggest that a high degree of caution should be exercised before pruning is used in sensitive domains.
Between ††Code associated with this paper is available at: https://bit.ly/2C8GriD.infancy and adulthood, the number of synapses in our brain first multiply and then fall. Synaptic pruning improves efficiency by removing redundant neurons and strengthening synaptic connections that are most useful for the environment . Despite losing of all synapses between age two and ten, the brain continues to function [42, 62]. The phrase ”Use it or lose it” is frequently used to describe the environmental influence of the learning process on synaptic pruning, however there is little scientific consensus on what exactly is lost .
In this work, we ask what is lost when we prune a deep neural network. In 1990, a popular paper was published titled “Optimal Brain Damage” . The paper was among the first [27, 57, 68] to propose that deep neural networks could be pruned of “excess capacity” in a similar fashion to synaptic pruning. At face value, pruning appears to promise you can (almost) have it all. Deep neural networks are remarkably tolerant of high levels of pruning with an almost negligible loss to top-1 accuracy [24, 66, 50, 51, 7, 47]. For example, Gale et al.  show that removing of all weights in a ResNet-50 network  trained on ImageNet  results in less than a absolute decrease in top-1 test set accuracy. These more compact networks are frequently favored in resource constrained settings; pruned models require less memory, energy consumption and have lower inference latency [59, 6, 65, 39, 67].
The ability to prune networks with seemingly so little degradation to generalization performance is puzzling. The cost to top-1 accuracy appears minimal if it is spread uniformally across all classes, but what if the cost is concentrated in only a few classes? Are certain types of examples or classes disproportionately impacted by pruning? An understanding of these trade-offs is critical for sensitive tasks such as hiring [15, 25], health care diagnostics [71, 19], self-driving cars , where the introduction of pruning may be at odds with fairness objectives to treat protected attributes uniformly and/or the need to guarantee a certain level of recall for certain classes. Pruning is already commonly used in these domains, often driven by the resource constraints of deploying models to mobile phone or embedded devices [17, 60].
|true: cloak||ladle||espresso||mashed potato|
|baseline model: gasmask||ladle||espresso||mashed potato|
|pruned model: breastplate||perfume||red wine||ice cream|
|true: stretcher||sewing machine||bathtub||crutch|
|baseline model: folding chair||sewing machine||bathtub||crutch|
|pruned model: barrow||polaroid camera||cucumber||apron|
|true: butternut squash||petri dish||parallel bars||stretcher|
|baseline model: cucumber||espresso||parallel bars||plunger|
|pruned model: cabbage head||petri dish||pool table||broom|
In this work we propose a formal methodology to evaluate the impact of pruning on a class and exemplar level (Sections 3.2 and 2.3). The measures we propose identify classes and images where there is a high level of disagreement or difference in generalization performance between pruned and non-pruned models. Our results are surprising and suggest that a reliance on top-line metrics such as top-1 or top-5 test-set accuracy hides critical details in the ways that pruning impacts model generalization. The primary findings of our work can be summarized as follows:
Pruning in deep neural networks is better described as “selective brain damage.” Pruning has a non-uniform impact across classes; a fraction of classes are disproportionately and systematically impacted by the introduction of sparsity.
The examples most impacted by pruning, which we term Pruning Identified Exemplars (PIEs), are more challenging for both pruned and non-pruned models to classify.
We conduct a small scale human study and find that PIEs tend to overindex on images with an incorrect ground truth label, images that involve fine grained classification tasks or depict multiple objects.
Pruning significantly reduces robustness to image corruptions and adversarial attacks.
For (1) and (2), we establish consistent findings for different standard architectures on CIFAR-10  and ImageNet. Toward finding (4), we measure changes to model sensitivity to both common image corruptions and natural adversarial examples using two open source robustness benchmarks: ImageNet-C  and ImageNet-A .
The over-indexing of poorly structured data (multi-object or incorrectly labelled data) in PIE hints that the explosion of growth in number of parameters in deep neural networks may be solving a problem that is better addressed in the data cleaning pipeline. More broadly, our findings provide important insights about when pruned models are qualified to make decisions on real world inputs. Our PIE methodology identifies a tractable subset of images which are more challenging for pruned and non-pruned models. PIE could be used to surface atypical examples for further human inspection , choose not to classify certain examples when the model is uncertain [2, 10, 11, 9], or to aid interpretability as a case based reasoning tool to explain model behavior [41, 23, 4, 33].
2 Methodology and Experiment Framework
We consider a supervised classification problem where a deep neural network is trained to approximate the function that maps an input variable to an output variable , formally . The model is trained on a training set of images , and at test time makes a prediction for each image in the test set. The true labels are each assumed to be one of categories or classes, such that .
A reasonable response to our desire for more compact representations is to simply train a network with fewer weights. However, as of yet, starting out with a compact dense model has not yet yielded competitive test-set performance. Instead, current research centers on training strategies where models are initialized with “excess capacity” which is then subsequently removed through pruning. A pruning method identifies the subset of weights to remove (i.e. set to zero). A pruned model function, , is one where a fraction of all model weights are set to zero. Equating weight value to zero effectively removes the contribution of a weight as multiplication with inputs no longer contributes to the activation. A non-pruned model function, , is one where all weights are trainable (). At times, we interchangeably refer to and as sparse and non-sparse model functions (where the level of sparsity is indicated by ).
2.2 Class Level Measure of Impact
Comparing only top-1 model accuracy between a baseline and a pruned model amounts to assuming that class accuracy is expected to maintain it’s relative relationship to the top-1 model accuracy before and after pruning. In this work, we consider whether this is a valid assumption. Is relative performance unaltered by pruning or are some classes impacted more than others?
For a given model, we compute the class accuracy for class and sparsity . We compute overall model accuracy from the set of class metrics:
where is the number of examples in class and is the total number of examples in the data set. If the impact of pruning was uniform, we would expect each class accuracy to shift by the same number of percentage points as the difference in top-1 accuracy between the pruned and non-pruned model. This forms our null hypothesis () – the shift in accuracy for class before and after pruning is the same as the shift in top-1 accuracy. For each class we consider whether to reject and accept the alternate hypothesis () that pruning disparately affected the class’s accuracy in either a positive or negative direction:
Evaluating whether the difference between a sample of mean-shifted class accuracy from pruned and non-pruned models is “real” amounts to determining whether two data samples are drawn from the same underlying distribution, which is the subject of a large body of goodness of fit literature [14, 1, 36]. Neural network training is most often done in an independent non-deterministic fashion, and we consider each model in its population of models to be a sample of some underlying distribution. Given a class and a population of models trained at a sparsity , we construct a set of samples of the mean-shifted class accuracy as . In this work, we use a two-sample, two-tailed, independent Welch’s t-test  to determine whether the means of the samples and differ significantly. If the two samples were drawn from distributions with different means with 95% or greater probability (-value ), then we reject the null hypothesis and consider the class to be disparately affected by -sparsity pruning relative to the baseline.
After finding the subset of classes for a given -sparsity that shows a statistically significant change relative to the baseline, we can quantify the degree of deviation, which we refer to as normalized recall difference, by comparing the average -pruned and baseline class accuracies after normalizing for their respective average model accuracies:
2.3 Image Level Measure of Impact
How does pruning impact model performance on individual images? A natural extension of the hypothesis testing in the prior section is to consider whether to reject or retain the null hypothesis that the output probability for a given image for a dense and pruned models is equal. However, recent work has highlighted that deep neural networks produce output probabilities that are uncalibrated [20, 40, 45] and thus cannot be interpreted as a measure of certainty. Deep neural networks do not know what they do not know, and often ascribe high probabilities to out-of-distribution data points or are overly sensitive to adversarially perturbed inputs [30, 55].
|Significant||Largest increase||Largest decrease|
|# incr.||# decr.||class||norm||abs||class||norm||abs|
|0.5||-0.8||91||54||petri dish||3.41||2.6||frying pan||-4.66||-5.46|
|0.7||-1.7||189||128||cd player||4.99||3.33||tow truck||-6.94||-8.6|
We are interested in how model predictive behavior changes through the pruning process. Given the limitations of uncalibrated probabilities in deep neural networks, we focus on the level of disagreement between the predictions of pruned and non-pruned networks on a given image. Let be the prediction of the th -pruned model of its population for image where denotes an non-pruned model, and let be the set of predictions for the -pruned model population on exemplar . For set we find the modal label, i.e. the class predicted most frequently by the -pruned model population for exemplar , which we denote . Exemplar is classified as a pruning identified exemplar if and only if the modal label is different between the set of -pruned models and the non-pruned models:
We note that there is no constraint that the non-pruned predictions for PIEs match the true label, thus the detection of PIEs is an unsupervised protocol that could in principal be performed at test time.
3 Experiment Setup and Results
3.1 Experiment Setup
We consider two classification tasks and models; a wide ResNet model  trained on CIFAR-10 and a ResNet-50 model  trained on ImageNet. Both networks are trained with batch normalization . A key goal of our analysis is to produce findings that are not anecdotal as would be the case when analyzing one trained model of each type. Instead, we independently train a population of 30 models for each experimental setting. We train for steps (approximately epochs) on ImageNet with a batch size of images and for steps on CIFAR-10 with a batch size of . For ImageNet, the baseline non-pruned model obtains a mean top-1 accuracy of and mean top-5 accuracy of across 30 models. For CIFAR-10, mean baseline top-1 accuracy is . We prune over the course of training to obtain a target end sparsity level . For example, indicates that of model weights are removed by pruning, leaving a maximum of non-zero weights. Figure 2 shows the distributions of model accuracy across model populations for the non-pruned and pruned models for ImageNet and CIFAR-10.
Across all experiments, we use magnitude pruning as proposed by Zhu and Gupta  to identify the weights to remove. Magnitude pruning is a simple rule-based method that thresholds weights at zero that fall below a certain absolute magnitude. It has been shown to outperform more sophisticated Bayesian pruning methods and is considered state-of-the-art across both computer vision and language models . The choice of magnitude pruning also allowed us to specify and precisely vary the final model sparsity for purposes of our analysis, unlike regularizer approaches that allow the optimization process itself to determine the final level of sparsity [50, 51, 7, 70, 68, 57]. Although the ability to precisely vary sparsity is required for this experimental framework, we note that our methodology can be extended to other methods. In order to encourage replication of our results using additional pruning methods, we have open sourced our code for all experiments.
3.2 Impact of Sparsity on Class Level Performance
We now return to our initial question about class level impact – Is relative performance unaltered by pruning or are some classes impacted more than others?. We compute the the normalized recall class difference (introduced in 2.2) for each class in image. We find that the impact of magnitude pruning on ImageNet classification is disparate across classes and amplified as sparsity increases. For example, at sparsity only 51 of 1,000 classes in the ImageNet test set exhibit a statistically significant change in class accuracy, however at sparsity, accuracy is impacted for 582 classes in a statistically significant way.
The directionality and magnitude of the impact is nuanced and surprising. Our results show that certain classes are relatively robust to the overall degradation experienced by the model whereas others degrade in performance far more than the model itself. This amounts to “selective brain damage” with performance on certain classes evidencing far more sensitivity to the removal of model capacity. Table 1 shows that more classes show a significant relative increase in accuracy than a decrease at every level, though the overall model accuracy decreases at every pruning level, indicating that the magnitude of class decreases must be larger in order to pull the model accuracy lower. The model appears to cannibalize performance on a small subset of classes in order to preserve performance (and even improve relative performance on a small number of classes). Figure 4 visualizes the magnitude of the normalized recall differences for and pruning and highlights the degree to which the classes spread from model average.
We performed the same analysis on the CIFAR-10 models and found that while pruning presents a non-uniform impact there are fewer classes that are statistically significant. One class out of ten was significantly impacted at , , and , and two classes were impacted at . We suspect that we found less disparate impact for CIFAR-10 because, while the model has less capacity, the number of weights is still sufficient to model the limited number of classes and lower dimension dataset.
3.3 Impact of Sparsity on Individual Exemplars
We now turn to the impact of pruning at an exemplar level. We use the PIE methodology introduced in 2.3 and identify a subset of PIE images at every level of sparsity (for both CIFAR-10 and ImageNet). At and sparsity, we classify and of all ImageNet test-set images respectively as PIEs. For CIFAR-10, PIEs constitute and of the test set at and sparsity. Why does pruning introduce a high level of prediction disagreement for some images but not others? We turn our attention now to gaining an understanding of what makes PIEs different from non-PIEs?
PIEs are more difficult for both pruned and non-pruned models to classify. In Fig. 5, we compare the test-set performance of a fully parameterized non-pruned model on a fixed number of randomly selected () PIE images, () non-PIE images and () a random sample of the test set. The results are consistent across both CIFAR-10 and ImageNet datasets; removing PIE images from the test-set improves top-1 accuracy for both pruned and non-pruned models relative to a random sample. Inference restricted to only PIE images significantly degrades top-1 accuracy. In the appendix, we include additional plots that show that while all models perform far worse on PIE images, the degradation to performance is amplified as model sparsity increases.
The most challenging PIEs are identified at low levels of sparsity. In Figure 5, the lowest test-set accuracy for both pruned and non-pruned models occurs when inference is restricted to PIEs identified at sparsity. Test-set accuracy steadily increases for PIEs identified at higher levels of sparsity. This suggests that the introduction of sparsity first erodes performance on the images that the model finds the most challenging.
Why are PIEs harder to classify? A qualitative inspection of PIEs (Figure 7) suggests that these hard-to-generalize-to images tend to be of lower image quality, mislabelled, entail abstract representations, require fine-grained classification or depict atypical class examples. We conducted a limited human study (involving volunteers who work at an industry research lab) to label a random sample of PIE and non-PIE ImageNet images. We broadly group the properties we codify as indicative of 1) the exemplar being challenging or 2) the task being ill-specified. We introduce these groupings below (after each bucket we report the percentage of PIEs and non-PIEs in each category as a fraction of total PIEs and non-PIE codified):
Poorly specified task
ground truth label incorrect or inadequate – images where there is not sufficient information for a human to arrive at the correct ground truth label. For example in Fig. 7, the image of the plate of food with the label restaurant is cropped such that it is impossible to tell whether the food is in a restaurant or in a different setting. [ of non-PIEs, of PIEs]
multiple-object image – images depicting multiple objects where a human may consider several labels to be appropriate (e.g., an image which depicts both a paddle and canoe, desktop computer consisting of a screen, mouse and monitor, a barber chair in a barber shop). [ of non-PIE, of PIEs]
fine grained classification – involves classifying an object that is semantically close to various other class categories present the data set (e.g., rock crab and fiddler crab, bassinet and cradle, cuirass and breastplate). [ of non-PIEs, of PIEs]
image corruptions – images exhibit common corruptions such as motion blur, contrast, pixelation. We also include in this category images with super-imposed text, an artificial frame and images that are black and white rather than the typical RBG color images in ImageNet. [ of non-PIE, of PIE]
abstract representations – the surfaced exemplar depicts a class object in an abstract form such a cartoon, painting, or sculptured incarnation of the object. [ of non-PIE, of PIE]
We find that the number of image corruptions and abstract representations surfaced by PIE appears similar to the overall representation in the ImageNet dataset. However, we find that PIEs appear to heavily overindex relative to non-PIEs on certain properties, such as having an incorrect ground truth label, involving a fine-grained classification task or multiple objects. This suggests that the task itself is often incorrectly specified. Both ImageNet and CIFAR-10 are single image classification tasks, however of PIEs codified by humans were identified as multi-object images where multiple labels could be considered reasonable (vs. of non-PIEs). The over-indexing of incorrectly structured data in PIE hints that the explosive growth in number of parameters in deep neural networks may be solving a problem better addressed in the data cleaning pipeline.
3.4 The Role of Additional Capacity
The PIE procedure surfaces exemplars that are harder for both pruned and non-pruned models to classify. Given that PIE surfaces data points where there is the greatest divergence in behavior between pruned and non-pruned models, it is useful to understand the directionality of some of the properties described in the previous section. For example, many PIEs are often atypical or unusual class examples. We have already noted that model degradation when restricted to inference on PIEs is amplified as sparsity increases. Does this measure of model brittleness mirror other open source robustness benchmarks?
ImageNet-C ImageNet-C  is an open source data set that consists of algorithmically generated corruptions (blur, noise) applied to the ImageNet test-set. We compare top-1 accuracy given inputs with corruptions of different severity. As described by the methodology of Hendrycks and Dietterich , we compute the corruption error for each type of corruption by measuring model performance rate across five corruption severity levels (in our implementation, we normalize the per-corruption error by the performance of the pruned model on the clean ImageNet dataset). ImageNet-C corruption substantially degrades mean top-1 accuracy of non-pruned models Fig. 8. This sensitivity is amplified at high levels of sparsity, where there is a further steep decline in top-1 accuracy. Sensitivity to different corruptions is remarkably varied, with certain corruptions such as gaussian, shot an impulse noise consistently causing more degradation.
ImageNet-A ImageNet-A is a curated test set of natural adversarial images designed to produce drastically low test accuracy. We find that the sensitivity of pruned models to ImageNet-A mirrors the patterns of degradation to ImageNet-C and sets of PIEs. As sparsity increase, top-1 and top-5 accuracy further erode, suggesting that pruned models are more brittle to adversarial examples.
4 Related Work
Model compression is diverse and includes research directions such as reducing the precision or bit size per model weight (quantization) [12, 35, 22], efforts to start with a network that is more compact with fewer parameters, layers or computations (architecture design) [34, 37, 44] and student networks with fewer parameters that learn from a larger teacher model (model distillation)  and finally pruning by setting a subset of weights or filters to zero [51, 70, 13, 26, 64, 27, 73, 61, 54]. Articulating the trade-offs of compression has overwhelming centered on change to overall accuracy. Our contribution, while limited in scope to model compression techniques that prune deep neural networks, is the first work to our knowledge to propose a formal methodology to evaluate the impact of pruning in deep neural networks at a class and exemplar level is non-uniform
We also consider how pruning impacts robustness to natural adversarial examples and image corruptions. We note that recent work by [29, 31] considers complimentary variant of this question by benchmark ImageNet-A and ImageNet-B robustness across a limited set of dense non-pruned architectures with different numbers of parameters (for example ResNet-50 vs ResNet-101). While our work is focused on understanding the impact of sparsity on an exemplar and class level, one of our key findings is that PIE is far more challenging to classify for both pruned and non-pruned models. Leveraging this subset of data points for interpretability purposes or to cleanup the dataset fits into a broader and non-overlapping body of literature that aims to classify input data points as prototypes – “most typical” examples of a class – ([3, 63]) or outside of the training distribution (OOD) [30, 46, 49, 46, 52] and work on calibrating deep neural network predictions [45, 20, 40].
We propose a formal methodology to evaluate the impact of pruning at a class and exemplar level. We show that deep neural networks pruned to different levels of sparsity “forget” certain classes and examples more than others. While a subset of classes are systematically impacted, the direction of this impact is surprising and nuanced. Our results show certain classes are relatively impervious to the reduction in model capacity while others bear the brunt of degradation in performance. Pruning identified exemplars are a subset of exemplars where there is a high level of disagreement between pruned and non-pruned models. We show that this subset is universally challenging for models at all levels of sparsity to classify. Our results shed light on previously unknown trade-offs, and suggest that caution should be used before using pruned models in sensitive domains where human welfare can be adversely impacted.
We thank the generosity of our peers for valuable input on earlier versions of this work. In particular, we would like to acknowledge the input of Jonas Kemp, Simon Kornblith, Julius Adebayo, Hugo Larochelle, Dumitru Erhan, Nicolas Papernot, Catherine Olsson, Cliff Young, Martin Wattenberg, Utku Evci, James Wexler, Trevor Gale, Melissa Fabros, Prajit Ramachandran, Pieter Kindermans, Erich Elsen and Moustapha Cisse. We thank the institutional support and encouragement of Dan Nanas, Rita Ruiz, Sally Jesmonth and Alexander Popper.
-  (1954) A test of goodness of fit. Journal of the American Statistical Association 49 (268), pp. 765–769. External Links: Cited by: §2.2.
-  (2008-06) Classification with a reject option using a hinge loss. J. Mach. Learn. Res. 9, pp. 1823–1840. External Links: Cited by: §1.
-  (2019) Prototypical examples in deep learning: metrics, characteristics, and utility. External Links: Cited by: §4.
-  (2000) Case-based explanation for artificial neural nets. In Artificial Neural Networks in Medicine and Biology, H. Malmgren, M. Borga, and L. Niklasson (Eds.), London, pp. 303–308. External Links: Cited by: §1.
-  (2000) Structural and functional brain development and its relation to cognitive development. Biological Psychology 54 (1), pp. 241 – 257. External Links: Cited by: §1.
-  (2016-06) Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), Vol. , pp. 367–379. External Links: Cited by: §1.
-  (2014-12) Memory Bounded Deep Convolutional Networks. ArXiv e-prints. External Links: Cited by: §1, §3.1.
-  (2014) Memory bounded deep convolutional networks. CoRR abs/1412.1442. External Links: Cited by: §6.1.
-  (2017-03) Online Learning with Abstention. arXiv e-prints, pp. arXiv:1703.03478. External Links: Cited by: §1.
-  (2016) Boosting with abstention. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 1660–1668. External Links: Cited by: §1.
-  (2016) Learning with rejection. In ALT, Cited by: §1.
-  (2014-12) Training deep neural networks with low precision multiplications. arXiv e-prints, pp. arXiv:1412.7024. External Links: Cited by: §4.
-  (1990) Optimal brain damage. In Advances in Neural Information Processing Systems, pp. 598–605. Cited by: §1, §4.
-  R. B. D’Agostino and M. A. Stephens (Eds.) (1986) Goodness-of-fit techniques. Marcel Dekker, Inc., New York, NY, USA. External Links: Cited by: §2.2.
-  (2018) Amazon scraps secret ai recruiting tool that showed bias against women. Reuters. External Links: Cited by: §1.
-  (2009) ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, Cited by: §1.
-  (2017-01) Dermatologist-level classification of skin cancer with deep neural networks. Nature 542, pp. . External Links: Cited by: §1.
-  (2019) The state of sparsity in deep neural networks. CoRR abs/1902.09574. External Links: Cited by: §1, §3.1.
-  (2018) 3D deep learning for detecting pulmonary nodules in ct scans. Journal of the American Medical Informatics Association : JAMIA 25 10, pp. 1301–1310. Cited by: §1.
-  (2017-06) On Calibration of Modern Neural Networks. arXiv e-prints, pp. arXiv:1706.04599. External Links: Cited by: §2.3, §4.
-  (2016) Dynamic network surgery for efficient dnns. CoRR abs/1608.04493. External Links: Cited by: §6.1.
-  (2015) Deep learning with limited numerical precision. CoRR abs/1502.02551. External Links: Cited by: §4.
-  (2017-07) Efficient Data Representation by Selecting Prototypes with Importance Weights. arXiv e-prints, pp. arXiv:1707.01212. External Links: Cited by: §1.
-  (2015) Learning both Weights and Connections for Efficient Neural Network. In NIPS, pp. 1135–1143. Cited by: §1.
-  (2019) A face-scanning algorithm increasingly decides whether you deserve the job. The Washington Post. External Links: Cited by: §1.
-  (1993-03) Optimal brain surgeon and general network pruning. In IEEE International Conference on Neural Networks, Vol. , pp. 293–299 vol.1. External Links: Cited by: §4.
-  (1993) Second order derivatives for network pruning: optimal brain surgeon. In Advances in Neural Information Processing Systems 5, pp. 164–171. Cited by: §1, §4.
-  (2015-12) Deep Residual Learning for Image Recognition. ArXiv e-prints. External Links: Cited by: §1, §3.1.
-  (2019) Benchmarking neural network robustness to common corruptions and perturbations. In International Conference on Learning Representations, External Links: Cited by: §1, §3.4, §4.
-  (2016-10) A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks. arXiv e-prints, pp. arXiv:1610.02136. External Links: Cited by: §2.3, §4.
-  (2019-07) Natural Adversarial Examples. arXiv e-prints, pp. arXiv:1907.07174. External Links: Cited by: §1, §4.
-  (2015-03) Distilling the Knowledge in a Neural Network. arXiv e-prints, pp. arXiv:1503.02531. External Links: Cited by: §4.
-  (2019) A benchmark for interpretability methods in deep neural networks. In NeurIPS 2019, Cited by: §1.
-  (2017-04) MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. ArXiv e-prints. External Links: Cited by: §4.
-  (2016) Quantized neural networks: training neural networks with low precision weights and activations. CoRR abs/1609.07061. External Links: Cited by: §4.
-  (2002) Goodness-of-fit tests and model validity. Goodness-of-fit Tests and Model Validity, Birkhäuser Boston. External Links: Cited by: §2.2.
-  (2016-02) SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and 0.5MB model size. ArXiv e-prints. External Links: Cited by: §4.
-  (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. CoRR abs/1502.03167. External Links: Cited by: §3.1.
-  (2018) Efficient Neural Audio Synthesis. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pp. 2415–2424. Cited by: §1.
-  (2017) What uncertainties do we need in bayesian deep learning for computer vision?. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 5574–5584. Cited by: §2.3, §4.
-  (2016) Examples are not enough, learn to criticize! criticism for interpretability. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 2280–2288. Cited by: §1.
-  (2009) Fundamentals of human neuropsychology. A series of books in psychology, Worth Publishers. External Links: Cited by: §1.
-  (2012-05) Learning multiple layers of features from tiny images. University of Toronto, pp. . Cited by: §1.
-  (2017-06–11 Aug) Resource-efficient machine learning in 2 KB RAM for the internet of things. In Proceedings of the 34th International Conference on Machine Learning, D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, International Convention Centre, Sydney, Australia, pp. 1935–1944. External Links: Cited by: §4.
-  (2017) Simple and scalable predictive uncertainty estimation using deep ensembles. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, USA, pp. 6405–6416. External Links: Cited by: §2.3, §4.
-  (2018) Training confidence-calibrated classifiers for detecting out-of-distribution samples. In International Conference on Learning Representations, External Links: Cited by: §4.
-  (2018) SNIP: single-shot network pruning based on connection sensitivity. CoRR abs/1810.02340. External Links: Cited by: §1.
-  (2017-12) Leveraging uncertainty information from deep neural networks for disease detection. Scientific Reports 7, pp. . External Links: Cited by: §1.
-  (2018) Enhancing the reliability of out-of-distribution image detection in neural networks. In International Conference on Learning Representations, External Links: Cited by: §4.
-  (2017-08) Learning Efficient Convolutional Networks through Network Slimming. ArXiv e-prints. External Links: Cited by: §1, §3.1.
-  (2017-12) Learning Sparse Neural Networks through Regularization. ArXiv e-prints. External Links: Cited by: §1, §3.1, §4.
-  (2018-08) Metric Learning for Novelty and Anomaly Detection. arXiv e-prints, pp. arXiv:1808.05492. External Links: Cited by: §4.
-  (2018) Scalable Training of Artificial Neural Networks with Adaptive Sparse Connectivity Inspired by Network Science. Nature Communications. Cited by: §6.1.
-  (2017-04) Exploring Sparsity in Recurrent Neural Networks. arXiv e-prints, pp. arXiv:1704.05119. External Links: Cited by: §4.
-  (2014-12) Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images. arXiv e-prints, pp. arXiv:1412.1897. External Links: Cited by: §2.3.
-  (2017-01) Technical report, U.S. Department of Transportation, National Highway Traffic, Tesla Crash Preliminary Evaluation Report Safety Administration. PE 16-007. Cited by: §1.
-  (1992) Simplifying neural networks by soft weight-sharing. Neural Computation 4 (4), pp. 473–493. External Links: Cited by: §1, §3.1.
-  (1994) Synaptic development of the cerebral cortex: implications for learning, memory, and mental illness. In The Self-Organizing Brain: From Growth Cones to Functional Networks, J. V. Pelt, M.A. Corner, H.B.M. Uylings, and F.H. L. D. Silva (Eds.), Progress in Brain Research, Vol. 102, pp. 227 – 243. External Links: Cited by: §1.
-  (2016-06) Minerva: enabling low-power, highly-accurate deep neural network accelerators. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), Vol. , pp. 267–278. External Links: Cited by: §1.
-  (2018-05) Evolutionary pruning of transfer learned deep convolutional neural network for breast cancer diagnosis in digital breast tomosynthesis. Physics in Medicine & Biology 63 (9), pp. 095005. External Links: Cited by: §1.
-  (2016-06) Compression of Neural Machine Translation Models via Pruning. arXiv e-prints, pp. arXiv:1606.09274. External Links: Cited by: §4.
-  (2004) Longitudinal mapping of cortical thickness and brain growth in normal children. Journal of Neuroscience 24 (38), pp. 8223–8231. External Links: Cited by: §1.
-  (2017-11) ConvNets and ImageNet Beyond Accuracy: Understanding Mistakes and Uncovering Biases. arXiv e-prints, pp. arXiv:1711.11443. External Links: Cited by: §4.
-  (1997) Sparse connection and pruning in large dynamic artificial neural networks. Cited by: §4.
-  (2018) Faster gaze prediction with dense networks and Fisher pruning. CoRR abs/1801.05787. External Links: Cited by: §1.
-  (2017) Soft Weight-Sharing for Neural Network Compression. CoRR abs/1702.04008. Cited by: §1.
-  (2018) LPCNet: Improving Neural Speech Synthesis Through Linear Prediction. CoRR abs/1810.11846. External Links: Cited by: §1.
-  (1991) Generalization by weight-elimination with application to forecasting. In Advances in Neural Information Processing Systems 3, R. P. Lippmann, J. E. Moody, and D. S. Touretzky (Eds.), pp. 875–882. Cited by: §1, §3.1.
-  (1947) The generalization of ‘Student’s’ problem when several different population variances are involved. Biometrika 34, pp. 28–35. External Links: Cited by: §2.2.
-  (2016-08) Learning Structured Sparsity in Deep Neural Networks. ArXiv e-prints. External Links: Cited by: §3.1, §4.
-  (2019) Automated pulmonary nodule detection in ct images using deep convolutional neural networks. Pattern Recognition 85, pp. 109 – 119. External Links: Cited by: §1.
-  (2016) Wide residual networks. CoRR abs/1605.07146. External Links: Cited by: §3.1.
-  (2017-10) To prune, or not to prune: exploring the efficacy of pruning for model compression. ArXiv e-prints. External Links: Cited by: §4.
-  (2017) To prune, or not to prune: exploring the efficacy of pruning for model compression. CoRR abs/1710.01878. External Links: Cited by: §3.1, §6.1, §6.1.
6.1 Magnitude Pruning
There are various pruning methodologies that use the absolute value of the weight as way to rank importance and remove from the network weights that are below a user specified threshold. This is often over the course of training; training is punctuated at certain pruning steps and a fraction of weights are set to zero. Many different magnitude pruning methods have been proposed [8, 21, 74] that largely differ in whether the weights are removed permanently or can “recover” by still receiving subsequent gradient updates. This would allow certain weights to become non-zero again if pruned incorrectly. While magnitude pruning is often used as a criteria to remove individual weights, it can be adapted to remove entire neurons or filters by extending the ranking criteria to a set of weights and setting the threshold appropriately. Recent work on evolutionary strategies has also leveraged an interative version of magnitude pruning .
In this work, we use the magnitude pruning methodology proposed by . Pruning is introduced over the course of training and removed weights continue to receive gradient updates after being pruned. For ImageNet, each model trains for a total of 32,000 steps. We prune every 500 steps between 1,000 and 9,000 steps. For CIFAR-10, we train the model for 80,000 steps. We prune every 2,000 steps between 1,000 and 20,000 steps. These hyperparameter choices were based upon a limited grid search which suggested that these particular settings minimized degradation to test-set accuracy across all sparsity levels. At the end of training, the final pruned mask is fixed and during inference only the remaining weights contribute to the model prediction.
6.2 Additional class-level results
The class-level analysis summary for CIFAR-10 is in Table 4. Relative to ImageNet, the percentage of classes significantly impacted at 90% pruning is small: 20% for CIFAR-10 versus 58% for ImageNet. As discussed in the main body, we suspect this is due to CIFAR-10 being a simpler task and the network we started from having much more capacity than necessary for the task.
|Fraction Pruned||Top 1||Top 5||# Signif classes||# PIEs|
|Fraction Pruned||Top 1||# Signif classes||# PIEs|
Figure 4 in the main body of the paper visualizes the relative increases and decreases in performance across classes at 70% and pruning on ImageNet. Those figures were shrunk for space; Figure 9 shows the full chart for sparsity for clarity.
|Significant||Largest increase||Largest decrease|
|# incr.||# decr.||class||norm||abs||class||norm||abs|
6.3 Additional PIE Results
In the body of the paper we show the performance of the unpruned model on PIE images found at varying levels of sparsity and showed that performance is worst for images which we suspect are the most difficult and performance is still poor but better for for larger values of . In Figure 11 we plot the performance of the pruned models on PIEs identified at different levels of sparsity and show that the behavior of the pruned models track the behavior of the non-pruned model in this regard.
6.4 Additional Corruption and Adversarial Results
Sparse models are less robust to natural adversarial examples. At high levels of sparsity, models are also more brittle to common image corruptions. We include the raw ImageNet-C results in Figure 5.
6.5 Human Study
A balanced sampled PIE and non-PIE were selected at random and shuffled. The classification as PIE or non-PIE was not known or available to the human labels. We include an image of the data collection interface in Figure 12. Questions codified for every image considered:
Does label 1 accurately label an object in the image? (0/1)
Does this image depict a single object? (0/1)
Would you consider labels 1,2 and 3 to be semantically very close to each other? (does this image require fine grained classification) (0/1)
Do you consider the object in the image to be a typical exemplar for the class indicated by label 1? (0/1)
Is the image quality corrupted (some common image corruptions – overlaid text, brightness, contrast, filter, defocus blur, fog, jpeg compression, pixelate, shot noise, zoom blur, black and white vs. rbg)? (0/1)
Is the object in the image an abstract representation of the class indicated by label 1? [[an abstract representation is an object in an abstract form, such as a painting, drawing or rendering using a different material.]] (0/1)
|ImageNet Robustness to ImageNet-C Corruptions (By Level of Sparsity)|
|Corruption Type||Pruning Fraction||Top-1||Top-5||Top-1 Relative||Top-5 Relative|