Attribution Preservation in Network Compression for Reliable Network Interpretation
Neural networks embedded in safety-sensitive applications such as self-driving cars and wearable health monitors rely on two important techniques: input attribution for hindsight analysis and network compression to reduce its size for edge-computing. In this paper, we show that these seemingly unrelated techniques conflict with each other as network compression deforms the produced attributions, which could lead to dire consequences for mission-critical applications. This phenomenon arises due to the fact that conventional network compression methods only preserve the predictions of the network while ignoring the quality of the attributions. To combat the attribution inconsistency problem, we present a framework that can preserve the attributions while compressing a network. By employing the Weighted Collapsed Attribution Matching regularizer, we match the attribution maps of the network being compressed to its pre-compression former self. We demonstrate the effectiveness of our algorithm both quantitatively and qualitatively on diverse compression methods.
Riding on the recent success of deep learning in numerous fields, there is an emergent trend to utilize deep neural networks (DNNs) even for safety-critical applications such as self-driving cars and wearable health monitors. Due to the inherent nature of such devices, it is of paramount importance that the utilized DNNs be reliable and trustworthy to human users.
For a system to be reliable, perpetual service must be rendered and the integrity of the system must hold even under unexpected circumstances. For most commercially deployed DNNs, this condition is hardly met as they are often operated in the cloud due to their heavy computational requirements. However, this dependence on clouds acts as a critical weakness in safety-sensitive settings as intermittent communication failures to the cloud may cause difficulties in reacting to situations immediately, or even worse, the device’s connection to the cloud may be severed indefinitely. Thus, to guarantee reliable service, the DNNs must be embedded on the edge device. To this end, network compression techniques such as pruning Han et al. (2015a); Li et al. (2016) and distillation Hinton et al. (2015); Zagoruyko and Komodakis (2017) are commonly employed - as a compressed network would require less computational time and memory but maintain its prediction performance to a certain acceptable margin, effectively substituting the original network for edge computation.
At the same time, for a system to be trustworthy, the system must be transparent enough for humans to understand its workings and the reasons for its outputs. An example would be when a health monitor predicts an onset of a disease Xu et al. (2019) - then the clinician would require an acceptable explanation to the device output. However, the black-box nature of deep neural networks complicates this goal - impeding its advance in safety-critical areas. For DNNs to gain trustworthiness, the ability to explain why the network makes such decisions is essential. Such field of interest - eXplainable AI (XAI) - has emerged as one of the important frontiers in the field of deep learning. Among numerous XAI methods, the most commonly used methods are attribution methods Selvaraju et al. (2017), which weigh the parts of the input data according to how much they ‘contributed’ to produce the output prediction. Such attribution methods are beginning to be applied in safety-critical fields Liang et al. (2020).
To ensure the safety of the system, the two aforementioned conditions should be simultaneously satisfied - the embedded DNNs must be equipped with both compression and attribution. However, we show for the first time that these seemingly unrelated techniques conflict with each other: compressing a network causes deformations in the produced attributions, even if the predictions of the network stays the same before and after compression (See Figure 1). This is a potentially severe crack in the integrity of the compressed network, as the premise in which a compressed network is acceptable in safety-critical fields is that the compressed network is as reliable as its former self. This implies that the compressed network must behave almost identically to the pre-compression network while being smaller in size. Moreover, the attributions of the compressed network are not only different from their past counterparts but also broken down compared to their respective segmentation ground truths, as shown in Figure 1 and Table 1. These attribution distortions directly cause incorrect interpretations, which could lead to dire consequences for safety-critical systems. Such a problem arises from the pitfall of existing network compression approaches: they only aim to maintain the prediction quality of the network while reducing the size of the network.
|For samples with correct pred.|
|KD (w/ Ours)||0.29M||88.06||79.12|
Compressing a network forces the network to cram its necessary decision procedures and information inside a smaller space. This space restriction forces the network to abandon its standard decision procedures and resort to using shortcuts and hints that are seemingly indecipherable to humans. Thus, its decision procedures would become harder to interpret, which is reflected in its production of deformed attribution maps.
To resolve this newfound unintended issue, we propose a novel attribution-aware compression framework to ensure both the reliability and trustworthiness of the compressed model. One way to tackle this problem is to inject the attribution information to the now-compressing network by employing a matching regularizer to match the attributions to a ground truth signal (e.g. ground truth segmentation data). However, these kinds of signals are very rare as they require extensive human labor. To bypass this problem, we concentrate on the observation that the attributions of the pre-network (teacher) are closer to the ground truth signal compared to the post-network (student), as shown in Table 1. Thus, in the absence of ground truth signals, the attributions of the teacher can serve as a proxy. In this sense, we propose a regularizer that matches the attribution maps of the now-compressing network to its attribution maps before compression, transferring the attributional power of the pre-network to the post-network. Our work sheds new light on transfer learning techniques from the perspective of XAI, as they can be re-interpreted and subsumed under our framework.
Our contributions are as follows:
We show for the first time that compressing networks via pruning or distillation distorts the attributions of the network (i.e. compressed networks classify correctly but pay attention to wrong places), hence the well-calibrated explainability of the original model can be completely destroyed even with matched performance.
We propose a matching technique to efficiently preserve diverse levels of attribution maps while compressing the networks, by regularizing the differences between the sampled attribution maps of the teacher and the student.
Through extensive experiments, we validate the effectiveness of our framework and show that our attribution matching not only maintains the interpretation of the model but also yields significant performance gains.
2 Related Work
Recent advances in producing human-understandable explanations for predictions of DNNs have gained much attention throughout the machine learning community. Among a variety of approaches towards this goal, one widely adopted method of interpretation is input attribution. Attribution approaches try to explain deep neural networks by producing visual explanations about the decisions of the network. By examining how the network’s output reacts to change in the input, the contributions of each input variable are calculated. In computer vision, these contributions are displayed in a 2-D manner, forming an attribution map. Attribution maps identify the spatial locations of the parts of the image the network deems significant in producing such a decision. Early works toward this direction use the gradient of the network output with respect to the input pixels to represent the sensitivity and significance of specific input pixels Zhou et al. (2015); Simonyan et al. (2013); Erhan et al. (2009). More recent studies such as Guided Backprop Springenberg et al. (2014), Grad-Cam Selvaraju et al. (2017) or integrated gradients Sundararajan et al. (2017) proposed to process and combine these gradient signals in more careful ways. Another line of works proposed to propagate relevance values in a way that their total amount is preserved for a single layer. These relevance scores are backpropagated through the network from the output layer to the input layer. Several studies such as EBP Zhang et al. (2016), LRP Montavon et al. (2018) proposed to define novel relevance scores differing from vanilla gradients and backpropagate these values according to a set of novel backpropagation rules.
Commonly used deep neural networks are heavy in computation and memory by design. Their resource requirement is the main impediment in operating these networks on resource-constrained platforms. To alleviate this constraint, many branches of works have been proposed to reduce the size of an existing neural network. The most commonly employed approach is to reduce the number of weights, neurons, or layers in a network while maintaining approximately the same performance. This approach on deep neural networks was first explored in early works such as LeCun et al. (1990) and Hassibi et al. (1994). Recent studies conducted by Han et al. (2015b, a) has brought popularity to this line of work with a simple unstructured pruning method that reduces the size of the network by pruning unimportant connections within the network. However, unstructured pruning has an inherent weakness as it produces large sparse weight matrices that are computationally inefficient unless equipped with a specifically designed hardware. To resolve this issue, structured pruning methods were proposed Li et al. (2016); Hu et al. (2016); Wen et al. (2016) where entire channels are pruned simultaneously to ensure the denseness of the weights.
Network distillation, another branch of network compression initially proposed by Hinton et al. (2015), attempts to reduce the size of the network by transferring the knowledge of the full network to a student network of smaller size. By employing a loss function that teaches the student network to mimic the outputs of the teacher network, a smaller network with similar performance can be obtained. Advanced methods of distillation have succeeded in achieving much more effective transfer by not only transferring the output logits but the information of the intermediate activations as in Zagoruyko and Komodakis (2017); Romero et al. (2014); Jang et al. (2019); Ahn et al. (2019).
3 Attribution-Preserving Compression
Network compression Throughout the paper, we use the term network compression to refer to any activity that reduces the size of the network while maintaining the predictive performance of the network within a certain acceptable margin(pruning, distillation, quantization, and more). The general compression framework is composed of the following stages: Network pre-training, reduction, and fine-tuning. First, a full-size network (or teacher network) is trained. Next, the network is reduced in size. In the reduction phase, parts of the network (be it weights, channels, or information) are discarded, producing a network (or student network) that is smaller in size. For example, in the case of network pruning, the connections of the full network are severed with a pruning criterion and the according weights are discarded, shrinking the number of parameters in the network. Finally, the network is fine-tuned on the same dataset to sufficiently recover from the performance degradation caused by the reduction phase, producing the network . For certain kinds of algorithms such as network sparsification Wen et al. (2016), steps two and three can be executed simultaneously.
Attribution maps For a neural network , an attribution for an input data point at a certain layer is a multidimensional tensor containing the importance values of each input or neuron at that layer which the network considers important in making its according prediction. These attribution values are calculated based on the magnitude of the point and its sensitivity to change of value. Most attribution algorithms leverage the activation value (for magnitude) and gradient (for sensitivity) to determine the importance.
This definition can be readily applied to Convolutional Neural Networks (CNN). Consider a convolutional layer with kernel ( and are the height and width of kernel respectively, and and are the number of input and output channels), and output activation . Then, the attribution of this layer is a 3-dimensional tensor . However, due to the spatio-local nature of CNNs, the attributions are often summed and collapsed along their channel dimension to produce a spatial attribution map to enhance human-interpretability. Specifically, suppose we have an original 3-dimensional attribution that is the concatenation of 2-dimensional (in ) attributions . Then, the collapsed version is computed as .
3.2 Weighted Collapsed Attribution Matching Framework
We now present our framework, Weighted Collapsed Attribution Matching, which preserves the attributions in a compressed network by transferring the attributional power of its past self to the current self. To this end, we employ a matching loss that matches the attribution map of to of in the fine-tuning stage of compression.
The key ingredient of our framework is the way of computing attribution maps. Beyond naively collapsing the 3-dimensional attribution to a 2-dimensional matrix, our framework allows to consider the importance of each channel when creating an attribution map. For the -th layer of a CNN, the attribution map based on the importance-aware collapsing is produced in the following way:
where is output activation of channel , is a function of choice, is the importance of channel given by an importance calculation function , and is an optional post-processing function. When it is clear from the context, we suppress the notation for clarity.
Given the weighted collapsed map in (1), we consider the following objective in the fine-tuning stage that tries to reduce the (normalized) difference between and :
where is the supervised learning loss for , is a tunable hyperparameter and is the set of layers to match. Note here that we use to note element-wise norm (or Frobenius norm for matrix input). The overall schematic of our framework is depicted in Figure 2. For any kind of attribution algorithm that is end-to-end differentiable, we can directly apply and minimize our weighted collapsed attribution matching regularizer via stochastic gradient descent. This form of framework in (2) can be applied to any compression method that involves a fine-tuning phase - pruning, distillation, quantization, etc.
Equally weighted collapsed activation map matching
A simple form of (1) is to naively assign equal importance weights to the channels and collapsing them along its channel dimension. Setting for all , and as element-wise identity and square function respectively, we have
where represents the Hadamard power (or element-wise power) of . This regularizer was proposed in a prior work on transfer learning (Zagoruyko and Komodakis, 2017) to boost knowledge transfer from a teacher network to a student network, just to improve performance. This regularizer is viewed in the new light of XAI in our framework that it is matching label aggregated, channel-wise equally weighted attribution map. From our experiments below, we confirm that this regularizer is partially effective in preserving attribution maps in compression. However, this form of attribution map does not contain label-specific attribution information since all activation values are equally weighted and aggregated. In other words, this regularizer may teach the student how to look and distinguish objects, but does not pass on the information of ‘what’ and ‘why’ it should look at a certain region.
Sensitivity-weighted activation map matching
As a practical showcase of our framework, we demonstrate a simple sensitivity-weighted matching regularizer. We elaborate on the flow of our framework using Grad-Cam, a simple yet effective and widely used attribution method. Grad-Cam produces an attribution map by aggregating the activation maps with a linear combination of activations, where each activation map is weighted by the sensitivity of the channel that is the label-specific pooled gradient of an activation map. Motivated by this, we define in (1) as
where is the output logit generated by for some target class .We set as ReLU to remove negative regions Selvaraju et al. (2017). We also set as identity so that
Unlike the collapsed activation map in (3), the activation maps are weighted by the pooled gradient values taken with respect to the output prediction. Since grad-cam is end-to-end differentiable, this form of regularization can be easily implemented within the conventional automatic differentiation framework. Since separate attribution maps can be created for each class label, we can match the attribution maps for all classes. However, to reduce the computational overhead, matching the attribution maps of high scoring classes is more plausible.
Another interesting family in framework (2) is the one leveraging stochasticity in computing importance weight . Stochasticity injection in deep learning has been proven to exhibit generalization benefits Srivastava et al. (2014); Kingma et al. (2015), and recent works are starting to utilize this concept to boost the performance of knowledge transfer between a teacher and a student Saito et al. (2017); Lee et al. (2017). Inspired by these works, we formulate a stochastic matching regularizer to facilitate relevant information transfer and prevent overfitting in which the student network only learns to superficially imitate the attribution maps of the teacher. For this purpose, we impose a probability distribution in generating importance weights as where is a probability distribution of the importance generating function . In the fine-tuning phase, importance weights are sampled from the distribution and a perturbed attribution map is created.
A simple and applicable formulation is to impose a Bernoulli distribution on the importance weights. Specifically, similar to dropout, we draw i.i.d. samples from a Bernoulli distribution and mask the importance weights before summing the attributions. This is equivalent to dropping randomly selected channel-wise attributions in before collapsing them. Given the calculated channel-wise importance weights and drop probability , the stochastic matching regularizer using Bernoulli masks in framework (2) is formulated as follows:
where is a function of choice. In this way, we expect that diverse levels of attribute maps of the teacher network are transferred into the student network in the training process. Further diverse strategies can be explored under this setting, such as sharing the drop mask between the teacher network and the compressed network according to their similarity.
In this section, we evaluate the performance of our framework on three distinct methods of compression: unstructured pruning, structured pruning, and knowledge distillation. For the choice of attribution algorithm to evaluate the interpretability of the models, we use Grad-Cam Selvaraju et al. (2017) to generate the attribution map for a given data point not only due to its simplicity and popularity but also due to its ability to detect important regions that reflect a model’s decision process (Appendix E). For each compression method, we compare the following four methods: naive fine-tuning, equally weighted collapsed activation map matching (3) (denoted as ‘EWA’), sensitivity-weighted activation map matching (4) (denoted as ‘SWA’) and its stochastic version (denoted as ‘SSWA’). We apply our matching regularizers on the last convolutional layer of a network. This is justified in the sense that the last convolutional layer conveys the most class distinctive information. For sensitivity-weighted matching and its stochastic variant, we match only the attribution map generated from the top 1 prediction of the full network. This is due to the computational cost of calculating the Jacobian matrix with contemporary automatic differentiation libraries, in which they require separate backpropagation steps for each row of the Jacobian matrix. For the settings described above, we conduct extensive experiments on the Pascal VOC 2012 Everingham et al. (2012) multi-label classification dataset. Further details on experimental settings and evaluation metrics are provided in Appendix C.
Evaluating attribution maps
To the best of our knowledge, there is no commonly agreed metric to measure the deviation of an attribution map to another due to the subjectiveness of attribution algorithms. To assess as objectively as possible, we measure the degree of deformation in attribution maps with cosine similarity, a widespread metric to represent the similarity between two vectors. However, cosine similarity can only measure the directional similarity between two vectors. Thus, the difference in intensity between two attribution maps is not captured. For this cause, we also measure the normalized distance between the attribution maps to capture the difference in intensities.
Since samples that the model’s prediction is wrong are not ‘understood’ by the model, their attribution maps are likely to break down. Thus, if we evaluate the attribution performance on the entire test set, models with low predictive performance are naturally at a disadvantage. To compensate for this effect and compare the attributions of all models on the same ground, we only consider the samples that each model correctly predicted.
Comparison with ground truth segmentation labels
We also evaluate the effectiveness of our framework by comparing the absolute quality of the attribution maps. Towards this, we evaluate the localization capability of the attribution maps by comparing them to ground truth segmentation labels, which is a widely used method to measure the soundness of attribution methods. We utilize the held out 1,449 images with segmentation masks in the PASCAL VOC 2012 dataset. However, the segmentation maps lack the intensity information present in attribution maps. Thus, the heatmaps must be thresholded to be compared. Since the ground truth segmentation labels are imbalanced, the performance of localization is affected by the intensity threshold in which we create the weak localization maps. Thus we compute the ROC-AUC by changing the intensity threshold. We separate the segmentation masks associated with the ground truth labels, calculate the ROC-AUC value for each label, and calculate their average. Moreover, to evaluate how many samples are broken due to compression, we utilize the Point Accuracy Petsiuk et al. (2018) that counts whether the max value of the heatmap is inside the segmentation map.
|Prediction Performance||Attribution Score|
|Network||Method||mAP||F1 Score||AUC||Point Acc|
|VGG16 / 2||KD||83.75||65.92||82.53||72.01|
|VGG16 / 4||KD||81.31||62.50||80.61||68.86|
|VGG16 / 8||KD||76.91||52.51||78.74||67.26|
|VGG16 / 2||KD||0.705||29.84|
|VGG16 / 4||KD||0.650||35.52|
|VGG16 / 8||KD||0.563||44.10|
4.1 Knowledge Distillation
For our experiments on knowledge distillation, we use the standard network distillation technique introduced in Hinton et al. (2015): we train a smaller student model using a linear combination of the typical cross-entropy loss with ground truth label and the KL divergence between the teacher and student output logits. we use the VGG16 network Simonyan et al. (2013) and create smaller student versions of the VGG16 network by maintaining the overall architecture but reducing the number of channels for all layers. We prepare 3 students: one-half (VGG16/2), one-quarter (VGG16/4), and one-eighth (VGG16/8). The teacher network is first initialized with off-the-shelf ImageNet pretrained weights, then trained with the PASCAL VOC 2012 dataset. When the teacher’s training is complete, a randomly initialized student is trained with knowledge distillation. In Table 3 and Table 3, we list the results of knowledge distillation experiments. We observe that the network trained with our framework not only effectively preserves the attribution maps, but also consistently outperforms the network distilled without our method in terms of prediction performance, which is measured in mean-average-precision (mAP) and F1 score. This result is partly expected from the work Zagoruyko and Komodakis (2017). We also observe that matching the sensitivity-weighted activation map outperforms the equally weighted one. We suspect that this gain is caused by the channel weighting scheme and matching an activation map that is conditioned on a class rather than matching a class-degenerated map.
|Prediction Performance||Attribution Score|
|Method||mAP||F1 Score||AUC||Point Acc|
4.2 Unstructured Pruning
We evaluate the performance of our method on networks pruned in an unstructured fashion. Unstructured pruning severs individual connections in the network to reduce the number of parameters, resulting in sparse weight matrices. We use the unstructured pruning method proposed in Han et al. (2015b) for network pruning. First, we initialize the full VGG16 network with off-the-shelf weights pretrained on ImageNet. Then a full network is trained on the PASCAL VOC 2012 dataset.
|Prediction Performance||Attribution Score|
|Method||mAP||F1 Score||AUC||Point Acc|
After the training is complete, the weights of the network are sorted according to their magnitude and a desired amount of weights are pruned. We use pruning rate . After pruning is complete, the remaining sparse network is fine-tuned for 30 epochs on the same dataset. The whole process is then iterated 16 times to produce the final compressed network with pruning rate . Our matching regularizer was employed at all pruning iterations.
4.3 Structured Pruning
For structured pruning, we use the structured pruning proposed in Li et al. (2016), in which whole filters are pruned according to the magnitude of each filter’s norm. The general flow of the experiment is similar to other methods. We use the same ImageNet-initialized VGG16 to train the full network. We use channel pruning rate . For structured pruning, we do not iterate the pruning cycle but execute the process a single time(one-shot pruning), so we set a higher pruning rate. The results of structured pruning experiments are summarized in Table 7 and Table 7. We observe similar tendencies.
4.4 Qualitative Evaluation of Attribution Maps
Aside from the quantitative assessment done in previous sections, we also conduct a qualitative assessment of the attribution maps. We draw and examine the attribution maps produced by structure-pruned networks trained with naive fine-tuning and SSWA with respect to the map of the full network. To this cause, we select images among the samples that all the methods have succeeded in predicting the correct label. The images are shown in Figure 4. We observe that even though the predictions of the networks are all correct, the quality of attribution maps produced by the compressed networks with respect to the full network varies. We see that the attribution maps produced by our method most resemble the maps of the teacher network.
4.5 Effects on Other Attribution Methods
|AUC/Point Acc||Grad Cam||Excitation Bp||RAP|
In the sections above, we observed the effectiveness of our method using Grad-Cam. In this section, in addition to Grad-Cam, we observe how maps produced by other attribution methods are deformed by compression and remedied by our method. We calculate the ROC-AUC curve and point accuracy of other attribution maps including Excitation Backprop Zhang et al. (2016), LRP Montavon et al. (2018), and RAP Nam et al. (2019) for knowledge distillation with VGG/8. The experimental setting is identical to that of Section 4.1. As described in Table 8, we observe that the maps of the three attribution methods are indeed deformed when compression is performed, and exhibit inferior point accuracy and ROC-AUC performance compared to the network before compression. Moreover, we observe that even though SSWA utilized gradient-based attribution maps akin to Grad-Cam, employing this regularizer helps to preserve other attribution methods including non-differentiable ones Zhang et al. (2016); Montavon et al. (2018); Nam et al. (2019). This is partly expected as the decision-critical regions of an input are indeed reflected in Grad-Cam maps (Appendix E). Thus, if any other attribution method is indeed trying to reveal the decisive regions, they are bound to show the regions similar to Grad-Cam.
In this work, we assert the problem of attribution preservation in compressed deep neural networks based on the observation that compression techniques significantly alters the generated attributions. To this end, we propose our attribution map matching framework which effectively and efficiently enforces the attribution maps of the compressed networks to be the same as those of the full networks. We validate our method through extensive experiments on benchmark datasets. The results show that our framework not only preserves the interpretation of the original networks but also yields significant performance gains over the model without attribution preservation.
In the paper, we brought up the attribution deformation problem in compressed networks, and a novel method to combat this issue. As discussed in Section 1, we believe that people trying to deploy deep learning models to safety-critical fields must be aware of this finding to ensure the reliability and trustworthiness of the system. To this end, we may think of a possible scenario. Suppose that a CNN classifier vision module trained with our matching regularizer is utilized in a self-driving system. In case of an accident, we may inspect the records of the deep learning module to learn the decision that caused the accident. In this situation, the model trained with our regularizer will provide more accurate attribution, leading to a cleaner and more just assessment.
However, the sense of attributional safety presented by our method can give a false sense of security and blind trust towards the system and its interpretations, while by no means the system is flawless. For example, a wearable health monitor might predict a person to be healthy, and provide its supporting explanations. If these explanations are blindly trusted, while they are wrong underneath the surface, the user might take reactive measures that are ultimately bad for oneself.
This work was supported by the National Research Foundation of Korea (NRF) grants (No.2018R1A5A1059921, No.2019R1C1C1009192), Institute of Information & Communications Technology Planning & Evaluation (IITP) grants (No.2017-0-01779, A machine learning and statistical inference framework for explainable artificial intelligence, No.2019-0-01371, Development of brain-inspired AI with human-like intelligence, No.2019-0-00075, Artificial Intelligence Graduate School Program (KAIST)) funded by the Korea government (MSIT). This work is also supported by Samsung Advanced Institute of Technology (SAIT).
Supplementary Material: Attribution Preservation in Network Compression for Reliable Network Interpretation
Appendix A Deformation of Other Attribution Methods
Here, we observe the deformation of various attribution methods other than Grad-Cam for several compression methods. As in the main paper, we calculate the ROC-AUC curve and the localization accuracy (Point accuracy) of attribution maps including Excitation Backprop , LRP , and RAP . The AUC denotes the degree of overlap between the ground truth segmentation and the attribution map. Point accuracy  is a measure of whether the max value of the heatmap is inside the segmentation map or not. Note that only the samples that the predictions of the network were correct are counted for a fair evaluation. As shown in Table 9 and Table 10, we observe that all attribution methods are deformed when compression is performed, and point accuracy and ROC-AUC performance are degraded compared to the scores before compression.
In the main paper, we showed that our attribution matching regularizer partially preserves other non-differential attribution maps even though the matching is executed on the scope of differentiable maps such as Grad-Cam . We leave the task of fully preserving various attribution methods for future work.
|ROC-AUC||Params||Grad Cam||Excitation Bp||RAP|
|Point Accuracy||Params||Grad Cam||Excitation Bp||RAP|
Appendix B Experiments on ImageNet
In addition to the PASCAL VOC 2012 experiments in Section 4 of the main text, we report the results of similar experiments on the ImageNet dataset . The general outline of the experiments is held identical to the PASCAL VOC 2012 experiments except for a few modifications. Since several prior works report that performing knowledge distillation for the ImageNet-1000 classification task is notoriously difficult [2, 32], we omit the distillation experiment and evaluate the performance of our framework on two methods of compression: Unstructured Pruning and Structured Pruning. In section 4, we measured the ROC-AUC of the attribution maps with respect to ground truth segmentation labels. For the following ImageNet experiments, we use the segmentation labels provided by . This data provides ground truth segmentation labels for 4276 images extracted from ImageNet. However, the classification labels of these images do not belong to the ImageNet-1000 task but to the whole ImageNet class labels - the class labels are unusable. Thus, we cannot exclude the scores produced by samples that the models have predicted wrong. We opt for generating the attribution maps of the top-1 prediction of the model for all samples and compare it to the ground truth segmentation labels.
b.1 Unstructured Pruning
We conduct experiments on unstructured pruning . For this experiment, we use the one-shot pruning pipeline instead of iterative pruning due to the computational cost of repeatedly fine-tuning on ImageNet. In the fine-tuning phase, the pruned network is fine-tuned for 10 epochs with batch size 180. We report on two pruning rates of and . For both cases, we observe that our method better preserves the attribution maps compared to the naive compressed network (Table 11). However, the number gaps for all metrics are smaller compared to the PASCAL VOC 2012 experiment. We suspect that this is due to the relative easiness of the ImageNet in terms of localizing. For most ImageNet samples, a single main object is centered on the image. This implies that in most cases the network only has to focus on the center part of the image. Thus, the network only has to maintain its focus on the center part of the image when it is compressed, which is a relatively easy task.
b.2 Structured Pruning
We conduct experiments for structured pruning methods on ImageNet. For these experiments, we use ResNet34 instead of VGG16 due to computational constraints. We prune the network with the channel pruning rate set to due to the difficulty of the ImageNet classification task. After pruning, the network is fine-tuned for 20 epochs. We observe same tendencies in the results (Table 12). Our method outperforms naive compression in terms of maintaining the attribution maps.
Appendix C Experimental Details For the PASCAL VOC 2012 Experiments
We used the Pascal VOC 2012  multi-label classification dataset which consists of 5717 training and 5823 validation high-resolution images. Among the validation samples, we utilize 1,449 held out images with segmentation masks for localization evaluation. The dataset can be downloaded from the following link: http://host.robots.ox.ac.uk/pascal/VOC/voc2012/ .
We normalize the input with mean and standard deviation . For data augmentation, we use random resized crop and random horizontal flip provided by Torchvision and Pytorch. .
For the CNN implementation, we used the vgg16_bn implementation provided by Torchvision. To train the full network(teacher), we used stochastic gradient descent (SGD) with learning rate 0.1, momentum 0.9, weight decay of . We trained the model with batch size 128 for 250 epochs. For distillation experiments, we used SGD with learning rate 0.1, momentum 0.9, and weight decay . We trained the models for 350 epochs with batch size 64. For unstructured pruning, we used SGD with learning rate , momentum , and weight decay . We trained the models for 16 pruning iterations where a single iteration is of 30 epochs. A batch size of 64 was used. For structured pruning, a one-shot pruning scheme of 60 epochs was used. The optimizer hyperparameters and batch size are identical to unstructured pruning. We used regularizer strength of 100 for EWA and 50 for SWA and SSWA, across all compression methods.
Apparatus and Runtime.
Our experiments on PASCAL took around 100 seconds per epoch on a single machine equipped with 2 Intel(R) Xeon(R) CPU E5-2630 v4 CPUs and 4 NVIDIA Geforce TITAN Xp graphics cards.
Given a pair of attribution maps from before () and after ( compression, the cosine similarity is computed as follows:
The normalized distance between the attribution maps are evaluated as follows:
To evaluate against ground truth segmentation labels, we use ROC-AUC and point accuracy provided by the pointing game . Since segmentation labels are provided as 0’s and 1’s, it is possible to evaluate the quality of attribution maps as a binary classification task. In this sense, we normalize the attribution maps to take values within interval and apply a decision threshold to record the accuracy. This process can be repeated with different thresholds to produce a ROC curve. Using this curve, we report the AUC of the ROC curve. The pointing game accuracy is measured in the following manner: if the spatial location of the maximum value of an attribution map is located within the segmentation mask, it is a hit. Otherwise, it is a miss. This process is repeated and averaged for the test samples.
Appendix D More Examples
Below, we provide visualizations of attribution maps for additional samples for extended qualitative assessment.
Appendix E Validation of Grad-Cam Maps as a Mean to Measure Attribution Quality
Here, we conduct additional experiments to ascertain Grad-Cam’s capability to extract regions that are deemed important by the model. We additionally measure the perturbation metric, RemOve-And-Retrain (ROAR) , to evaluate how well the attribution maps from compressed networks explain the model behavior. To measure ROAR, attribution maps for the entire training data are extracted from the network undergoing the test. Then, the top- pixels of an image ranked by the attribution map is removed. Finally, a separate classifier is retrained on this perturbed dataset. If the attribution map was to accurately represent the importance of the pixels, the classifier must exhibit lower predictive performance. We measure this metric on the full network, naively distilled network, and a network trained with our method. Random attribution was compared as a baseline. (a) As shown in Figure 7, all Grad-Cam perturbations (from different models) were able to lower the F1 score more than random perturbations, which verifies that Grad-Cam indeed reflects a model’s decision-making process. (b) The student trained with our method scored almost on par with the full network. This indicates that the attributions (which reflect a model’s decision process) are indeed preserved by our method.
- (2019) Variational information distillation for knowledge transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9163–9171. Cited by: §2.
- (2019) On the efficacy of knowledge distillation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4794–4802. Cited by: Appendix B.
- (2009) Visualizing higher-layer features of a deep network. University of Montreal. Cited by: §2.
- (2012) The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. Note: http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html Cited by: §C.1, §4.
- (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. In International Conference on Learning Representations, Cited by: §1, §2.
- (2015) Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, pp. 1135–1143. Cited by: §B.1, §2, §4.2.
- (1994) Optimal brain surgeon: extensions and performance comparisons. In Advances in neural information processing systems, pp. 263–270. Cited by: §2.
- (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §1, §2, §4.1.
- (2019) A benchmark for interpretability methods in deep neural networks. In Advances in Neural Information Processing Systems, pp. 9737–9748. Cited by: Appendix E.
- (2016) Network trimming: a data-driven neuron pruning approach towards efficient deep architectures. arXiv preprint arXiv:1607.03250. Cited by: §2.
- (2019) Learning what and where to transfer. In Proceedings of the 36th International Conference on Machine Learning, Cited by: §2.
- (2015) Variational dropout and the local reparameterization trick. In Advances in neural information processing systems, pp. 2575–2583. Cited by: §3.2.
- (2012) Segmentation propagation in imagenet. In European Conference on Computer Vision, pp. 459–473. Cited by: Appendix B.
- (1990) Optimal brain damage. In Advances in neural information processing systems, pp. 598–605. Cited by: §2.
- (2017) Overcoming catastrophic forgetting by incremental moment matching. In Advances in neural information processing systems, pp. 4652–4662. Cited by: §3.2.
- (2016) Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710. Cited by: §1, §2, §4.3.
- (2020) Prediction of radiation pneumonitis with dose distribution: a convolutional neural network (cnn) based model. Frontiers in oncology 9, pp. 1500. Cited by: §1.
- (2018) Methods for interpreting and understanding deep neural networks. Digital Signal Processing 73, pp. 1–15. Cited by: Appendix A, §2, §4.5.
- (2019) Relative attributing propagation: interpreting the comparative contributions of individual units in deep neural networks. External Links: Cited by: Appendix A, §4.5.
- (2017) Automatic differentiation in pytorch. Cited by: §C.1.
- (2018) RISE: randomized input sampling for explanation of black-box models. In Proceedings of the British Machine Vision Conference (BMVC), Cited by: Appendix A, §C.3, Table 1, §4.
- (2014) Fitnets: hints for thin deep nets. arXiv preprint arXiv:1412.6550. Cited by: §2.
- (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Cited by: Appendix B.
- (2017) Adversarial dropout regularization. arXiv preprint arXiv:1711.01575. Cited by: §3.2.
- (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pp. 618–626. Cited by: Appendix A, §1, §2, §3.2, §4.
- (2013) Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034. Cited by: §2, §4.1.
- (2014) Striving for simplicity: the all convolutional net. arXiv preprint arXiv:1412.6806. Cited by: §2.
- (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §3.2.
- (2017) Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3319–3328. Cited by: §2.
- (2016) Learning structured sparsity in deep neural networks. In Advances in neural information processing systems, pp. 2074–2082. Cited by: §2, §3.1.
- (2019) Current status and future trends of clinical diagnoses via image-based deep learning. Theranostics 9 (25), pp. 7556. Cited by: §1.
- (2017) Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. In International Conference on Learning Representations, Cited by: Appendix B, §1, §2, §3.2, §4.1.
- (2016) Top-down neural attention by excitation backprop. In European Conference on Computer Vision, Cited by: Appendix A, §2, §4.5.
- (2015) Learning deep features for discriminative localization. CoRR abs/1512.04150. Cited by: §2.