Understanding Individual Neuron Importance Using Information Theory††thanks: This work was supported by the German Federal Ministry of Education and Research in the framework of the Alexander von Humboldt-Professorship. The work of Bernhard C. Geiger has partly been funded by the Erwin Schrödinger Fellowship J 3765 of the Austrian Science Fund. The Know-Center is funded within the Austrian COMET Program - Competence Centers for Excellent Technologies - under the auspices of the Austrian Federal Ministry of Transport, Innovation and Technology, the Austrian Federal Ministry of Digital and Economic Affairs, and by the State of Styria. COMET is managed by the Austrian Research Promotion Agency FFG.
In this work, we investigate the use of three information-theoretic quantities – entropy, mutual information with the class variable, and a class selectivity measure based on Kullback-Leibler divergence – to understand and study the behavior of already trained fully-connected feed-forward neural networks. We analyze the connection between these information-theoretic quantities and classification performance on the test set by cumulatively ablating neurons in networks trained on MNIST, FashionMNIST, and CIFAR-10. Our results parallel those recently published by Morcos et al., indicating that class selectivity is not a good indicator for classification performance. However, looking at individual layers separately, both mutual information and class selectivity are positively correlated with classification performance, at least for networks with ReLU activation functions. We provide explanations for this phenomenon and conclude that it is ill-advised to compare the proposed information-theoretic quantities across layers. Finally, we briefly discuss future prospects of employing information-theoretic quantities for different purposes, including neuron pruning and studying the effect that different regularizers and architectures have on the trained neural network. We also draw connections to the information bottleneck theory of neural networks.
Recent years have seen an increased effort in explaining the success of deep neural networks (NNs) along the lines of several, sometimes controversial, hypotheses. One of these hypotheses suggests that NNs with good generalization performance do not rely on single directions, i.e., the removal of individual neurons has little effect on classification error, and that highly class-selective neurons (especially in shallow layers) may even harm generalization performance .
The claim that class-selectivity is a poor indicator for classification performance has been questioned since its introduction (see Section 2). For example,  grants that the effect of class-selective neurons on overall classification performance is minor, but show that it can be very large for individual classes. The authors of  showed that some neuron outputs represent features relevant for multiple classes while others represent class-specific features, indicated that ablating both types of neurons can lead to drops in overall classification performance, that certain layers appear to be particularly important for certain classes, and that some layers exhibit redundancy. Finally,  considered feature- rather than class-selectivity and showed that ablating highly orientation-selective neurons from shallow layers harms classification performance. The interplay between individual neurons, redundancy, class specificity, and classification performance thus seems to be more intricate than expected.
Our work complements [22, 11, 19] and provides yet another perspective on the results of . We propose information-theoretic quantities to measure the importance of individual neurons for the classification task. Specifically, we investigate how the variability, class selectivity, and class information of a neuron output (Section 4) connect with classification performance when said neuron is ablated (Section 5). Our experiments rely on fully-connected feed-forward NNs trained on the MNIST, FashionMNIST, and CIFAR-10 datasets, as these datasets have evolved into benchmarks for which the results are easy to understand intuitively. We show that neither class selectivity nor class information are good performance indicators when ablating neurons across layers, thus confirming the results in . However, we observe 1) that class information and class selectivity values differ greatly from layer to layer and 2) that for NNs with ReLU activation functions, class information and class selectivity are positively correlated with classification performance when cumulative ablation is performed separately for each layer. Our results are not in contrast with those in , but complement them and can be reconciled with them with reference to Simpson’s paradox. In Section 6, we briefly discuss the implications of our findings on neuron pruning, a recent trend for reducing the complexity of NNs [7, 17, 10, 8, 12].
Of course, quantities computed from individual neuron outputs are not capable of drawing a complete picture. In Section 6 we briefly discuss scenarios in which such a picture is greatly misleading and outline ideas how this shortcoming can be removed. Specifically, we believe that partial information decomposition [20, 14], a decomposition of mutual information into unique, redundant, and synergistic contributions, can be used to consolidate this work and  with the works in the spirit of the information bottleneck principle [16, 15, 1].
2 Related Work
Morcos et al.  studied the dependence of NN classification performance on the output of individual neurons (“single direction”) via ablation analyses. They showed that NNs trained on permuted labels are more sensitive to cumulative ablation, indicating a stronger reliance on individual neurons. In contrast, NNs with better generalization performance are more robust against ablation. Computing the class selectivity of each neuron, they showed that there is little correlation between this quantity and the performance drop of the NN when said neuron is ablated. They therefore conclude that class selectivity (and the mutual information between the neuron output and the class variable) is a poor indicator of NN performance.
The authors of  provide a different perspective on the results in  by showing that, although the effect of ablating individual neurons on overall classification performance is small, the effect on the classification of individual classes can be large. For example, ablating the ten most informative neurons from the fifth convolutional layer of AlexNet for a given class of the Places dataset makes its detection probability drop by more than 40%, on average. Similar results were shown in  where the authors performed individual and pairwise ablations in a NN with two hidden layers trained on MNIST. They showed that some neurons encode general features, affecting overall classification accuracy strongly, while some neurons encode class-specific features, the ablation of which has less (but still noticable) effect on the classification performance. Both  and  observed that ablating neurons with class-specific features can have positive effect on the detection of an unrelated class, suggesting implications for targeted weight pruning. Meyes et al.  further discovered that pairwise ablation often has a stronger effect than the summed effect of ablating individual neurons, indicating that intermediate layers exhibit redundancy in some cases. Ablating a certain fraction of filters in different layers of a VGG-19 trained on ImageNet showed that different layers have different sensitivity to ablation, and that this sensitivity is also class-dependent, i.e., some classes suffer more from ablating filters in a given layer than others . The authors of  performed ablation analysis in LSTM neural language models and LSTMs trained for machine translation. They ranked neurons via their linguistic importance by training a logistic classifier on neuron outputs and observing the weight the classifier places on a given neuron. They discovered that certain linguistic properties are represented by few neurons, while other properties are highly distributed, and that ablating a fixed fraction of linguistically important neurons harms part-of-speech tagging and semantic tagging tasks more strongly than ablating the same fraction of linguistically unimportant neurons.
That class-selectivity increases towards deeper layers was observed in [13, 21]. More specifically, the authors of  claim that deeper layers in a CNN are specific for a single class. This allows distilling so-called critical paths by retaining only these class-specific neurons in the deeper layers to obtain a CNN trained in a one-vs-all classification task. In contrast to class-selectivity, orientation-selectivity, a special kind of feature-selectivity in convolutional NNs, appears to occur in layer at different depths . Ablating these orientation-selective filters in shallow layers harms classification performance more than ablating unselective filters, and ablating filters in deeper layers has little overall effect [19, Fig. 5].
3 Setup and Preliminaries
We consider classification via fully-connected feed-forward NNs, i.e., of assigning data sample to a class in , . We assume that the parameters of the NN had been learned from the labeled training set . We moreover assume that we have access to a labeled validation set that was left out during training. We denote this dataset by , in which is the -th data sample and the corresponding class label. We assume that .
Let denote the output of the -th neuron in the -th layer of the NN if is the data sample at the input. With denoting the weight connecting the -th neuron in the -th layer to the -th neuron in the -th layer, denoting the bias term of the -th neuron in the -th layer, and denoting an activation function, we obtain by setting
and by setting to the -th coordinate of . The output of the network is a softmax layer with neurons, each corresponding to one of the classes.
We assume that the readers are familiar with information-theoretic quantities such as entropy, mutual information and Kullback-Leibler (KL) divergence, cf. [3, Ch. 2]. To be able to use such quantities to measure the importance of individual neurons in the NN, we treat class labels, data samples, and neuron outputs as random variables (RVs). To this end, let be a quantizer that maps neuron outputs to a finite set . Now let be a RV over the set of classes and a RV over , corresponding to the quantized output of the -th neuron in the -th layer. We define the joint distribution of and via the joint frequencies of in the validation set, i.e.,
where is the indicator function. The assumptions that and that is small obviate the need for more sophisticated estimators for the distribution , such as Laplacian smoothing.
4 Information-Theoretic Quantities for Measuring Individual Neuron Importance
In this section we propose information-theoretic quantities as candidate importance measures for neurons in a NN; as we show in Appendix F.7, each of these measures can be computed from the validation set with a complexity of . A selection of additional information-theoretic candidate measures is available in Appendix F, together with a discussion of relationships between them.
Entropy. Entropy quantifies the uncertainty of a RV. In the context of a NN, the entropy
has been proposed as an importance function for pruning in . Specifically, the entropy indicates if the neuron output varies enough to fall into different quantization bins for different data samples.
Mutual Information. While small or zero entropy of a neuron output suggests that it has little influence on classification performance, the converse is not true, i.e., a large value of does not imply that the neuron is important for classification. Indeed, the neuron may capture a highly varying feature of the input that is irrelevant for classification. As a remedy, we consider the mutual information between the neuron output and the class variable, i.e.,
This quantity measures how the knowledge of helps predicting , was used to characterize neuron importance in, e.g., , and appears in corresponding classification error bounds . It can be shown that neurons with small also have small , cf. [3, Th. 2.6.5].
Kullback-Leibler Selectivity. It has been observed that, especially at deeper layers, the activity of individual neurons may distinguish one class from all others. Mathematically, for such a neuron there exists a class such that the class-conditional distribution differs significantly from the marginal distribution , i.e., the specific information (cf. ) is large. Neurons with large specific information for at least one class may be useful for classification, but may nevertheless be characterized by low entropy and low mutual information , especially if the number of classes is large. We therefore propose the maximum specific information over all classes as a measure of neuron importance:
This quantity is high for neurons that activate differently for exactly one class and can thus be seen as an information-theoretic counterpart of the selectivity measure used in . We thus call the quantity defined in (5) Kullback-Leibler selectivity. Specifically, KL selectivity is maximized if all data samples of a specific class label are mapped to one value of and all the other data samples (corresponding to other class labels) are mapped to other values of . In this case, can be used to distinguish this class label from the rest. KL selectivity is an upper bound on mutual information and is zero if and only if the mutual information is zero (see Appendix F.3).
5 Understanding Neuron Importance via Cumulative Ablation
We trained three different fully-connected NNs on three different datasets:
A NN trained on MNIST,
a NN trained on FashionMNIST, and
a NN trained on CIFAR-10.
Each of the networks was trained with both ReLU and sigmoid activation functions; we considered training without regularization, with -regularization (weight decay ), with dropout in hidden layers (dropout probabilities: for MNIST and FashionMNIST, for CIFAR-10), and with dropout and batch normalization in the hidden layers. For more details on the datasets (train/test/validation split, preprocessing, etc.) and the training parameters (cost function, optimizer, learning rate, etc.) we refer the reader to Appendix A. All of the results shown in this paper are obtained by averaging over NNs trained using the same setting but with different random initializations for MNIST and FashionMNIST, and over NNs for CIFAR-10, respectively.
Note that our goal for training was not to achieve state-of-the-art performance for fully-connected NNs for the considered datasets. Rather, our aim was to avoid factors such as overfitting, data augmentation, or linear bottleneck layers, that may confound our findings, while still achieving decent classification performance indicating that training was successful. During initial exploratory experiments, we observed that our qualitative results continue to hold for NNs the classification performance of which varies within a reasonable range. Moreover, the effects discussed below appear to be independent of the choice of optimizer and the optimizer parameters.
We computed the information-theoretic measures (3), (4), and (5) for each neuron in each NN using the validation sets . Designing the quantizers for estimating information-theoretic quantities is challenging in general (cf. recent discussions in [15, 16]) but appears to be unproblematic in our case. We observed that using more than two quantization bins did not yield significantly different results (see Appendix B); we therefore discuss results for one-bit quantization, i.e., . Specifically, we selected the quantizer thresholds to lie at 0.5 and 0 for sigmoid and ReLU activation functions, respectively.
To connect the proposed information-theoretic measures to the classification performance of the trained NNs, we performed cumulative ablation analysis. Specifically, using the computed measures we rank the neurons of each layer or of the NN as a whole. We subsequently cumulatively ablate the lowest- or the highest-ranking neurons and compute the classification error on the test dataset. If cumulatively ablating neurons with low (high) values leads to small (large) drops in classification performance, then the information-theoretic measures with which the ranking was obtained can be assumed to indicate that the ablated neurons are important for good classification performance. Obtaining such validated importance measures is not only relevant for understanding NNs, but may also for reducing the complexity of NNs by neuron pruning and studying effects of different regularizers and different architectures on what the latent representations learn (cf. Section 6).
We chose cumulative ablation over the ablation of single neurons because most NNs used in practice (and in our experiments) are highly overparameterized and hence often exhibit high levels of redundancy. Ablating single neurons therefore has often only negligible effect on classification performance and hence fails to yield meaningful insights about the relation between the importance of the neuron for classification and its properties such as class selectivity. Cumulative ablation often has a greater impact on classification performance than the summed impact of single neuron ablation, as was also observed in  for pairs of neurons. We thus believe that cumulative ablation is a more powerful tool to study the connection between class selectivity/information and neuron importance for classification.
Following , we perform ablation to zero, i.e., we replace the output of each ablated neuron by zero. In few cases, we observed that ablation to the mean, i.e., replacing the output of each ablated neuron by its mean value on the training set, led to slightly increased robustness against cumulative ablation. We briefly discuss a selection of these cases in Appendix E.
5.1 Neurons Become more Variable, Informative, and Selective Towards Deeper Layers
Fig. 1 shows the empirical distribution of the proposed information-theoretic quantities for different layers of the trained NNs. It can be observed that, in general, all quantities increase towards deeper layers, which is in agreement with observations in [13, Figs. A2 & A4.a] and  that shallow layers are general, i.e., not related to a specific class, whereas features in deeper layers are more and more specific towards the class variable. For example, that the mutual information terms , corresponding to the individual neuron’s quantized outputs, increase towards deeper layers suggests that neurons become informative about the class variable individually rather than collectively; i.e., while in the shallow layers class information can only be obtained from the joint observation of multiple neurons encoding general features, in deeper layers individual neurons may represent features that are informative about a set of classes.
We have confirmed the same trend for all information-theoretic quantities and all combinations of activation functions and regularization techniques for MNIST, and for entropy and mutual information for FashionMNIST, respectively. KL selectivity also appears to increase for NNs trained on FashionMNIST, but this increase can only be seen with a finer quantizer resolution (see Appendix B). The behavior for the NN trained on CIFAR-10 is less obvious, cf. Fig 1(c); however, one still observes a small increase of the relevant quantities from the second to the fourth hidden layer. We believe that the behavior for CIFAR-10 can be explained by many neurons in these layers being inactive or uninformative, respectively. We will touch upon this issue again in Section 5.3.
The behavior of mutual information in Fig. 1 appears to be in contrast with the behavior of the mutual information between the class and the complete layer, i.e., with . The data processing inequality (cf. [3, Th. 2.8.1]) dictates that this latter quantity should decrease towards deeper layers; proper training reduces this decrease, as empirically observed in [16, 15]. That the mutual information terms increase towards deeper layers suggests that neurons in deeper layers exhibit a higher degree of redundancy. We will revisit this statement in Sections 5.3 and 6.
5.2 Whole-Network Cumulative Ablation Analysis
The authors of  concluded that neither mutual information nor class selectivity are correlated with classification performance, and that highly selective neurons may actually harm classification. To review this claim, we rank all neurons in the NN with ReLU activation functions trained on MNIST w.r.t. their information-theoretic quantities and ablate to zero those neurons with lowest ranks (i.e., we perform cumulative ablation analysis across both layers simultaneously). The results are shown in Fig. 2. For both regularization and dropout, it can be seen that ablating neurons with small entropy or mutual information values performs worse than ablating neurons randomly. While the effect is only mild for the NN trained with regularization (Fig. 2(a)), the effect is severe when dropout was used during training (Fig. 2(b)), indicating that mutual information is at best weakly positively or even negatively correlated with classification performance, respectively. The latter effect was also observed in [13, Fig. A4.a], where the authors claimed that neurons with large mutual information have adverse effects on classification performance. Judging from Fig. 2, it appears that ablating neurons with low KL selectivity outperforms random ablations. The effect is only mild and not present in experiments with FashionMNIST and MNIST with sigmoid activation functions, for example. The results are thus in line with those of  and suggest that it is not the neurons with largest KL selectivity or mutual information values within the entire NN that are important for classification.
We now provide a new interpretation in the light of the results of Section 5.1. With reference to Figs. 1(a) and 1(b), ablating neurons with lowest mutual information mostly ablates neurons in the first hidden layer. These neurons extract general features that are combined in the second layer to features that are specific to the class variable, leading to higher mutual information and KL selectivity in the second layer. By ablation, these generic features are removed, thus deeper layers are not able to extract class-specific features anymore and classification suffers. The effect is most strongly pronounced for dropout, where most neurons in the first layer have smaller mutual information than any neuron in the second layer (see Fig. 1(a)). Neurons are thus ablated almost exclusively from the first layer, which explains why classification performance drops so quickly in Fig. 2(b) when the number of neurons ablated approaches the size of the first hidden layer. The same effect holds also for entropy and KL selectivity, and more generally, for most experiments with MNIST and FashionMNIST datasets. We thus conclude that, due to large differences between the information-theoretic quantities of different layers, cumulatively ablating neurons from multiple layers simultaneously cannot be used decide whether any of these proposed importance measures is a good indicator for classification performance.
5.3 Layer-Wise Cumulative Ablation Analysis
Since the conclusion that neurons with large mutual information adversely affects classification performance appears counterintuitive, we next perform ablation analysis in each layer separately. The results are shown in Figs. 3 and 4. First of all, it can be seen that ablating neurons in the shallow layers has stronger negative effects than ablating neurons in deeper layers. For example, in Fig 3, ablating the 50 lowest-ranked neurons in the second hidden layer has negligible effect on classification performance. We believe that this is because many neurons in the second layer are redundant; the alternative that many neurons in the second layer are inactive or irrelevant for classification can be ruled out because of large entropy, mutual information and KL selectivity values, cf. Fig. 1(b).
Most importantly, one can see in Fig. 3 that ablating neurons with low (high) ranks leads to better (worse) classification performance than ablating neurons randomly, at least in the first hidden layer; this holds for every considered quantity, although the effect is less pronounced for entropy. Ablating neurons with large mutual information or KL selectivity from the second layer harms classification performance the most, while it seems advisable to ablate neurons with low mutual information values. Ablating neurons with low KL selectivity from the second layer performs worse than ablating neurons randomly.
Very similar results can be seen for NNs trained on MNIST and CIFAR10 in Fig. 4: In general, mutual information and KL selectivity appear to be positively correlated with classification performance in the sense that ablating neurons with small (large) such values causes small (large) drops in classification accuracy. We have observed the same general picture for NNs with ReLU activation functions trained with other combinations of datasets and regularizers; the results for NNs with sigmoid activation functions and for NNs trained with batch normalization are less clear (see Appendices C and D).
Looking at Fig. 4 in more detail, one can further see that the differences between random ablation and ablating neurons according to their rank are small for the second layer in the NN trained on MNIST (Fig. 4(a)) and for the first layer in the NN trained on CIFAR10 (Fig. 4(a)). We believe that this is caused by high redundancy in the former case, and by high synergy – i.e., neurons are informative about classes not individually, but only jointly – in the latter case. This belief is supported by the fact that the neurons in the second layer in Fig. 4(a) have large mutual information (see Fig. 1(a)), while the neurons in the first layer of Fig. 4(a) have large entropy but small mutual information when compared to entropy values (see Fig. 1(c)). The results for ablating neurons in the second, third, and fourth hidden layer for the NN trained on CIFAR in Fig. 4(a) furthermore suggest that many neurons in these layers are inactive, which is again confirmed by the distribution of entropy values in Fig. 1(c). In the third hidden layer, there appear to be approximately 70 active neurons, characterized by positive mutual information and/or KL selectivity values. Ablating these 70 neurons completely disables the NN to perform classification; ablating all but these 70 neurons has negligible effect on classification performance. In the fourth layer of the same NN, approx. the 250 neurons with the highest information-theoretic quantities have to be ablated to strip the NN of its classification capabilities. In contrast, 100 of those neurons suffice to achieve full classification performance, indicating a high degree of redundancy in this layer. All this suggests that our proposed quantities are not adequate to measure neuron importance in layers which create primarily synergistic or redundant representations. We touch upon this issue again in Section 6.
The general conclusion of this section remains that, for NNs with ReLU activation functions, mutual information and KL selectivity of a neuron output are correlated with this neuron’s importance in the classification task. On the surface, this seems to contradict the claims in . This contradiction is resolved by combining, e.g., the observations from Figs. 1 and 4(a): The mutual information of neurons in the second layer of the NN trained on MNIST are large compared to those of the first layer; simultaneously, neurons in the second layer seem to be highly redundant, i.e., can be removed without affecting classification performance. Combining these facts, it follows that ablating neurons with large mutual information values affects classification performance less than ablating neurons with small mutual information values, leading to the negative correlation reported in . We have argued in this section, however, that the correlation between mutual information and impact on classification performance becomes positive if this correlation is evaluated layer-by-layer. Thus, the apparent conflict between our results and those in  is an instance of Simpson’s paradox and can be resolved by recognizing that it is ill-advised to compare mutual information and (KL) selectivity values across different layers.
6 Discussion, Limitations, and Outlook
By considering validation data and neuron outputs as realizations of RVs, we defined and calculated information-theoretic quantities to measure the importance of individual neurons for classification. Using cumulative ablation analyses, we arrived at the following main findings:
In deeper layers, ablation has smaller effects on classification performance. This may be explained by an increased redundancy (cf. Section 5.3).
The correlation between the considered quantites and neuron importance depends on the activation function. For example, the correlation is strongly positive for NNs with ReLU activation functions trained with regularization and with dropout (cf. Section 5.3). In contrast, the correlation is weak for NNs with sigmoid activation functions (Appendix C) and for NNs trained with batch normalization (Appendix D).
The correlation between the considered quantites and neuron importance depends on the depth and structure of the considered layer, and on the NN architecture as a whole. For example, wide layers may have many inactive neurons (leading to stronger correlation, Fig. 4(a)), while deep layers may have many redundant neurons (leading to weaker correlation, Fig. 4(a)).
On the one hand, our experiments are currently restricted to fully-connected NNs. The question whether our qualitative claims hold more generally, e.g., for convolutional NNs and deeper NNs, shall be answered in future work. On the other hand, the considered datasets are well-understood benchmarks for NNs and the experiments are performed for an array of settings including different activation functions and regularizers. We thus believe that our results are a valid and interesting complement to the available body of literature [13, 22, 11, 19] devoted to understanding the importance of individual neurons for classification.
We close by discussing connections between our work two related fields of research. First, we believe that our findings have implications for neuron pruning, a recent trend aimed at reducing the computational complexity of large NNs. For example, the authors of  proposed pruning neurons based on their output entropy or on the magnitude of incoming and outgoing weights. They achieved satisfactory performance only after retraining the NN. Retraining is also necessary in [8, 12], which suggest pruning filters from convolutional NNs. Rather than pruning neurons, the authors of [17, 10] suggest merging neurons that behave similarly in a well-defined sense. Our conclusion from Figs. 2, 3, and 4 is that, if the neurons to be pruned are selected based on information-theoretic quantities, pruning has to be performed layer by layer rather than in the entire NN at once. We believe this observation may be true for other quantities as well; e.g., for -based weight pruning, the distribution of weight magnitudes may differ from layer to layer. We further observed that our information-theoretic quantities not only differ greatly between layers (see Fig. 1) but also have different meanings (see Section 4 and Appendix F.3). This suggests that it may be useful to employ different quantities when pruning different layers. Finally, our discussion in Section 5.3 indicates that deeper layers have more redundancy and hence can be more severely pruned without impacting the performance significantly. To operationalize this, however, one requires measures of neuron redundancy, preferably going beyond a simple comparison of in- and outgoing weights.
Second, our work connects with the recently proposed information bottleneck theory of neural networks  (but see also [15, 1, 5] for critical assessments) which is based on the mutual information between an entire layer and the class variable. As dictated by the data processing inequality, this quantity cannot increase when going towards deeper layers, which needs to be reconciled with the result that the mutual information between individual neurons and the class variable increases when going towards deeper layers. We believe that the connection between these two superficially conflicting results is an information-theoretic measure of neuron redundancy: Such candidate measures exist within the framework of partial information decomposition [20, 14], and have been used in the analysis of NNs. For example, the authors of  discover two distinct phases during training a NN with a single hidden layer, characterized by large amounts of redundant and unique information, respectively.
The information bottleneck theory of NNs represents the information an entire layer contains about the class variable by a single number, and thus allows only limited insight into the behavior of the NN. Our conclusions and the connections of our work with neuron pruning and the information bottleneck theory suggests that information-theoretic quantities derived from individual neuron outputs are not sufficient either. Consider the following three examples:
Suppose that and that class is particularly easy to predict from the neuron outputs in the -th layer. Suppose further that we use KL selectivity to measure neuron importance. It may happen that is the maximizer of (5) for all neurons in the layer. Thus, neuron importance is evaluated only based on the ability to distinguish class from the rest, which ignores separating the remaining classes. Ablating neurons based on KL selectivity may thus result in a NN unable to correctly classify classes other than .
Suppose the -th and -th neuron in the -th layer have the same output for every input , i.e., , and that is large. If we use mutual information for ablation, then both neurons will be given high importance. Still, the neurons are redundant since one can always replace the other by adjusting the outgoing weights accordingly.
Suppose that and that we use mutual information to measure neuron importance. Suppose further that the -th and -th neuron in the -th layer are binary. It may happen that both and are independent of , but that equals the exclusive or of and . Thus, , although both neuron jointly determine the class, i.e., .
In the first example, the neurons are individually informative, but KL selectivity may declare a set of neurons as important that is redundant (in the sense of determining class ) but insufficient (for determining other classes). The second example presents a similar situation but with mutual information. In the third example, the neurons are individually uninformative, but jointly so.
The first two examples can possibly be accounted for by introducing quantities that take the redundancy of a layer into account, such as those proposed by [17, 2]. For the first example, another option is to replace the KL selectivity of a neuron by its specific information spectrum, . The resulting spectra, evaluated for every neuron in a given layer, could allow selecting a subset of neurons such that each class in is represented. More generally, all examples can be treated by investigating how the mutual information between a complete layer and the class variable splits into redundant, unique, and synergistic parts. Neurons that contain only redundant information shall be assigned little importance; in deeper layers, unique information may be given higher value than synergistic information, whereas the contrary may be true for shallow layers. This line of thinking suggests that, in addition to the measures proposed so far, partial information decomposition [20, 14] may be used to shed more light on the behavior of neural networks. Future work shall be devoted to this analysis.
A second line of future work shall extend the observations in this paper and in Appendix D. Specifically, we aim to study in detail the effect of different training schemes, including different regularizers and batch normalization, on the information-theoretic quantities of the neuron outputs. Similarly, the effect of different NN architectures (underparameterized NNs that underfit, bottleneck layers, particularly wide layers, slim but deep NNs, etc.) shall be analyzed in detail. While a significant amount of knowledge can be created already with the quantities proposed in Section 4, also these analyses will benefit from a more complete picture using partial information decomposition.
-  Amjad, R.A., Geiger, B.C.: Learning representations for neural network-based classification using the information bottleneck principle (2019), accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence; preprint available: arXiv:1802.09766 [cs.LG]
-  Babaeizadeh, M., Smaragdis, P., Campbell, R.H.: NoiseOut: A simple way to prune neural networks. arXiv:1611.06211 (2016)
-  Cover, T.M., Thomas, J.A.: Elements of Information Theory. John Wiley & Sons, Inc., New York, NY, 1 edn. (1991)
-  Dalvi, F., Durrani, N., Sajjad, H., Belinkov, Y., Bau, A., Glass, J.: What is one grain of sand in the desert? Analyzing individual neurons in deep NLP models. In: Proc. AAAI Conf. on Artificial Intelligence (AAAI). Honolulu (Jan 2019)
-  Goldfeld, Z., van den Berg, E., Greenewald, K.H., Melnyk, I., Nguyen, N., Kingsbury, B., Polyanskiy, Y.: Estimating information flow in neural networks. arXiv:1810.05728v3 [cs.LG] (Nov 2018)
-  Han, T.S., Verdú, S.: Generalizing the Fano inequality. IEEE Transactions on Information Theory 40(4), 1247–1251 (Jul 1994)
-  He, T., Fan, Y., Qian, Y., Tan, T., Yu, K.: Reshaping deep neural network for fast decoding by node-pruning. In: Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP). pp. 245–249 (2014)
-  Li, H., Kadav, A., Durdanovic, I., Samet, H., Graf, H.P.: Pruning filters for efficient convnets. In: Proc. Int. Conf. on Learning Representations (ICLR). Toulon (Apr 2017), arXiv:1608.08710 [cs.CV]
-  Lin, J.: Divergence measures based on the shannon entropy. IEEE Transactions on Information Theory 37(1), 145–151 (Jan 1991)
-  Mariet, Z., Sra, S.: Diversity networks. In: Proc. Int. Conf. on Learning Representations (ICLR). San Juan (May 2016), arXiv:1511.05077v6 [cs.LG]
-  Meyes, R., Lu, M., de Puiseau, C.W., Meisen, T.: Ablation of a robotâs brain: Neural networks under a knife. arXiv:1901.08644 [cs.NE] (Dec 2018)
-  Molchanov, P., Tyree, S., Karras, T., Aila, T., Kautz, J.: Pruning convolutional neural networks resource efficient inference. In: Proc. Int. Conf. on Learning Representations (ICLR). Toulon (Apr 2017), arXiv:1611.06440v2 [cs.LG]
-  Morcos, A.S., Barrett, D.G., Rabinowitz, N.C., Botvinick, M.: On the importance of single directions for generalization. In: Proc. Int. Conf. on Learning Representations (ICLR). Vancouver (May 2018)
-  Rauh, J., Banerjee, P., Olbrich, E., Jost, J., Bertschinger, N.: On extractable shared information. Entropy 19(7), 328 (Jul 2017)
-  Saxe, A.M., Bansal, Y., Dapello, J., Advani, M., Kolchinsky, A., Tracey, B.D., Cox, D.D.: On the information bottleneck theory of deep learning. In: Proc. Int. Conf. on Learning Representations (ICLR). Vancouver (May 2018)
-  Shwartz-Ziv, R., Tishby, N.: Opening the black box of deep neural networks via information. arXiv:1703.00810 [cs.LG] (Mar 2017)
-  Srinivas, S., Babu, R.V.: Data-free parameter pruning for deep neural networks. In: Proc. British Machine Visian Conf. (BMVC). pp. 31.1–31.12. Swansea (Sep 2015)
-  Tax, T., Mediano, P., Shanahan, M.: The partial information decomposition of generative neural network models. Entropy 19(9), 474 (Sep 2017)
-  Ukita, J.: Causal importance of orientation selectivity for generalization in image recognition, openreview.net/forum?id=Bkx_Dj09tQ
-  Williams, P.L., Beer, R.D.: Nonnegative decomposition of multivariate information. arXiv:1004.2515 [cs.IT] (Apr 2010)
-  Yu, F., Qin, Z., Chen, X.: Distilling critical paths in convolutional neural networks. In: NeurIPS Workshop on Compact Deep Neural Networks with industrial applications. Montreal (2018)
-  Zhou, B., Sun, Y., Bau, D., Torralba, A.: Revisiting the importance ofindividual units in CNNs via ablation. arXiv:1806.02891 [cs.CV] (Jun 2018)
Appendix A Dataset Description and Training Parameters
MNIST: The data samples are 784-dimensional vectors, each entry assuming a grayscale value of a image representing a handwritten digit. This dataset is divided into training samples and test samples. We further performed a split off the training samples as a labeled training set and a labeled validation set .
FashionMNIST: The dataset is meant to serve as a drop-in replacement for MNIST with the same number of samples and the same dimension of the images. Each data sample is a labeled image of one of the ten clothing items in the dataset.
CIFAR-10: The dataset samples are -dimensional vectors, each entry assuming a colored image of one of the ten items in the dataset. This dataset is divided into training samples and test samples. We further performed a split off the training samples as a labeled training set and a labeled validation set .
The networks are trained in order to minimize cross-entropy loss and the specified regularizer (if any), We train the NN using the RMSProp optimizer with a learning rate of , momentum of ( for CIFAR-10) and a batch size of . The number of training epochs were chosen individually for each dataset and regularizer setting, ranging between and epochs, to achieve good classification performance while avoiding overfitting. The models are implemented111The source code of our experiments can be downloaded from https://firstname.lastname@example.org/raa2463/neuron-importance-via-information.git. in Python using Pytorch.
Appendix B Effect of Quantizer Resolution
Figs. 5 and 6 show the effect the quantizer resolution has on cumulative ablation. Specifically, we cumulatively ablated neurons with low importance measures from all layers of the NNs. As it can be seen, the performance of different quantizer resolutions is similar, which suggests that higher resolutions do not affect the ranking of neurons strongly. As an extreme example, consider the results in Fig. 7, obtained for a NN with sigmoid activation functions trained on FashionMNIST with regularization. It can be seen that a one-bit quantizer is insufficient to detect the increase in KL selectivity towards the second hidden layer; two- or three-bit quantization suffices in this case. Nevertheless, the neuron rankings within each layer are sufficiently stable w.r.t. quantizer resolution such that the ablation curves do not differ by a great amount and yield the same qualitative and quantitative results.
We therefore decided to focus on one-bit quantization in our experiments. First, this minimizes the computational complexity of computing the proposed importance measures (Appendix F.7). Second, it guarantees that , which justifies using (2) as an estimate of the joint distribution of and . And finally, such a coarse quantization ensures that the neuron output can be interpreted by a linear separation; i.e., the fact that mutual information is invariant under bijections is unproblematic in this case, cf. [1, Sec. 4.3].
Appendix C Results for NNs with Sigmoid Activation Functions
While we have observed clear results for NNs with ReLU activation functions trained without regularization, with regularization, and with dropout, our results are inconclusive for NNs with sigmoid activation functions. For these, it appears that the correlation between the proposed neuron importance measures and classification performance is very weak. To illustrate this, we have provided a selection of results in Fig. 8.
Appendix D Results for NNs trained with Batch Normalization
Similar to results with sigmoid activation functions, our results with NNs trained using batch normalization and dropout are inconclusive, irrespective of the activation function. While the performance of these NNs on the classification task competes with the performance of NNs trained differently, the robustness against cumulative ablation seems not to be linked to the proposed importance measures. For example, comparing Figs. 9 and 10 with Figs. 3 and 4 suggests that NNs trained with batch normalization are more robust against random ablation, but that selectively ablating neurons has little effect on classification performance. Inspecting the distribution of importance measures in each layer reveals that batch normalization causes these measures to be higher than when using different regularization techniques, especially in NNs with ReLU activation functions. For example, in our experiments with CIFAR10 we observed that in deeper layers more neurons are active and have higher mutual information and KL selectivity values, i.e., are individually more informative about the class. One possible explanation is that batch normalization combined with dropout increases the redundancy in deeper layers which subsequently causes increased robustness against cumulative ablation, irrespective of neuron importance. An in-depth analysis of these effects, including investigations of the effect of regularizers on neuron importance measures, is deferred to future work.
Appendix E Ablation Strategies
For NNs with ReLU activation functions, it was shown in [13, Fig. A1] that ablation to the empirical mean 222Note that ablating the neuron output to a constant value is equivalent to removing said neuron and adapting the bias terms for neurons in the -th layer. We consider two options for adapting the bias terms : Leaving the bias terms unchanged (ablation to zero), or replacing by (6) performs worse than ablation to zero. We were not able to reproduce this effect in general in our experiments, as performance of the ablation strategy appears to depend on the choice of the activation function, on the regularization used during training, and on the order in which neurons are ablated (see Fig. 11). More specifically, we observed that indeed for NNs with ReLU activation functions and trained on FashionMNIST and MNIST using dropout, ablation to zero is preferable, with large effects present in the second layer. For NNs trained using regularization, the situation is less clear and sometimes ablation to the mean leads to greater robustness agains cumulative ablation. The general picture for NNs trained on FashionMNIST and MNIST suggests that ablation to the mean is sometimes preferable in the first layer, while ablation to zero is preferable in the second layer most of the time. Moreover, since the difference between the two ablation strategies is very small in the first layer and larger in the second, we chose ablation to zero for our experiments. As an additional benefit, this makes our results comparable with those in .
Appendix F An Overview over Information-Theoretic Neuron Importance Measures
In this appendix, we review information-theoretic candidate measures for neuron importance. This appendix is self-contained, as it also includes those measures already introduced in the main part of this document.
Entropy quantifies the uncertainty of a RV. In the context of a NN, the entropy
has been proposed as an importance function for pruning in  (for one-bit quantization). Specifically, the entropy indicates if the neuron output varies enough to fall into different quantization bins for different data samples. In our hypothetical classification task, a neuron will be assigned maximum importance if data samples cause to fall into one quantization bin and the other data samples cause to fall into the other quantization bin. In contrast, a neuron for which the outputs for all data samples fall in the same quantization bin will have least importance corresponding to zero entropy. Assuming sigmoid activation functions and saturated neuron outputs, the former case corresponds to each saturation region being active for half of the data samples, while the latter case corresponds to only one saturation region being active. In the latter case, the neuron is uninformative about the data sample and the class.
f.2 Mutual Information
While small or zero entropy of a neuron output suggests that it has little influence on classification performance, the converse is not true, i.e., a large value of does not imply that the neuron is important for classification. Indeed, the neuron may capture a highly varying feature of the input that is irrelevant for classification. As a remedy, we consider the mutual information between the neuron output and the class variable, i.e.,
This quantity measures how the knowledge of helps predicting and appears in corresponding classification error bounds . In our hypothetical classification task with saturated sigmoid activation functions, a neuron will be assigned maximum importance if its output is in each saturation region for half of the data samples (which maximizes the first term in (4)) such that the class label determines the saturation region (which minimizes the second term in (4)). In contrast, mutual information assigns the least importance to a neuron output that falls in different saturation regions independently of the class labels. In this case, knowing the value of does not help in predicting . It can be shown that neurons with small also have small , cf. [3, Th. 2.6.5].
f.3 Kullback-Leibler Selectivity
It has been observed that, especially at deeper layers, the activity of individual neurons admits distinguishing one class from all others. Mathematically, for such a neuron there exists a class such that the class-conditional distribution differs significantly from the marginal distribution , i.e., the specific information (cf. ) is large. Neurons with large specific information for at least one class may be useful for the classification task (see Section 5.3 below), but may nevertheless be characterized by low entropy and low mutual information , especially if the number of classes is large. We therefore propose the maximum specific information over all classes as a measure of neuron importance:
This quantity assigns high importance to neurons that activate differently for exactly one class and can thus be seen as an information-theoretic counterpart of the selectivity measure used in . We thus call the quantity defined in (5) Kullback-Leibler selectivity. Specifically, KL selectivity is maximized if all data samples of a specific class label are mapped to one value of and all the other data samples (corresponding to other class labels) are mapped to other values of . In this case, can be used to distinguish this class label from the rest. In contrast, KL selectivity is zero if and only if the mutual information is zero.
KL selectivity is not only large if a neuron output helps distinguishing one class from the rest, but also if it helps distinguishing a set of classes from its complement. This is the main conclusion of the following lemma.
Note that the distribution is a convex combination of distributions , . Indeed,
Since KL divergence is convex [3, Th. 2.7.2], we thus have
with equality if consists of a single element. Thus, maximizing over all sets is equivalent to maximizing over all class labels. This completes the proof.
The second result relates mutual information with KL selectivity.
with equality if and only if .
Note that mutual information can be written as a KL divergence, i.e.,
Since the convex combination on the right-hand side is always bounded from above by its maximum, the inequality is proved. Finally, if , then so is the convex combination on the right-hand side. Therefore, all terms need to be identical to zero, i.e., for all . It thus follows that KL selectivity is zero if mutual information is zero.
f.4 Jensen-Shannon Subset Separation
A consequence of Lemma 2 is that a large mutual information implies that at least for some class label , the conditional distributions differs from the marginal distribution . More generally, there needs to be a set , such that differs from . We measure the difference between these two distributions using Jensen-Shannon (JS) divergence. Specifically, the JS divergence between two distributions on , on the same finite alphabet and a weight is defined as 
where . JS divergence is nonnegative, symmetric, bounded, and zero if . JS divergence can be used to bound the Bayesian binary classification error from above and below, where and are the class priors and where and are the conditional probabilities of the observation given the respective classes (see Theorems 4 and 5 in ).
Evaluating the JS divergence between and with weights and , respectively, is thus connected to the binary classification problem of deciding whether or not the neuron output is connected to a subset of all classes. If there is at least one nontrivial set such that the JS divergence is large, then the neuron output is useful in separating data samples from classes in from those from classes in . Hence, we consider the following importance measure:
|The following proposition gives a clearer picture of this cost function by showing that the JS divergence between these distributions coincides with the mutual information the neuron output shares with indicator function on a subset of class labels. The connection between JS divergence and mutual information is known; we reproduce the proof for the convenience of the reader.
In essence, Lemma 3 shows that our importance function (11) can be interpreted as a divergence between two distributions, and as the amount of information the neuron output shares with an indicator variable on a subset of class labels. Hence, this importance function measures the ability of the neuron output to separate class subsets.
Note further that, by (4), we have
In case all class labels occur equally often (i.e., has a uniform distribution on ), the right-hand side of above equation achieves its maximum for sets that contain half the class labels. Thus, Jensen-Shannon subset separation tends to give higher importance to neurons that separate into equally-sized sets rather than to neurons that separate it in an unbalanced manner.
f.5 Labeled Mutual Information
The maximization in (11) has a computational complexity of , which makes it impractical for datasets with many classes. Instead, one can perform a maximization over individual classes rather than subsets of classes and thus obtains the definition of labeled mutual information:
With reference to Lemma 2, one can show that
i.e., labeled mutual information contains the same specific information that we used in the definition of KL selectivity. Note, however, that (except in certain corner cases), the maximizum in (12) may be achieved for a different class than the maximum in (5). Nevertheless, labeled mutual information and KL selectivity tend to give identical results in special corner cases.
By similar arguments as in the discussion of JS subset separation, we have
Therefore, labeled mutual information in general decreases with the number of possible class labels, i.e., with the cardinality of . This is not the case for KL selectivity.
f.6 Ordering Between Importance Measures
where follows from (8) and the nonnegativity of entropy, from the data processing inequality, and from the fact that the maximization is performed over a smaller set in (12) than in (11). Second, Lemma 2 shows that KL selectivity is an upper bound on mutual information. Finally, there is no ordering between KL selectivity and entropy.
f.7 Complexity of Computing Importance Functions
We assume that the validation set is, in any case, run through the NN, i.e., we ignore the computational complexity of computing neuron outputs. Assuming that and , the most complex step is computing the distribution from the data set ; this can be done with a complexity of . Similarly, the distribution and the set of distributions can be computed with a complexity of .
Entropy can be computed from with a complexity of ; mutual information from with a complexity of ; KL selectivity and labeled mutual information from the set of distributions with a complexity of ; and JS subset separation with a complexity of . These computations have, with the exception of JS divergence, a complexity negligible compared to .