Gradient-Adjusted Neuron Activation Profiles for Comprehensive Introspection of Convolutional Speech Recognition Models

Gradient-Adjusted Neuron Activation Profiles for Comprehensive Introspection of Convolutional Speech Recognition Models


Deep Learning based Automatic Speech Recognition (ASR) models are very successful, but hard to interpret. To gain better understanding of how Artificial Neural Networks (ANNs) accomplish their tasks, introspection methods have been proposed. Adapting such techniques from computer vision to speech recognition is not straight-forward, because speech data is more complex and less interpretable than image data. In this work, we introduce Gradient-adjusted Neuron Activation Profiles (GradNAPs) as means to interpret features and representations in Deep Neural Networks. GradNAPs are characteristic responses of ANNs to particular groups of inputs, which incorporate the relevance of neurons for prediction. We show how to utilize GradNAPs to gain insight about how data is processed in ANNs. This includes different ways of visualizing features and clustering of GradNAPs to compare embeddings of different groups of inputs in any layer of a given network. We demonstrate our proposed techniques using a fully-convolutional ASR model.


Andreas Krug, Sebastian Stober \addressOtto von Guericke University Magdeburg, Germany {keywords} speech recognition, convolutional neural networks, model introspection, feature visualization

1 Introduction

Artificial Neural Networks have become a very popular tool for solving challenging tasks across various fields of application. Performance gains are often achieved through increasing their complexity in terms of types of architectures or the number of neurons [22]. At the same time, larger computational models become harder to interpret [23]. This complicates detecting erroneous behavior and thus can be risky in critical applications. Introspection techniques have been proposed to get insight into ANNs [24, 21]. However, these methods are often designed for certain applications or architectures. In particular, many introspection techniques focus on images, as features are easy to interpret visually.

The complexity of ANNs is becoming closer to that of real brains. Those have been studied in neuroscience for over 50 years. Well-established methods in this field can be adapted to analyze ANNs [9]. Our work is inspired by a popular technique from neuroscience, the Event-Related Potential (ERP). The ERP technique is used for analyzing brain activity through Electroencephalography (EEG) [14]. ERPs aim to measure brain activity for a particular fixed event (stimulus). As the event is consistent across all EEG measurements, aligning the data at this stimulus and averaging the signals yields event-specific information [13]. This way, in ERPs, variations in brain activity are averaged out. We analyze ANNs similarly, but as their responses are deterministic, we average out data variations. For example, in a speech recognition model, activity can be observed for a particular phoneme in audio recordings of different speakers and articulations.

In our work, we present Gradient-adjusted Neuron Activation Profiles as an ERP-inspired analysis of ANNs, which combines and extends our previous work. GradNAPs allow for a comprehensive analysis of features and representations in any layer of the network, as well as identification of neurons which respond to a particular group of inputs. We demonstrate multiple ways to examine network responses of a fully-convolutional Automatic Speech Recognition (ASR) model using GradNAPs.

Figure 1: (A) Alignment procedure and (B) GradNAP computation from aligned activations for a layer.

2 Background

2.1 Convolutional speech recognition

Convolutional Neural Networks are not uncommon in ASR [1]. Here, we demonstrate our model using a simple, fully-convolutional architecture based on Wav2Letter [5]. This architecture is useful for low-resource ASR model training and transfer learning [11]. Moreover, introspection methods from computer vision can easily be adapted to it [10]. For comparability, we use a pre-trained model from our previous work [8]. The 11-layer 1D-convolutional network predicts graphemes from spectrograms. The model was trained on z-normalized spectrograms, which were scaled to 128 mel-frequency bins. Whole sequence audio recordings from the LibriSpeech corpus [19] were used as training data. The acoustic model predicts sequences of graphemes, which are decoded by a Connectionist Temporal Classification (CTC) beam search decoder.

2.2 Model introspection for deep neural networks

Introspection describes the process of analyzing or visualizing internal structures or processes of computational models. This is of particular interest in Deep Learning (DL) models, as these work as black-boxes [23]. Several introspection techniques have been proposed, mostly in the field of computer vision [24, 21, 4]. A common way of explaining ANNs is to visualize learned features by optimizing the input to maximally activate certain neurons or sets of neurons [23, 6, 16]. Optimal inputs do not always look natural. This problem can be tackled by regularizing the optimization. Another typical introspection strategy is to determine parts of the input, which are relevant for a certain prediction [24, 21, 6]. Those techniques visualize saliency maps on top of the input, which are easy to interpret. However, as those methods work on single examples, it is hard to assess the model comprehensively. Moreover, one has to choose such introspection techniques carefully, as some can be misleading [2]. More comprehensive insight into ANNs is provided by analyzing representations of different classes using the complete data set. This can be done by training linear classifiers on intermediate representations [3], through Canonical Correlation Analysis (CCA) of representations or by clustering class-specific neuron activations [18]. In speech, the latter type of analysis was conducted for Multi-Layer Perceptrons for speech-to-phoneme prediction [18, 17] and for convolutional ASR [10, 8].

3 Methods

3.1 Model & data set

For comparability with our previous work [8], we use the same model and data set. The architecture is based on Wav2Letter [5] and was trained on the LibriSpeech corpus [19]. This data set does not contain phoneme mappings. Therefore, in our earlier work, we obtained them through a grapheme-to-phoneme translation model using an attention-based encoder-decoder architecture, trained on the CMU Pronunciation Dictionary (CMUDict) [12].

3.2 Gradient-adjusted Neuron Activation Profiles

We introduce GradNAPs as a way to compute characteristic neuron responses of an ANN to groups of inputs. GradNAPs are an adaptation of the ERP technique to ANNs. Our method combines and extends two of our previously described introspection methods for ASR: Normalized Averaging of Aligned Inputs (NAvAI) [10] and Neuron Activation Profiles [8]. We will first describe, how GradNAP analysis differs from our previous work. Afterwards, we explain our technique in detail.

As in our previous work, we adopt normalized averaging to obtain group-specific network responses. To preserve more information than NAPs, we do not create time-independence by sorting on the time axis, but by sensitivity-based alignment like in NAvAI. Different to NAvAI, we incorporate the activation strength into the alignment, as predictions can be highly sensitive to changes in inactive neurons. On top of that, we mask out activations of low relevance for the prediction. As these improvements utilize gradients, we call our method Gradient-adjusted NAPs (GradNAPs).

To properly apply an averaging approach, it is necessary that the different recordings are temporally aligned, similar to time-locked data in ERP analysis. To achieve this, we first center each layer’s activations and the spectrogram frames at the time of highest importance for the prediction. We refer to this step as “alignment” (Figure 1A). We compute neuron activations and sensitivity values in every layer for each spectrogram frame. Sensitivity is the gradient of a one-hot-vector for the predicted grapheme with respect to each layer’s activations. We identify importance for a prediction by strong activation with high absolute sensitivity value. Hence, we center activations at time point of maximum . As zeros in z-normalized spectrograms do not represent absence of the corresponding frequency, we center spectrogram inputs at time of maximum , The centering is implemented by cropping. Equivalently, we center the gradients, so they remain aligned to the activations.

Figure 1B visualizes how to obtain a GradNAP in a layer. We average aligned activations and gradients over a group to obtain a group-specific profile. As some neurons show baseline activations and some information are common to all inputs, we normalize activations by subtracting the average over the complete data set. We do not normalize gradients this way, because zero-gradients would lose their meaning. Instead, we scale them to a range of . Finally, we apply this gradient mask to the normalized averaged activations to obtain a GradNAP. Input layer GradNAPs are computed using spectrogram frames instead of activations.

3.3 Visualization of group-specific features

Here, we visualize GradNAPs as line plots, inspired by typical action potential plots of real neurons. We compute group-responsiveness of neurons as the neuron-wise sum of absolute values in the corresponding GradNAP. A neuron can be positively or negatively responsive to a group. Hence, we multiply by the sign of the sum of GradNAP values (Equation 1).


We obtain the 5 most responsive neurons in terms of and compute a common optimal input. The optimization target is the joint pre-activation of those neurons at a single time point. Pre-activations of positively and negatively responsive neurons are maximized and minimized, respectively (Equation 2). We did not optimize for each responsive neuron separately, as class-specificity is distributed across multiple neurons [15].


We apply L1 and L2 regularization on the input values, scaled depending on the receptive field size  in the respective layer . We scale L1 regularization by and L2 by . This dependence on avoids that regularization becomes stronger for deeper layers. Optimization is performed using Adam [7] with learning rate for 16 steps, initializing the input with random values from a normal distribution with and .

3.4 Representation power of layers for different groups

We apply hierarchical clustering with Euclidean distance and complete linkage to GradNAPs of graphemes and phonemes, like in our previous work [8]. Differently, instead of using a fixed distance threshold for emergence of clusters, we apply different distance thresholds using percentiles 75% to 95% in steps of 5%. We evaluate the resulting clusterings by computing their Silhouette score [20]. This score is based on the difference between distances within clusters compared to the nearest other cluster. Moreover, we average Silhouette scores over those 5 thresholds in each layer. This ensures that the information is not specific to a particular parameter choice. We compare those results between graphemes and phonemes.

4 Results & Discussion

4.1 Per-layer GradNAPs

Figure 2: NAvAI patterns compared to input layer GradNAPs.

GradNAPs in the input layer are an improvement of NAvAI [10]. Examples of GradNAPs in the input layer compared to exemplary NAvAI results are shown in Figure 2. GradNAPs in the input layer are directly interpretable. They show how the intensity of frequencies differs from the average over the complete data set. The gradient-based masking guarantees that the GradNAP only shows regions, which are important for the prediction. This advantage over NAvAI is demonstrated for /T/ in Figure 2 (right). While NAvAI shows a pattern over the whole receptive field size, the corresponding GradNAP also identified prediction-relevant parts of it.

We observe phoneme-typical patterns in input layer GradNAPs (Figure 2 bottom). Phonemes /AA/ and /AE/ share a high intensity formant at around 700 Hz. A second formant is identified at around 1200 Hz and 1900 Hz for /AA/ and /AE/, respectively. The input pattern for /T/ (right) shows a change of high to low intensities of all frequencies at the alignment time. Those patterns match the expectation well. However, identified formants are spreading a wider range of frequencies. This is probably due to speaker variation. The grapheme-specific NAvAI result for a (as in [10]) is most similar to the input GradNAP for /AE/. This indicates that grapheme a was pronounced as /AE/ in the majority of the data.

In deeper layers, neuron order does not have a meaning. Therefore, corresponding GradNAPs cannot be interpreted by visual inspection. An example can be seen in Figure 1B (rightmost). Instead, we visualize features by optimizing inputs for the most responsive neurons. Those results are shown in Section 4.2.

In all layers, we observed that GradNAP values become smaller and drop to 0 the further away from the alignment time. This indicates that the model did not use the complete receptive field for prediction. Thus, compressing the model in terms of choosing smaller kernel sizes, fewer filters or layers is possible.

4.2 Visualizing group-specific features

Again inspired by neuroscience, we visualize GradNAPs as neuron action potentials. Figure 3 shows GradNAPs of exemplary phonemes /AE/ and /T/ in the 2nd layer as action potential plots.

Figure 3: Neuron action potentials and feature visualization in the second layer for phonemes /AE/ and /T/.

The plots show phoneme-specific neuron activations for all neurons in the same layer superimposed. The 5 most responsive neurons are highlighted with different colors (those do not represent the identity of the neuron). We observed that neuron responses to both the vowel phoneme /AE/ and the plosive /T/ are close to the center. This indicates that the network focuses on acoustic features of the phoneme, instead of correlating features in their context. Next to each action potential plot in Figure 3, the optimal input to the set of 5 most responsive neurons is shown. The optimal input for neurons which are responsive to /AE/ shows high intensities for frequencies around 700 Hz and 1900 Hz. This corresponds to the /AE/-typical formants (also in agreement with Figure 2). The visualized features for /T/-responsive neurons show several quick transitions from high to low intensities of most frequencies. As no neuron peaks twice, it is likely that the multiple occurrence of the plosive pattern is related to detecting it in different contexts. Because optimal inputs are not aligned, it is reasonable that features occur at more than one time point. This also causes repetitive patterns in optimal inputs for deeper layers, which are distinguishable but not natural. We omit them here, because unnatural feature visualization is not easily interpretable. This problem could be tackled with stronger regularization, but could also lead to misleading interpretations.

4.3 Analysis of grapheme and phoneme encoding

We analyze, which layers represent graphemes and phonemes best by clustering of GradNAPs. In each layer, we compute Silhouette scores for cluster assignments using different distance thresholds. Higher scores correspond to more distinct clusters, indicating better representation of the respective group. Figure 4 (top) shows Silhouette scores for graphemes (left), phonemes (center) and the averages over distance thresholds for both groupings (right). Higher percentiles mostly lead to higher Silhouette scores. This is reasonable, as we expect a hierarchy of similar phonemes rather than large clusters. Surprisingly, representation quality does not consistently increase from lower to deeper layers. Silhouette scores even decrease for phonemes from the input layer to the 5th layer. Deeper layers of the network show better clustering for phonemes than for graphemes over all distance thresholds. The highest Silhouette score can be observed for phonemes in the 9th layer. However, the corresponding clusters are large and do not separate phonemic categories. Layers 10 and 11 have a much larger number of neurons than the others. This results in differently distributed distance matrices, which probably causes the drop of cluster quality from layer 9 to layer 10.

Figure 4: Silhouette scores at different distance thresholds (A) and 75th percentile clustering of GradNAPs in layer 10 (B).

Silhouette scores indicate that in higher layers, phonemes are better represented than graphemes. However, they are not suitable for detecting the exact layer, where clusters of meaningful phonemic categories emerge. This indicates that phoneme similarity is not the strongest factor for distinguishing neuron responses. Nevertheless, we observe similar phonemic categories in clusterings from the 10th layer on, which is shown in Figure 4 (bottom). In an earlier work, we performed clustering analysis of NAPs [8]. We confirm the prior finding, that phonemic categories are represented well from the 10th layer on and that the phoneme clustering is identifying more sub-categories. However, the differences between grapheme and phoneme clustering are smaller than in our earlier work. Most likely, this is an effect of gradient masking, which scales down a lot of prediction-irrelevant values.

5 Conclusion

GradNAPs are a promising tool to gain insight into ANNs. We combined strengths of existing introspection techniques, extended them and applied more comprehensive analyses. With our method, introspection is not limited to the predicted classes, but can be performed for any grouping of inputs. Moreover, model introspection is possible for different parts of the network (inputs, any layer, subsets of neurons). We presented per-layer clustering of GradNAPs for different groups and action potentials with feature visualization on the individual-neuron level. Our method is generally applicable to any type of data and is not limited to CNNs. If there are too many groups, the clustering overview can become cluttered. This can be circumvented by choosing higher-level groups or only a subset of interest. Future work will utilize our method to analyze the network during training. This could shed light on when and how the network learns to detect features for particular groups.

6 Acknowledgements

This research has been funded by the Federal Ministry of Education and Research of Germany (BMBF) and supported by the donation of a GeForce GTX Titan X graphics card from the NVIDIA Corporation.


  1. O. Abdel-Hamid, A. Mohamed, H. Jiang, L. Deng, G. Penn and D. Yu (2014) Convolutional neural networks for speech recognition. IEEE/ACM Transactions on audio, speech, and language processing 22 (10), pp. 1533–1545. Cited by: §2.1.
  2. J. Adebayo, J. Gilmer, M. Muelly, I. Goodfellow, M. Hardt and B. Kim (2018) Sanity checks for saliency maps. In Advances in Neural Information Processing Systems, pp. 9525–9536. Cited by: §2.2.
  3. G. Alain and Y. Bengio (2016) Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644. Cited by: §2.2.
  4. S. Bach, A. Binder, G. Montavon, F. Klauschen, K. Müller and W. Samek (2015) On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one 10 (7), pp. e0130140. Cited by: §2.2.
  5. R. Collobert, C. Puhrsch and G. Synnaeve (2016) Wav2Letter: an end-to-end convnet-based speech recognition system. arXiv preprint arXiv:1609.03193. External Links: Link, 1609.03193 Cited by: §2.1, §3.1.
  6. D. Erhan, Y. Bengio, A. Courville and P. Vincent (2009) Visualizing higher-layer features of a deep network. University of Montreal 1341 (3), pp. 1. Cited by: §2.2.
  7. D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.3.
  8. A. Krug, R. Knaebel and S. Stober (2018) Neuron activation profiles for interpreting convolutional speech recognition models. In Proceedings of the 2018 NeurIPS Workshop IRASL: Interpretability and Robustness for Audio, Speech, and Language, Cited by: §2.1, §2.2, §3.1, §3.2, §3.4, §4.3.
  9. A. Krug and S. Stober (2017) Adaptation of the event-related potential technique for analyzing artificial neural networks. In Cognitive Computational Neuroscience (CCN), Cited by: §1.
  10. A. Krug and S. Stober (2018) Introspection for convolutional automatic speech recognition. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 187–199. Cited by: §2.1, §2.2, §3.2, §4.1, §4.1.
  11. J. Kunze, L. Kirsch, I. Kurenkov, A. Krug, J. Johannsmeier and S. Stober (2017) Transfer learning for speech recognition on a budget. In Proceedings of the 2nd Workshop on Representation Learning for NLP, pp. 168–177. External Links: Link Cited by: §2.1.
  12. K. Lenzo (2007) The CMU pronouncing dictionary. Carnegie Melon University. Cited by: §3.1.
  13. S. J. Luck (2005) An Introduction to the Event-Related Potential Technique. Monographs of the Society for Research in Child Development 78 (3), pp. 388. External Links: 9780262621960, ISBN 0262122774, ISSN 1540-5834 Cited by: §1.
  14. S. Makeig and J. Onton (2009) ERP features and EEG dynamics: an ICA perspective. In Oxford handbook of event-related potential components, pp. 51–87. Cited by: §1.
  15. A. S. Morcos, D. G. Barrett, N. C. Rabinowitz and M. Botvinick (2018) On the importance of single directions for generalization. arXiv preprint arXiv:1803.06959. Cited by: §3.3.
  16. A. Mordvintsev, C. Olah and M. Tyka (2015) Inceptionism: going deeper into neural networks. Google Research Blog. Retrieved June 20 (14), pp. 5. Cited by: §2.2.
  17. T. Nagamine and N. Mesgarani (2017) Understanding the representation and computation of multilayer perceptrons: a case study in speech recognition. In International Conference on Machine Learning, pp. 2564–2573. Cited by: §2.2.
  18. T. Nagamine, M. L. Seltzer and N. Mesgarani (2015) Exploring how deep neural networks form phonemic categories. In Sixteenth Annual Conference of the International Speech Communication Association, Cited by: §2.2.
  19. V. Panayotov, G. Chen, D. Povey and S. Khudanpur (2015) Librispeech: an ASR corpus based on public domain audio books. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pp. 5206–5210. Cited by: §2.1, §3.1.
  20. P. J. Rousseeuw (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics 20, pp. 53–65. Cited by: §3.4.
  21. R. R. Selvaraju, A. Das, R. Vedantam, M. Cogswell, D. Parikh and D. Batra (2016) Grad-CAM: visual explanations from deep networks via gradient-based localization. arXiv preprint arXiv:1610.02391. External Links: 1610.02391, Link Cited by: §1, §2.2.
  22. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke and A. Rabinovich (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §1.
  23. J. Yosinski, J. Clune, A. Nguyen, T. Fuchs and H. Lipson (2015) Understanding neural networks through deep visualization. arXiv preprint arXiv:1506.06579. Cited by: §1, §2.2.
  24. M. D. Zeiler and R. Fergus (2014) Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818–833. Cited by: §1, §2.2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description