# Learning Explanations from Language Data

###### Abstract

PatternAttribution is a recent method, introduced in the vision domain, that explains classifications of deep neural networks. We demonstrate that it also generates meaningful interpretations in the language domain.

[pages=1,color=black!50,angle=0,scale=0.2,xpos=0,ypos=-135]2018 EMNLP Workshop

on

Analyzing and Interpreting Neural Networks for NLP

(BlackboxNLP)
\aclfinalcopy

## 1 Introduction

In the last decade, deep neural classifiers achieved state-of-the-art results in many domains, among others in vision and language. Due to the complexity of a deep neural model, however, it is difficult to explain its decisions. Understanding its decision process potentially allows to improve the model and may reveal new knowledge about the input.

Recently, Kindermans et al. (2018) claimed that “popular explanation approaches for neural networks (…) do not provide the correct explanation, even for a simple linear model.” They show that in a linear model, the weights serve to cancel noise in the input data and thus the weights show how to extract the signal but not what the signal is. This is why explanation methods need to move beyond the weights, the authors explain, and they propose the methods “PatternNet” and “PatternAttribution” that learn explanations from data. We test their approach in the language domain and point to room for improvement in the new framework.

## 2 Methods

Kindermans et al. (2018) assume that the data passed to a linear model is composed of signal () and noise (, from distraction) . Furthermore, they also assume that there is a linear relation between signal and target where is a so called signal base vector, which is in fact the “pattern” that PatternNet finds for us. As mentioned in the introduction, the authors show that in the model above, serves to cancel the noise such that

(1) |

They go on to explain that a good signal estimator should comply to the conditions in Eqs. 1 but that these alone form an ill-posed quality criterion since already satisfies them for any for which . To address this issue they introduce another quality criterion over a batch of data :

(2) |

and point out that Eq. 2 yields maximum values for signal estimators that remove most of the information about in the noise.

We argue that Eq. 2 still is not exhaustive. Consider the artificial estimator

which arguably is a a bad signal estimator for large as its estimation contains scaled noise, . Nevertheless, it still satisfies Eqs. 1 and yields maximum values for Eq. 2 since

is again just scaled noise and thus does not correlate with the output . To solve this issue, we propose the following criterion:

The minuend measures how much noise is left in the signal, the subtrahend measures how much signal is left in the noise. Good signal estimators split signal and noise well and thus yield large . We leave it to future research to evaluate existing signal estimators with our new criterion.

For our experiments, the authors equip us with expressions for the signal base vectors for simple linear layers and ReLU layers. For the simple linear model, for instance, it turns out that . To retrieve contributions for PatternAttribution, in the backward pass, the authors replace the weights by .

## 3 Experiments

To test PatternAttribution in the NLP domain, we trained a CNN text classifier Kim (2014) on a subset of the Amazon review polarity data set Zhang et al. (2015). We used 150 bigram filters, dropout regularization and a dense FC projection with 128 neurons. Our classifier achieves an F score of 0.875 on a fixed test split. We then used Kindermans et al. (2018) PatternAttribution to retrieve neuron-wise signal contributions in the input vector space.^{1}^{1}1Our experiments are available at https://github.com/DFKI-NLP/language-attributions.

To align these contributions with plain text, we summed up the contribution scores over the word vector dimensions for each word and used the accumulated scores to scale RGB values for word highlights in the plain text space. Positive scores are highlighted in red, negative scores in blue. This approach is inspired by Arras et al. (2017a). Example contributions are shown in Figs. 1 and 2.

## 4 Results

We observe that bigrams are highlighted, in particular no highlighted token stands isolated. Bigrams with clear positive or negative sentiment contribute heavily to the sentiment classification. In contrast, stop words and uninformative bigrams make little to no contribution. We consider these meaningful explanations of the sentiment classifications.

## 5 Related Work

Many of the approaches used to explain and interpret models in NLP mirror methods originally developed in the vision domain, such as the recent approaches by Li et al. (2016), Arras et al. (2017a), and Arras et al. (2017b). In this paper we implemented a similar strategy.

Following Kindermans et al. (2018), however, our approach improves upon the latter methods for the reasons outlined above. Furthermore, PatternAttribution is related to Montavon et al. (2017) who make use of Taylor decompositions to explain deep models. PatternAttribution reveals a good root point for the decomposition, the authors explain.

## 6 Conclusion

We successfully transferred a new explanation method to the NLP domain. We were able to demonstrate that PatternAttribution can be used to identify meaningful signal contributions in text inputs. Our method should be extended to other popular models in NLP. Furthermore, we introduced an improved quality criterion for signal estimators. In the future, estimators can be deduced from and tested against our new criterion.
^{†}^{†}footnotetext: * Co-first authorship.^{†}^{†}footnotetext: This research was partially supported by the German Federal Ministry of Education and Research through the projects DEEPLEE (01IW17001) and BBDC (01IS14013E).

## References

- Arras et al. (2017a) Leila Arras, Franziska Horn, Grégoire Montavon, Klaus-Robert Müller, and Wojciech Samek. 2017a. ”What is relevant in a text document?”: An interpretable machine learning approach. PLOS ONE, 12(8).
- Arras et al. (2017b) Leila Arras, Grégoire Montavon, Klaus-Robert Müller, and Wojciech Samek. 2017b. Explaining recurrent neural network predictions in sentiment analysis. In Proceedings of the 8th EMNLP Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pages 159–168.
- Kim (2014) Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746–1751.
- Kindermans et al. (2018) Pieter-Jan Kindermans, Kristof T. Schütt, Maximilian Alber, Klaus-Robert Müller, Dumitru Erhan, Been Kim, and Sven Dähne. 2018. Learning how to explain neural networks: PatternNet and PatternAttribution. In International Conference on Learning Representations (ICLR).
- Li et al. (2016) Jiwei Li, Xinlei Chen, Eduard Hovy, and Dan Jurafsky. 2016. Visualizing and understanding neural models in NLP. In Proceedings of NAACL-HLT, pages 681–691.
- Montavon et al. (2017) Grégoire Montavon, Sebastian Lapuschkin, Alexander Binder, Wojciech Samek, and Klaus-Robert Müller. 2017. Explaining nonlinear classification decisions with deep taylor decomposition. Pattern Recognition, 65:211–222.
- Zhang et al. (2015) Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems (NIPS), pages 649–657.