Self-explainability as an alternative to interpretability for judging the trustworthiness of artificial intelligences

Self-explainability as an alternative to interpretability for judging the trustworthiness of artificial intelligences


The ability to explain decisions made by AI systems is highly sought after, especially in domains where human lives are at stake such as medicine or autonomous vehicles. While it is always possible to approximate the input-output relations of deep neural networks with human-understandable rules, the discovery of the double descent phenomena suggests that no such approximation will ever map onto the actual functioning of deep neural networks. Double descent indicates that deep neural networks typically operate by smoothly interpolating between data points rather than by extracting a few high level rules. As a result neural networks trained on complex real world data are inherently hard to interpret and prone to failure if used outside their domain of applicability. To show how we might be able to trust AI despite these problems, we introduce the concept of self-explaining AI. Self-explaining AIs are capable of providing a human-understandable explanation of each decision along with confidence levels for both the decision and explanation. Some difficulties to this approach along with possible solutions are sketched. Finally, we argue it is also important that AI systems warn their user when they are asked to perform outside their domain of applicability.

Interpretability explainabile artificial intelligence trust deep learning

1 Introduction

There is growing interest in developing methods to explain deep neural network function, especially in high risk areas such as medicine and driverless cars. Such explanations would be useful to ensure that deep neural networks follow known rules and when troubleshooting failures. Despite much work in the area of model interpretation, the techniques that have been developed all have major flaws, often leading to much confusion regarding their use [31, 22]. Even more troubling, though, is that a new understanding is emerging that deep neural networks function through the interpolation of data points, rather than extrapolation [16]. This calls into question long-held narratives about deep neural networks “extracting” high level features and rules, and suggests that current methods of explanation will always fall short of explaining how deep neural networks actually work.

In response to difficulties raised by explaining black box models, Rudin argues for developing better interpretable models instead, arguing that the “interpetability-accuracy” trade-off is a myth. While it is true that the notion of such a trade-off is not rigorously grounded, empirically in many domains the state-of-the art systems are all deep neural networks. For instance, most state-of-art AI systems for computer vision are not interpretable in the sense required of Rudin. Even highly distilled and/or compressed models which achieve good performance on ImageNet require at least 100,000 free parameters [20]. Moreover, the human brain also appears to be an overfit “black box” which performs interpolation, which means that how we understand brain function also needs to change [16]. If evolution settled on a model (the brain) which is uninterpretable, then we expect advanced AIs to also be of that type. Interestingly, although the human brain is a “black box”, we are able to trust each other. Part of this trust comes from our ability to “explain” our decision making in terms which make sense to us. Crucially, for trust to occur we must believe that a person is not being deliberately deceptive, and that their verbal explanations actually maps onto the processes used in their brain to arrive at their decisions.

Motivated by how trust works between humans, in this work we explore the idea of self-explaining AIs. Self-explaining AIs yield two outputs - the decision and an explanation of that decision. This idea is not new, and it is something which was pursued in expert systems research in the 1980s [36]. More recently Kulesza et al. introduced a model which offers explanations and studied how such models allow for “explainable debugging” and iterative refinement [17]. However, in their work they restrict themselves to a simple interpretable model (a multinomial naive Bayes classifier). In this work we explore how to create trustworthy self-explaining AI for deep neural networks of arbitrary complexity.

After defining key terms, we discuss the challenge of interpreting deep neural networks raised by recent studies on generalization in deep learning. Then, we discuss how self-explaining AIs might be built. We argue that they should include at least three components - a measure of mutual information between the explanation and the decision, an uncertainty on both the explanation and decision, and a “warning system” which warns the user when the decision falls outside the domain of applicability of the system. We hope this work will inspire further work in this area which will ultimately lead to more trustworthy AI.

1.1 Interpretation, explanation, and self-explanation

As has been discussed at length elsewhere, different practitioners understand the term “intepretability” in different ways, leading to much confusion on the subject (for a detailed reviews, see[22] or [1] or [25]). The related term “explainability” is typically used in a synonymous fashion[31], although some have tried to draw a distinction between the two terms [18]. For the purpose of this paper, we take the two terms to be synonymous. Murdoch et al. argue that interpetable methods must be descriptively accurate and relevant[25]. By “accurate” they mean that the interpretation reproduces or explains a large amount of the input-output relations of the model, without attempting to explain how the model works internally. Any explanation will be an approximation, and the degree of approximation which is deemed acceptable may vary depending on application. This framing of explanation in terms of input-output mapping is in contrast to other (often older) notions of interpretability which attempt to explain mechanistically how a model processes data in ways that are understandable (ie, without heavy use of math), such as visualizing feature maps in CNNs. Regarding “relevance”, what counts as a “relevant explanation” is domain specific – it must be cast in terminology that is both understandable and relevant to users. For deep neural networks, the two desiderata of accuracy and relevance appear to be in tension - as we try to accurately explain the details of how a deep neural network interpolates, we move further from what may be considered relevant to the user.

The oft-used term “black box” also warrants some discussion. The term is technically a misnomer since the precise workings of deep networks are fully transparent from their source code and network weights. A further point is that even if we did not have access to the source code or weights (for instance for intellectual property reasons, or because the relevant technical expertise is missing), it is likely that a large amount of information about the network’s function could be gleaned through careful study of the its input-output relations. Developing mathematically rigorous techniques for “shining lights” into “black boxes” was a popular topic in early cybernetics research, [2] and this subject is attracting renewed interest in the era of deep learning. As an example of what is achievable, recently it has been shown that weights can be inferred for ReLU networks through careful analysis of input-output relations [30]. One way of framing the problem of “self-explaining AI” is that we would like the AI algorithm to be capable of probing its own input-output relations so it can warn the user when it may be making an error and (ideally) also distill its functioning in a human-understandable way.

1.2 Why deep neural networks are inherently non-interpretable

Many methods for interpretation of deep neural networks have been developed, such as sensitivity analysis (also called “saliency maps”), iterative mapping,[6] “distilling” a neural network into a simpler model [9], exploring failure modes and adversarial examples [11, 13], visualizing filters in CNNs [39], Shapley values [23], Local Interpretable Model-agnostic Explanations (LIME) [28], DeepLIFT [34], and layerwise relevance propagation [3]. Yet, all of these methods capture only particular aspects of neural network function, and the outputs of these methods are very easy to misinterpret [31, 19, 38]. Many of these methods are also unstable and not robust to small changes [8, 38]. Yet, deep neural networks are here to stay, and we expect them to become even more complex and inscrutable as time goes on. As explained in detail by Lillicrap & Kording [20], attempts to compress deep neural networks into a simpler interpretable models with equivalent accuracy are doomed to fail when working with complex real world data such as images or human language. If the world is messy and complex, then neural networks trained on real world data will also be messy and complex.

On top of these issues, there is a more fundamental issue when it comes to giving explanations for deep neural network function. For some years now it has been noted that deep neural networks have enormous capacity and seem to be vastly underdetermined, yet they still generalize. This was shown very starkly in 2016 when in Zhang et al. showed how deep neural networks can memorize random labels on ImageNet images [40]. More recently it has been shown that deep neural networks operate in a regime where the bias-variance trade-off no-longer applies [4]. As network capacity increases, test error first bottoms out and then starts to increase, but then (surprisingly) starts to decrease after a particular capacity threshold is reached. Belkin et al. call this the “double descent phenomena” [4] and it was also noted in an earlier paper by Sprigler et al [35], who argue the phenomena is analogous to the “jamming transition” found in the physics of granular materials. The phenomena of “double descent” appears to be universal to all machine learning [4, 5], although its presence can be masked by common practices such as early stopping [4, 26], which may explain why it took so long to be discovered.

In the regime where deep neural networks operate, they not only interpolate each training data point, but do so in a “direct” or “robust” way [16]. This means that the interpolation does not exhibit overshoot or undershoot which is typical of overfit models, rather it is a smooth almost piecewise interpolation. Interpolation also brings brings with it a corollary - that they can’t extrapolate. The fact that deep neural networks cannot extrapolate calls into question popular ideas that deep neural networks “extract” high level features and “discover” regularities in the world. Actually, deep neural networks are “dumb” - any regularities that they appear to have captured internally are solely due to the data that was fed to them, rather than a self-directed “regularity extraction” process. Richard P. Feynman said “What I cannot create, I do not understand”. Since we can create deep neural networks, we arguable can understand them, in the sense of Feynman. We can understand neural networks in terms of four components necessary for creating them - data, network architecture, learning rules, and objective functions [29]. Future theories of deep neural network function will be successful to the extent that they focus on these key ingredients, rather than the internal workings and weights of a trained model.

1.3 How can we trust a self-explaining AI’s explanation?

In his landmark 2014 book Superintelligence: Paths, Dangers, Strategies, Nick Bostrom notes that highly advanced AIs may be incentivized to deceive their creators until a point where they exhibit a “treacherous turn” against them [7]. In the case of superintelligent or otherwise highly advanced AI, the possibility of deception appears to be a highly non-trivial concern. Here however, we suggest some methods by which we can trust the explanations given by present day deep neural networks, such as typical convolutional neural networks or transformer language models. Whether these methods will still have utility when it comes to future AI systems is an open question.

To show how we might create trust, we focus on an explicit and relatively simple example. Shen et al[33] and later LaLonde et al [18]. have both proposed deep neural networks for lung nodule classification which offer “explanations”. Both authors make use of a dataset where clinicians have labeled lung nodules not only by severity (cancerous vs non-cancerous) but also quantified them (on a scale of 1-5) in terms of five visual attributes which are deemed relevant for diagnosis (subtlety, sphericity, margin, lobulation, spiculation, and texture). While the details of the proposed networks vary greatly, both output predictions for severity and scores for each of the visual attributes. Both authors claim that the visual attribute predictions “explain” the diagnostic prediction, since the diagnostic branch and visual attribute prediction branch(es) are connected near the base of the network. However, no evidence is presented that the visual attribute prediction is in any way related to the diagnosis prediction.1 (The output activations of the last layer shared by both branches could be computed in a largely independent manner.) Additionally, even if the visual attributes were used, no weights are provided for the importance of each attribute to the prediction, and there may be other attributes/features of equal or greater importance that are used but not among those outputted.2

Figure 1: Sketch of a simple self-explaining AI system. Optional components are shown with dashed lines.

We would like to show that the attributes used in the explanation and the diagnosis output are related. This may be done by looking at the layer where the diagnosis and explanation branch diverge. There are many ways of quantifying the relatedness of two variables, the Pearson correlation being one of the simplest, but also one of the least useful in this context since it is only sensitive to linear relationships. A measure which is sensitive to non-linear relationships and which has nice theoretical interpretation is the mutual information. For two random variables and it is defined as:


Where is the Shannon entropy. One can also define a mutual information correlation coefficient:[21]


This coefficient has the nice property that it reduces to the Pearson correlation in the case that is a Gaussian function with non-zero covariance. The chief difficulty of applying mutual information is that the underlying probability distributions , , and all have to be estimated. Various techniques exist for doing this however, such as by using kernel density estimation with Parzen windows [37].3

Suppose the latent vector is denoted by and has length . Denote the diagnosis of the network as and the vector of attributes . Then for a particular attribute in our explanation word set we calculate the following to obtain a “relatedness” score between the two:


An alternative (and perhaps complementary) method would be to train a surrogate (“post-hoc”) model to try to predict the diagnosis from the attributes (also shown in figure 1). We can learn two things from this surrogate model. First, if the model is not as accurate as the diagnosis branch of the main model, then we know the main model is using additional features. Secondly, we can change or scramble a particular attribute and see if the output of this model changes, on average.

1.4 Ensuring robustness through applicability domains and uncertainty analysis

The concept of an “applicability domain”, or the domain where a model makes good predictions, is well known in the area of molecular modeling known as quantitative structure property relationships (QSPR), and a number of techniques have been developed (for a review, see [32] or [27]). However, the practice of quantifying the applicability domain of models hasn’t become widespread in other areas where machine learning is applied. A simple way of defining the applicability domain by calculating the convex hull of the latent vectors for all training data points. If the latent vector of a test data point falls on or outside the convex hull, then the model should send an alert saying that the test point falls outside the domain it was trained for.

Finally, models should contain measures of uncertainty for both their decisions and their explanations. Ideally, this would be performed in a fully Bayesian way using a Bayesian neural network. [24] For instance, it has been shown that random dropout during inference can be used to estimate uncertainties at little extra computational cost [10].

1.5 Conclusion

We argued that deep neural networks trained on complex real world data are very difficult to interpret due to their power arising from brute-force interpolation over big data rather than through the extraction of high level generalizable rules. Motivated by this and by the need for trust in AI systems we introduce the concept of self-explaining AI. We described how a simple self explaining AI would function for diagnosing medical images such as chest X-rays or CT scans. To build trust, we showed how a mutual information metric can be used to verify that the explanation given is related to the diagnostic output. Crucially, in addition to an explanation, self-explaining AI outputs confidence levels for both the decision and explanation, further aiding our ability to gauge the trustworthiness of any given diagnosis. Finally, we argue that an applicability domain analysis should be done for AI systems where robustness and trust are important, and that systems should alert the user if they are asked to do work outside their domain of applicability.

2 Funding & disclaimer

No funding sources were used in the creation of this work. The author (Dr. Daniel C. Elton) wrote this article in his personal capacity. The views expressed are his own and do not necessarily represent the views of the National Institutes of Health or the United States Government.


  1. We note that while it may seem intuitive that the two output branches must be related, this must be rigorously shown for trustworthiness to hold. Non-intuitive behaviours have repeatably been demonstrated in deep neural networks, for instance it has been shown that commonly used networks based on rectified linear units (ReLUs) contain large “linear regions” in which the power of network non-linearity is not utilized [15, 14]. Indeed, even state-of-the art models likely contain many unused or redundant connections, as evidenced by the repeated success of model compression techniques when applied to state-of-the-art image classification models.
  2. This weakness is acknowledged by Shen et al., who point out there are a multitude of other features known to be relevant which are not outputted, most notably location in the body, which is strongly associated with malignancy [33].
  3. Note that this sort of approach should not be taken as quantifying “information flows” in the network. In fact, since the output of units is continuous, the amount of information which can flow through the network is infinite (for discussion of this and how to recover the concept of “information flow” in neural networks see [12]). What we are doing is measuring the mutual information over the particular data distribution used.


  1. Ahmad, M.A., Eckert, C., Teredesai, A.: Interpretable machine learning in healthcare. In: Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics - BCB ’18. ACM Press (2018).
  2. Ashby, W.R.: An introduction to cybernetics. London : Chapman & Hall (1956)
  3. Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K.R., Samek, W.: On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLOS ONE 10(7), e0130140 (Jul 2015).
  4. Belkin, M., Hsu, D., Ma, S., Mandal, S.: Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences 116(32), 15849–15854 (Jul 2019).
  5. Belkin, M., Hsu, D., Xu, J.: Two models of double descent for weak features. arXiv preprint: 1903.07571 (2019)
  6. Bordes, F., Berthier, T., Jorio, L.D., Vincent, P., Bengio, Y.: Iteratively unveiling new regions of interest in deep learning models. In: Medical Imaging with Deep Learning (MIDL) (2018)
  7. Bostrom, N.: Superintelligence: Paths, Dangers, Strategies. Oxford University Press, Inc., USA, 1st edn. (2014)
  8. Dombrowski, A.K., Alber, M., Anders, C.J., Ackermann, M., Müller, K.R., Kessel, P.: Explanations can be manipulated and geometry is to blame (2019)
  9. Frosst, N., Hinton, G.: Distilling a neural network into a soft decision tree. arXiv preprints: 1711.09784 (2017)
  10. Gal, Y., Ghahramani, Z.: Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In: Balcan, M.F., Weinberger, K.Q. (eds.) Proceedings of The 33rd International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 48, pp. 1050–1059. PMLR, New York, New York, USA (20–22 Jun 2016)
  11. Goertzel, B.: Are there deep reasons underlying the pathologies of today’s deep learning algorithms? In: Artificial General Intelligence, pp. 70–79. Springer International Publishing (2015)
  12. Goldfeld, Z., Van Den Berg, E., Greenewald, K., Melnyk, I., Nguyen, N., Kingsbury, B., Polyanskiy, Y.: Estimating information flow in deep neural networks. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 97, pp. 2299–2308. PMLR, Long Beach, California, USA (09–15 Jun 2019)
  13. Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. arXiv preprints: 1412.6572 (2014)
  14. Hanin, B., Rolnick, D.: Complexity of linear regions in deep networks. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 97, pp. 2596–2604. PMLR, Long Beach, California, USA (09–15 Jun 2019)
  15. Hanin, B., Rolnick, D.: Deep ReLU networks have surprisingly few activation patterns. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32, pp. 359–368. Curran Associates, Inc. (2019)
  16. Hasson, U., Nastase, S.A., Goldstein, A.: Direct-fit to nature: an evolutionary perspective on biological (and artificial) neural networks (Sep 2019).
  17. Kulesza, T., Burnett, M., Wong, W.K., Stumpf, S.: Principles of explanatory debugging to personalize interactive machine learning. In: Proceedings of the 20th International Conference on Intelligent User Interfaces - IUI ’15. ACM Press (2015)
  18. LaLonde, R., Torigian, D., Bagci, U.: Encoding Visual Attributes in Capsules for Explainable Medical Diagnoses. arXiv e-prints: 1909.05926 (Sep 2019)
  19. Lie, C.: Relevance in the eye of the beholder: Diagnosing classifications based on visualised layerwise relevance propagation. Master’s thesis, Lund Unversity, Sweden (2019)
  20. Lillicrap, T.P., Kording, K.P.: What does it mean to understand a neural network? arXiv preprint: 1907.06374 (2019)
  21. Linfoot, E.: An informational measure of correlation. Information and Control 1(1), 85 – 89 (1957).
  22. Lipton, Z.C.: The mythos of model interpretability. CoRR abs/1606.03490 (2016)
  23. Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems 30, pp. 4765–4774. Curran Associates, Inc. (2017)
  24. McClure, P., Rho, N., Lee, J.A., Kaczmarzyk, J.R., Zheng, C.Y., Ghosh, S.S., Nielson, D.M., Thomas, A.G., Bandettini, P., Pereira, F.: Knowing what you know in brain segmentation using bayesian deep neural networks. Frontiers in Neuroinformatics 13 (Oct 2019).
  25. Murdoch, W.J., Singh, C., Kumbier, K., Abbasi-Asl, R., Yu, B.: Definitions, methods, and applications in interpretable machine learning. Proceedings of the National Academy of Sciences 116(44), 22071–22080 (Oct 2019).
  26. Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: Where bigger models and more data hurt. arXiv preprint: 1912.02292 (2019)
  27. Netzeva, T.I., Worth, A.P., Aldenberg, T., Benigni, R., Cronin, M.T., Gramatica, P., Jaworska, J.S., Kahn, S., Klopman, G., Marchant, C.A., Myatt, G., Nikolova-Jeliazkova, N., Patlewicz, G.Y., Perkins, R., Roberts, D.W., Schultz, T.W., Stanton, D.T., van de Sandt, J.J., Tong, W., Veith, G., Yang, C.: Current status of methods for defining the applicability domain of (quantitative) structure-activity relationships. Alternatives to Laboratory Animals 33(2), 155–173 (Apr 2005).
  28. Ribeiro, M.T., Singh, S., Guestrin, C.: ”why should i trust you?”. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD. ACM Press (2016).
  29. Richards, B.A., Lillicrap, T.P., Beaudoin, P., Bengio, Y., Bogacz, R., Christensen, A., Clopath, C., Costa, R.P., de Berker, A., Ganguli, S., Gillon, C.J., Hafner, D., Kepecs, A., Kriegeskorte, N., Latham, P., Lindsay, G.W., Miller, K.D., Naud, R., Pack, C.C., Poirazi, P., Roelfsema, P., Sacramento, J., Saxe, A., Scellier, B., Schapiro, A.C., Senn, W., Wayne, G., Yamins, D., Zenke, F., Zylberberg, J., Therien, D., Kording, K.P.: A deep learning framework for neuroscience. Nature Neuroscience 22(11), 1761–1770 (Oct 2019).
  30. Rolnick, D., Kording, K.P.: Identifying weights and architectures of unknown relu networks. arXiv preprints: 1910.00744 (2019)
  31. Rudin, C.: Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence 1(5), 206–215 (May 2019).
  32. Sahigara, F., Mansouri, K., Ballabio, D., Mauri, A., Consonni, V., Todeschini, R.: Comparison of different approaches to define the applicability domain of QSAR models. Molecules 17(5), 4791–4810 (Apr 2012).
  33. Shen, S., Han, S.X., Aberle, D.R., Bui, A.A., Hsu, W.: An interpretable deep hierarchical semantic convolutional neural network for lung nodule malignancy classification. Expert Systems with Applications 128, 84–95 (Aug 2019).
  34. Shrikumar, A., Greenside, P., Kundaje, A.: Learning important features through propagating activation differences. arXiv preprints: 1704.02685 (2017)
  35. Spigler, S., Geiger, M., d’Ascoli, S., Sagun, L., Biroli, G., Wyart, M.: A jamming transition from under- to over-parametrization affects generalization in deep learning. Journal of Physics A: Mathematical and Theoretical 52(47), 474001 (Oct 2019).
  36. Swartout, W.R.: XPLAIN: a system for creating and explaining expert consulting programs. Artificial Intelligence 21(3), 285–325 (Sep 1983).
  37. Torkkola, K.: Feature extraction by non-parametric mutual information maximization. Journal of Machine Learning Research 3, 1415–1438 (2003)
  38. Yeh, C.K., Hsieh, C.Y., Suggala, A.S., Inouye, D.I., Ravikumar, P.: On the (in)fidelity and sensitivity for explanations. arXiv preprint: 1901.09392 (2019)
  39. Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Computer Vision – ECCV 2014, pp. 818–833. Springer International Publishing (2014)
  40. Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning requires rethinking generalization. arXiv preprint: 1611.03530 (2016)
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description