Self-explainability as an alternative to interpretability for judging the trustworthiness of artificial intelligences
Self-explaining AI as an alternative to interpretable AI
The ability to explain decisions made by AI systems is highly sought after, especially in domains where human lives are at stake such as medicine or autonomous vehicles. While it is always possible to approximate the input-output relations of deep neural networks with human-understandable rules, the discovery of the double descent phenomena suggests that no such approximation will ever map onto the actual functioning of deep neural networks. Double descent indicates that deep neural networks typically operate by smoothly interpolating between data points rather than by extracting a few high level rules. As a result neural networks trained on complex real world data are inherently hard to interpret and prone to failure if used outside their domain of applicability. To show how we might be able to trust AI despite these problems, we introduce the concept of self-explaining AI. Self-explaining AIs are capable of providing a human-understandable explanation of each decision along with confidence levels for both the decision and explanation. Some difficulties to this approach along with possible solutions are sketched. Finally, we argue it is also important that AI systems warn their user when they are asked to perform outside their domain of applicability.
Keywords:Interpretability explainable artificial intelligence trust deep learning
There is growing interest in developing methods to explain deep neural network function, especially in high risk areas such as medicine and driverless cars. Such explanations would be useful to ensure that deep neural networks follow known rules and when troubleshooting failures. Despite much work in the area of model interpretation, the techniques that have been developed all have major flaws, often leading to much confusion regarding their use [35, 26]. Even more troubling, though, is that a new understanding is emerging that deep neural networks function through the interpolation of data points, rather than extrapolation . This calls into question long-held narratives about deep neural networks “extracting” high level features and rules, and suggests that current methods of explanation will always fall short of explaining how deep neural networks actually work.
In response to difficulties raised by explaining black box models, Rudin argues for developing better interpretable models instead, arguing that the “interpetability-accuracy” trade-off is a myth. While it is true that the notion of such a trade-off is not rigorously grounded, empirically in many domains the state-of-the art systems are all deep neural networks. For instance, most state-of-art AI systems for computer vision are not interpretable in the sense required of Rudin. Even highly distilled and/or compressed models which achieve good performance on ImageNet require at least 100,000 free parameters . Moreover, the human brain also appears to be an overfit “black box” which performs interpolation, which means that how we understand brain function also needs to change . If evolution settled on a model (the brain) which is uninterpretable, then we expect advanced AIs to also be of that type. Interestingly, although the human brain is a “black box”, we are able to trust each other. Part of this trust comes from our ability to “explain” our decision making in terms which make sense to us. Crucially, for trust to occur we must believe that a person is not being deliberately deceptive, and that their verbal explanations actually maps onto the processes used in their brain to arrive at their decisions.
Motivated by how trust works between humans, in this work we explore the idea of self-explaining AIs. Self-explaining AIs yield two outputs - the decision and an explanation of that decision. This idea is not new, and it is something which was pursued in expert systems research in the 1980s . More recently Kulesza et al. introduced a model which offers explanations and studied how such models allow for “explainable debugging” and iterative refinement . However, in their work they restrict themselves to a simple interpretable model (a multinomial naive Bayes classifier). In this work we explore how to create trustworthy self-explaining AI for deep neural networks of arbitrary complexity.
After defining key terms, we discuss the challenge of interpreting deep neural networks raised by recent studies on generalization in deep learning. Then, we discuss how self-explaining AIs might be built. We argue that they should include at least three components - a measure of mutual information between the explanation and the decision, an uncertainty on both the explanation and decision, and a “warning system” which warns the user when the decision falls outside the domain of applicability of the system. We hope this work will inspire further work in this area which will ultimately lead to more trustworthy AI.
1.1 Interpretation, explanation, and self-explanation
As has been discussed at length elsewhere, different practitioners understand the term “intepretability” in different ways, leading to much confusion on the subject (for a detailed reviews, see or  or ). The related term “explainability” is typically used in a synonymous fashion, although some have tried to draw a distinction between the two terms . Here we take explanation/explainability and interpretation/interpretability to be synonymous. Murdoch et al. define an explanation as a verbal account of neural network function which is descriptively accurate and relevant . By “descriptively accurate” they mean that the interpretation reproduces a large number of the input-output mappings of the model. The explanation may or may not map onto how the model works internally. Additionally, any explanation will be an approximation, and the degree of approximation which is deemed acceptable may vary depending on application. By “relevance”, what counts as a “relevant explanation” is domain specific – it must be cast in terminology that is both understandable and relevant to users. For deep neural networks, the two desiderata of accuracy and relevance appear to be in tension - as we try to accurately explain the details of how a deep neural network interpolates, we move further from what may be considered relevant to the user.
This definition of explanation in terms of input-output mapping contrasts with a second meaning of the term explanation which we may call mechanistic explanation. Mechanistic explanations abstract faithfully (but approximately) the actual data transformations occurring in the model. To consider why mechanistic explanations can be useful, consider a deep learning model we trained recently to segment the L1 vertebra . The way a radiologist identifies the L1 vertebra is by scanning down from the top of the body and finding the last vertebra that has ribs attached to it, which is T12. L1 is directly below T12. In our experience our models for identifying L1 tend to be brittle, indicating they probably use a different approach. For instance, they may do something like “locate the bright object in the middle of the image” or “locate the bright object which is just above the kidneys”. These techniques would not be as robust as the technique used by radiologists. If a self-explaining AI or AGI could describe human anatomy at a high level as a radiologist can, this would go a long way towards engendering trust.
There is another type of explanation we wish to discuss which we may call meta-level explanation. Richard P. Feynman said “What I cannot create, I do not understand”. Since we can create deep neural networks, we do understand them, in the sense of Feynman, and therefore we can explain them in terms of how we build them. More specifically, we can explain neural network function in terms of four components necessary for creating them - data, network architecture, learning rules, and objective functions . The way one explains deep neural network function from data, architecture, and training is analogous to how one explains animal behaviour using the theory of evolution. The evolution of architectures by “graduate student descent” and the explicit addition of inductive biases mirrors the evolution of organisms. Similarly, the training of architectures mirrors classical conditioning of animals as they get older. The explanation of animal behaviour in terms of meta-level theories like evolution and classical conditioning has proven to be enormously successful and stands in contrast to attempts to seek detailed mechanistic accounts.
Finally, the oft-used term black box also warrants discussion. The term is technically a misnomer since the precise workings of deep networks are fully transparent from their source code and network weights, and therefore for sake of rigor should not be used. A further point is that even if we did not have access to the source code or weights (for instance for intellectual property reasons, or because the relevant technical expertise is missing), it is likely that a large amount of information about the network’s function could be gleaned through careful study of the its input-output relations. Developing mathematically rigorous techniques for “shining lights” into “black boxes” was a popular topic in early cybernetics research , and this subject is attracting renewed interest in the era of deep learning. As an example of what is achievable, recently it has been shown that weights can be inferred for ReLU networks through careful analysis of input-output relations . One way of designing a “self-explaining AI” would be to imbue the AI with the power to probe its own input-output relations so it can warn its user when it may be making an error and (ideally) also distill its functioning into a human-understandable format.
1.2 Why deep neural networks are inherently non-interpretable
Many methods for interpretation of deep neural networks have been developed, such as sensitivity analysis (also called “saliency maps”), iterative mapping, “distilling” a neural network into a simpler model , exploring failure modes and adversarial examples [13, 15], visualizing filters in CNNs , activation maximization based visualizations, influence functions , Shapley values , Local Interpretable Model-agnostic Explanations (LIME) , DeepLIFT , explanatory graphs , and layerwise relevance propagation . Yet, all of these methods capture only particular aspects of neural network function, and the outputs of these methods are very easy to misinterpret [35, 23, 42]. Many of these methods are also unstable and not robust to small changes [8, 42]. Yet, deep neural networks are here to stay, and we expect them to become even more complex and inscrutable as time goes on. As explained in detail by Lillicrap & Kording , attempts to compress deep neural networks into a simpler interpretable models with equivalent accuracy are doomed to fail when working with complex real world data such as images or human language. If the world is messy and complex, then neural networks trained on real world data will also be messy and complex.
On top of these issues, there is a more fundamental issue when it comes to giving explanations for deep neural network function. For some years now it has been noted that deep neural networks have enormous capacity and seem to be vastly underdetermined, yet they still generalize. This was shown very starkly in 2016 when in Zhang et al. showed how deep neural networks can memorize random labels on ImageNet images . More recently it has been shown that deep neural networks operate in a regime where the bias-variance trade-off no-longer applies . As network capacity increases, test error first bottoms out and then starts to increase, but then (surprisingly) starts to decrease after a particular capacity threshold is reached. Belkin et al. call this the “double descent phenomena”  and it was also noted in an earlier paper by Sprigler et al , who argue the phenomena is analogous to the “jamming transition” found in the physics of granular materials. The phenomena of “double descent” appears to be universal to all machine learning [4, 5], although its presence can be masked by common practices such as early stopping [4, 30], which may explain why it took so long to be discovered.
In the regime where deep neural networks operate, they not only interpolate each training data point, but do so in a “direct” or “robust” way . This means that the interpolation does not exhibit overshoot or undershoot which is typical of overfit models, rather it is a smooth almost piecewise interpolation. Interpolation also brings brings with it a corollary - that they can’t extrapolate. The fact that deep neural networks cannot extrapolate calls into question popular ideas that deep neural networks “extract” high level features and “discover” regularities in the world. Actually, deep neural networks are “dumb” - any regularities that they appear to have captured internally are solely due to the data that was fed to them, rather than a self-directed “regularity extraction” process.
1.3 How can we trust a self-explaining AI’s explanation?
In his landmark 2014 book Superintelligence: Paths, Dangers, Strategies, Nick Bostrom notes that highly advanced AIs may be incentivized to deceive their creators until a point where they exhibit a “treacherous turn” against them . In the case of superintelligent or otherwise highly advanced AI, the possibility of deception appears to be a highly non-trivial concern. Here however, we suggest some methods by which we can trust the explanations given by present day deep neural networks, such as typical convolutional neural networks or transformer language models. Whether these methods will still have utility when it comes to future AI systems is an open question.
To show how we might create trust, we focus on an explicit and relatively simple example. Shen et al and later LaLonde et al . have both proposed deep neural networks for lung nodule classification which offer “explanations”. Both authors make use of a dataset where clinicians have labeled lung nodules not only by severity (cancerous vs non-cancerous) but also quantified them (on a scale of 1-5) in terms of five visual attributes which are deemed relevant for diagnosis (subtlety, sphericity, margin, lobulation, spiculation, and texture). While the details of the proposed networks vary greatly, both output predictions for severity and scores for each of the visual attributes. Both authors claim that the visual attribute predictions “explain” the diagnostic prediction, since the diagnostic branch and visual attribute prediction branch(es) are connected near the base of the network. However, no evidence is presented that the visual attribute prediction is in any way related to the diagnosis prediction.
We would like to show that the attributes used in the explanation and the diagnosis output are related. This may be done by looking at the layer where the diagnosis and explanation branch diverge. There are many ways of quantifying the relatedness of two variables, the Pearson correlation being one of the simplest, but also one of the least useful in this context since it is only sensitive to linear relationships. A measure which is sensitive to non-linear relationships and which has nice theoretical interpretation is the mutual information. For two random variables and it is defined as:
Where is the Shannon entropy. One can also define a mutual information correlation coefficient:
This coefficient has the nice property that it reduces to the Pearson correlation in the case that is a Gaussian function with non-zero covariance. The chief difficulty of applying mutual information is that the underlying probability distributions , , and all have to be estimated. Various techniques exist for doing this however, such as by using kernel density estimation with Parzen windows .
Suppose the latent vector is denoted by and has length . Denote the diagnosis of the network as and the vector of attributes . Then for a particular attribute in our explanation word set we calculate the following to obtain a “relatedness” score between the two:
An alternative (and perhaps complementary) method would be to train a surrogate (“post-hoc”) model to try to predict the diagnosis from the attributes (also shown in figure 1). We can learn two things from this surrogate model. First, if the model is not as accurate as the diagnosis branch of the main model, then we know the main model is using additional features. Secondly, we can change or scramble a particular attribute and see if the output of this model changes, on average.
1.4 Ensuring robustness through applicability domains and uncertainty analysis
The concept of an “applicability domain”, or the domain where a model makes good predictions, is well known in the area of molecular modeling known as quantitative structure property relationships (QSPR), and a number of techniques have been developed (for a review, see  or ). However, the practice of quantifying the applicability domain of models hasn’t become widespread in other areas where machine learning is applied. A simple way of defining the applicability domain by calculating the convex hull of the latent vectors for all training data points. If the latent vector of a test data point falls on or outside the convex hull, then the model should send an alert saying that the test point falls outside the domain it was trained for.
Finally, models should contain measures of uncertainty for both their decisions and their explanations. Ideally, this should be done in a Bayesian way using a Bayesian neural network . With the continued progress of Moore’s law, training Bayesian CNNs  is now becoming feasible and in our view this is a worthwhile use of additional CPU/GPU cycles. There are also approximate methods - for instance it has been shown that random dropout during inference can be used to estimate uncertainties at little extra computational cost . Just as including experimental error bars is standard in all of science, and just as we wouldn’t trust a doctor who could not also give a confidence level in his diagnosis, uncertainty quantification should be standard practice in AI research.
We argued that deep neural networks trained on complex real world data are very difficult to interpret due to their power arising from brute-force interpolation over big data rather than through the extraction of high level generalizable rules. Motivated by this and by the need for trust in AI systems we introduce the concept of self-explaining AI. We described how a simple self explaining AI would function for diagnosing medical images such as chest X-rays or CT scans. To build trust, we showed how a mutual information metric can be used to verify that the explanation given is related to the diagnostic output. Crucially, in addition to an explanation, self-explaining AI outputs confidence levels for both the decision and explanation, further aiding our ability to gauge the trustworthiness of any given diagnosis. Finally, we argue that an applicability domain analysis should be done for AI systems where robustness and trust are important, and that systems should alert the user if they are asked to do work outside their domain of applicability.
2 Funding & disclaimer
No funding sources were used in the creation of this work. The author (Dr. Daniel C. Elton) wrote this article in his personal capacity. The views expressed are his own and do not necessarily represent the views of the National Institutes of Health or the United States Government.
- We note that while it may seem intuitive that the two output branches must be related, this must be rigorously shown for trustworthiness to hold. Non-intuitive behaviours have repeatably been demonstrated in deep neural networks, for instance it has been shown that commonly used networks based on rectified linear units (ReLUs) contain large “linear regions” in which the power of network non-linearity is not utilized [17, 16]. Indeed, even state-of-the art models likely contain many unused or redundant connections, as evidenced by the repeated success of model compression techniques when applied to state-of-the-art image classification models.
- This weakness is acknowledged by Shen et al., who point out there are a multitude of other features known to be relevant which are not outputted, most notably location in the body, which is strongly associated with malignancy .
- Note that this sort of approach should not be taken as quantifying “information flows” in the network. In fact, since the output of units is continuous, the amount of information which can flow through the network is infinite (for discussion of this and how to recover the concept of “information flow” in neural networks see ). What we are doing is measuring the mutual information over the particular data distribution used.
- Ahmad, M.A., Eckert, C., Teredesai, A.: Interpretable machine learning in healthcare. In: Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics - BCB ’18. ACM Press (2018). https://doi.org/10.1145/3233547.3233667
- Ashby, W.R.: An introduction to cybernetics. London : Chapman & Hall (1956)
- Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K.R., Samek, W.: On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLOS ONE 10(7), e0130140 (Jul 2015). https://doi.org/10.1371/journal.pone.0130140
- Belkin, M., Hsu, D., Ma, S., Mandal, S.: Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences 116(32), 15849–15854 (Jul 2019). https://doi.org/10.1073/pnas.1903070116
- Belkin, M., Hsu, D., Xu, J.: Two models of double descent for weak features. arXiv preprint: 1903.07571 (2019)
- Bordes, F., Berthier, T., Jorio, L.D., Vincent, P., Bengio, Y.: Iteratively unveiling new regions of interest in deep learning models. In: Medical Imaging with Deep Learning (MIDL) (2018)
- Bostrom, N.: Superintelligence: Paths, Dangers, Strategies. Oxford University Press, Inc., USA, 1st edn. (2014)
- Dombrowski, A.K., Alber, M., Anders, C.J., Ackermann, M., MÃ¼ller, K.R., Kessel, P.: Explanations can be manipulated and geometry is to blame (2019)
- Elton, D.C., Sandfort, V., Pickhardt, P.J., Summers, R.M.: Accurately identifying vertebral levels in large datasets. In: In Proceedings of the SPIE: Medical Imaging. (2020)
- Erhan, D., Bengio, Y., Courville, A., Vincent, P.: Visualizing higher-layer features of a deep network. Tech. Rep. 1341, University of Montreal (2009), also presented at the ICML 2009 Workshop on Learning Feature Hierarchies, Montréal, Canada.
- Frosst, N., Hinton, G.: Distilling a neural network into a soft decision tree. arXiv preprints: 1711.09784 (2017)
- Gal, Y., Ghahramani, Z.: Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In: Balcan, M.F., Weinberger, K.Q. (eds.) Proceedings of The 33rd International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 48, pp. 1050–1059. PMLR, New York, New York, USA (20–22 Jun 2016)
- Goertzel, B.: Are there deep reasons underlying the pathologies of today’s deep learning algorithms? In: Artificial General Intelligence, pp. 70–79. Springer International Publishing (2015)
- Goldfeld, Z., Van Den Berg, E., Greenewald, K., Melnyk, I., Nguyen, N., Kingsbury, B., Polyanskiy, Y.: Estimating information flow in deep neural networks. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 97, pp. 2299–2308. PMLR, Long Beach, California, USA (09–15 Jun 2019)
- Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. arXiv preprints: 1412.6572 (2014)
- Hanin, B., Rolnick, D.: Complexity of linear regions in deep networks. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 97, pp. 2596–2604. PMLR, Long Beach, California, USA (09–15 Jun 2019)
- Hanin, B., Rolnick, D.: Deep ReLU networks have surprisingly few activation patterns. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32, pp. 359–368. Curran Associates, Inc. (2019)
- Hasson, U., Nastase, S.A., Goldstein, A.: Direct-fit to nature: an evolutionary perspective on biological (and artificial) neural networks (Sep 2019). https://doi.org/10.1101/764258
- Kendall, A., Gal, Y.: What uncertainties do we need in bayesian deep learning for computer vision? In: Proceedings of the 31st International Conference on Neural Information Processing Systems. p. 5580â5590. NIPSâ17, Curran Associates Inc., Red Hook, NY, USA (2017)
- Koh, P.W., Liang, P.: Understanding black-box predictions via influence functions. In: Proceedings of the 34th International Conference on Machine Learning - Volume 70. p. 1885â1894. ICMLâ17, JMLR.org (2017)
- Kulesza, T., Burnett, M., Wong, W.K., Stumpf, S.: Principles of explanatory debugging to personalize interactive machine learning. In: Proceedings of the 20th International Conference on Intelligent User Interfaces - IUI ’15. ACM Press (2015)
- LaLonde, R., Torigian, D., Bagci, U.: Encoding Visual Attributes in Capsules for Explainable Medical Diagnoses. arXiv e-prints: 1909.05926 (Sep 2019)
- Lie, C.: Relevance in the eye of the beholder: Diagnosing classifications based on visualised layerwise relevance propagation. Master’s thesis, Lund Unversity, Sweden (2019)
- Lillicrap, T.P., Kording, K.P.: What does it mean to understand a neural network? arXiv preprint: 1907.06374 (2019)
- Linfoot, E.: An informational measure of correlation. Information and Control 1(1), 85 – 89 (1957). https://doi.org/10.1016/S0019-9958(57)90116-X
- Lipton, Z.C.: The mythos of model interpretability. CoRR abs/1606.03490 (2016)
- Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems 30, pp. 4765–4774. Curran Associates, Inc. (2017)
- McClure, P., Rho, N., Lee, J.A., Kaczmarzyk, J.R., Zheng, C.Y., Ghosh, S.S., Nielson, D.M., Thomas, A.G., Bandettini, P., Pereira, F.: Knowing what you know in brain segmentation using Bayesian deep neural networks. Frontiers in Neuroinformatics 13 (Oct 2019). https://doi.org/10.3389/fninf.2019.00067
- Murdoch, W.J., Singh, C., Kumbier, K., Abbasi-Asl, R., Yu, B.: Definitions, methods, and applications in interpretable machine learning. Proceedings of the National Academy of Sciences 116(44), 22071–22080 (Oct 2019). https://doi.org/10.1073/pnas.1900654116
- Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: Where bigger models and more data hurt. arXiv preprint: 1912.02292 (2019)
- Netzeva, T.I., Worth, A.P., Aldenberg, T., Benigni, R., Cronin, M.T., Gramatica, P., Jaworska, J.S., Kahn, S., Klopman, G., Marchant, C.A., Myatt, G., Nikolova-Jeliazkova, N., Patlewicz, G.Y., Perkins, R., Roberts, D.W., Schultz, T.W., Stanton, D.T., van de Sandt, J.J., Tong, W., Veith, G., Yang, C.: Current status of methods for defining the applicability domain of (quantitative) structure-activity relationships. Alternatives to Laboratory Animals 33(2), 155–173 (Apr 2005). https://doi.org/10.1177/026119290503300209
- Ribeiro, M.T., Singh, S., Guestrin, C.: ”why should i trust you?”. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD. ACM Press (2016). https://doi.org/10.1145/2939672.2939778
- Richards, B.A., Lillicrap, T.P., Beaudoin, P., Bengio, Y., Bogacz, R., Christensen, A., Clopath, C., Costa, R.P., de Berker, A., Ganguli, S., Gillon, C.J., Hafner, D., Kepecs, A., Kriegeskorte, N., Latham, P., Lindsay, G.W., Miller, K.D., Naud, R., Pack, C.C., Poirazi, P., Roelfsema, P., Sacramento, J., Saxe, A., Scellier, B., Schapiro, A.C., Senn, W., Wayne, G., Yamins, D., Zenke, F., Zylberberg, J., Therien, D., Kording, K.P.: A deep learning framework for neuroscience. Nature Neuroscience 22(11), 1761–1770 (Oct 2019). https://doi.org/10.1038/s41593-019-0520-2
- Rolnick, D., Kording, K.P.: Identifying weights and architectures of unknown relu networks. arXiv preprints: 1910.00744 (2019)
- Rudin, C.: Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence 1(5), 206–215 (May 2019). https://doi.org/10.1038/s42256-019-0048-x
- Sahigara, F., Mansouri, K., Ballabio, D., Mauri, A., Consonni, V., Todeschini, R.: Comparison of different approaches to define the applicability domain of QSAR models. Molecules 17(5), 4791–4810 (Apr 2012). https://doi.org/10.3390/molecules17054791
- Shen, S., Han, S.X., Aberle, D.R., Bui, A.A., Hsu, W.: An interpretable deep hierarchical semantic convolutional neural network for lung nodule malignancy classification. Expert Systems with Applications 128, 84–95 (Aug 2019). https://doi.org/10.1016/j.eswa.2019.01.048
- Shrikumar, A., Greenside, P., Kundaje, A.: Learning important features through propagating activation differences. arXiv preprints: 1704.02685 (2017)
- Spigler, S., Geiger, M., d’Ascoli, S., Sagun, L., Biroli, G., Wyart, M.: A jamming transition from under- to over-parametrization affects generalization in deep learning. Journal of Physics A: Mathematical and Theoretical 52(47), 474001 (Oct 2019). https://doi.org/10.1088/1751-8121/ab4c8b
- Swartout, W.R.: XPLAIN: a system for creating and explaining expert consulting programs. Artificial Intelligence 21(3), 285–325 (Sep 1983). https://doi.org/10.1016/s0004-3702(83)80014-9
- Torkkola, K.: Feature extraction by non-parametric mutual information maximization. Journal of Machine Learning Research 3, 1415–1438 (2003)
- Yeh, C.K., Hsieh, C.Y., Suggala, A.S., Inouye, D.I., Ravikumar, P.: On the (in)fidelity and sensitivity for explanations. arXiv preprint: 1901.09392 (2019)
- Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Computer Vision – ECCV 2014, pp. 818–833. Springer International Publishing (2014)
- Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning requires rethinking generalization. arXiv preprint: 1611.03530 (2016)
- Zhang, Q., Cao, R., Shi, F., Wu, Y.N., Zhu, S.C.: Interpreting cnn knowledge via an explanatory graph. In: McIlraith, S.A., Weinberger, K.Q. (eds.) AAAI. pp. 4454–4463. AAAI Press (2018)