Explanation Methods in Deep Learning: Users, Values, Concerns and Challenges
Issues regarding explainable AI involve four components: users, laws & regulations, explanations and algorithms. Together these components provide a context in which explanation methods can be evaluated regarding their adequacy. The goal of this chapter is to bridge the gap between expert users and lay users. Different kinds of users are identified and their concerns revealed, relevant statements from the General Data Protection Regulation are analyzed in the context of Deep Neural Networks (DNNs), a taxonomy for the classification of existing explanation methods is introduced, and finally, the various classes of explanation methods are analyzed to verify if user concerns are justified. Overall, it is clear that (visual) explanations can be given about various aspects of the influence of the input on the output. However, it is noted that explanation methods or interfaces for lay users are missing and we speculate which criteria these methods / interfaces should satisfy. Finally it is noted that two important concerns are difficult to address with explanation methods: the concern about bias in datasets that leads to biased DNNs, as well as the suspicion about unfair outcomes.
Increasingly, Artificial Intelligence (AI) is used in order to derive actionable outcomes from data (e.g. categorizations, predictions, decisions). The overall goal of this chapter is to bridge the gap between expert users and lay users, highlighting the explanation needs of both sides and analyzing the current state of explainability. We do this by taking a more detailed look at each component mentioned above and in Figure 1. Finally we address some concerns in the context of DNNs.
1.1 The components of explainability
Issues regarding explainable AI (XAI) involve (at least) four components: users, laws and regulations, explanations and algorithms. Together these components provide a context in which explanation methods can be evaluated regarding their adequacy. We briefly discuss these components in Figure 1.
1.2 Users and laws
AI has a serious impact on society, due to the large scale adoption of digital automation techniques that involve information processing and prediction. Deep Neural Networks (DNNs) belong to the set of automation technique that is used increasingly because of their capability to learn from raw information. DNNs are fed a humongous amount of digital information that is easily collected from users. Currently there is much debate regarding the safety of and trust in data processes in general, leading to investigations regarding the explainability of AI supported decision making. The level of concern about these topics is reflected by official regulations such as the General Data Protection Regulation (GDPR), also mentioned in [, ], incentives to promote the field of explainability [] and institutional initiatives to ensure the safe development of AI such as OpenAI. As the technology becomes more widespread, DNNs in particular, the dependency on said technology increases and trust in DNN technology becomes a necessity. Current DNNs are achieving unparalleled performance in areas of Computer Vision (CV) and Natural Language Processing (NLP), being used in real world applications such as bone age assessment [] in medical imagery, implemented in critical vision based applications in Tesla cars [], and powering legal technology that assist lawyers.
1.3 Explanation and DNNs
The challenge with DNNs in particular lies in providing insight into the processes leading to their outcomes, and thereby helping to clarify under which circumstances they can be trusted to perform as intended and when they cannot. Unlike other methods in Machine Learning (ML), such as decision trees or Bayesian networks, an explanation for a certain decision made by a DNN cannot be retrieved by simply looking at the internal process. The inner representation and the flow of information is complicated: 1) As architectures get deeper, the number of learnable parameters increases. It is not uncommon to have networks with millions of parameters. 2) As architectures get more complex, often consisting of various types of components (unit type, activation functions, regularization techniques, connections between units, memory mechanisms, cost functions), the net result of the interaction between these components is oftentimes unknown. And finally 3) more complex architectures lead to a more complex information flow. Because of these complications DNNs are often called black box models, as opposed to glass-box models []. Fortunately, these problems have not escaped the attention of the ML/Deep Learning (DL) community [, , , , , , , ]. For as long as Artificial Neural Networks (ANNs) have existed, research has been done on how to interpret and explain the decision process of the ANN by developing explanation methods []. The objective of explanation methods is to make specific aspects of the DNN information representation and information flow interpretable by humans.
2 Users and their concerns
Various types of DNN users can be distinguished. Users entertain certain values; these include ethical values such as fairness, neutrality, lawfulness, autonomy, privacy or safety, or functional values such as accuracy, usability, speed or predictability. Out of these values certain concerns regarding DNNs may arise, e.g. apprehensions about discrimination or accuracy. These concerns get translated into questions about the system, e.g. did the factor âraceâ influence the outcome of the system, or how reliable was the data used? In this section we identify at least two general types of users: the expert users and the lay users, and in total six specific kinds of users. Note that there could be (and there regularly is) overlap between the users described below, such that a particular user can be classified as belonging to more than one of the categories.
Expert users are the system builders and/or modifiers and often come in direct contact with the source code. Two types of experts can be identified:
DNN engineers are generally researchers involved in extending the field and have detailed knowledge about the mathematical theories and principles of DNNs. DNN engineers are interested in explanations of a functional nature e.g. the effects of various hyperparameters on the performance of the network or methods that can be used for model debugging.
DNN developers are generally application builders who make software solutions that can be used by lay people. DNN developers often make use of off-the-shelf DNNs, often re-training the DNN along with tuning certain hyperparameters and integrating them with various software components resulting in a functional application. The DNN developer is concerned with the goals of the overall application and assesses whether they have been met by the DNN solution. DNN developers are interested in explanation methods that allow them to understand the behavior of the DNN in the various use contexts of the integrated software application.
Lay users do not and need not have knowledge of how the DNN was implemented and the underlying mathematical principles, nor knowledge of how the DNN was integrated with other software components resulting in a final functional application. At least four lay users are identified:
The owner of the software application in which the DNN is embedded. The owner is usually an entity that acquires the application for possible commercial, practical or personal use. For example, an owner can be an organization (e.g. a hospital or a car manufacturer) that purchases the application for end users (e.g. employees (doctors) or clients (car buyers)), but the owner can also be a consumer that purchases the application for personal use. In the latter case the categorization of owner fully overlaps with the next category of users which are the end users. The owner is concerned with explainability questions about the capabilities of the application, e.g. justification of a prediction or a prediction given the input data, and aspects of accountability, e.g. to what extent can application malfunction be attributed to the DNN component?
The end user for whom the application was intended to be used by. The end user uses the application as part of their profession or for personal use. The end user is concerned with explainability about the capabilities of the application, e.g. justification of a prediction given the input data, and explainability regarding the behavior of the application, e.g. why does the application not do what it was advertised to do?
The data subject is the entity whose information is being processed by the application or the entity which is directly affected by the application outcome. An outcome is the output of the application in the context of the use case. Sometimes the data subject is the same entity as the end user, for example in the case that the application is meant for personal use. The data subject is mostly concerned with the ethical and moral aspects that result from the actionable outcomes. An actionable outcome is an outcome that has consequences or an outcome on which important decisions are based.
Stakeholders are people or organizations without a direct connection to either the development, use or outcome of the application and who can reasonably claim an interest in the process, for instance when its use runs counter to particular values they protect. Governmental and non-governmental organizations may put forward legitimate information requests regarding the operations and consequences of DNNs. Stakeholders are often interested in the ethical and legal concerns raised in any phase of the process.
2.1 Case: self-driving car
In this section the different users are presented in the context of a self-driving car.
The DNN engineer creates a DL solution to the problem of object segmentation and object classification by experimenting with various types of networks. Given raw video input the DL solution gives the output of the type of object and the location of the object in the video.
The DNN developer creates a planning system which integrates the output of the DL solution with other components in the system. The planning system decides which actions the car will take.
The owner acquires the planning system and produces a car in which the planning system is operational.
The end user purchases the car and uses the car to go from point A to point B.
The data subjects are all the entities from which information is captured along the route from point A to point B: pedestrians, private property such as houses, other cars.
The stakeholders are governmental institutions which formulate laws regulating the use of autonomous vehicles, or insurance companies that have to assess risk levels and their consequences.
3 Laws and regulations
An important initiative within the European Union is the General Data Protection Regulation (GDPR)
The GDPR focuses in part on profiling: “any form of automated processing of personal data consisting of the use of personal data to evaluate certain personal aspects relating to a natural person, in particular to analyse or predict aspects concerning that natural person’s performance at work, economic situation, health, personal preferences, interests, reliability, behaviour, location or movements” (Article 4, Definitions, Paragraph 4). According to articles 13, 14 and 15, when personal data is collected from a data subject for automated decision-making, the data subject has the right to access, and the data controller is obliged to provide, “meaningful information about the logic involved.” Article 12 stipulates that the provision of information to data subjects should be in “concise, transparent, intelligible and easily accessible form, using clear and plain language.”.
The right to meaningful information translates into the demand that actionable outcomes of DNNs need to be explained, i.e. be made transparent, interpretable or comprehensible to humans. Transparency refers to the extent to which an explanation makes a particular outcome understandable to a particular (group of) users. Understanding, in this context, amounts to a person grasping how a particular outcome was reached by the DNN. Note that this need not imply agreeing with the conclusion, i.e. accepting the outcome as valid or justified. In general, transparency may be considered as recommendable, leading to e.g. a greater (societal) sense of control and acceptance of ML applications. Transparency is normally also a precondition for accountability: i.e. the extent to which the responsibility for the actionable outcome can be attributed to legally (or morally) relevant agents (governments, companies, expert or lay users, etc.). However, transparency may also have negative consequences, e.g. regarding privacy or by creating possibilities for manipulation (of data, processing or training).
In relation to the (perceived) need for explanation, two reasons for investigation stand out in particular. First, a DNN may appear to dysfunction, i.e. fail to operate as intended, e.g. through bugs in the code (process malfunction). Second, it may misfunction, e.g. by producing unintended or undesired (side-)effects [, ] that are deemed to be societally or ethically unacceptable (outcome malfunction). Related to dysfunction is a first category of explanations. This category is based on the information necessary in order to understand the system’s basic processes, e.g. to assess whether it is functioning properly, as intended, or whether it dysfunctions (e.g. suboptimal or erroneous results). This type of explanation is normally required by DNN developers and expert users. The information is used to interpret, predict, monitor, diagnose, improve, debug or repair the functioning of a system [].
Once an application is made available to non-expert users, normally certain guarantees regarding the systemâs proper functioning are in place. Generally speaking, owners, end users, data subjects and stakeholders are more interested in a second category of explanations, where suspicions about a DNNâs misfunctioning (undesired outcomes) leads to requests for “local explanations”. Users may request information about how a particular outcome was reached by the DNN, which aspects of input data, which learning factors or other parameters of the system influenced its decision or prediction. This information is then used to assess the appropriateness of the outcome in relation to the concerns and values of users [, , , ]. The aim of local explanations is to strengthen the confidence and trust of users that the system is not (or will not be) conflicting with their values, i.e. that it does not violate fairness or neutrality. Note that this implies that the offered explanations should match (within certain limits) the particular userâs capacity for understanding [], as indicated by the GDPR.
5 Explanation methods
So far the users, the GDPR, and the role of explanations have been discussed. To bridge the gap from that area to the more technical area of explanation methods, we need to be able to evaluate the capabilities of existing methods, in the context of the users and their needs. We bridge the gap in two ways. First, we identify on a high level desirable properties of explanation methods. These properties are based on the desired characteristics for explainers by []: interpretable, local fidelity, model-agnostic and global perspective. We break down interpretable in two aspects: clarity and parsimony. Local fidelity is captured in the overall term fidelity. Model-agnostic is captured in the term generalizability. It is not clear what [] meant by global perspective. Second, we introduce a taxonomy to categorize all types of explanation methods and third, assess the presence of the desirable properties in the categories in our taxonomy.
5.1 Desirable properties
High Fidelity The degree to which the interpretation method resembles the input-output mapping of the DNN. This term appears in [, , , , , , ]. Fidelity is arguably the most important property that an explanation model should possess. If an explanation method is not faithful to the original model then it cannot give valid explanations because the input-output mapping is incorrect. In general, local methods are more faithful than global methods.
High Parsimony This refers to the complexity of the resulting explanation. An explanation that is parsimonious is a simple explanation. This concept is generally related to Occam’s razor and in the case of explaining DNNs the principle is also of importance. The degree of parsimony can in part be dependent on the user’s capabilities.
High Generalizability The range of architectures to which the explanation method can be applied. This increases the usefulness of the explanation method. Methods that are model-agnostic are the highest in generalizability.
High Explanatory Power In this context this means how many phenomena the method can explain. This roughly translates to how many types of questions the method can answer. Previously in Section 2 we have identified a number of questions that users may have.
5.2 Introducing a taxonomy for explanation methods
Over a relatively short period of time a plethora of explanation methods and strategies have come into existence, driven by the need of expert users to analyze and debug their DNNs. However, apart from a non-exhaustive overview of existing methods [] and classification schemes for purely visual methods [, , , ], little is known about efforts to rigorously categorize the whole set of explanation methods and to find the underlying patterns that guide explanation methods. In this section an attempt at a coherent taxonomy for explanation methods is proposed. Three main classes of explanation methods are identified and their features described. The taxonomy was derived by analyzing the historical and contemporary trends surrounding the topic of interpretation of DNNs and explainable AI. We realize that we cannot foresee the future developments of DNNs and their explainability methods. As such it is possible that in the future the taxonomy needs to be extended with more classes. We propose the following taxonomy:
- Rule-extraction methods
Extract rules that approximate the decision-making process in a DNN by utilizing the input and output of the DNN.
- Attribution methods
Measures the importance of a component by changing to the input or internal components and recording how much the changes affect the model performance. Methods known by other names that fall in this category are occlusion, perturbation, erasure, ablation and influence. Attribution methods are often visualized and sometimes referred to as visualization methods.
- Intrinsic methods
Aim to improve the interpretability of internal representations with methods that are part of the DNN architecture. Intrinsic methods increase fidelity, clarity and parsimony in attribution methods
In the following subsections we will describe the main features of each class and give examples from current research.
5.3 Rule-extraction methods
Rule-extraction methods extract human interpretable rules that approximate the decision-making process in a DNN. Older genetic algorithm based rule extraction methods for ANNs (not DNNs) can be found in [, , ]. [] specify three categories of rule extraction methods.:
- Decompositional approach
Decompositional refers to the taking apart of the network. In other words, this means to break down the network into smaller individual parts. For the decompositional approach, the architecture of the network and/or its outputs are used in the process. [] uses a decompositional algorithm that extracts rules for each layer in the DNN. These rules are merged together in a final merging step to produce a set of rules that describe the network behaviour by means of its inputs. [] succeeded in extracting rules from an LSTM by applying a decompositional approach.
- Pedagogical approach
Introduced by [] and named by [] the pedagogical approach involves “viewing rule extraction as a learning task where the target concept is the function computed by the network and the input features are simply the network’s input features.” []. The pedagogical approach has the advantage that it is inherently model-agnostic. Recent examples are found in [, ].
- Eclectic approach
According to [] “membership in this category is assigned to techniques which utilize knowledge about the internal architecture and/or weight vectors in the trained artificial neural network to complement a symbolic learning algorithm.”
In terms of fidelity, local explanations are more faithful than global explanations. For rule-extraction this means that rules that govern the result of a specific input, or a neighborhood of inputs are more faithful than rules that govern all possible inputs. Rule extraction is arguably the most interpretable category of methods in our taxonomy considering that the resulting set of rules can be unambiguously be interpreted by a human being as a kind of formal language. Therefor we can say that it has a high degree of clarity. In terms of parsimony we can say that if the ruleset is ”small enough” the parsimony is higher than when the ruleset is ”too large”. What determines ”small enough” and ”too large” is difficult to quantify formally and is also dependent on the user; expert vs. lay. In terms of generalizability it can go both ways: if a decompositional approach is used it is likely that the method is not generalizable, while if a pedagogical approach is used the method is highly generalizable. In terms of explanatory power, rule-extraction methods can 1) validate whether the network is working as expected in terms of overall logic flow, and 2) explain which aspects of the input data had an effect that lead to the specific output.
5.4 Attribution methods
Attribution, a term introduced by [], also referred to as relevance [, , , ], contribution [], class saliency [] or influence [, , ], aims to reveal components of high importance in the input to the DNN and their effect as the input is propagated through the network. Because of this property we can categorize the following methods to the attribution category: occlusion [], erasure [], perturbation [], adversarial examples [] and prediction difference analysis []. Other methods that belong to this category: [, , ]. It is worth mentioning that attribution methods do not only apply to image input but also to other forms of input, such as text processing by LSTMs []. The definition of attribution methods in this chapter is similar to that of saliency methods [], but more general than the definition of attribution methods in [] akin to the definition in [].
Visualization of attribution
The majority of explanation methods for DNNs visualize the information obtained by attribution methods. Visualization methods were popularized by [, , ] in recent years and are concerned with how the important features are visualized. [] identifies that current methods focus on three aspects of visualization: feature visualization, relationship visualization and process visualization. Overall visualization methods are very intuitive methods to gain a variety of insight about a DNN decision process on many levels including architecture assessment, model quality assessment and even user feedback integration, e.g. [] creates intuitive visualization interfaces for image processing DNNs.
[] has shown recently that attribution methods “lack reliability when the explanation is sensitive to factors that do not contribute to the model prediction”. Furthermore they introduce the notion of input invariance as a prerequisite for accurate attribution. In other words, if the attribution method does not satisfy input invariance, we can consider it to have low fidelity. In terms of clarity, there is a degree of ambiguity that is inherent with these methods because visual explanations can be interpreted in multiple ways by different users, even by users in the same user category. In contrast to the precise results of rule-extraction methods, the information that results from attribution methods has less structure. In addition, the degree of clarity is dependent on the degree of fidelity of the method: low fidelity can cause incorrect attribution, resulting in noisy output with distracting attributions that increase ambiguity. The degree of parsimony depends on the method of visualization itself. Methods that visualize only the significant attributions exhibit a higher degree of parsimony. The degree of generalizability depends on which components are used to determine attribution. Methods that only use the input and output are inherently model-agnostic resulting in the highest degree of generalizability. Following this logic, methods that make use of internal components are generalizable to the degree that other models share these components. For example, deconvolutional networks [] can be applied to models that make use of convolutions to extract features from input images. In terms of explanatory power, this class of methods can reflect intuitively with visual explanations which factors in the input dimension had a significant impact on the output of the DNN. However these methods do not explain the reason for the importance of the particular factor attribution.
5.5 Intrinsic methods
The previous categories are designed to make explainable some aspects of a DNN in a process separate from training the DNN. In contrast, this category aims to improve the interpretability of internal representations with methods that are part of the DNN architecture, e.g. as part of the loss function [, ], modules that add additional capabilities [, ], or as part of the architecture structure, in terms of operations between layers [, , , ]. [] provides an interpretive loss function to increase the visual fidelity of the learned features. More importantly [] shows that by training DNNs with adversarial data and a consistent loss, we can trace back errors made by the DNN to individual neurons and identify whether the data was adversarial. [] gives a DNN the ability to answer relational reasoning questions about a specific environment, by introducing a relational reasoning module that learns a relational function, which can be applied to any DNN. [] builds on work by [] and introduces a recurrent relational network which can take the temporal component into account. [] introduces an explicit structure to DNNs for visual recognition by building in an AND-OR grammar directly in the network structure. This leads to better interpretation of the information flow in the network, hence increased parsimony in attribution methods. [] make use of generative neural networks perform causal inference and [] use generative neural networks to learn functional causal models. Intrinsic methods do not explicitly explain anything by themselves, instead they increase fidelity, clarity and parsimony in attribution methods. This class of methods is different from attribution methods because it tries to make the DNN inherently more interpretable by changing the architecture of the DNN, where attribution methods use what is there already and only transform aspects of the representation to something meaningful after the network is trained. Because of the nature of this category of methods, the desirable properties cannot be attributed to them.
6 Addressing general concerns
As indicated in Figure 1, users have certain values, that in relation to a particular technology may lead to concerns, that in relation to particular applications can lead to specific questions. [, ] distinguish various concerns that users may have. However, the types of concerns they discuss focus to a large extent on the inconclusiveness, inscrutability or misguidedness of used evidence. That is, they concern to a significant extent the reliability and accessibility of used data (data mining, properly speaking). Although such concerns are important in relation to DNNs as well, we will not discuss these concerns here. Instead we will focus on the types of concerns that can reasonably be thought to be especially salient in relation to DNNs. In addition to apprehensions about data, there are concerns that involve aspects of the processing itself, e.g. the inferential validity of an algorithm. Also, questions may be raised about the validity of a training process (e.g. requiring information about how exactly a DNN is trained). The following concerns and question are taken from [] and are adapted to fit the context of the DNN process. These concerns and questions are not specific for one particular user category. Given the broad application context of the DNN process and the available explanation methods, we analyze whether the concern is justified and whether the accompanying questions could be answered. Note that the list of concerns and questions is incomplete and that this list serves as an example of the possible concerns and questions that can exist.
- Inconclusive process:
statistical inferences produce inconclusive knowledge, they can identify significant correlations, but rarely causal relations. Users may feel that more certainty regarding an outcome is required than the inconclusive process allows. This concern is justified considering that it is often difficult to say with certainty that the relationship between various variables is causal, at least for DNNs. DNNs learn correlations between input and output by discovering meaningful features. Note that what is meaningful to the DNN may not be meaningful to humans, and vice versa. Explanation methods do not resolve this concern because they do not reveal causal relations either. As stated by [], explanation methods may make DNNs more comprehensible. However, this does not resolve the above stated concern.
- Inscrutable data process:
data used, or aspects of the dataâs scope, origin and quality may be unknown to the user. In addition, the exact use of the data by the DNN may be opaque. Users may worry about what (part of the) data exactly has led to the outcome. The first concern is justified since oftentimes the data and the details of the data used are unknown to the user. The second concern is resolved in certain cases but remains justified in others. For DNNs that process image data there exist methods that help us visualize the features that the DNN learns, see Section 5.2. With visualizations of the learned features we can reason about how the DNN came to certain conclusions. For sentiment analysis on bodies of text explanation methods can also highlight which part of the text led to certain predictions. In both of these cases, the result of the explanation method is intuitive for human interpretation. In other cases, primarily regarding the use of data that is not easily visualized, it is more difficult to understand which features are being learned. It is not a lost cause, however, since attribution methods are adept at pointing out the various flows of influence going from input to outcome. The drawback is that the result of such explanation methods are often difficult to interpret, especially for lay users, because interpretation requires knowledge that lay users most likely do not possess.
- Misguided process:
the data collection process itself may have been biased in certain ways, affecting the conclusions based on them (âgarbage in, garbage outâ). Users can be apprehensive about the data acquisition process. This concern is justified because the process of collecting and annotating data inherently imposes bias on the data in one way or another. Since DNNs are trained on datasets, they cannot escape the bias. In this case what we mean by bias is that particular outcomes are supported as an inherent property of the process, and not the definition of bias in the statistical sense. In the current case, bias means that the data does not represent all aspects in which the DNN can be applied. Instead the data is a small and unrepresentative sample of situations that can occur in real life. This can lead to the trained DNN not being able to generalize properly.
- Unfair outcomes:
users may feel that the outcome of the DNN is somehow unfair in relation to the particular values they hold, e.g. violating fairness or privacy. This concern is justified when DNNs are not used responsibly, in the sense that one or more of the users misuses the DNN. On a low level this can happen when the experts fail to provide important information about the DNN, e.g. information about how and on what kind of data the DNN was trained, and under which circumstances the DNN is meant to be used. On a higher level this happens when lay users utilize the DNN for an inappropriate task. This concern is arises out of miscommunication and a lack of knowledge, and not out of the requirement for an explanation of some kind.
In this chapter we have tried to analyze the question of “What can be explained?” given the users and their needs, laws and regulations, and existing explanation methods. Explicitly, we looked at the capabilities of explanation methods and analyzed which questions/concerns about explainability in the DNN context these methods address. Overall, it is clear that (visual) explanations can be given about various aspects of the influence of the input on the output (e.g. given the input data, which aspects of the data lead to the output?), by making use of both rule-extraction and attribution methods. Also, when used in combination with attribution methods, intrinsic methods lead to more explainable DNNs. It is likely that in the future we will see the rise of a new category of explanation methods, the combination methods, which are defined as the combination of rule-extraction, attribution and intrinsic methods, to answer specific questions in a simple human interpretable language. Furthermore, it is obvious that current explanation methods are tailored to expert users, since the interpretation of the results require knowledge of the DNN process. As far as we are aware, explanation methods, e.g. intuitive explanation interfaces, for lay users do not exist. Ideally, if such explanation methods would exist, they should be able to answer, in a simple human language, questions about every operation that the application performs. This is not an easy task since the number of conceivable questions one could ask about the working of an application is incredibly large. Two particular concerns, which is difficult to address with explanation methods, is the concern about bias in datasets that leads to biased DNNs, as well as the suspicion about unfair outcomes: Can we indicate that the DNN is biased, and if so, can we remove the bias? Has the DNN been applied responsibly? These are not problems that are directly solvable with explanation methods. However, explanation methods alleviate the first problem to the extent that learned features can be visualized (using attribution methods) and further analyzed for bias using other methods that are not explanation methods. For the second problem, more general measures, such as regulations and laws, are or would need to be developed.
- This article will appear as a chapter in Explainable and Interpretable Models in Computer Vision and Machine Learning, a Springer series on Challenges in Machine Learning
- Philip Adler, Casey Falk, Sorelle A Friedler, Gabriel Rybeck, Carlos Scheidegger, Brandon Smith, and Suresh Venkatasubramanian. Auditing black-box models for indirect influence. In Data Mining (ICDM), 2016 IEEE 16th International Conference on, pages 1–10. IEEE, 2016.
- Marco Ancona, Enea Ceolini, Cengiz Öztireli, and Markus Gross. A unified view of gradient-based attribution methods for deep neural networks. arXiv preprint arXiv:1711.06104, 2017.
- Robert Andrews, Joachim Diederich, and Alan B Tickle. Survey and critique of techniques for extracting rules from trained artificial neural networks. Knowledge-based systems, 8(6):373–389, 1995.
- A Duygu Arbatli and H Levent Akin. Rule extraction from trained neural networks using genetic algorithms. Nonlinear Analysis: Theory, Methods & Applications, 30(3):1639–1648, 1997.
- Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 10(7):e0130140, 2015.
- David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, and Klaus-Robert MÃžller. How to explain individual classification decisions. Journal of Machine Learning Research, 11(Jun):1803–1831, 2010.
- Alexander Binder, Sebastian Bach, Gregoire Montavon, Klaus-Robert Müller, and Wojciech Samek. Layer-wise relevance propagation for deep neural network architectures. In Information Science and Applications (ICISA) 2016, pages 913–922. Springer, 2016.
- Mariusz Bojarski, Philip Yeres, Anna Choromanska, Krzysztof Choromanski, Bernhard Firner, Lawrence Jackel, and Urs Muller. Explaining how a deep neural network trained with end-to-end learning steers a car. arXiv preprint arXiv:1704.07911, 2017.
- Mark W Craven and Jude W Shavlik. Using sampling and queries to extract rules from trained neural networks. In Machine Learning Proceedings 1994, pages 37–45. Elsevier, 1994.
- David Danks and Alex John London. Algorithmic bias in autonomous systems. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pages 4691–4697. AAAI Press, 2017.
- Yinpeng Dong, Hang Su, Jun Zhu, and Fan Bao. Towards interpretable deep neural networks by leveraging adversarial examples. arXiv preprint arXiv:1708.05493, 2017.
- Yinpeng Dong, Hang Su, Jun Zhu, and Bo Zhang. Improving interpretability of deep neural networks with semantic information. arXiv preprint arXiv:1703.04096, 2017.
- Derek Doran, Sarah Schulz, and Tarek R Besold. What does explainable ai really mean? a new conceptualization of perspectives. arXiv preprint arXiv:1710.00794, 2017.
- Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machine learning. arxiv preprint. arXiv preprint arXiv:1702.08608, 2017.
- Finale Doshi-Velez, Mason Kortz, Ryan Budish, Chris Bavitz, Sam Gershman, David O’Brien, Stuart Schieber, James Waldo, David Weinberger, and Alexandra Wood. Accountability of ai under the law: The role of explanation. arXiv preprint arXiv:1711.01134, 2017.
- Dumitru Erhan, Yoshua Bengio, Aaron Courville, and Pascal Vincent. Visualizing higher-layer features of a deep network. University of Montreal, 1341:3, 2009.
- Luciano Floridi, Nir Fresco, and Giuseppe Primiero. On malfunctioning software. Synthese, 192(4):1199–1220, 2015.
- Ruth Fong and Andrea Vedaldi. Interpretable explanations of black boxes by meaningful perturbation. arXiv preprint arXiv:1704.03296, 2017.
- Olivier Goudet, Diviyan Kalainathan, Philippe Caillou, David Lopez-Paz, Isabelle Guyon, Michele Sebag, Aris Tritas, and Paola Tubaro. Learning functional causal models with generative neural networks. arXiv preprint arXiv:1709.05321, 2017.
- Felix Grün, Christian Rupprecht, Nassir Navab, and Federico Tombari. A taxonomy and library for visualizing learned features in convolutional neural networks. arXiv preprint arXiv:1606.07757, 2016.
- Yaǧmur Guçlütürk, Umut Güçlü, Marc Perez, Hugo Jair Escalante, Xavier Baro, Isabelle Guyon, Carlos Andujar, Julio Jacques Junior, Meysam Madadi, Sergio Escalera, et al. Visualizing apparent personality analysis with deep residual networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3101–3109, 2017.
- David Gunning. Explainable artificial intelligence (xai). Defense Advanced Research Projects Agency (DARPA), nd Web, 2017.
- Patrick Hall, Wen Phan, and SriSatish Ambati. Ideas on interpreting machine learning. 2017. https://www.oreilly.com/ideas/ideas-on-interpreting-machine-learning.
- Andreas Holzinger, Chris Biemann, Constantinos S Pattichis, and Douglas B Kell. What do we need to build explainable ai systems for the medical domain? arXiv preprint arXiv:1712.09923, 2017.
- Andreas Holzinger, Markus Plass, Katharina Holzinger, Gloria Cerasela Crisan, Camelia-M Pintea, and Vasile Palade. A glass-box interactive machine learning approach for solving np-hard problems with the human-in-the-loop. arXiv preprint arXiv:1708.01104, 2017.
- Pieter-Jan Kindermans, Sara Hooker, Julius Adebayo, Maximilian Alber, Kristof T Schütt, Sven Dähne, Dumitru Erhan, and Been Kim. The (un) reliability of saliency methods. arXiv preprint arXiv:1711.00867, 2017.
- Pieter-Jan Kindermans, Kristof Schütt, Klaus-Robert Müller, and Sven Dähne. Investigating the influence of noise and distractors on the interpretation of neural networks. arXiv preprint arXiv:1611.07270, 2016.
- Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. arXiv preprint arXiv:1703.04730, 2017.
- Himabindu Lakkaraju, Ece Kamar, Rich Caruana, and Jure Leskovec. Interpretable & explorable approximations of black box models. arXiv preprint arXiv:1707.01154, 2017.
- Hyunkwang Lee, Shahein Tajmir, Jenny Lee, Maurice Zissen, Bethel Ayele Yeshiwas, Tarik K Alkasab, Garry Choy, and Synho Do. Fully automated deep learning system for bone age assessment. Journal of Digital Imaging, pages 1–15, 2017.
- Jiwei Li, Will Monroe, and Dan Jurafsky. Understanding neural networks through representation erasure. arXiv preprint arXiv:1612.08220, 2016.
- Xilai Li, Tianfu Wu, Xi Song, and Hamid Krim. Aognets: Deep and-or grammar networks for visual recognition. arXiv preprint arXiv:1711.05847, 2017.
- Christos Louizos, Uri Shalit, Joris M Mooij, David Sontag, Richard Zemel, and Max Welling. Causal effect inference with deep latent-variable models. In Advances in Neural Information Processing Systems, pages 6449–6459, 2017.
- Jianjun Lu, Shozo Tokinaga, and Yoshikazu Ikeda. Explanatory rule extraction based on the trained neural network and the genetic programming. Journal of the Operations Research Society of Japan, 49(1):66–82, 2006.
- Gary Marcus. Deep learning: A critical appraisal. arXiv preprint arXiv:1801.00631, 2018.
- Urszula Markowska-Kaczmar and Paweł Wnuk-Lipiński. Rule extraction from neural network by genetic algorithm with pareto optimization. Artificial Intelligence and Soft Computing-ICAISC 2004, pages 450–455, 2004.
- Brent Daniel Mittelstadt, Patrick Allo, Mariarosaria Taddeo, Sandra Wachter, and Luciano Floridi. The ethics of algorithms: Mapping the debate. Big Data & Society, 3(2):2053951716679679, 2016.
- Grégoire Montavon, Wojciech Samek, and Klaus-Robert Müller. Methods for interpreting and understanding deep neural networks. Digital Signal Processing, 2017.
- W James Murdoch, Peter J Liu, and Bin Yu. Beyond word importance: Contextual decomposition to extract interactions from lstms. arXiv preprint arXiv:1801.05453, 2018.
- W James Murdoch and Arthur Szlam. Automatic rule extraction from long short term memory networks. arXiv preprint arXiv:1702.02540, 2017.
- Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. Feature visualization. Distill, 2017. https://distill.pub/2017/feature-visualization.
- Chris Olah, Arvind Satyanarayan, Ian Johnson, Shan Carter, Ludwig Schubert, Katherine Ye, and Alexander Mordvintsev. The building blocks of interpretability. Distill, 2018. https://distill.pub/2018/building-blocks.
- Rasmus Berg Palm, Ulrich Paquet, and Ole Winther. Recurrent relational networks for complex relational reasoning. arXiv preprint arXiv:1711.08028, 2017.
- Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Ananthram Swami. Practical black-box attacks against deep learning systems using adversarial examples. arXiv preprint arXiv:1602.02697, 2016.
- Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Nothing else matters: Model-agnostic explanations by identifying prediction invariance. arXiv preprint arXiv:1611.05817, 2016.
- Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1135–1144. ACM, 2016.
- Marko Robnik-Šikonja and Igor Kononenko. Explaining classifications for individual instances. IEEE Transactions on Knowledge and Data Engineering, 20(5):589–600, 2008.
- Wojciech Samek, Thomas Wiegand, and Klaus-Robert Müller. Explainable artificial intelligence: Understanding, visualizing and interpreting deep learning models. arXiv preprint arXiv:1708.08296, 2017.
- Adam Santoro, David Raposo, David GT Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap. A simple neural network module for relational reasoning. arXiv preprint arXiv:1706.01427, 2017.
- Christin Seifert, Aisha Aamir, Aparna Balagopalan, Dhruv Jain, Abhinav Sharma, Sebastian Grottel, and Stefan Gumhold. Visualizations of deep neural networks in computer vision: A survey. In Transparent Data Mining for Big and Small Data, pages 123–144. Springer, 2017.
- Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. arXiv preprint arXiv:1704.02685, 2017.
- Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
- Sandra Wachter, Brent Mittelstadt, and Luciano Floridi. Transparent, explainable, and accountable ai for robotics. Science Robotics, 2(6), 2017.
- Adrian Weller. Challenges for transparency. Workshop on Human Interpretability in Machine Learning – ICML 2017, 2017.
- Tianfu Wu, Xilai Li, Xi Song, Wei Sun, Liang Dong, and Bo Li. Interpretable r-cnn. arXiv preprint arXiv:1711.05226, 2017.
- Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833. Springer, 2014.
- Matthew D Zeiler, Dilip Krishnan, Graham W Taylor, and Rob Fergus. Deconvolutional networks. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 2528–2535. IEEE, 2010.
- Haipeng Zeng. Towards better understanding of deep learning with visualization. 2016.
- Jan Ruben Zilke, Eneldo Loza Mencía, and Frederik Janssen. Deepred–rule extraction from deep neural networks. In International Conference on Discovery Science, pages 457–473. Springer, 2016.
- Luisa M Zintgraf, Taco S Cohen, Tameem Adel, and Max Welling. Visualizing deep neural network decisions: Prediction difference analysis. arXiv preprint arXiv:1702.04595, 2017.