Explainable Machine Learning in Deployment
Explainable machine learning seeks to provide various stakeholders with insights into model behavior via feature importance scores, counterfactual explanations, and influential samples, among other techniques. Recent advances in this line of work, however, have gone without surveys of how organizations are using these techniques in practice. This study explores how organizations view and use explainability for stakeholder consumption. We find that the majority of deployments are not for end users affected by the model but for machine learning engineers, who use explainability to debug the model itself. There is a gap between explainability in practice and the goal of public transparency, since explanations primarily serve internal stakeholders rather than external ones. Our study synthesizes the limitations with current explainability techniques that hamper their use for end users. To facilitate end user interaction, we develop a framework for establishing clear goals for explainability, including a focus on normative desiderata.
Machine learning (ML) systems are being increasingly embedded into many aspects of daily life, such as healthcare (de2018clinically), finance (heaton2016deep), and social media (alvarado2018towards). In an effort to design ML systems worthy of human trust, research has proposed a variety of techniques for explaining ML models to stakeholders. Deemed “explainability,” this body of previous work attempts to illuminate the reasoning used by ML models. “Explainability” loosely refers to any technique that helps the user of a ML model understand why the model behaves the way it does. Explanations can come in many forms: from telling patients which symptoms were indicative of a particular diagnosis (lundberg2018explainable) to helping factory workers analyze inefficiencies in a production pipeline (dhurandhar2018improving).
Explainability has been touted as a way to enhance transparency of ML systems, particularly for end users. Often releasing (or forcing organizations to release) the data that models were trained on or the accompanying code is challenging due to user privacy issues and incentives to preserve trade secrecy. Moreover, end users are generally not equipped to be able to understand how raw data and code translate into benefits or harms that might affect them individually. By providing an explanation for how the model made a decision, explainability techniques seek to provide transparency directly targeted to human users, often with the goal of improving user trust. The importance of explainability as a concept has been reflected in legal and ethical guidelines for data and ML. In cases of automated decision-making, Articles 13-15 of the European General Data Protection Regulation (GDPR) require that data subjects have access to “meaningful information about the logic involved, as well as the significance and the envisaged consequences of such processing for the data subject” (gdpr). In addition, technology companies have released artificial intelligence (AI) principles that include transparency as a core value, including notions of explainability, interpretability, or intelligibility (ibm2019; msft2019).
This growing interest in “peering under the hood” of ML models and being able to provide explanations to human users has made explainability an important subfield of ML. Despite this growing literature, there has been little work characterizing how explanations have been deployed by organizations in the real world.
In this paper, we attempt to understand how organizations have deployed local explainability techniques so that we can observe which techniques work best in deployment, report on shortcomings of particular techniques, and better guide future research. We focus specifically on local explainability techniques. These techniques explain individual predictions, which makes them more relevant for providing transparency for end users.
Our interview study synthesizes interviews of roughly fifty people from approximately thirty organizations to understand which explainability techniques are used and how. We report trends from two sets of interviews and provide suggestions for future research directions that combine explainability with privacy, fairness, and causality. To the best of our knowledge, we are the first to conduct a study of how explainability techniques are used by organizations that use ML models in their workflows.
For the sake of clarity, we define various terms based on the context in which they appear in the forthcoming prose.
Predictor refers to a trained ML model.
Explainability refers to attempts to provide insights into the predictor’s behavior.
Stakeholders are the people who either want the model to be “explainable”, will consume the explanation, or are affected by the model itself.
Practice refers to the real-world context in which the predictor has been deployed.
Local Explainability aims to explain the predictor’s behavior at a specific input.
Global Explainability attempts to understand the high-level concepts and reasoning used by a predictor.
The rest of this paper is organized as follows:
We discuss the methodology of our survey, describing the interviews and introducing notation in Section 2.
We summarize our overall findings in Section 3.
We detail how local explainability techniques are used at various organizations and discuss technique-specific takeaways in Section 4.
We develop a framework for establishing clear goals when deploying local explainability in Section 5.1.
We discuss ethical desiderata for explainability in Section 5.2.
We conclude in Section 6.
2.1. Interview Structure
In the spirit of holstein2019improving, we study how industry practitioners look at and deploy explainable ML. Specifically, we study how particular organizations deploy explainability algorithms, including who consumes the explanation and how it is evaluated for the intended stakeholder. We conduct two set of interviews: Group 1 looked at how data scientists who are not currently using explainable machine learning hope to leverage various explainability tools, while Group 2, the crux of this paper, looked at how explainable machine learning has been deployed in practice.
For Group 1, Fiddler Labs led a set of around twenty interviews to assess explainability needs across various organizations in technology and financial services. We specifically focused on teams that do not currently employ explainability technology. These semi-structured, hour-long interviews included the following questions:
What are your ML use cases?
What is your current model development workflow?
What explainability tools do you use?
What are your pain points in deploying ML models?
Do you think explainability will help address those points?
Group 2 spanned roughly thirty people across approximately twenty different organizations, both for-profit and non-profit. Most of these organizations are members of the Partnership on AI, which is a global multistakeholder non-profit established to study and formulate best practices on AI technologies for the benefit of society. With each individual, we held a thirty-minute to two-hour semi-structured interview to understand the state of explainability in their organization, their motivation for using explanations, and the benefits and shortcomings of the methods used. Some organizations asked to stay anonymous, not to be referred to explicitly in the prose, or not to be included in the acknowledgements.
Of the people we spoke with in Group 2, around one-third represented non-profit organizations (academics, civil society organizations, and think tanks), while the rest worked for for-profit organizations (corporations, industrial research labs, and start-ups). Broken down by organization, around half were for-profit and half were academic / non-profit. Around one-third of the interviewees were executives at their organization, around half were research scientists or engineers, and the remainder comprised professors at academic institutions, who commented on the consulting they have done with industry leaders to commercialize their research. The questions we asked Group 2 included, but were not limited to, the following:
Does your organization use ML model explanations?
What type of explanations have you used (e.g., feature-based, sample-based, counterfactual, or natural language)?
Who is the audience for the model explanation (e.g., research scientists, product managers, domain experts, or users)?
In what context have you deployed the explanations (e.g., informing the development process, informing human decision-makers about the model, or informing the end user on how actions were taken based on the model’s output)?
How does your organization decide when and where to use model explanations?
2.2. Technical Details
Let be a family of black box predictors. Let and denote the input space and output space, respectively. A black box predictor maps an input to an output , . When we assume has a parametric form, we denote that parametric black box predictor as where . We denote to be a training dataset, where is an input-output pair, , and denotes all inputs .
Each organization we spoke with has deployed an ML model . They hope to explain a data point using an explanation function . Local explainability refers to an explanation for why predicted for a fixed point . The local explanation methods we discuss come in one of the following forms: Which feature of was most important for prediction with ? Which training datapoint was most important to ? What is the minimal change to the input required to change the output ?
In this paper, we deliberately decide to focus on the more popularly deployed local explainability techniques instead of global explainability techniques. Global explainability refers to techniques that attempt to explain the model as a whole. These techniques attempt to characterize the concepts learned by the model (kim2017interpretability), simpler models learned from the representation of complex models (dhurandhar2018improving), prototypical samples from a particular model output (kim2016MMD), or the topology of the data itself (dumouchel2002data). None of our interviewees reported deploying global explainability techniques, though some studied these techniques in research settings.
3. Summary of Findings
3.1. Explainability Needs
This subsection provides an overview of explainability needs that were uncovered with Group 1 - data scientists from organizations that do not currently deploy explainability techniques. These scientists were asked to describe their top “pain points” in building and deploying ML models, and how they hope to use explainability.
Model performance debugging: Most data scientists struggle with debugging poor model performance. They wish to identify causes for why the model performs poorly on certain inputs, and also to identify regions of the input space with below average performance. In addition, they seek guidance on how to engineer new features, drop redundant features, and gather more data to improve model performance. For instance, one data scientist said: “If I have 60 features, maybe it’s equally effective if I just have 5 features.” Dealing with feature interactions was also a concern, as the data scientist continued, “Feature A will impact feature B, [since] feature A might negatively affect feature B - how do I attribute [importance in the presence of] correlations?” Others mentioned explainability as a debugging solution, helping to “narrow down where things are broken.”
Model monitoring: Several data scientists worry about drift in the feature and prediction distributions after deployment. Ideally, they would like to be alerted when there is a significant drift relative to the training distribution (amodei2016concrete; pinto2019automatic). One organization would like explanations for how drift in feature distributions would impact model outcomes and feature contribution to the model: “We can compute how much each feature is drifting, but we want to cross-reference with which features are impacting the model a lot.”
Model transparency: Organizations that deploy models to make decisions that directly affect end users seek explanations for model predictions. The explanations are meant to increase model transparency and comply with current or forthcoming regulation. In general, data scientists believe that explanations can also help communicate predictions to a broader external audience of other business teams and customers. One company stressed the need to “show your work” to provide reasons on underwriting decisions to customers, and another company needed explanations to respond to customer complaints.
Model audit: In financial organizations, due to regulatory requirements, all deployed ML models must go through an internal audit. Data scientists building these models need to have them reviewed by internal risk and legal teams. One of the goals of the model audit is to conduct various kinds of tests provided by regulations like SR 11-7 (sr11-7). An effective model validation framework should include: (1) evaluation of conceptual soundness of the model, (2) ongoing monitoring, including benchmarking, and (3) outcomes analysis, including back-testing. Explainability is viewed as a tool for evaluating the soundness of the model on various data points. Financial institutions would like to conduct sensitivity analyses, checking the impact of small changes to inputs on model outputs. Unexpectedly large changes in outputs can indicate an unstable model.
3.2. Explainability Usage
In Table 1, we aggregate some of the explainability use cases that we received from different organizations in Group 2. For each use case, we define the domain of use (i.e., the industry in which the model is deployed), the purpose of the model, the explainability technique used, the stakeholder consuming the explanation, and how the explanation is evaluated. Evaluation criteria denote how the organization compares the success of various explanation functions for the chosen technique (e.g., after selecting feature importance as the technique, an organization can compare LIME (ribeiro2016should) and SHAP (shap) explanations via the faithfulness criterion (yeh2019sensitive)).
In our study, feature importance was the most common explainability technique, and Shapley values were the most common type of feature importance explanation. The most common stakeholders were ML Engineers (or Research Scientists), followed by domain experts (Loan Officers and Content Moderators). Section 4 provides definitions for each technique and further details on how these techniques were used at Group 2 organizations.
|\topruleDomain||Model Purpose||Explainability Technique||Stakeholders||Evaluation Criteria|
|\midruleFinance||Loan Repayment||Feature Importance||Loan Officers||Completeness (shap)|
|Content Moderation||Malicious Reviews||Feature Importance||Content Moderators||Completeness (shap)|
|Finance||Cash Distribution||Feature Importance||ML Engineers||Sensitivity (yeh2019sensitive)|
|Facial Recognition||Smile Detection||Feature Importance||ML Engineers||Faithfulness (ancona2018towards)|
|Content Moderation||Sentiment Analysis||Feature Importance||QA ML Engineers||norm|
|Healthcare||Medicare access||Counterfactual Explanations||ML Engineers||normalized norm|
|Object Detection||Adversarial Perturbation||QA ML Engineers||norm|
Most organizations in Group 2 deploy explainability atop their existing ML workflow for one of the following stakeholders:
Executives: These individuals deem explainability necessary to align with the company’s internal AI principles. One research scientist felt that “explainability was strongly advised and marketed by higher-ups,” though sometimes explainability simply became a checkbox.
ML Engineers: These individuals (including data scientists and researchers) train ML models at their organization and use explainability techniques to understand how the trained model works: do the most important features, most similar samples, and nearest training point(s) in the opposite class make sense? Using explainability to debug what the model has learned, this group of individuals were the most common explanation consumers in our study.
End Users: This is the most intuitive consumer of an explanation. The person consuming the output of an ML model or making a decision based on the model output is the end user. Explainability shows the end user why the model behaved the way it did, which is important for showing the model is trustworthy and also providing greater transparency.
Other Stakeholders: There are many other possible stakeholders for explainability. One such group is regulators, who may mandate that certain algorithmic decision-making systems provide explanations either for affected populations or for their own regulatory activities. It is important that this group understands how explanations are deployed based on existing research, what techniques are feasible, and how the techniques can align with the desired explanation from a model. Another group is domain experts, who are often tasked with auditing the model’s behavior and ensuring it aligns with expert intuition. For many organizations, minimizing the divergence between the expert’s intuition and the explanation used by the model is key to successfully implementing explainability.
Overwhelmingly, we found that local explainability techniques are mostly consumed by ML engineers and data scientists to audit models before deployment rather than to provide explanations to end users. Our interviews reveal factors that prevent organizations from showing explanations to end users or those affected by decisions made from ML model outputs.
3.4. Beyond Deep Learning
Though deep learning has gained popularity in recent years, many organizations in Group 2 still use classical ML techniques (e.g., logistic regression, support vector machines, and GP regression), likely due to a need for simpler, interpretable models (rudin2019stop).
A subset of the explainability community has focused on interpreting black-box deep learning models, even though practitioners overwhelmingly feel that there is a dearth of model-specific techniques to understand traditional ML models. For example, one research scientist noted that, “Many [financial institutions] use kernel-based methods on tabular data.” As a result, there is a desire to translate explainability techniques for kernel support vector machines for genomics (shrikumar2018gkmexplain) to models trained on tabular data.
Model agnostic techniques like shap can be used for traditional models, but are “likely overkill” for explaining kernel-based ML models, according to one research scientist, since model-agnostic methods can be computationally expensive and lead to poorly approximated explanations.
3.5. Key Takeaways
This subsection summarizes some key takeaways from Group 2 that shed light on the reasons for the limited deployment of explainability techniques and their use primarily as sanity checks for ML engineers. Organizations generally still consider the judgments of domain experts to be the implicit ground truth for explanations. Since explanations produced by current techniques often deviate from the understanding of domain experts, some organizations still use human experts to evaluate the explanation before it is presented to users. Part of this deviation stems from the potential for ML explanations to reflect spurious correlations, which result from models detecting patterns in the data that lack causal underpinnings. As a result, organizations find explainability techniques useful for helping their ML engineers identify inconsistencies between the model’s explanations and their intuition or that of domain experts, rather than for directly providing explanations to end users.
In addition, there are technical limitations that make it difficult for organizations to show end users explanations in real-time. The non-convexity of certain models make certain explanations (e.g., providing the most influential datapoints) hard to compute quickly. Moreover, providing certain explanations can raise privacy concerns by running the risk of model inversion.
More broadly, organizations lack frameworks for deciding why they want an explanation, and current research fails to capture the objective of an explanation. For example, large gradients, representing the direction of maximal variation with respect to the output manifold, do not necessarily “explain” anything to end users. At best, gradient-based explanations provide an interpretation of how the model behaves upon an infinitesimal perturbation (not necessarily a feasible one (hooker2019please)), but does not “explain” if the model captures the underlying causal mechanism in the data.
4. Deploying Local Explainability
In this section, we dive into how local explainability techniques are used at various organizations (Group 2) . We start by defining each local explainability technique, then discuss organizations’ use cases, and finally report takeaways for the technique in question.
4.1. Feature Importance
Feature importance was by far the most popular technique we found across our study. It is used across domains (finance, healthcare, facial recognition, content moderation). Also known as feature-level interpretations, feature attributions, or saliency maps, this method is by far the most widely used and most well-studied explainability technique (baehrens2010explain; gilpin2018explaining).
Feature importance defines an explanation functional that takes in a predictor and a point of interest and returns importance scores for all features, where (simplified to in context) is the importance of (or attribution for) feature of .
These explanation functionals roughly fall into two categories: perturbation-based techniques (vstrumbelj2014explaining; ribeiro2016should; shap; chen2018shapley; meaningful_pert; dasp) and gradient-based techniques (smilkov2017smoothgrad; shrikumar2017learning; sundararajan2017axiomatic; ancona2018towards; dtd; second). Note that gradient-based techniques can be seen as a special case of a perturbation-based technique with an infinitesimal perturbation size. Heatmaps are also a type of feature-level explanation that denote how important a region or collection of features, is (meaningful_pert; adel2018excitation). A prominent class of perturbation based methods are based on Shapley values from cooperative game theory (shapley52). Shapley values are a fair way to distribute the gains from a cooperative game to its players. In applying the method to explaining a model prediction, a cooperative game is defined between the features with the model prediction as the gain. The highlight of Shapley values is that they enjoy axiomatic uniqueness guarantees. Additional details about Shapley value explanations can be found in shap, sundararajan2017axiomatic, and aas2019explaining.
4.1.2. Shapley Values in Practice
Organization A works with financial institutions and helps explain models for credit risk analysis. To integrate into the existing ML workflow of these institutions, Organization A proceeds as follows. They let data scientists train a model to the desired accuracy. Note that Organization A focuses mostly on models trained on tabular data, though they are beginning to venture into unstructured data (i.e., language and images). During model validation, risk analysts conduct stress tests before deploying the model to loan officers and other decision-makers. After decision-makers vet the model outputs as a sanity check and decide whether or not to override the model output, Organization A generates Shapley value explanations.
Before launching the model, risk analysts are asked to review the Shapley value explanations to ensure that the model exhibits expected behavior (i.e., the model uses the same features that a human would for the same task). Notably, the customer support team at these institutions can also use these explanations to tell individuals who inquire as to what went into the decision-making process for their loan approval or cash distribution decision. They are shown the percentage contribution to the model output (the positive norm of the Shapley value explanation along with the sign of contribution). This means that the explanation would be along the lines of, “55% of the decision was decided by your age, which positively correlated with the predicted outcome.”
When comparing Shapley value explanations to other popular feature importance techniques, Organization A found that in practice LIME explanations (ribeiro2016should) give unexpected explanations that do not align with human intuition. Recent work (badLIME) shows that the fragility of LIME explanations can be traced to the sampling variance when explaining a singular data point and to the explanation sensitivity to sample size and sampling proximity.
Though decision-makers have access to the feature-importance explanations, end users are still not shown these explanations as reasoning for model output. Organization A aspires to eventually expose this “explanation” to end users.
For gradient-based language models, Organization A uses Integrated Gradients, a path integral variant of Shapley Values (sundararajan2017axiomatic; shap), to flag malicious reviews and moderate content at the aforementioned institutions. This information can be highlighted to ensure the trustworthiness and transparency of the model to the decision-maker (the hired content moderator), since they can now see which word was most important to flag the content as malicious.
Going forward, Organization A intends to use a global variant of the Shapley value explanations by exposing how Shapley value explanations work on average for datapoints of a particular predicted class (e.g., on average someone who was denied a loan had their age matter most for the prediction). This global explanation would help risk analysts get a birds-eye view of how a model behaves and whether it aligns with their expectations.
4.1.3. Heatmaps in Transportation
Organization B looks to detect facial expressions from video feeds of users driving. They hope to use explainability to identify the actions a user is performing while the user drives. Organization B has tried feature visualization and activation visualization techniques that get attributions by backpropagating gradients to regions of interest (zhang2018top; adel2018excitation). Specifically, they use these probabilistic Winner-Take-All techniques (variants of existing gradient-based feature importance techniques (shrikumar2017learning; sundararajan2017axiomatic)) to localize the region of importance in the input space for a particular classification task. For example, when detecting a smile, they expect the mouth of the driver to be important.
Though none of these desired techniques have been deployed for the end user (the driver in this case), ML engineers at Organization B found these techniques useful for qualitative review. On tiny datasets, engineers can figure out which scenarios have false positives (videos falsely detected to contain smiles) and why. They can also identify if true positives are paying attention to the right place or if there is a problem with spurious artifacts.
However, while trying to understand why the model erred by analyzing similarities in false positives, they have struggled to scale this local technique across heatmaps in aggregate across multiple videos. They are able to qualitatively evaluate a sequence of heatmaps for one video, but doing so across 100M frames simultaneously is far more difficult. Paraphrasing the VP of AI at Organization B, aggregating saliency maps across videos is moot and contains little information. Note that an individual heatmap is an example of a local explainability technique, but an aggregate heatmap for 100M frames would be a global technique. Unlike aggregating Shapley values for tabular data as done at Organization A, taking an expectation over heatmaps (in the statistical sense) does not work, since aggregating pixel attributions is meaningless. One option Organization B discussed would be to clustering low dimensional representations of the heatmaps and then tagging each cluster based on what the model is focusing on; unfortunately, humans would still have to manually label the clusters of important regions.
4.1.4. Spurious Correlations
Related to model monitoring for feature drift detection discussed in Section 3.1, Organization B has encountered issues with spurious correlations in their smile detection models. Their Vice President of AI noted that “[ML engineers] must know to what extent you want ML to leverage highly correlated data to make classifications.” Explainability can help identify models that focus on that correlation and can find ways to have models ignore it. For example, there may be a side effect of a correlated facial expression or co-occurrence: cheek raising, for example, co-occurs with smiling. In a cheek-raise detector trained on the same dataset as a smile detector but with different labels, the model still focused on the mouth instead of the cheeks. Both models were fixated on a prevalent co-occurrence. Attending to the mouth was undesirable in the cheek-raise detector but allowed in the smile detector.
One way Organization B combats this is by using simpler models on top of complex feature engineering. For example, they use black box deep learning models for building good descriptors that are robust across camera viewpoints and will detect different features that subject matter experts deem important for drowsiness. There is one model per important descriptor (i.e., one model for eyes closed, one for yawns, etc.). Then, they fit a simple model on the extracted descriptors such that the important descriptors are obvious for the final prediction of drowsiness. Ideally, if Organization B had guarantees about the disentanglement of data generating factors (adel2018discovering), they would be able to understand which factors (descriptors) play a role in downstream classification.
4.1.5. Feature Importance - Takeaways
Not only do Shapley values come with nice axiomatic guarantees, they are also simple to deploy for decision-makers to sanity check the models they have built.
Feature importance is not used directly for end users, and instead explanations require looping in decision-makers who are acting based the original model outputs.
Heatmaps are hard to aggregate over, which makes it hard to do false positive detection at scale.
Spurious correlations can be detected with simple gradient-based techniques.
4.2. Counterfactual Explanations
Counterfactual explanations are techniques that explain individual predictions by providing a means for recourse. While some existing open source implementations for counterfactual explanations exist (ustun2019actionable; wexler2019if), they either work for specific model-types or are not black-box in nature. In this section, we discuss the formulation for counterfactual explanations and describe one solution for each deployed technique.
Counterfactual explanations are points close to the input for which the decision of the classifier changes. For example, for a person who was rejected for a loan by a ML model, a counterfactual explanation would possibly suggest: ”Had your income been greater by $5000, the loan would have been granted.”
Given an input and a classifier , a counterfactual explanation can be found by solving the optimization problem:
While the term counterfactual has a well understood meaning in the causality literature (rubin; pearl2000causality), counterfactual explanations for ML were introduced by wachter2017counterfactual. sharma2019certifai provide details on existing techniques.
4.2.2. Counterfactual Explanations in Healthcare
Organization C uses a faster version of the formulation in sharma2019certifai to find counterfactual explanations for projects in healthcare. When people apply for Medicare, Organization C hopes to flag if a user’s application has errors and to provide explanations on how to correct the errors. Moreover, ML engineers can use the robustness score to compare different models trained using this data: this robustness score is effectively the distance between the counterfactual and original point in Euclidean space. The original formulation makes use of a slower genetic algorithm, so they optimized the counterfactual explanation generation process. They are currently developing a first-of-its-kind application that can directly take in any black-box model and data and return a robustness score, fairness measure, and counterfactual explanation, all from a single underlying algorithm.
The use of this approach has several advantages: it can be applied to black-box models, works for any input data type, and generates multiple explanations in a single run of the algorithm. However, there are some shortcomings that Organization C is trying to address. One challenge of counterfactual models is that the counterfactual might not be feasible. Organization C plans to address this by using the training data to guide the counterfactual generation process, ensuring that the counterfactuals are feasible given the training distribution. In addition, the flexibility of the counterfactual approach comes with a drawback that is common among explanations for black-box models: there is no guarantee of the optimality of the explanation since black-box techniques cannot guarantee optimality.
Through the creation of a deployed solution for this method, the organization realized that clients would ideally want an explainability score, along with the measure of fairness and robustness. They are currently developing an explainability score that seeks to measure how explainable different models are. However, since explanations are subjective, it is crucial to see how such a measure and the produced explanations are received by clients.
4.2.3. Counterfactual Explanations - Takeaways
Counterfactual explanation solutions yield client interest, since the underlying method is flexible and such explanations are easy for end users to understand.
Since the method is heuristic, it is hard to say that the explanation produced is optimal. In general, counterfactual explanations are difficult to evaluate.
4.3. Adversarial Training
In order to ensure the predictor being deployed is robust to adversaries and behaves as intended, many organizations use adversarial training to ensure the predictor fits to the desired, robust, and human-interpretable features. Interestingly, this is a use case for explainability techniques that is not for enhancing transparency so much as protecting the integrity of the algorithmic decision-making process.
Recent works have also explored the intersection between adversarial robustness and model interpretations (yeh2019sensitive; adv2int; second; ghorbani2017interpretation; dombrowski2019explanations). In particular, adversarially trained models have been shown not only to be robust but also to provide sharper and clearer feature importance scores. The claim of one of these works is that the closest adversarial example should perturb the robust features (indicative of a particular class) and not fit to spurious non-robust features (ilyas2019adversarial). The robustness of a model to adversarial attacks depends on how well the feature importance (saliency) map aligns with the input. The setup of feature importance in second is as follows:
We let be the top- feature importance scores of the input, . This is similar to the adversarial example setup which is usually written in the same manner as the above (without the norm to limit the number of features that changed). It is also interesting to note that the formulation to find counterfactual explanations above matches the formulation for finding adversarial examples. sharma2019certifai use this connection to generate adversarial examples and define a black-box model robustness score.
4.3.2. Image Content Moderation
Organization D moderates user-generated content (UGC) on several public platforms. Specifically, the R&D team at Organization D developed several models to detect adult and violent content from users’ uploaded images. Their quality assurance (QA) team measures model robustness to improve content detection accuracy under the threat of adversarial examples.
The robustness of a content moderation model is measured by the minimum perturbations required for an image to evade detection. Given a gradient-based image classification model , and we assume where is the final (logit) layer output, and is the prediction score for the -th class. The objective can be formulated as the following optimization problem to find the minimum perturbation:
is some distance measure that Organization D chooses to be the distance in Euclidean space; is the cross-entropy loss function and is a balancing factor.
As is common in the adversarial literature, Organization D applies Projected Gradient Descent (PGD) to search for the minimum perturbation from the set of allowable perturbations (madry2017towards). The search process can be formulated as
until is misclassified by the detection model. ML engineers on the QA team are shown a -norm perturbation distance averaged over test images randomly sampled from the test dataset. The larger the average perturbation, the more robust the model is, as it takes greater effort for an attacker to evade detection. The average perturbation required is also widely used as a metric when comparing different candidate models and different versions of a given model.
Organization D finds that more robust models have more convincing gradient-based explanations, i.e., the gradient of the output with respect to the input shows that the model is focusing on relevant portions of the images, confirming recent research (tsipras2018robustness; adv2int; ilyas2019adversarial).
4.3.3. Text Content Moderation
Organization E uses text content moderation algorithms on its UGC platforms, such as forums. Its QA team is responsible for the reliability and robustness of a sentiment analysis model, which labels posts as positive or negative, trained on UGC. The QA team seeks to find the minimum perturbation required to change the classification of a post. In particular, they want to know how to take misclassified posts (e.g., negative ones classified as positive) and change them to the correct class.
Given a sentiment analysis model , which maps from feature space to a set of class , an adversary aims to generate an adversarial post from the original post whose ground truth label is so that . The QA team tries to minimize for a domain-specific distance function. Organization E uses the distance in the embedding space, but it is equally valid to use the editing distance (niu2018word). Note that perturbation technique changes accordingly.
In practice, to find the minimum distance in the embedding space, Organization E chooses to iteratively modify the words in the original post, starting from the words with the highest importance. Here importance is defined as the gradient of the model output with respect to a particular word. ML engineers compute the Jacobian matrix of the given posts where is the -th word. The Jacobian matrix is as follows:
where represents the number of classes (in this case ), and represents the confidence value of the th class. The importance of word is defined as
i.e., the partial derivative of the confidence value based on the predicted class regarding to the input word . This procedure ranks the words by their impact on the sentiment analysis results. The QA team then applies a set of transformations/perturbations to the most important words to find the minimum number of important words that must be perturbed in order to flip an sentiment analysis API result.
4.3.4. Adversarial Training - Takeaways
There is a relation between model robustness and explainability. Model robustness improves the quality of feature importances (specifically saliency maps), confirming recent research findings (adv2int).
Feature importance helps find minimal adversarial perturbations for language models in practice.
4.4. Influential Samples
This technique asks the question: Which data point in the training dataset is most influential to the predictor’s output for a test point ? Statisticians have used measures like Cook’s distance (cook1977detection) which measure the effect of deleting a data point on the model output. However, such measures require an exhaustive search and hence do not scale well for larger datasets.
For over half of the organizations, influence functions has been the tool of choice for explaining which training points are influential to the predictor’s output for a point (koh2017understanding) (though only one organization actually deployed the technique). We let be the predictor’s loss for point , so the empirical risk minimizer is given by . Note that is the predicted output at with the trained risk minimizer. (koh2017understanding) defines the most influential data point to a fixed point as that which maximizes the following:
This quantity measures the effect of upweighting on datapoint () on the loss at . The goal of sample importance is to uncover which training examples, when perturbed, would have the largest effect (positive or negative) on the loss of a test point.
4.4.2. Influence Functions in Insurance
Organization F uses influence functions to explain risk models in the insurance industry. They hope to identify which customers might see an increase in their premiums based on their driving history in the past. The organization hopes to divulge to the end user how the premiums for drivers similar to them are priced. In other words, they hope to identify the influential training data points (koh2017understanding) to understand which past drivers had the greatest influence on the prediction for the observed driver. Unfortunately, Organization F has struggled to expose this information to end users since the Hessian computation has made doing so impractical since the latency is high.
More pressingly, even when Organization F lets the influence function procedure run, they find that many influential data points are simply outliers that are important for all drivers since those anomalous drivers are far out of distribution. As a result, instead of identifying which drivers are most similar to a given driver, the influential sample explanation identifies drivers that are very different from any driver (i.e., outliers). While this is could in theory be useful for outlier detection, it prevents the explanations from being used at deployment.
4.4.3. Influential Samples - Takeaways
Influence functions can be intractable for large datasets; as such, a significant effort is needed to improve these methods to make them easy to deploy in practice.
Influence functions can be sensitive to outliers in the data, such that they might be more useful for outlier detection than for providing end users explanations.
This section provides recommendations for future work, based on the technique-specific takeaways in Section 4 and the key takeaways in Section 3.5. In order to address the challenges organizations face when striving to provide explanations to end users, we recommend a framework for establishing clear desiderata in explainability, including how to approach normative concerns.
5.1. Establish Clear Desiderata
Most organizations we spoke to solely deploy explainability techniques for internal engineers and scientists, as a debugging mechanism or as a sanity check. At the same time, these organizations also affirmed the importance of understanding the stakeholder, and hope to be able to explain a model prediction to the end user. Once the target population of the explanation is understood, organizations can devise and deploy explainability techniques accordingly. We propose the following 3 steps for establishing clear desiderata and improving decision making around explainability. These include: clearly identifying the target population, understanding their needs, and clarifying the intention of the explanation.
Identify your target population (aka your stakeholder). That is, who is your desired explanation consumer? Ideally this person is also affected by or is shown output based on the model.
Engage with the stakeholder. Ask them some variant of “What would you need the model to explain to you in order to understand, trust, or contest the model prediction?” and “How would an explanation help you?”
If the explanation would not be helpful: Better understand the use case of the model and how it is being deployed.
If the explanation would be helpful: Follow up with understanding how the explanation will better inform the target population.
Understand the intention of the explanation. Once the context of the explanation and the helpfulness of the explanation are established, examine what will be done with the explanation.
Static Consumption: Will the explanation be used as a sanity check for a data scientist or shown to an end user as reasoning for a particular prediction?
Dynamic Model Updates: Will the explanation be used to garner feedback from the end user as to how the model ought to be updated to better align with their intuition? That is, how does the user interact with the model after viewing the explanation?
Once the desiderata are clarified, domain experts can be shown the explanations to ensure that they exhibit expected behavior. Having clearer desiderata is vital since the current literature lacks a clear direction for why explanations are desired and how explanations would be helpful in practice.
5.2. Important Normative Desiderata
In this subsection, we discuss a few normative desiderata that companies should consider when deploying explainability techniques. These desiderata (with the exception of causality) were not explicitly mentioned in our interviews, and have been consciously included here in order highlight important AI ethics concerns.
5.2.1. Fairness Guided By Explainability
As organizations consider deploying explainability techniques, it is important to reflect on fairness as a key desideratum. Explanations can help expose fairness violations by providing insights into possible biases in a model. For example, work on counterfactual explanations (sharma2019certifai) and (ustun2019actionable) has demonstrated how explanations can be used to examine predictor fairness. We now define how bias can potentially be detected using two explainability approaches.
Approach 1: Given a binary predictor , an input , and a feature importance explanation function that returns importance scores for all features, where (simplified to in context) is the importance of (or attribution for) feature of . Let be a protected attribute. Then, if the predictor indicates potential discrimination at a local level, where is a fairness-sensitive parameter decided by the ML engineer.
Approach 1 implies that if a protected attribute, such as race or gender, has an importance beyond a certain level for determining the prediction for an individual, this is indicative of potential bias, and the ML engineer should take steps to examine such decision-making. This approach can be extended to a global level by taking an expectation of the importance of the protected attribute over the different groups to find the importance of the protected attribute for every group. Note that the applicability of this approach for fairness is based on the assumption that features are independent, and other methods should be considered if this is not the case.
In addition, it is important to note that this is not meant to be a definition of ”bias” itself but a potential indicator for bias. Given that much of the bias literature in ML relies on the use of protected attributes to mitigate or correct for bias (barocas-hardt-narayanan), the fact that a protected class variable is considered important for a prediction does not necessarily imply the model exhibits algorithmic bias. Instead, the goal of this approach is to provide a sanity check. If a protected class variable plays a key role in the algorithm’s decision, the ML engineer should ensure that the use of this variable is appropriate.
Approach 2: Consider a set of points that represent the most influential samples responsible for the prediction of input . Let be a protected attribute. If a higher percentage of points in have the same value for as the input , this is potentially indicative of prejudice towards or against the group having that value of .
Approach 2 implies that a model is might be biased when individuals having the same protected attribute value are most responsible towards a new individual that belongs to the same protected attribute group. This might be indicative of bias arising due to a group being treated similarly (positive or negative) historically (which is reflected in the data and hence the model), or a lack of sample diversity for particular groups. This might also be reflective of an reliance of the protected attribute towards a model’s decision, which, as discussed above, can be undesirable if not done to specifically correct for bias. Again, this is not meant to be a definition of bias, but rather a sanity check for ML engineers to check for bias.
The two approaches above intuitively follow from the respective explanation regimes (i.e., feature importance and influence functions), and we suggest how both approaches generate explanations that might be useful for detecting biases.
5.2.2. Explainability and Privacy
Another important desideratum for organizations to consider when deploying explainability techniques is privacy, since model explanations can be used to reconstruct the model (milli2019model) or to do transparency-based membership inference (shokri2019privacy). tramer2016stealing shows how models can be replicated with access to only the model APIs. Providing black-box explanations could potentially mitigate this issue. preece2018asking and sokol2019counterfactual discuss how explainability could compromise on privacy. We describe an example case and then provide possible suggestions to practitioners on how to address these issues.
Consider the use of counterfactual explanations to explain a prediction to an end user. An end user could not only use such tools to change their data maliciously to fool a model, but by querying the model multiple times using a random set of inputs, a user can possibly learn the approximate decision boundary that is being used for classification. Moreover, methods such as the what-if tool for producing counterfactual explanations pose a risk to data privacy since they provide points that actually belong to the training dataset (wexler2019if). Providing the most relevant data points using sample importance unintentionally provides data that may belong to prior users, which infringes privacy. Where those risks are present, industry practitioners may need to develop methods to avoid the harmful use of explainability tools. Below is one possible way to approach such issues, applying a framework analogous to differential privacy.
Approach: Given a model and an input to the model , consider an explanation represented by , where could be in the form of a counterfactual explanation point or a sample important to the inputs prediction. Then, the private explanation should be such that where is a zero-centered Laplace noise with scale .
The approach above implies that publicly released training data points or counterfactual explanations need to have noise added to them so that data rights or model privacy are not compromised. Global explainability methods need to investigate ways to provide explanations about the model without providing details on model weights (directly or via global level feature importance scores) (milli2019model).
5.2.3. Explainability and Causality
One chief scientist told us that “Figuring out causal factors is the holy grail of explainability.” However, causal explanations are largely lacking in the literature, other than preliminary work on causal attribution for deep learning models (causal). Though non-causal explanations can still provide valid and useful interpretations of how the model works (miller2018explanation), many organizations posit that the lack of use of local explanation techniques for end users stem from a lack of causal interpretations.
The relationship between causality and explainability in ML relates to a broader dichotomy. In distinguishing ML techniques from those in econometrics or statistics, a key component is prediction vs. inference (athey2017state; athey2017; predPolicy). ML focuses on prediction, such that understanding the underlying process is generally considered less important than predictive accuracy. In fact, the power of ML tools is often presented as a trade-off between accuracy and explainability, with simpler more interpretable models often performing worse on accuracy measures (rudin2019stop). That said, there has been a growing demand from end users, civil society groups, and law makers for ML engineers not only to justify the quality of their predictions through accuracy metrics but also to provide explanations for how the predictions were arrived at. Explanations, however, are inherently inferential, so these trends imply a growing need for an inferential approach to ML. While econometrics provides many tools for causal inference on linear models (athey2017state), more work needs to be done to connect statistical inference techniques to the context of more complex ML models.
5.2.4. Unintended Consequences from Explainability Research
As ML is increasingly being deployed in high-stakes situations, including in finance, criminal justice, and content moderation, the ethical implications of how explainability techniques are used are an important concern. In order for explainability techniques to facilitate greater accountability for end users, ethical desiderata must include broader societal considerations for how the ML is being used.
For example, several organizations we talked with have begun to make extensive use of natural language processing and image recognition models for content moderation in response to business incentives, regulatory requirements, and sociopolitical pressure. In some instances, explainability techniques have become part of the workflows and development of those content moderation processes and are making them more effective.
Though there are aspects of these use cases that are clearly in users’ interest, there are others where that is much less clear, with potential for adverse effects both when these systems work correctly and when they err (gillespie2018custodians; transparency_content; marwick2016best). ML researchers may not always be in a position to set the objectives and criteria for how their technology is applied, which makes it difficult to propose best practices for ethical approaches to such work. The ML research community should continue to be mindful about the potential for both constructive and unintended consequences of its work in this and other sensitive domains.
In this study, we critically examine how explanation techniques are used in practice and illuminate the gaps between current techniques and normative desiderata. We are the first, to our knowledge, to interview various organizations on how they deploy explainability in their ML workflows, concluding with salient directions for future research. We found that while ML engineers are increasingly using explainability techniques as sanity checks during the development process, there are still significant limitations to current techniques that prevent their use to directly inform end users. These limitations include the need for domain experts to evaluate explanations, the risk of spurious correlations reflected in model explanations, the lack of causal intuition, and the latency in computing and showing explanations in real-time. Future research should seek to address these limitations.
We also highlighted the need for organizations to establish clear desiderata for their explanation techniques and to incorporate ethics-related desiderata, taking into account issues such as fairness, privacy, causality, and the potential for adverse unintended consequences. Through this analysis, we take a step towards explaining explainability deployment and hope that future research builds trustworthy explainability solutions.
The authors would like to thank the following individuals for their advice, contributions, and/or support: Karina Alexanyan (Partnership on AI), Gagan Bansal (University of Washington), Rich Caruana (Microsoft), Amit Dhurandhar (IBM), Krishna Gade (Fiddler Labs), Jette Henderson (CognitiveScale), Yannis Kalantidis (Facebook), Bahador Kaleghi (Element AI), Been Kim (Google), Hima Lakkaraju (Harvard University), Katherine Lewis (Partnership on AI), Peter Lo (Partnership on AI), Terah Lyons (Partnership on AI), Saayeli Mukherji (Partnership on AI), Erik Pazos (QuantumBlack), Inioluwa Deborah Raji (Partnership on AI), Francesca Rossi (IBM), Jay Turcot (Affectiva), Kush Varshney (IBM), Dennis Wei (IBM), Edward Zhong (Baidu), Gabi Zijderveld (Affectiva), and around ten other anonymous individuals.