Manipulating and Measuring Model Interpretability

# Manipulating and Measuring Model Interpretability

## Abstract

Despite a growing body of research focused on creating interpretable machine learning methods, there have been few empirical studies verifying whether interpretable methods achieve their intended effects on end users. We present a framework for assessing the effects of model interpretability on users via pre-registered experiments in which participants are shown functionally identical models that vary in factors thought to influence interpretability. Using this framework, we ran a sequence of large-scale randomized experiments, varying two putative drivers of interpretability: the number of features and the model transparency (clear or black-box). We measured how these factors impact trust in model predictions, the ability to simulate a model, and the ability to detect a model’s mistakes. We found that participants who were shown a clear model with a small number of features were better able to simulate the model’s predictions. However, we found no difference in multiple measures of trust and found that clear models did not improve the ability to correct mistakes. These findings suggest that interpretability research could benefit from more emphasis on empirically verifying that interpretable models achieve all their intended effects.

## 1 Introduction

Machine learning is increasingly used to make decisions that affect people’s lives in critical domains like criminal justice, credit, lending, and medicine. Machine learning models are often evaluated based on their predictive performance on held-out data sets, measured, for example, in terms of accuracy, precision, or recall. However, good performance on held-out data may not be sufficient to convince decision makers that a model is trustworthy or reliable in the wild.

To address this problem, a new line of research has emerged that focuses on developing interpretable machine learning methods. There are two common approaches. The first is to employ models in which the impact of each feature on the model’s prediction is easy to understand. Examples include generative additive models (Lou et al., 2012, 2013; Caruana et al., 2015) and point systems (Jung et al., 2017; Ustun and Rudin, 2016). The second is to provide post-hoc explanations for (potentially complex) models. One thread of research in this direction looks at how to explain individual predictions by learning simple local approximations of the model around particular data points (Ribeiro et al., 2016; Lundberg and Lee, 2017) or estimating the influence of training examples (Koh and Liang, 2017), while another focuses on visualizing model output (Wattenberg et al., 2016).

Despite the flurry of activity and innovation in this area, there is still no consensus about how to define, quantify, or measure the interpretability of a machine learning model (Doshi-Velez and Kim, 2017). Indeed, different notions of interpretability, such as simulatability, trustworthiness, and simplicity, are often conflated (Lipton, 2016). This problem is exacerbated by the fact that there are different types of users of machine learning systems and these users may have different needs in different scenarios. For example, the approach that works best for a regulator who wants to understand why a particular person was denied a loan may be different from the approach that works best for a data scientist trying to debug a machine learning model.

We take the perspective that the difficulty of defining interpretability stems from the fact that interpretability is not something that can be directly manipulated or measured. Rather, interpretability is a latent property that can be influenced by different manipulable factors (such as the number of features, the complexity of the model, the transparency of the model, or even the user interface) and that impacts different measurable outcomes (such as an end user’s ability to simulate, trust, or debug the model). Different factors may influence these outcomes in different ways. As such, we argue that to understand interpretability, it is necessary to directly manipulate and measure the influence that different factors have on real people’s abilities to complete tasks.

This endeavor goes beyond the realm of typical machine learning research. While the factors that influence interpretability are properties of the system design, the outcomes that we would ultimately like to measure are properties of human behavior. Because of this, building interpretable machine learning models is not a purely computational problem. In other words, what is or is not “interpretable” is defined by people, not algorithms. We therefore take an interdisciplinary approach, building on decades of psychology and social science research on human trust in models (e.g., Önkal et al., 2009; Dietvorst et al., 2015; Logg, 2017). The general approach used in this literature is to run randomized human-subject experiments in order to isolate and measure the influence of different manipulable factors on trust. Our goal is to apply this approach in order to understand interpretability—i.e., the relationships between properties of the system design and properties of human behavior.

We present a sequence of large-scale randomized human-subject experiments, in which we varied factors that are thought to make models more or less interpretable (Glass et al., 2008; Lipton, 2016) and measured how these changes impacted people’s decision making. We focus on two factors that are often assumed to influence interpretability, but rarely studied formally: the number of features and the model transparency, i.e., whether the model internals are clear or a black box. We focus on laypeople as opposed to domain experts, and ask which factors help them simulate a model’s predictions, gain trust in a model, and understand when a model will make mistakes. While others have used human-subject experiments to validate or evaluate particular machine learning innovations in the context of interpretability (e.g., Ribeiro et al., 2016; Lim et al., 2009), we attempt to isolate and measure the influence of different factors in a more systematic way by taking an experimental approach.

In each of our pre-registered experiments, participants were asked to predict the prices of apartments in a single neighborhood in New York City with the help of a machine learning model. Each apartment was represented in terms of eight features: number of bedrooms, number of bathrooms, square footage, total rooms, days on the market, maintenance fee, distance from the subway, and distance from a school. All participants saw the same set of apartments (i.e., the same feature values) and, crucially, the same model prediction for each apartment, which came from a linear regression model. What varied between the experimental conditions was only the presentation of the model. As a result, any observed differences in the participants’ behavior between the conditions could be attributed entirely to the model presentation.

In our first experiment (Section 2), we hypothesized that participants who were shown a clear model with a small number of features would be better able to simulate the model’s predictions and more likely to trust (and thus follow) the model’s predictions. We also hypothesized that participants in different conditions would exhibit varying abilities to correct the model’s inaccurate predictions on unusual examples. As predicted, we found that participants who were shown a clear model with a small number of features were better able to simulate the model’s predictions; however, we did not find that they were more likely to trust the model’s predictions and instead found no difference in trust between the conditions. We also found that participants who were shown a clear model were less able to correct inaccurate predictions.

In our second experiment (Section 3), we scaled down the apartment prices and maintenance fees to match median housing prices in the U.S. in order to determine whether the findings from our first experiment were merely an artifact of New York City’s high prices. Even with scaled-down prices and fees, the findings from our first experiment replicated.

In our third experiment (Section 4), we dug deeper into our finding that there was no difference in trust between the conditions. To make sure that this finding was not simply due to our measures of trust, we instead used the weight of advice measure frequently used in the literature on advice-taking (Yaniv, 2004; Gino and Moore, 2007) and subsequently used in the context of algorithmic predictions by Logg (2017). We hypothesized that participants would give greater weight to the predictions of a clear model with a small number of features than the predictions of a black-box model with a large number of features, and update their own predictions accordingly. We also hypothesized that participants’ behavior might differ if they were told that the predictions were made by a “human expert” instead of a black-box model with a large number of features. Even with the weight of advice measure, we again found no difference in trust between the conditions. We also found no difference in the participants’ behavior when they were told that the predictions were made by a human expert.

We view these experiments as a first step toward a larger agenda aimed at quantifying and measuring the impact of different manipulable factors that influence interpretability.

## 2 Experiment 1: Predicting apartment prices

Our first experiment was designed to measure the influence of the number of features and the model transparency on three properties of human behavior that are commonly associated with interpretability: laypeople’s abilities to simulate a model’s predictions, gain trust in a model, and understand when a model will make mistakes. Before running the experiment, we posited and pre-registered three hypotheses:1

• Simulation. A clear model with a small number of features will be easiest for participants to simulate.

• Trust. Participants will be more likely to trust (and thus follow) the predictions of a clear model with a small number of features than the predictions of a black-box model with a large number of features.

• Detection of mistakes. Participants in different conditions will exhibit varying abilities to correct the model’s inaccurate predictions on unusual examples.

For the unusual examples, we intentionally did not pre-register any hypotheses about which conditions would make participants more or less able to correct inaccurate predictions. On the one hand, if a participant understands the model better, she may be better equipped to correct examples on which the model makes mistakes. On the other hand, a participant may place greater trust in a model she understands well, leading her to closely follow its predictions.

Prediction error. Finally, we pre-registered our intent to analyze participants’ prediction error in each condition, but intentionally did not pre-register any directional hypotheses.

### 2.1 Experimental design

As explained in the previous section, we asked participants to predict apartment prices with the help of a machine learning model. We showed all participants the same set of apartments and the same model prediction for each apartment. What varied between the experimental conditions was only the presentation of the model. We considered four primary experimental conditions in a design:

• Some participants saw a model that uses only two features (number of bathrooms and square footage—the two most predictive features), while some saw a model that uses all eight features. (Note that all eight feature values were visible to participants in all conditions.)

• Some participants saw the model internals (i.e., a linear regression model with visible coefficients), while some were presented with the model as a black box.

Screenshots from each of the four primary experimental conditions are shown in Figure 5. We additionally considered a baseline condition in which there was no model available.

We ran the experiment on Amazon Mechanical Turk using psiTurk (Gureckis et al., 2016), an open-source platform for designing online experiments. The experiment was IRB-approved. We recruited 1,250 participants, all located in the U.S., with Mechanical Turk approval ratings greater than . The participants were randomly assigned to the five conditions (clear-2, participants; clear-8, ; bb-2, ; bb-8, ; and no-model, ). Each participant received a flat payment of $2.50. Participants were first shown detailed instructions (including, in the clear conditions, a simple English description of the corresponding two- or eight-feature linear regression model), before proceeding with the experiment in two phases. In the training phase, participants were shown ten apartments in a random order. In the four primary experimental conditions, participants were shown the model’s prediction of each apartment’s price, asked to make their own prediction, and then shown the apartment’s actual price. In the baseline condition, participants were asked to predict the price of each apartment and then shown the actual price. In the testing phase, participants were shown another twelve apartments. The order of the first ten was randomized, while the remaining two always appeared last, for reasons described below. In the four primary experimental conditions, participants were asked to guess what the model would predict for each apartment (i.e., simulate the model) and to indicate how confident they were in this guess on a five-point scale (Figure 8). They were then shown the model’s prediction and asked to indicate how confident they were that the model was correct. Finally, they were asked to make their own prediction of the apartment’s price and to indicate how confident they were in this prediction (Figure 8). In the baseline condition, participants were asked to predict the price of each apartment and to indicate their confidence. The apartments shown to participants were selected from a data set of actual Upper West Side apartments taken from StreetEasy,2 a popular and reliable New York City real estate website, between 2013 and 2015. To create the models for the four primary experimental conditions, we first trained a two-feature linear regression model on our data set using ordinary least squares with Python’s scikit-learn library (Pedregosa et al., 2011), rounding coefficients to “nice” numbers within a safe range.3 To keep the models as similar as possible, we fixed the coefficients for number of bathrooms and square footage and the intercept of the eight-feature model to match those of the two-feature model, and then trained a linear regression model with the remaining six features, following the same rounding procedure to obtain “nice” numbers. The resulting coefficients are shown in Figure 5. When presenting the model predictions to participants, we rounded predictions to the nearest$100,000.

To enable comparisons across experimental conditions, the ten apartments used in the training phase and the first ten apartments used in the testing phase were selected from those apartments in our data set for which the rounded predictions of the two- and eight-feature models agreed and chosen to cover a wide range of deviations between the models’ predictions and the apartments’ actual prices. By selecting only apartments for which the two- and eight-feature models agreed, we were able to ensure that what varied between the experimental conditions was only the presentation of the model. As a result, any observed differences in the participants’ behavior between the conditions could be attributed entirely to the model presentation.

## 4 Experiment 3: Alternative measure of trust

In our first two experiments, we found that participants were no more likely to trust the predictions of a clear model with a small number of features than the predictions of a black-box model with a large number of features, as indicated by the deviation of their own predictions from the model’s prediction. However, perhaps another measure of trust would reveal differences between the conditions. In this section, we therefore present our third experiment, which was designed to allow us to compare participants’ trust across the conditions using an alternative measure of trust: the weight of advice measure frequently used in the literature on advice-taking (Yaniv, 2004; Gino and Moore, 2007; Logg, 2017).

Weight of advice quantifies the degree to which people update their beliefs (e.g., predictions made before seeing the model’s predictions) toward advice they are given (e.g., the model’s predictions). In the context of our experiment, it is defined as , where is the model’s prediction, is the participant’s initial prediction of the apartment’s price before seeing , and is the participant’s final prediction of the apartment’s price after seeing . It is equal to 1 if the participant’s final prediction matches the model’s prediction and equal to 0.5 if the participant averages their initial prediction and the model’s prediction.

To understand the benefits of comparing weight of advice across the conditions, consider the scenario in which is close to . There are different reasons why this might happen. On the one hand, it could be the case that was far from and the participant made a significant update to their initial prediction based on the model. On the other hand, it could be the case that was already close to and the participant did not update her prediction at all. These two scenarios are indistinguishable in terms of the participant’s deviation from the model’s prediction. In contrast, weight of advice would be high in the first case and low in the second.

We additionally used this experiment as a chance to see whether participants’ behavior would differ if they were told that the predictions were made by a “human expert” instead of a model. Previous studies have examined this question from different perspectives with differing results (e.g., Önkal et al., 2009; Dietvorst et al., 2015). Most closely related to our experiment, Logg (2017) found that when people were presented with predictions from either an algorithm or a human expert, they updated their own predictions toward predictions from an algorithm more than they did toward predictions from a human expert in a variety of domains. We were interested to see whether this finding would replicate.

We pre-registered four hypotheses:

• Trust (deviation). Participants’ predictions will deviate less from the predictions of a clear model with a small number of features than the predictions of a black-box model with a large number of features.

• Trust (weight of advice). Weight of advice will be higher for participants who see a clear model with a small number of features than for those who see a black-box model with a large number of features.

• Humans vs. machines. Participants will trust a human expert and a black-box model to differing extents. As a result, their deviation from the model’s predictions and their weight of advice will also differ.

• Detection of mistakes. Participants in different conditions will exhibit varying abilities to correct the model’s inaccurate predictions on unusual examples.

The first two hypotheses are variations on H2 from our first experiment, while the last hypothesis is identical to H3.

### 4.1 Experimental design

We considered the same four primary experimental conditions as in the first two experiments plus a new condition, expert, in which participants saw the same information as in bb-8, but with the black-box model labeled as “Human Expert” instead of “Model.” We did not include a baseline condition because the most natural baseline would have been to simply ask participants to predict apartment prices (i.e., the first step of the testing phase described below).

We again ran the experiment on Amazon Mechanical Turk. We excluded people who had participated in our first two experiments, and recruited 1,000 new participants all of whom satisfied the selection criteria from our first two experiments. The participants were randomly assigned to the five conditions (clear-2, ; clear-8, ; bb-2, ; bb-8, ; and expert, ). Each participant received a flat payment of . We excluded data from one participant who reported technical difficulties.

We asked participants to predict apartment prices for the same set of apartments used in the first two experiments. However, in order to calculate weight of advice, we modified the experiment design so that participants were asked for two predictions for each apartment during the testing phase: an initial prediction before being shown the model’s prediction and a final prediction after being shown the model’s prediction. To ensure that participants’ initial predictions were the same across the conditions, we asked for their initial predictions for all twelve apartments before introducing them to the model or human expert and before informing them that they would be able to update their predictions. This design has the added benefit of potentially reducing the amount of anchoring on the model or expert’s predictions.

Participants were first shown detailed instructions (which intentionally did not include any information about the corresponding model or human expert), before proceeding with the experiment in two phases. In the (short) training phase, participants were shown three apartments, asked to predict each apartment’s price, and shown the apartment’s actual price. The testing phase consisted of two steps. In the first step, participants were shown another twelve apartments. The order of all twelve apartments was randomized. Participants were asked to predict the price of each apartment. In the second step, participants were introduced to the model or human expert before revisiting the twelve apartments. As in the first two experiments, the order of the first ten apartments was randomized, while the remaining two (apartments 11 and 12) always appeared last. For each apartment, participants were first reminded of their initial prediction, then shown the model or expert’s prediction, and then asked to make their final prediction of the apartment’s price.5

### 4.2 Results

H7. Trust (deviation). Contrary to our first hypothesis, and in line with the findings from our first two experiments, we found no significant difference in participants’ deviation from the model between clear-2 and bb-8 (see Figure 25).

H8. Trust (weight of advice). Weight of advice is not well defined when a participant’s initial prediction matches the model’s prediction (i.e., ). For each condition, we therefore calculated the mean weight of advice over all participant–apartment pairs for which the participant’s initial prediction did not match the model’s prediction.6 This calculation can be viewed as calculating the mean conditioned on there being initial disagreement between the participant and the model. Contrary to our second hypothesis, and in line with the findings for the measures of trust in our first two experiments, we did not find a significant difference in participants’ weight of advice between the clear-2 and bb-8 conditions (see Figure 25).

H9. Humans vs. machines. Contrary to our third hypothesis, we did not find a significant difference in participants’ trust, as indicated by either the deviation of their predictions from the model or expert’s prediction or by their weight of advice, between the bb-8 and expert conditions.

H10. Detection of mistakes. In contrast to our first two experiments, we did not find that participants in the clear conditions were less able to correct inaccurate predictions.

## 5 Discussion and future work

We investigated how two factors that are thought to influence model interpretability—the number of features and the model transparency—impact laypeople’s abilities to simulate a model’s predictions, gain trust in a model, and understand when a model will make mistakes. Although we found that a clear model with a small number of features was easier for participants to simulate, we found no difference in trust. We also found that participants were less able to correct inaccurate predictions when they were shown a clear model instead of a black box. These findings suggest that one should not take for granted that a “simple” or “transparent” model always leads to higher trust. However, we caution readers against jumping to the conclusion that interpretable models are not valuable. Our experiments focused on just one model, presented to one specific subpopulation, for only a subset of the scenarios in which interpretability might play an important role. Instead, we see this work as the first of many steps towards a larger goal of rigorously quantifying and measuring when and why interpretability matters.

The general experimental approach that we introduced—i.e., presenting people with models that make identical predictions but varying the presentation of these model in order to isolate and measure the impact of different factors on people’s abilities to perform well-defined tasks—could be applied in a wide range of different contexts and may lead to different conclusions in each. For example, instead of a linear regression model, one could examine decision trees or rule lists in a classification setting. Or our experiments could be repeated with participants who are domain experts, data scientists, or researchers in lieu of laypeople recruited on Amazon Mechanical Turk. Likewise, there are many other scenarios to be explored such as debugging a poorly performing model, assessing bias in a model’s predictions, or explaining why an individual prediction was made. We hope that our work can serve as a useful template for examining the importance of interpretability in these and other contexts.

### Footnotes

1. Links to the pre-registration documents on aspredicted.org for each experiment are omitted to preserve author anonymity.
2. https://streeteasy.com/
3. In particular, for each coefficient, we found a value that was divisible by the largest possible exponent of ten and was in the safe range, which is the coefficient value plus or minus .
4. We follow standard notation, where, e.g., the result of a t-test with degrees of freedom is reported as , where is the value of the test statistic and is the corresponding -value.
5. We initially considered an alternative design in which participants were asked to predict each apartment’s price, shown the model’s prediction, and then asked to update their own prediction before moving on to the next apartment. During pilots, it appeared that participants changed their initial predictions in response to the model. To verify this, we ran a larger version of this experiment, hypothesizing that participants’ initial predictions would deviate less from the model’s predictions in the clear-2 condition. As predicted, this was indeed the case (, ). The amount by which participants’ initial predictions change based on the model they see could be viewed as another measure of trust.
6. We found no significant difference in the fraction of times that participants’ initial predictions matched the model’s predictions.

### References

1. Douglas Bates, Martin Mächler, Ben Bolker, and Steve Walker. Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1):1–48, 2015.
2. Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2015.
3. Berkeley J. Dietvorst, Joseph P. Simmons, and Cade Massey. Algorithm aversion: People erroneously avoid algorithms after seeing them err. Journal of Experimental Psychology: General, 144(1):114–126, 2015.
4. Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608, 2017.
5. Francesca Gino and Don A. Moore. Effects of task difficulty on use of advice. Journal of Behavioral Decision Making, 20(1):21–35, 2007.
6. Alyssa Glass, Deborah L McGuinness, and Michael Wolverton. Toward establishing trust in adaptive agents. In Proceedings of the 13th International Conference on Intelligent User Interfaces (IUI), 2008.
7. Todd M Gureckis, Jay Martin, John McDonnell, Alexander S Rich, Doug Markant, Anna Coenen, David Halpern, Jessica B Hamrick, and Patricia Chan. psiTurk: An open-source framework for conducting replicable behavioral experiments online. Behavior Research Methods, 48(3):829–842, 2016.
8. Jongbin Jung, Connor Concannon, Ravi Shro, Sharad Goel, and Daniel G. Goldstein. Simple rules for complex decisions. arXiv preprint arXiv:1702.04690, 2017.
9. Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. In Proceedings of the 34th International Conference on Machine Learning (ICML), 2017.
10. Brian Y Lim, Anind K Dey, and Daniel Avrahami. Why and why not explanations improve the intelligibility of context-aware intelligent systems. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI), 2009.
11. Zachary C Lipton. The mythos of model interpretability. arXiv preprint arXiv:1606.03490, 2016.
12. Jennifer M. Logg. Theory of machine: When do people rely on algorithms? Harvard Business School NOM Unit Working Paper No. 17-086, 2017.
13. Yin Lou, Rich Caruana, and Johannes Gehrke. Intelligible models for classification and regression. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2012.
14. Yin Lou, Rich Caruana, Johannes Gehrke, and Giles Hooker. Accurate intelligible models with pairwise interactions. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2013.
15. Scott Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems 30 (NIPS), 2017.
16. Dilek Önkal, Paul Goodwin, Mary Thomson, and Sinan Gönül. The relative influence of advice from human experts and statistical methods on forecast adjustments. Journal of Behavioral Decision Making, 22:390–409, 2009.
17. Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
18. Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should I trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2016.
19. Berk Ustun and Cynthia Rudin. Supersparse linear integer models for optimized medical scoring systems. Machine Learning Journal, 102(3):349–391, 2016.
20. Martin Wattenberg, Fernanda Viégas, and Moritz Hardt. Attacking discrimination with smarter machine learning. Accessed at https://research.google.com/bigpicture/attacking-discrimination-in-ml/, 2016.
21. Ilan Yaniv. Receiving other people’s advice: Influence and benefit. Organizational Behavior and Human Decision Processes, 93:1–13, 2004.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters