Designing Interfaces to Help Stakeholders Comprehend, Navigate, and Manage Algorithmic Trade-Offs
Artificial intelligence algorithms have been applied to a wide variety of tasks, including assisting human decision making in high-stake contexts. However, these algorithms are complex and have trade-offs, notably between prediction accuracy and fairness to population subgroups. This makes it hard for domain stakeholders to understand algorithms and deploy them in a way that respects their goals and values. We created an interactive interface to help stakeholders explore algorithms, visualize their trade-offs, and select algorithms with trade-offs consistent with their values. We evaluated our interface on the problem of predicting criminal defendants’ likelihood to re-offend through (i) a large-scale Amazon Mechanical Turk experiment, and (ii) in-depth interviews with domain experts. The interface proved effective at communicating algorithm trade-offs and supporting human decision making and also affected participants’ trust in algorithmically aided decision making. Our results have implications for the deployment of intelligent algorithms and suggest important directions for future work.
swblue \addauthorlorendarkpastelgreen \addauthorhaiyired \addauthorbowenorange \addauthorireneyellow \addauthorzachpurple \CopyrightYear2020 \setcopyrightacmlicensed \doihttps://doi.org/10.1145/3313831.XXXXXXX \isbn978-1-4503-6708-0/20/04 \conferenceinfoCHI’20,April 25–30, 2020, Honolulu, HI, USA \acmPrice$15.00
<ccs2012> <concept> <concept_id>10003120.10003121.10003129</concept_id> <concept_desc>Human-centered computing Interactive systems and tools</concept_desc> <concept_significance>300</concept_significance> </concept> </ccs2012>
Human-centered computing Interactive systems and tools
On social media sites such as Facebook, algorithms are used to identify and censor trolls, fake news, terrorism, racist and sexist ads. In Wikipedia, the largest peer produced knowledge repository in the world, a number of automated systems and semi-automated tools are used to assess the quality of edits and determine whether they should be retained or reverted [halfaker2012bots].
These algorithms typically are produced via machine learning techniques. However, a large body of recent work has identified serious trade-offs between different algorithmic criteria. First, there is a well-documented trade-off between false positives (a prediction that a given condition exists, but it does not) and false negatives (a prediction that a given condition does not exist, but in fact it does) [klinkman1998false, dove2017ux]. This is important because both types of errors can have harmful effects both on the people directly affected by a prediction and society as a whole. For example, when predicting criminal defendants’ likelihood to re-offend, a false positive means detaining someone who will not re-offend, and a false negative means releasing someone who will re-offend, both obviously harmful. Second, other research has shown a trade-off between prediction accuracy and fairness [kleinberg2018algorithmic, agarwal2018reductions, menon2018cost, kearns2019empirical, dwork2018decoupled], and trade-offs between different fairness notions [KMR16, Chouldechova17]. Specifically, improving fairness – such as minimizing differences in false-positive rates between different racial or gender groups – can lead to a decrease in overall prediction accuracy. To complicate things even more, many desirable notions of fairness also are incompatible [KMR16]. Therefore, managing accuracy and fairness trade-offs is both important and complex.
Stakeholders must be able to understand these algorithms, their general behavior, and trade-offs to deploy them in a way consistent with their goals and values. However, to the best of our knowledge, few studies have investigated techniques for explaining these algorithms and their trade-offs to stakeholders. Our research takes on this challenge.
We created interactive interfaces to let users visualize and explore prediction algorithms and their trade-offs, and help them select specific models with trade-offs consistent with their values. We conducted a study in the context of recidivism prediction (predicting whether or not a defendant will re-offend) to evaluate the effectiveness of our interfaces. We chose this context because: (i) as noted above, this is a high-stakes decision, and (ii) the machine learning community has intensely studied trade-offs between different accuracy and fairness notions in this context [andrews2006recent, singh2014international, skeem2016risk].
We conducted two studies: (i) a large-scale Amazon Mechanical Turk experiment, and (ii) in-depth interview sessions with domain experts. We found that (i) our interfaces are effective at communicating algorithm trade-offs and significantly improve non-expert participants’ understanding of algorithmic trade-offs; (iii) our interfaces let participants navigate between a wide range of machine learning models and select a model with most acceptable trade-offs; however, we also observed great diversity in the model selection among our participants; and (iii) our interfaces also affected participants’ trust in algorithmic decision making. Almost 50% of participants changed self-report trust in prediction algorithms after using our interfaces, with some increasing their trust and others decreasing. Our results suggest important directions for future work.
2 Related Work and Research Questions
2.1 Explanations of Machine Learning Algorithms
More recently, the use of machine learning algorithms have been extended to high-stake domains, such as decision-making assistance of school admission [cheng2019explaining] and prediction of recidivism [corbett2017algorithmic, dressel2018accuracy]. The use of increasingly complex and non-transparent algorithms in these domains has led to a surge of interest in explainable artificial intelligence (XAI) (see [biran2017explanation] for a review). XAI aims to provide users with explanations of algorithms’ decisions in some level of detail, in order to ensure algorithmic fairness, identify potential bias/problems, and to ensure that the algorithms perform as expected [gilpin2018explaining].
Researchers have made progress on transforming complex models, such as neural networks into simple ones, (such as linear models or decision trees), through the approximation of the entire model [craven1999rule] or local approximation [ribeiro2016model]. Visualization techniques have been developed to explain different types of machine learning models. Examples include traditional machine learning models such as linear models [ribeiro2016should], decision trees [lakkaraju2016interpretable], and ensemble classifiers [wyner2017explaining], and more complicated deep neural networks [gilpin2018explaining].
However, researchers critique that this line of research is built based on the intuition of researchers, but is neither grounded in a deep understanding of the actual users, nor evaluated with the actual users or stakeholders [miller2018explanation]. Some recent work attempts to draw principles and methods from Human-Computer Interaction (HCI) to improve the usability of explanation interfaces and to perform empirical studies for evaluating the efficacy of these interfaces. For example, Cheng et al. conducted human-centered design and empirical evaluation of parallel interface prototypes to explore the effectiveness of different strategies (e.g., “black-box” versus “white-box”, and “interactive” versus “static”) to help non-expert stakeholders understand algorithmic decision making [cheng2019explaining]. They conducted an online experiment to evaluate the effectiveness of the interfaces on the users’ objective and self-report understanding of the algorithm. In this paper, we will take the human-centered approach to design and evaluate explanation interfaces for machine learning algorithms, focusing on improving people’s understandings of trade-offs in machine learning algorithms.
2.2 Trade-Offs in Machine Learning Algorithms
Machine learning algorithms have inherent trade-offs between different system criteria. The trade-off between false positives and false negatives is a fundamental problem in many machine learning-based applications. They are also referred to as precision and recall [buckland1994relationship]. In some contexts such as recommendation systems, practitioners often prioritize minimizing false positives over minimizing false negatives [dove2017ux], by focusing on avoiding recommending items that a user may not like. But for high-stake applications like prediction of recidivism, both types of errors could impact people’s lives.
A growing body of literature demonstrates that there are trade-offs between fairness and accuracy, and trade-offs between different fairness notions. In particular, much of the fairness-aware machine learning research aims to formulate fairness notions as algorithmic constraints and build predictive models that satisfy fairness notions, including statistical parity [dwork2018decoupled], equalized opportunity [hardt2016equality], and calibration [pleiss2017fairness]. For many of these fairness measures, prior research has identified a range of trade-offs between fairness and accuracy [agarwal2018reductions, kearns2019empirical, menon2018cost, dwork2018decoupled]. Recent studies indicate that different desirable notions of fairness are not only incompatible with each other [pleiss2017fairness], but often mutually exclusive [KMR16, Chouldechova17].
2.3 Human-Centered & Stakeholder-Driven Algorithm Design
The broad goal of this research is to contribute to building trustworthy AI systems that can tackle challenges facing society. In order to meet the design needs of building trustworthy and acceptable AI systems, prior works suggests the importance of engaging relevant stakeholders throughout the AI design process [lee2019procedural, zhu2018value]. Researchers have introduced and evaluated the human-centered and stakeholder-driven approaches in algorithmic system design in the domain of on-demand donation matching [lee2019procedural], editor recruitment in the Wikipedia community [zhu2018value] and collective community design for web accessibility [salehi2018hive]. However, one important challenge of engaging stakeholders is the gaps in technical literacy between stakeholders and algorithm developers [burrell2016machine]. This literacy gap means that it is often difficult to explain the logic behind algorithmic decisions to non-expert stakeholders, let alone to allow them to provide feedback to influence the algorithmic design. Our research attempts to address this challenge by developing interface design approaches that improve not only the explainability of opaque machine learning technologies but also the interpretability of the expected outcomes.
2.4 Research Questions
We now highlight four research questions we want to answer in our paper.
RQ1 How can we design interfaces to help users comprehend trade-offs in algorithmic decision making?
RQ2 How can we design interfaces to help users navigate and manage trade-offs?
RQ3 Will the interfaces influence users’ trust of the algorithmic decision-making?
RQ4 How can the stakeholders use the interfaces in the real world?
3 Design Requirement
We used recidivism prediction (predicting whether the defendants will re-offend or not) as the context for exploring the general research problem of helping people understand intelligent algorithms and their trade-offs. Moreover, we identified the following necessary design elements.
(1) Help people understand two types of algorithmic trade-offs. Based on the prior literature, we decided that our interfaces will focus on two types of trade-offs:
Trade-offs between false positives and false negatives. In the context of recidivism prediction, false positives are cases where prediction indicates the defendant will re-offend when the defendant does not re-offend; false negatives are cases where prediction indicates the defendant will not offend when the defendant in fact re-offend. Both types of errors are detrimental in the high-stake decision-making contexts like the prediction of recidivism, we neither want to falsely detain defendants who will not re-offend, nor release defendants who will re-offend.
Trade-offs between overall prediction errors and fairness. Specifically, we define (un)fairness as the disparity in the numbers of false positives and false negatives across different groups. In the context of recidivism prediction, prior analysis has shown that, without considering fairness, African American defendants were nearly twice as likely to be misclassified as high risk to re-offend compared to their white counterparts [MachineBias, HowWeAnalyzedCOMPAS]. However, by equalizing the false positives or false negatives between protected group and unprotected group, the overall errors might increase, which is undesirable [agarwal2018reductions, kearns2019empirical].
(2) Allow trade-off navigation and model selection. It is important to allow users to compare a set of prediction models with a spectrum of trade-offs between false negatives, false positives, overall prediction errors and fairness measures, in order to explore and navigate trade-offs. Since our tool is designed for stakeholders without technical background, our design has to be able to effectively communicate the technical perspective of the model to non-technical audience.
4.1 Data and Algorithm
We used a data set from ProPublica which is an independent, non-profit newsroom that produces investigative journalism in the public interest111https://www.propublica.org. The data set is known as COMPAS (stands for Correctional Offender Management Profiling for Alternative Sanctions), and originally contains information of 11,757 defendants including their prior criminal history, jail and prison time, and demographics (such as race, gender, and age) [dieterich2016compas]. We followed the literature to formulate the problem as binary classification, and the labels are whether the defendants commit “a new misdemeanor or felony offense within two years of the COMPAS administration date” [HowWeAnalyzedCOMPAS].
We removed defendants whose records were not complete and who were just charged for traffic offenses and municipal ordinance violations. we focus on two sets of protected attributes, race and gender, and constrained them to be binary, such as African American and White defendants, female and male defendants. To better illustrate trade-offs, we created two balanced data sets for race and gender separately. This resulted a data set of 3,000 defendants (1,500 White defendants and 1,500 African American defendants) and a data set of 1,600 defendants (800 male defendants and 800 female defendants). We ran logistic regression on the two data sets. The prediction accuracy for the two data sets is 0.715 and 0.721, respectively, with a random 70%-30% split on train and test data, which is consistent with results of previous studies.
4.2 Capture Trade-Offs in Recidivism Prediction
Below we will discuss how we generated a set of models, with different trade-offs across a variety of system criteria.
4.2.1 Trade-Offs between False Positives and False Negatives
To capture trade-offs between false positives and false negatives, we varied the threshold. In order to map a probability to a binary category, we need to define a classification threshold: a value above that threshold indicates “re-offend”; a value below indicates “not re-offend”. Figure 1 shows the relationship between false positives and false negatives when we varied the threshold in our data.
4.2.2 Trade-Offs between Overall Errors and Disparity (Unfairness).
We followed the prevalent statistical fairness approach in the fairness machine learning literature. For this approach, we fixed a small number of groups specified by sensitive attributes, such as race and gender, and then asked for approximate equality of certain statistics of the predictor, such as false-positive rates and false-negative rates, among these groups. In our study, we considered two groups, denoted by and and specified by either race or gender, and formulated our (un)fairness measure as the disparity between the number of false positives and false negatives between the two groups (and henceforth disparity)222Note that false positive (or negative) rate is defined as the ratio between the number of false positives (or negatives) and the number of negatives (or positives). We chose to use counts instead of ratios since counts are easier to explain to the non-expert stakeholders.:
To capture trade-offs between overall accuracy and disparity, we adapted the algorithms from [agarwal2018reductions, KearnsNRW18] to generate the set of Pareto-optimal predictive models, for which it is impossible to improve either criteria without worsening the other. Figure 2 shows a Pareto curve of prediction errors and disparity on African American and White American defendants. Models on the right side of the Figure 2 prioritize “minimizing overall errors”, while models on the left side prioritize “minimizing disparity”. We can see that by reducing the disparity between two racial groups from 158 to 21, the overall error counts increase from 1253 to 1651. The technical details on how we generated the Pareto curves are included in the supplementary materials.
4.3 Creating User Interfaces to Explain Trade-Offs
We followed the human-centered design process for interface design [holtzblatt2004rapid]. Specifically, our design process involved a number of iterations. We started with a brainstorming session to ideate different design directions and features based on our design requirements. Next we synthesized and clustered the ideas. We then incorporated the ideas into the creation of low-fidelity prototypes, and conducted informal qualitative analysis and pilot studies to evaluate and improve the prototypes. Each step in this process provided rich insights from users’ perceptions and helped to shape our final design and implementation.
4.4 Interface Designs
Our final interface design consists of a two-part layout: (i) a control panel that allows users to select models, and (ii) a result panel that shows the report of the selected model. We explored two design options in the result panel: visualization view (See Figure 3) and text view (See Figure 4). Here we will describe them in detail.
4.4.1 Control Panel
Since we had two types of trade-offs to communicate to users, we designed two separate controls.
For trade-offs between false positives and false negatives (See the upper section of the control panel in Figure 3), we designed a control bar, which allows users to adjust the threshold.
For trade-offs between errors and disparity (See the lower section of the control panel in Figure 3), once a protected attribute (gender or race) is selected, we presented the Pareto curve with the given threshold. We allowed users to select any particular model shown on the Pareto curves.
4.4.2 Result Panel
Prior work has compared the use of visualizations and texts for communicating various statistical aspects of algorithms [gleicher2011visual]. Research shows that visualizations are more effective at grabbing user attention, but studies also reported that some users may prefer text over visual content. Therefore, we decided to design two different views in the result panel.
Visualization View. Inspired by the idea of confusion matrices in machine learning [talbot2009ensemblematrix], we created four separated quadrants with each dot representing one classified defendant in one of the four prediction categories (true positive, false positive, false negative, and true negative) (see the result panel in Figure 3). We also displayed the total number of defendants in each category. To distinguish correct and incorrect predictions, we applied two colors, blue and red, to highlight the difference. When a protected attribute (race or gender) is selected, the dots in each quadrant will be split into two different colors representing two groups under the selected protected attribute, such as African American and White defendants. As users move the control bars, the interface will display the changes of prediction outcomes accordingly.
In addition, the interface provides the summary of the key metrics (e.g., prediction errors and disparity) on the top of the result panel. Explanations about the metrics will show up when users hover over the question marks next to the metrics.
Text View. The text view displays the same set of information as the visualization view does, but in text. It describes the four prediction categories in plain text and the number of defendants in each category. We followed a natural logic by grouping the prediction outcomes by incorrect or correct predictions (see the result panel in Figure 4). When a protected attribute (race or gender) is selected, we show the number for each group. Information about prediction errors and disparity is also described in text.
5 Evaluation Overview
We evaluated our interface on helping people comprehend, navigate, and manage trade-offs through (i) a large-scale Amazon Mechanical Turk experiment, and (ii) six in-depth interviews with domain experts.
The goal of the Amazon Mechanical Turk experiment is to answer the first three research questions: RQ1 - can our interfaces help non-expert participants comprehend trade-offs in algorithmic decision making? RQ2 - can our interfaces help participants navigate trade-offs and select models? RQ3 - how do the interfaces influence participants’ perceived trust of the algorithmic decision-making system?
The goal of the in-depth interviews is to answer the last question: RQ4 - how will stakeholders involved in the domain problem (judges, lawyers, defendants, and public) use the interfaces in the real world settings?
6 Evaluation Study 1: Amazon MTurk Experiment
6.1 Experimental Design
We conducted a randomized between-subjects experiment with three conditions: visualization view condition where participants used our visualization view interface (Figure 3), text view where participants used our text view interface (Figure 4), and baseline condition where participants were not provided any interface and proceeded directly to the questionnaire. We included the baseline condition to assess people’s baseline understandings of the trade-offs in machine learning. Participants were randomly assigned to one of the three conditions, and they were allowed to spend as much time as needed to finish our evaluation questionnaires with or without the help of our interface.
6.2 Participant Recruitment
We recruited 301 participants from Amazon Mechanical Turk (AMT) in August 2019 for our study. To ensure the quality of survey responses, we recruited participants who have finished more than 100 tasks (HITs) with an task (HIT) approval rate of 95% or above. We also ensured that the participants are aged 18 or above and reside in the U.S., so that they have a higher chance of having the background knowledge about the U.S. judicial system. The average time for completing the survey was 28.6 minutes. Each participant received a base payment of $4 and an additional bonus (up to $1.2) based on the number of correct answers they had in the objective comprehension questions (each correct answer would contribute to an extra bonus payment of $0.2). To ensure participants would answer questions honestly without random guessing, we provide an “I don’t know” option for each question with a $0.05 bonus payment. On average, each participant received a payment of $8.70, which is higher than the minimum wage in the U.S. ($7.25 per hour at the time of writing).
In total, 301 participants finished the tasks on AMT, but 15 of them failed the attention check we set so we did not include their data. Our analysis included 107 participants in the baseline condition, 93 participants in the visualization view condition, and 86 participants in the text view condition. The demographic information including age, gender, race, education level, was comparable across the three conditions ( not significant).
6.3 Study Procedure
Participants were informed that the purpose of the study was to help users understand intelligent algorithms that support people in making important decisions. We also designed quiz questions to make sure participants understood the study context, such as the definition of recidivism and prediction algorithms. Participants had to answer the quiz questions correctly before they could proceed. We included the description of study context and quiz questions in the supplementary materials.
Participants were given as much time as they wanted to explore the interface and complete a set of questions (details will be described below in “Evaluation Metrics” section). We also inserted an instructed-response question for the attention check, which directed respondents to choose a specific answer in order to detect careless responses [meade2012identifying, cheng2019explaining].
6.4 Evaluation Metrics
We designed a set of questions to assess participants’ objective comprehension, subjective understanding, trust, and model selection, along with participants’ demographic information and technical literacy. We include all the questions we use in the supplementary materials.
Objective Comprehension. Participants answered six multiple-choice questions with objectively correct answers to evaluate their understandings of the algorithms: four of them focused on the understandings of the basic concepts such as false positives and false negatives in machine learning, while two of them assessed people’s understandings of algorithmic trade-offs. For example, we asked participants “Suppose you adjust the algorithm to reduce one type of error cases - false negatives (the number of defendants it predicts not to re-offend but actually re-offend). How will this affect the other type of error cases - false positives (the number of defendants it predicts to re-offend but actually do not re-offend)?” providing answer options such as “The false positives will increase.”, “The false positives will not change.”, “The false positives will decrease.” or “I don’t know.” .
Subjective Evaluation. Participants responded to a Likert scale to self-report how well they understood the algorithmic trade-offs.
Model Selection. In the two interface conditions, participants were instructed to adjust the interface controls to find a model consistent with their values, and were asked questions about whether and why the selected model was most consistent with their values.
Trust. We also measured participants’ perceived trust of the algorithm prediction on recidivism. We adapted the question from the prior literature that measured trust in human-machine systems on a 7-point Likert scale [lee1992trust, corritore2003line, cheng2019explaining]. Examples include “I trust the prediction results of defendants’ recidivism produced by the algorithm”.
We asked the same set of questions about trust twice. For participants in the two interface conditions, we asked these questions about trust before and after they explored our interfaces and answered the objective comprehension questions; for participants in the baseline conditions, we asked the questions about trust before and after they answered the objective comprehension questions. At the end, we asked participants to provide their reasons why they trusted or distrusted the algorithms in an open-ended question.
In addition, we also asked questions about participants’ technical literacy and their demographic information333We also asked participants about their legal literacy, defined as their familiarity with US legal system. However, legal literacy can be linearly predicted from other variables with a substantial degree (VIF of legal literacy = 11.6). Therefore, we did not include legal literacy in the analysis..
Technical Literacy We used a 7-point Likert scale question to assess participants’ familiarity with AI-powered systems such as email spam filters and product recommender systems. This question was adapted from [wilkinson2009measurement, cheng2019explaining].
Demographic Information. We asked participants for their basic demographic information including race, age range, gender, and education level. We operationalized education level as whether or not the participants had completed a bachelor’s degree, and operationalized age as whether or not the participant was above 34 (the median number).
Our main findings include the following:
Both visualization view and text view interfaces improved participants’ comprehension of trade-offs in algorithmic decision-making. We also found that our interfaces have differential impacts on participants. For example, participants with higher education levels were benefited more from the visualization view than those with lower education levels; and participants who were above 34 years old benefited more from using the text view.
Our interfaces allowed participants to select models that most represent their values. We also observed great diversity in the models they selected: some tended to balance trade-offs, while others concentrated only on one aim.
Our interfaces swayed 47.4% of participants’ trust in the algorithm: 22.3% trusted the algorithm more while 25.1% trusted the algorithm less.
Next we will describe the details of the analyses that lead to these conclusions.
6.5.1 Interfaces Improved Objective Comprehension of Trade-Offs
We created a set of linear regression models (see Table 1) with Objective Comprehension of the algorithmic concepts and trade-offs, and Subjective Evaluation as dependant variables, experimental conditions, and the interaction terms between the experimental conditions and participants’ personal characteristics (age, race, education level, literacy) as independent variables. Model 1, 3, and 5 in Table 1 examined the differences between three experimental conditions (IsVizView v.s. baseline, and IsTextView v.s. baseline). Model 2, 4, and 6 examined the interpersonal differences in using the interfaces.
|Model 1||Model 2||Model 3||Model 4||Model 5||Model 6|
|Intercept||0.632 **||0.031||0.632 **||0.031||0.367 **||0.036||0.367 **||0.035||5.264 **||0.129||5.264 **||0.126|
|IsVizView||0.077 †||0.045||-0.052||0.153||0.191 **||0.052||-0.095||0.176||0.209||0.189||-0.236||0.731|
|IsTextView||0.115 *||0.046||0.101||0.148||0.289 **||0.053||0.148||0.170||0.236||0.193||-1.255||0.786|
|IsOlder * IsVizView||0.093||0.067||0.134 †||0.076||-0.246 *||0.123|
|IsMale * IsVizView||-0.006||0.067||0.111||0.077||-0.071||0.277|
|IsWhite * IsVizView||0.038||0.082||0.140||0.094||0.396||0.336|
|EducationLevel * IsVizView||0.092||0.090||0.190 †||0.103||-0.285||0.369|
|AISystemFamiliarity * IsVizView||0.009||0.024||0.004||0.027||0.250 **||0.098|
|IsOlder * IsTextView||0.033||0.072||0.182 *||0.082||-0.041||0.133|
|IsMale * IsTextView||-0.049||0.071||0.069||0.082||0.276||0.294|
|IsWhite * IsTextView||-0.0244||0.079||-0.057||0.091||-0.030||0.327|
|EducationLevel * IsTextView||-0.040||0.110||0.023||0.127||-0.204||0.452|
|AISystemFamiliarity * IsTextView||0.003||0.024||0.012||0.028||0.319 **||0.099|
We found that participants, including those without interface support in baseline conditions, had some basic understandings of the machine learning concepts, such as false positives and false negatives. According to Model 1 in Table 1, participants in the baseline condition on average answered 63.2% objective comprehension questions on basic concepts correctly. Interfaces further improved participants’ comprehension of these concepts. Participants in visualization view got 70.9% questions correctly; and the difference between visualization view and baseline condition is marginally significant (coef. = 0.077, p < 0.1). Participants in text view got 74.7% questions correctly; the difference between text view and baseline condition is significant (coef. = 0.115, p < 0.05). T-test didn’t reveal statistical difference between participants’ performance in the two interface conditions.
Our results suggest that our interfaces significantly improved participants’ comprehension of algorithmic trade-offs. According to Model 3, participants in the baseline condition on average answered 36.7% objective comprehension questions on algorithmic trade-offs correctly. Participants in visualization view got 55.8% questions correctly; the difference between visualization view and baseline is significant (coef. = 0.191, p < 0.01). Participants in text view got 65.6% questions correctly; the difference between text view and baseline is significant (coef. = 0.289, p < 0.01). T-test between the two interface conditions shows marginal significance (p < 0.1).
Interestingly, according to Model 4, we found that visualization view interface tended to provide more gains to participants with high self-report education levels compared to participants with lower education levels (ceof. = 0.195, p < 0.1), while text view provided more benefits to participants who were above 34 than participants who were 34 years old or younger (coef. = 0.095, p < 0.05).
In Model 5 and 6, we examined the impact of using interfaces on participants’ subjective evaluation on their understanding of algorithmic trade-offs. Model 5 shows that there is no significant differences among participants in different conditions. Model 6 suggests that participants with higher familiarity with AI systems self-reported that they understood the algorithms better both using the visualization view and text view interfaces (ceof. = 0.250, p < 0.01 and ceof. = 0.319, p < 0.01). Participants who were above 34 had less gains on the self-report understandings of algorithmic trade-offs using visualization view, compared to participants who were 34 years old or younger (coef. = -0.246, p < 0.05).
6.5.2 Interfaces Enabled Participants to Select Models
Both interfaces (visualization view and text view) allowed participants to select models on a spectrum of trade-offs. We explicitly asked participants whether they think the interfaces helped them identify models that represent their values. Participants using both types of interfaces reported high ratings. The average ratings for visualization view and text view are 5.54 and 5.64 out of the 7-Likert scale, respectively (the differences are not statistically significant).
Figure 5 and 6 show the distribution of participants’ model selection, with respect to trade-offs between false positives and false negatives (Figure 5), and with respect to trade-offs between overall prediction errors and disparity (Figure 6).
Figure 5 suggests that, with respect to trade-offs between false positives and false negatives, many participants tended to balance the two types of errors in their model selections. 29.4% of participants selected the model in the middle, which minimized the overall errors. Furthermore, among those who did not select a “balanced” model, more participants selected models on the “reducing false negatives” side than models on the “reducing false positives” side.
Figure 6 shows a different pattern. With respect to trade-offs between overall errors and disparity, many participants showed a single aim: they either selected a model that minimizes the disparity (21.8%), or selected a model that minimizes the overall errors (16.8%). Among those who tried to balance the two goals, more participants tended to prefer “reducing disparity” over “reducing overall errors”.
In sum, the results showed great diversity in people’s model selection. This suggests intriguing opportunities and future work on how to aggregate different individuals’ opinions on “what the best model is”, which we will discuss later.
6.5.3 Interfaces Swayed Participants’ Trust of the Algorithm
We also measured the change of participants’ perceived trust of the algorithm prediction before and after participants explored the algorithm trade-offs for all the three conditions. We observe that 47.4% participants in the two interface conditions changed their perceived trust, and there was an about even split between those who became more or less trustful. In the baseline condition, participants were not provided interfaces but directly proceeded into questionnaires. Simply answering the questions about algorithmic trade-offs changed 30.0% participants’ perspectives in the baseline condition (the difference between the interface conditions and the baseline condition is significant, p < 0.01).
Participants explained their reasons why they changed their perceived trust toward the algorithm prediction in an open-ended follow-up question. Our analysis on their responses revealed some insights about this swing.
Many participants gained trust because our interfaces educated them about the algorithm itself and the inherent trade-offs in the algorithms. As one participant explicitly expressed:
“Using the tool helped me understand the algorithm and the results of changing the aggressiveness and disparity parameters…”
Also our interface makes it easier to see the prediction results. As one participant commented that “the visual aid makes it much easier to see how many false predictions the model can make”.
But on the other hand, having the ability to tune the algorithms and see different prediction outcomes led some participants to doubt the reliability of the algorithm, and further reduced their trust. As one participant said that “it seems that the parameters can be adjusted to create almost any type of results desired by researchers”.
Participants with concerns made some further suggestions for improvement. For example, one participant wished to know more about the data used by the algorithm, as it may be “skeptical without seeing what data it is using and what the weighting is”.
Another participant mentioned the importance of having human input in the decision making process.
“I think using an algorithm in a combination of input from a judge or court official of some sort would be best. I would not trust the algorithm by itself. ”
Another interesting observation is that participants had very different expectations about algorithm accuracy that affected their perceived trust. For example, one participant increased trust because “an accuracy of around 70% is fairly good” from their perspective. On the other hand, one participant commented that they think the algorithm is “less trustful as this is a large error rate”.
Overall, our evaluation demonstrated that our interfaces were effective at helping people comprehend and navigate trade-offs. In addition, our results suggested that people have varying levels of understanding of algorithms, heterogeneous preferences over fairness-accuracy trade-offs, and diverse perspectives on the trustworthiness of AI systems. This opens up new challenges and opportunities in building holistic solutions that take into account the human heterogeneity when interacting with decision-support systems.
7 Evaluation 2: Expert Study
To understand how our interfaces can potentially help stakeholders in the real-world setting (e.g., judges, lawyers, defendants and general public), we recruited and conducted in-depth interviews with domain experts in the problem of recidivism.
7.1 Recruitment and Study Protocol
We recruited 6 researchers who have studied the use of algorithms in criminal justice system. We began our recruitment at a conference on fairness in machine learning, and utilized a snowball sampling technique to identify more participants. We conducted semi-structured interviews, and each lasted on average 30 minutes. Because our participants were remotely located, we used zoom.us and asked our participants to share screen in the interview. By doing so, we were able to observe how participants were interacting with our interface in real time and ask follow-up questions.
We explained to participants the context and goals of the study, asked for consent to record, and then gave participants time to explore the interface. After some exploration and clarification questions, participants were asked to think out loud their thought process and complete tasks such as describing trade-offs indicated by the interface, and identifying a model given a specific desired property. Participants were provided both interfaces (visualization view and text view). In the interview, we focused on understanding how our expert participants envision the use of our interfaces in the real-world environment.
Participation was voluntary and uncompensated. Participants were 2 females and 4 males based in the U.S. Our participants had a diverse background, including computer science, statistics, criminal law, and justice.
7.2 Analysis and Results
Overall, all the participants quickly learned how to use interface, accurately identified the trade-offs, and completed the small tasks without difficulty.
We applied theme-based coding approach [hsieh2005three, lune2016qualitative] to analyze the interview transcripts and organize our findings into the following two topics: stakeholders’ vision of our interface in practice and suggested features for the interface.
7.2.1 Roles of the Interfaces in Practice
Our expert participants believed that our interface is a great tool to reveal trade-offs, and encourage real stakeholders to think about trade-offs.
“I think you’re onto something in that, it’s useful to have.. most people would say ‘well, you know, here’s option A and here’s option B, and see? They have these different trade-offs and consequences.’ ” (P1)
However, our interface cannot help with the final decision making. As P1 said that “the interface can make that trade-offs clear, but it cannot help with the normative questions”.
Ultimately, it is stakeholders’ job and responsibilities to make decisions and identify the acceptable trade-offs. As P2 said, in practice, they (statisticians) will set the threshold based on stakeholders’ requirements.
Some expert participants suggested that our interface can be useful at the final stage of model selection in the development of the algorithmic decision-making system:
“…it would be like, people had already decided that one… the final thing of this would be to pick a threshold, or to access its predictive accuracy is suitable for deployment at all…” (P3)
In addition, some experts believed this is a good tool for educating the public. As P3 commented that they could less imagine it be used by “policy makers sitting in a room”, but more for the general public interested in learning this process as “an instructive tool”.
7.2.2 Suggested Features for the Interface
Some participants wanted to see a more flexible interface with more possibilities. For example, P2 suggested to include more system criteria or metrics on the interface, in order to provide more dimensions for stakeholders to negotiate model trade-offs and make decisions:
“…you have actually five more comparisons you need to make when you address fairness, not one, and depending on the stakeholders, they may think one of those outcomes is more important than the other.”
The five comparisons he referred to were other fairness definitions in addition to the statistical parity notion we used. Those fairness definitions include overall accuracy equality, conditional procedure accuracy equality, conditional use accuracy equality, treatment equality, and total fairness (see details in [berk2018fairness]). P2 also suggested to introduce another parameter that can be used to adjust weights put on the two types of prediction errors in the cost function (e.g., cost ratio between false positives and false negatives).
In addition to introduce new system criteria, P6 suggested allowing participants to play with the features (e.g., input factors) of the prediction algorithm and see different outcomes.
At the same time, participants also recognized that adding more features would inevitably introduce complexity and cognitive overload. As P5 said that “the more you increase the complexity of the tool, the less intuitive it would have become”.
Both the visualization view and the text view significantly improved participants’ comprehension of trade-offs between different system criteria in machine learning-based algorithms. However, our results suggested that the design of the interfaces might create differential benefits. For example, participants with at least a bachelor’s degree tended to benefit more from using visualization view than participants without a bachelor’s degree; participants who were above 34 tended to benefit more from using text view. Future work is needed to identify reasons behind these differential impacts. For example, one speculation is that our visualization view might be open to interpretation while the text view is more direct. In future work, we want to adopt the principles from universal design [rose2000universal]; specifically, rather than creating “one-size-fits-all” solutions, we will develop alternative designs for people with different educational levels, age, backgrounds, and needs.
Our study showed that people have heterogeneous preferences over fairness-accuracy related trade-offs in their model selections. While researchers have developed sophisticated methods for generating predictive models with a wide range of fairness-accuracy trade-offs [agarwal2018reductions, KearnsNRW18], few studies have investigated how to help stakeholders with different preferences negotiate and select a final model. One promising technical approach is to draw techniques from social choice theory and to develop mechanisms that elicit preferences from individual stakeholders and select models based on scoring rules (e.g. the Borda count [borda, Lee2018WeBuildAI]). An alternative approach is to develop social mechanisms and user interfaces to facilitate discussions among groups of stakeholders, enabling them to reason about different fairness and accuracy measures, express priorities and acceptable trade-offs, and negotiate with each other and find appropriate models.
Our results also showed that our interfaces swayed half of participants’ trust of the algorithmic systems. According to Simmel [simmel1950sociology], there is no need to trust if people have total knowledge of the other party (algorithms in our context). Trust is also not a rational choice if people do not have any knowledge of the other party. Trust exists when people have some knowledge about the other party [simmel1950sociology]. One way to interpret our finding is that the knowledge that our tool offers helped people make informed decisions on whether they should trust the algorithms or not, which is critical for high-stake contexts. We believe this sets up the foundation of building more trustworthy algorithmic decision-making systems.
9 Limitations and Future Work
One limitation is the choice of conducting experiments on Amazon Mechanical Turk (AMT) and interviewing researchers who work on the related domains. While running the large-scale experiments on AMT allowed us to draw causal conclusions and interviewing researchers allowed us to envision the use of tools, these choices limited our ability to observe how real decision-makers and people who are affected by the decisions actually interact with the explanation tools. In future work, we will organize interviews and workshops with real stakeholders such as local judges, lawyers, and defendants to observe how they actually use explanation interfaces in real world settings and how they negotiate the trade-offs, which could complement our current findings.
We developed and evaluated our interfaces in the context of predicting recidivism. Future work is needed to verify the design and replicate our findings in other decision-making contexts. As mentioned in the introduction, algorithms have been used in many high-stake decision-making contexts, such as child protection, resume screening, student admission, content moderation on social media, etc. In future work, one direction we are pursuing is to create an “authoring tool” to allow stakeholders (who may not have programming and algorithm development experience) to create visualizations of trade-offs between different accuracy and fairness measures for the algorithmic systems that they are interested in.
Lastly, our current interfaces focus on providing a global view on the algorithm’s predictions on the entire population. While it is aligned with many statistical fairness notions, the interface does not capture notions of fairness that are concerned with algorithms’ decisions on an individual level. Two prominent examples are the notion of Lipschitz fairness [DworkHPRZ12] that requires that two similar individuals should be treated similarly by the algorithm, and the notion of meritocratic fairness [JosephKMR16, KearnsRW17] that requires that more qualified individuals should have higher chances to be selected by the algorithm. In future work, we will develop interfaces that can explain and communicate these individual fairness criteria, and provide a more microscopic view on algorithmic decision-making processes.
In this study, we developed interactive interfaces to explain trade-offs in the machine learning algorithm that predicts criminal defendants’ likelihoods to re-offend. We evaluated the interfaces through a large-scale online experiment and in-depth interviews with domain experts. Our results suggest our interfaces are promising in helping non-expert stakeholders comprehend, navigate, and manage trade-offs.