Leveraging Rationales to Improve Human Task Performance

Leveraging Rationales to Improve Human Task Performance

Abstract.

Machine learning (ML) systems across many application areas are increasingly demonstrating performance that is beyond that of humans. In response to the proliferation of such models, the field of Explainable AI (XAI) has sought to develop techniques that enhance the transparency and interpretability of machine learning methods. In this work, we consider a question not previously explored within the XAI and ML communities: Given a computational system whose performance exceeds that of its human user, can explainable AI capabilities be leveraged to improve the performance of the human? We study this question in the context of the game of Chess, for which computational game engines that surpass the performance of the average player are widely available. We introduce the Rationale-Generating Algorithm, an automated technique for generating rationales for utility-based computational methods, which we evaluate with a multi-day user study against two baselines. The results show that our approach produces rationales that lead to statistically significant improvement in human task performance, demonstrating that rationales automatically generated from an AI’s internal task model can be used not only to explain what the system is doing, but also to instruct the user and ultimately improve their task performance.

Explainable AI, Machine Learning
12345678910

1. Introduction

Machine learning (ML) systems across many application areas are increasingly demonstrating performance that is beyond that of humans. From games, such as Chess (Hsu, 1999) and Atari (Montfort and Bogost, 2009), to high-risk applications, such as autonomous driving (Teichmann et al., 2018) and medical diagnosis (Kermany et al., 2018), human users are becoming increasingly surpassed by, and reliant on, autonomous systems. In response to the proliferation of such models, the field of Explainable AI (XAI) has sought to develop techniques that enhance the transparency and interpretability of machine learning methods. XAI approaches have included taking advantage of intrinsically interpretable models (e.g., decision sets) (Lakkaraju et al., 2016; Letham et al., 2015), as well as developing interpretable approximations to explain the behavior of non-interpretable black-box models (e.g., decision tree approximations of deep neural networks) (Ribeiro et al., 2016; Koh and Liang, 2017). The vast majority of XAI research has focused on expert users, for example, medical personnel evaluating the decision-making capability of an automated diagnosis system (Adadi and Berrada, 2018; Ribeiro et al., 2016).

In this work, we consider a question not previously explored within the XAI and ML communities: Given a computational system whose performance exceeds that of its human user, can explainable AI capabilities be leveraged to improve the performance of the human? In other words, can the increasingly powerful machine learning systems that we develop be used to, in turn, further human capabilities? Within the context of XAI, we seek not to explain to the user the inner workings of some algorithm, but to instead communicate the rationale behind a given decision or choice. The distinction between explanation and rationale has been previously introduced by Ehsan et al. (Ehsan et al., 2019), who defined explanations as a way to expose the inner workings of a computational model through any communication modality (e.g., visual heatmap (Fong and Vedaldi, 2017)), often in a way that is accessible only to experts. Rationales, on the other hand, are defined as natural language explanations that do not literally expose the inner workings of an intelligent system, but instead provide contextually appropriate natural language reasons. These natural language reasons are accessible and intuitive to non-experts (e.g., “I had to go forward to avoid the red vehicle.”), facilitating understanding and communicative effectiveness. Within the context of their work, Ehsan et al. introduced a computational method for automatically generating rationales and validated a set of human factors that influence human perception and preferences (i.e., contextual accuracy, intelligibility, awareness, reliability, strategic detail)  (Ehsan et al., 2019).

In this work, we explore whether providing human users with a rationale of an intelligent system’s behavior can lead to improvement in the user’s performance. We study this question in the context of the game of Chess, for which computational game engines that surpass the performance of the average player are widely available. Our work makes the following contributions:

  • We introduce the Rationale-Generating Algorithm (RGA), an automated technique for generating rationales for utility-based computational methods.

  • We study two variants of our approach, one that takes advantage only of the system’s knowledge (RGA), and a second (RGA+) that additionally incorporates human domain expert knowledge.

We evaluate both techniques in a multi-day user study, comparing against two baselines to measure human learning performance. The results demonstrate that our approach produces rationales that lead to improvement in human task performance in the context of chess endgames. Using winning percentage and percentile rank of player moves as a measure of performance, we observe that the inclusion of domain expert knowledge (RGA+) significantly improves human task performance over both baselines when compared to non-rationale baselines. Additionally, user self-reported performance ratings also show that rationale-based interfaces lead to greater perceived user understanding of the task domains than non-rationale baselines. In summary, our approach is the first to demonstrate that rationales automatically generated from an AI’s internal task model can be used not only to explain what the system is doing, but also to instruct users in a manner that ultimately improves their own task performance in the absence of rationales.

2. Related Works

Traditionally, in the XAI community, an interpretable model can be described as one from which an AI expert or user can deduce model performance based on given inputs (Ehsan et al., 2019). The methods of interpretability can vary based on the complexity of a model and a wide range of survey papers summarize the different XAI models currently developed (Adadi and Berrada, 2018; Zhang and Zhu, 2018; Ribeiro et al., 2016). While some of the existing models are inherently interpretable, suitable for model-intrinsic decision-making (Letham et al., 2015; Caruana et al., 2015), other complex models need model-agnostic approaches for interpretability (Guidotti et al., 2018; Zhang et al., 2018, 2019; Wu et al., 2018). Despite the differing approaches for interpretability, XAI models have the common motivation of improving human understanding of AI systems and building trust with these systems (Gunning and Aha, 2019; Gunning, 2017).

One existing method for developing interpretability is to use intrinsic models in which interpretability is embedded inside the model. In (Letham et al., 2015) a Bayesian Rule List (BRL) is created to produce posterior distributions over permutations of ’if, then, else’ rules for a single prediction classification. Since these decision lists are intrinsically understandable, using them as the basis of BRL is successful in developing interpretability. However, the accuracy of BRL depends highly on the Bayesian prior favoring concise decision lists and a small number of total rules, limiting the applicability of this approach. To extend the applicability of model-intrinsic methods beyond the scope of concise decision lists, a generative additive model (GAM) is created to provide both high accuracy (better than random forest, logitboost and SVMs) and high interpretability for a single prediction classification (Caruana et al., 2015). To achieve model understanding, GAM creates a visual representation of the pairwise factors that result in a prediction and provides an ability for modular testing. (Caruana et al., 2015) refers to modular testing as allowing model-experts to easily remove and insert individual factors or pairwise factors to examine their effects on a prediction. While methods such as GAM provide interpretability for regression and some classification models, the intrinsic nature of its model makes it hard to provide interpretability for more black-box models which need model-agnostic implementations of interpretability.

An alternative to the model-intrinsic approach described above is a model-agnostic method for interpretability. Model-agnostic methods are a significant focus within the XAI community since model-agnostic methods provide high performance accuracy, but do not allow for inherent decoding and visualization of model prediction. Most model-agnostic methods produce interpretability by developing surrogate models, post-hoc implementations of interpretability derived from inherently interpretable models. In (Guidotti et al., 2018) an ad-hoc genetic algorithm is used to generate neighbor nodes for a specific local instance, which is then trained by a decision tree classifier to generate a logic rule. This logic rule represents the path in the decision tree that explains the factors that lead to a prediction. Additionally, (Zhang et al., 2019) uses inherently explainable decision tree models to learn CNN layers in order to provide CNN rationales at the semantic level. (Zhang et al., 2019) defines a CNN rationale as an ability to quantify the set of objects that contribute to a CNN prediction from prediction scores. (Zhang et al., 2018) also uses the concepts of explanatory graphs to visually reveal the hierarchy of knowledge inside CNNs in which each node summarizes the knowledge in the feature maps of a corresponding conv-layer. However, the decision tree rationales and explanatory graphs only provide quantitative distributions and visualizations to domain-experts, and are not interpretable to general users without domain knowledge. Moreover, the complexity of these decision trees vary greatly based on the complexity of the model, sometimes no longer staying interpretable due to a large node space. Thus, (Wu et al., 2018) creates a tree regularization penalty function that helps produce interpretability of complex models by using moderately-sized decision trees. Altogether though, the focus of the intrinsic and agnostic methodologies remain largely focused on giving insight into the inner workings of the AI systems, rather than providing humanly understandable rationales that extend beyond domain-expert usability.

The HCI community acknowledges the gap between XAI and the human usability of XAI, communicating that the explanations from AI and ML communities do not yet possess a large-scale efficacy on human users (Abdul et al., 2018). To address the importance of user-friendly explanations for any user, not only domain-experts, researchers in HCI have developed context-aware rules to focus interest on easy-to-use interfaces that are aware of their environment and context of use (Dey, 2018). However, many of these existing context-aware rules are developed as frameworks to aid decision-making in domain-specific applications such as smart homes (Costanza et al., 2014; Bourgeois et al., 2014) and the office (Cheverst et al., 2005), instead of utilizing a model-agnostic approach encouraged by the AI/ML community. In addition to the context-aware rules used above, educators have focused on intelligent tutoring systems (ITS) (Mahdi et al., 2016; Boyer et al., 2011; Hilles and Naser, 2017) to generate explanations for learning. While we similarly seek to improve the performance of a non-expert user, our work differs from tutoring systems in that the rationales produced by our system are generated based on and reflect an AI’s internal computational model, not a prescribed curriculum.

In the context of exploring effective methods of representing AI model predictions to human users, (Feng and Boyd-Graber, 2019) investigates the helpfulness of different factors that can help interpret an AI decision. The author focuses on methods such as presenting evidence examples, highlighting important input features and visualizing uncertainty in the context of the trivia game Quizbowl. Human performance improvement is analyzed by the time taken to respond to a question and the accuracy of an answer. However, since improvement is always measured in the presence of interpretations, their work does not provide insight on whether long-term human knowledge and task performance increase in the absence of the assistance provided by the interface.

To address a more generic solution for generating human understandable explanations from AI/ML produced models outside of a specified curriculum, (Ehsan et al., 2019) develops human understandable rationale in the context of the game Frogger. The rationale generation uses natural language processing to contextualize state and action representations for a reinforcement learning system and studies the human factors that influence explanation preferences. The metrics for measuring interpretability of the rationales are primarily focused on perceived understanding of the rationale. In contrast, we generate rationales and measure task performance and self-perceived task performance improvement, as opposed to surveying understanding of a given rationale.

3. Rationale Generating Algorithm

Figure 1. System overview diagram showing the interaction between a user, AI agent and RGA, in which represents state, represents action, represents utility function, represents domain knowledge and represents rationale.

XAI systems can be used to aid users in a decision-making process, such as medical diagnosis (Kermany et al., 2018). In this work, we introduce the Rationale Generating Algorithm (RGA), which provides the user with rationales designed to aid the user’s decision-making process while also increasing their understanding of the underlying task domain. Figure 1 presents an overview of the RGA pipeline. For a given task, we assume an expert AI Engine is available that takes in the current task state , and outputs a recommended action and its associated utility function 11. In context of this work, a utility function is defined as a weighted sum of all variables used to decide an output. Within RGA, the utility function is decomposed to identify the most significant factors contributing to the selection of over alternate actions. These factors take the form of variables, typically not very interpretable in their original form (e.g. ’PassedPawns ’). Additional factors that support decision making may also be obtained from provided expert domain knowledge (see below). While a utility function may be made up of dozens of variables, RGA selects only the top factors, which are then used to generate a human-readable rationale. Note that (Fox et al., 2017) has shown that in the context of an explainable planner, generating justifications to explain both good and bad action choices leads to a more robust explainable system. Motivated by this, RGA is able to generate rationales for both positively and negatively contributing actions.

Input - utility function,   - selected action,  - domain knowledge
Output - rationale

1:  =[:{}, :{}]
2:  =[:{}, :{}]
3:  = decomposeUtility()
4:  if   is not Ø then
5:      = decomposeDomainKnowledge()
6:  end if
7:   =
8:  .sortByWeight()
9:  for  in [do
10:     if  isPositive(then
11:        .append(genPos(.name,  ))
12:     else
13:        .append(genNeg(.name,  ))
14:     end if
15:  end for
16:  return  
Algorithm 1 generateExp( )

Algorithm 1 further details the manner in which RGA is implemented and a final rationale is chosen. Variables and (lines 1-2) store the name of each decision factor contributing to the selection of a given action, and its associated weight for utility-based () and domain-based () knowledge, respectively. Given the utility function used to select the given action , RGA first decomposes to obtain the list of all factors involved in the determination of the action () and their contributing normalized weight (). The resulting list is stored in (line 3).

RGA next optionally processes the domain knowledge, if it is available (lines 4-6). We define domain knowledge as any externally available data, typically encoded a priori by a domain expert (Sutcliffe et al., 2016). The domain knowledge is encoded as a supplemental utility function and used to supplement RGA with information that is not encoded in but might be helpful to include in a rationale to aid human understanding. For example, within the Chess domain we found that the default generated by our game engine did not include a variable for ’Checkmate’ (an important winning condition in chess), whereas providing information about a checkmate within a rationale would likely be helpful to the user. As a result, we include the concept of domain knowledge, and in our experimental section compare RGA performance with and without domain knowledge. If domain knowledge is present, RGA decomposes it similarly to the utility function into a list of factor names and their associated weights (line 5).

Once decision factors from the utility function and the domain knowledge are identified, the weighted lists for both sets of factors are merged and sorted by weight in descending order (lines 7-8). The top factors are then used to generate a rationale, with appropriate wording being dependent on whether the factor positively or negatively contributes to the task objective. For each of the features that aid in task completion, a positive rationale is scripted (e.g.’A boat can help cross the river because it floats on water’). If the top factors represent negative contributions to the objective, then a negative rationale is scripted (e.g. ’A car cannot help cross the river because it does not float on water’).

Once a rationale is returned by RGA, we display it to the user along with the recommended action selected by the AI Engine. The user may then leverage the rationale, and their own knowledge of the domain, to decide whether to perform or to select a different action. As shown in Figure 1, the user’s final selected action , which may be the same or different from , is then applied back to the task domain.

4. RGA and Chess

To measure the effect of human-understandable rationales on task performance, we apply RGA to the game of chess, specifically focusing our attention on endgame configurations, defined by no more than 12 pieces on the board. Chess is a good application for RGA as the utility function for chess is complex in nature, involving many parameters, and the game environment continuously changes over time. We focus specifically on endgame scenarios in which decision making is crucial to an outcome, and there is a relatively small amount of moves left. With endgame scenarios, we are able to analyze the effects of RGA through a measured experimental design.

For this research, we use the the utility function from the open source Chess AI Engine, Stockfish (Stockfish, ), and utilize RGA to generate human-understandable rationale for both an optimal and any non optimal moves taken. This section discusses RGA within the chess domain.

Figure 2. An overview of the application of RGA to the domain of chess, highlighting the two variants of RGA.
(a)
(b)
Figure 3. Example rationales generated for a particular board configuration where (a) represents a best-move explanation and (b) represents a non-optimal move explanation.

4.1. Chess Utility Function

The utility function from Stockfish includes over 100 utility factors that are combined in a multi-level hierarchical fashion. While all of these factors can be crucial to explaining different moves within a chess board configuration, only a subset of these main factors are useful for explaining moves in context of endgame configurations (Bain and Muggleton, 1994). The relevant endgame utility factors are determined after sorting all factors by maximum weight (line 8 of Algorithm 1). They include but are not limited to: ’Mobility’, ’KingDanger’, ’King’, ’HangingPiece’, ’PawnPromotion’, and ’Passed’.

As a brief summary, ’Mobility’ refers to the number of legal moves a player has for a given position–the more choices a player has, the stronger their position on the board. Also for endgames, the king is a powerful piece and is preferred to keep in the center of the board due to its limited range; ’KingDanger’ is the highest weighted feature that affects the ’King’ score. Additionally, ’Threats’ play a huge role in the outcome of an endgame, and are primarily represented by crucially attacking pieces such as rooks and kings, and by ’HangingPiece’ which refers to weak enemy pieces not defended by the opponent. ’Passed’ refers to the concept of ’Passed Pawns’ which checks to see if a pawn can continue to safely advance to the other side of the board. ’Passed’, includes the feature ’PawnPromotion’ which details when a pawn has reached the opponent’s top rank, allowing the player to switch its pawn out for a queen, rook, bishop or knight. These described utility factors represent the utility factors that we deemed appropriate from the Stockfish utility function, based on their weights, to use in justifying a non-optimal or optimal move choice.

4.2. Data Pre-Processing

To meet RGA’s requirement of having a normalized utility function input, we standardize the Stockfish utility function using Z-score scaling. For Z-score scaling, we perform a heuristic ad-hoc analysis by collecting average standard deviations and means for each relevant factor over several game configurations. Specifically, we collect 120 game configurations using a random FEN generator. To ensure completeness, we represent an equal distribution of game configurations within the range of 2-32 pieces. We started with 30 random game configurations, and doubled them twice, until the change in both average standard deviation and mean for each factor was negligible.

4.3. RGA In Chess

In the context of chess, RGA generates human understandable rationales with the objective of checkmating the opponent. As portrayed in Figure 1, the Stockfish AI engine generates a recommended action for the current board state , and its associated utility . In specific relation to chess, Figure 2 depicts the overall interaction between the Chess Engine, RGA, and the user. For each given board configuration, all relevant factors, actions and utilities are updated before generating a rationale. These utilities are sorted to find the highest utility and corresponding optimal move , which RGA uses to generate a rationale to the user to justify the optimal move he/she should take. Additionally, we detect non-optimal moves by comparing the user’s proposed action to the set of all possible actions for the given board configuration; if the proposed action falls in the bottom 1/3 of , then we use the factors in (line 7 of Algorithm 1) to generate a cautionary rationale justifying why a user should not make the proposed action. The user can ultimately decide upon the final action, considering or disregarding the rationales produced by RGA. Figure 3 provides an example of a best-move and a non-optimal move explanation using both a possible Stockfish and domain-knowledge factor.

4.4. Domain Knowledge Factors

In the application of chess, we use  to adds three important criteria that the Stockfish utility function does not explicitly represent: (1) explicit piece capture on next move, (2) check on next move, (3) checkmate on next or subsequent move. Considering the objective of chess, we weigh these additional domain-knowledge factors higher than those from the utility function. Shown in Figure 2, we distinguish RGA+ to be a superset of RGA. The overarching RGA+ denotes rationales that are reasoned from both domain knowledge and the chess utility function, whereas the RGA requires a utility function for rationale generation, but leaves domain knowledge inputs as optional to domain experts.

5. Experimental Design

To evaluate the effectiveness of RGA in improving task performance, we conducted a four-way between-subjects user study in which participants took part in chess gaming sessions over three consecutive days. We selected a multi-day study design to ensure observation of longer-term learning effects. Given the complexity of chess, which requires an estimated 10 years (Simon and Chase, 1988) or 5000 hours (Charness et al., 2005) to master, we conducted our study using only simplified game scenarios consisting of end games.

The study design consisted of the following four study conditions, which determine what guidance was provided to the participant:

  • None (baseline): The player receives no hints or rationales. This condition is equivalent to practicing chess independently with no guidance. (Figure 4(a))

  • Hints (baseline): The player receives a visual hint highlighting in color the best currently available move, as determined by the game engine utility function. No textual rationales are provided beyond the highlighting. This condition is equivalent to the hints system available in the vast majority of online and app-based chess programs. (Figure 4(b))

  • RGA: The player receives a visual hint highlighting in color the best currently available move (as in Hints). Additionally, the system displays a textual rationale based the Stockfish utility function only. (Figure 4(c))

  • RGA+: The player receives a visual hint highlighting in color the best currently available move (as in Hints). Additionally, the system displays a textual rationale based on both the Stockfish utility function and domain knowledge. (Figure 4(c))

(a) None
(b) Hints
(c) RGA (green) and RGA+ (blue)
Figure 4. Given a set game configuration, the three chessboards show the different user interfaces for the four experimental conditions where (a) represents the ’None’ cohort, (b) represents ’Hints’ and (c) represents the ’RGA+ and RGA’ cohort.

The sections below further detail our evaluation metrics, study hypotheses, participant recruitment method, and study design.

5.1. Metrics

Each chess session consisted of diagnostic games, during which participants were evaluated on their performance and received no suggestions or hints, and instructional games, during which participants received guidance according to their assigned study condition. During diagnostic games, the following metrics were used to evaluate performance:

  • Win Percentage (Win%): a metric commonly used in sports that takes into account the number of wins, losses and ties. In our domain, we additionally account for cases in which the maximum number of allowed moves has been reached12, maxmoves, which are weighted the same as ties. The final win percentage is calculated as:

    (1)
  • Percentile Rank (Percentile): a metric used to measure the distribution of scores below a certain percentage. In our domain, the distribution of scores is the move rating of all possible moves for a given board configuration. The move ratings are calculated by Stockfish. For each board configuration, we use the following formula to calculate the percentile rank of a chosen move   with   corresponding to all possible moves:

    (2)

Additionally, at the end of each study day, participants were given a short post-session questionnaire from which we obtain the following metric:

  • Perceived Performance (SelfEval): a metric that seeks to capture the participants’ self-reported perceived progress toward learning chess. Perceived performance is measured using a 5-Point Likert Scale rating based on the question ‘Do you believe your performance improved this session?’ (1 = Strongly disagree, 5 = Strongly agree’)

5.2. Hypotheses

We formulate the following hypotheses on the ability of interpretable rationales to improve participants’ chess performance defined by the Win% and Percentile Rank metrics:

  • H1a: Participants who received only utility-based rationales (RGA) will perform better than those who received no guidance (None).

  • H1b: Participants who received only utility-based rationales (RGA) will perform better than participants who received only suggestions (Hints).

  • H1c: Participants who received rationales that incorporate domain knowledge (RGA+) will perform better than participants who received no guidance (None).

  • H1d: Participants who received rationales that incorporate domain knowledge (RGA+) will perform better participants who received only suggestions (Hints).

  • H1e: Participants who received rationales that incorporate domain knowledge (RGA+) will perform better than participants who received only utility-based rationales (RGA).

Additionally, we hypothesize that participants’ measure of their own perceived performance, evaluated by the SelfEval metric, will follow similar trends as above. Specifically:

  • H2a: Participants who received only utility-based rationales (RGA) will have higher perceived performance ratings than those who received no guidance (None).

  • H2b: Participants who received only utility-based rationales (RGA) will have higher perceived performance ratings than those who received only suggestions (Hints).

  • H2c: Participants who received rationales that incorporate domain knowledge (RGA+) will have higher perceived performance than participants who received no guidance (None)

  • H1d: Participants who received rationales that incorporate domain knowledge (RGA+) will have higher perceived performance than who received only suggestions (Hints).

  • H2e: Participants who received rationales that incorporate domain knowledge (RGA+) will have higher perceived performance ratings than those who received only utility-based rationales (RGA).

5.3. Participants

We recruited 68 participants from Amazon’s Mechanical Turk. Participants were required to demonstrate basic knowledge of chess by passing a short test verifying the rules of the game (e.g.’Which piece can move straight forward, but captures diagonally?’). Participants were also required to participate three days in a row, and to not already be expert players. Eight players were removed from the study for either not participating for three days or for winning every game (suggesting that they were experts to begin with). The final set of participants included 60 individuals (44 males and 16 females), who ranged in age from 18 to 54 (6 between 18-24 years, 31 between 25-34 years, 18 between 35-44, and 3 between 45-54 years). Participants were randomly assigned to one of the four study conditions. Each daily session took approximately 15-20 minutes, and participants were compensated $2.00, $4.00, and $6.00 on days 1, 2 and 3, respectively.

5.4. Study Design

The study consisted of three sessions performed on three consecutive days. In each session, participants played 9 games of chess: three diagnostic games, followed by three instructional games, followed by three more diagnostic games. The use of diagnostic games at the beginning and end of each session enabled us to study participant performance both across and within sessions13.

The participant always played white, and the opponent black pieces were controlled by the Stockfish Engine AI, which always played optimally. For each board, the optimal number of player moves needed to win, , was determined by the Stockfish Engine, and participants were limited to 10 moves during the game. Starting board configurations were obtained from a popular online learning website (https://lichess.org/), selected such that . All participants received the same boards in the same order to ensure uniformity; each board was unique and did not repeat. Players were allowed to make any legal chess move, and each game consisted of an average of 6 moves (SD=2.62). As a result, participants in the Hints and RGA conditions received approximately 18 move suggestions per day on average. Furthermore, participants in the RGA condition received one rationale per move suggestion, receiving 18 rationales per day. Participants in the RGA+ conditions occasionally received an additional rationale per move suggestion to denote a possible checkmate in less than three moves.

Figure 5. Average Win% of the experimental conditions.
Figure 6. Average PercentileRank ratings of moves made by participants across all days.

6. Results

The participant performance data followed a normal distribution; as a result, we used ANOVA with a Tukey HSD post-hoc test to evaluate statistical significance across the experimental conditions with respect to the Win% and PercentileRank metrics. Additionally, we conducted a Mann-Whitney U test to analyze the Likert scale data for the SelfEval metric.

6.1. Participant Performance

Figure 5 presents the average participant win percentage (Win%) for each study condition over the three days. We observe that while no statistical differences are observed between conditions on the first day, the differences in performance grow on subsequent days. In particular we see the greatest rate of overall task improvement from Day 1 to Day 3 for the RGA+ condition. Our results indicate a correlation between the amount of explanation participants are given and their Win%, with more justifications leading to more wins. This is further supported by the results in Figure 6, which present the percentile rank (PercentileRank) of each participant’s average move–with a percentile rank of 100 denoting the most optimal move and percentile rank of 0 denoting the least optimal move. Similar to Figure 5, we observe a correlation between the PercentileRank of move ratings and the amount of justifications provided, with the upper quartile for PercentileRank being higher with more explanations.

Our statistical analysis shows that for both Win% and for PercentileRank, participants in ’RGA’ did not perform statistically better than those in ’None’, not validating H1a, nor statistically better those in ’Hints’, not validating H1b. We do observe that ’RGA+’ participants performed significantly better than ’None’ participants validating H1c. However, with respect to the ’Hints’ condition, ’RGA+’ had statistically better Win% performance but not PercentileRank, therefore only partially validating H1d. Finally, we see no significant difference between ’RGA+’ and ’RGA’ conditions in this study, not validating H1e.

In summary, these results indicate that humanly interpretable rationales can improve task performance, as long as the rationales fit a complete representation of the domain. Our results show that gathering both utility-based features and additional domain knowledge features (not represented by the utility) can accomplish completeness. It is also important to note that for task improvement, a decision not only needs to be interpretable, but more importantly humanly interpretable by accompanying rationales. The ’Hints’ conditions also provided interpretability by highlighting the best move, but the lack of accompanying rationales may be the cause of why no significance was seen between ’Hints’ and ’None’. Furthermore the inclusion of domain knowledge in RGA+ significantly improved participant Win% over the baseline conditions, compared to RGA, validating the need for a more complete domain representation. However, for this reason, it is interesting that no Win% significance was seen between ’RGA+’ and ’RGA’, but in seeing a trend of increasing difference between ’RGA+’ and ’RGA’, we suspect that over a longer period of time, a visible significance would be seen.

6.2. Participant Perceived Performance

Figure 7 presents the perceived performance rating (SelfEval) of participants in each experimental condition. The Likert scale data shows that participant groups that received more justifications (’RGA+’, ’RGA’) had higher ratings of ’Agree’ and ’Strongly Agree’ than participant groups that received little to no justifications (’Hints’, ’None’), showing the value justifications had on SelfEval. Furthermore, participant groups that received some justifications (’RGA’, ’Hints’) had more ’Neutral’ ratings than participant groups that were on the extreme ends of the justification spectrum (’RGA+’, ’None’), showing higher levels of uncertainty in their SelfEval.

Figure 7. Survey data of participants’ SelfEval across the experimental conditions.

The Mann-Whitney U tests in Table 1 further detail the specific significance between each experimental condition. As seen in Table 1, the SelfEval of ’RGA’ participants were not significantly stronger than ’None’ participants’ ratings, not validating H2a, nor significantly stronger than ’Hint’ participant ratings, not validating H2b. However, the ’RGA+’ participants did have a statistically higher SelfEval than ’None’ participants and ’Hints’ participants on all three days, validating H2c and validating H2d respectively. Additionally, from day two onward, ’RGA+’ participants rated their perceived performance higher than those from ’RGA’, validating H2e.

Overall, the SelfEval metric data aligns with the performance analysis from Section 6.1, showing that perceived performance ratings are significantly stronger with the presence of humanly interpretable rationales that are domain representative. The results above also portray additional significance not seen with the Win% and PercentileRank metrics. Unlike the trend in Section 6.1, SelfEval does show a significant difference in performance rating between ’RGA+’ and ’RGA’, reiterating the importance of holistic domain representation. Interestingly, similar to the analysis from Win% and PercentileRank, SelfEval also does not portray a significant difference between ’RGA’ and ’Hints’, implying that rationales from the utility function alone were not different enough from ’Hints’. In Figure 7, we see an increasing difference in ’Disagree’ ratings and ’Agree’ ratings over Day 1 and Day 3 between ’RGA’ and ’Hints’, implying that over a longer period of observation, a potential significance could be seen.

7. Discussion and Conclusions

In this work, we are the first to explore whether human-interpretable rationales automatically generated based on an AI’s internal task representation can be used to not just explain the AI’s reasoning, but also enable end users to better understand the task itself, thereby leading to improved user task performance. Our work introduces the Rationale-Generating Algorithm that utilizes utility based computational methods to produce rationales understandable beyond the scope of domain-expert users. To validate RGA, we applied it to the domain of chess and measured human task performance using both qualitative user self-reported data (self-perceived performance ratings) and quantitative performance measures (winning percentages and rank percentiles of the strength of moves played by each participant).

Conditions Day 1 Day2 Day3
RGA+ vs. RGA NS
RGA+ vs. Hints
RGA+ vs. None
RGA vs. Hints NS NS NS
RGA vs None NS
Hints vs None NS NS NS
Table 1. Mann-Whitney U test significance values for the SelfEval metric

Our results show that rationales from RGA are effective in improving performance when information from the AI’s utility function is combined with additional domain knowledge from an expert. The resulting system was able to statistically significantly improve user performance in chess compared to study participants who practiced the same number of games but did not receive rationales. Simply showing participants the optimal action without an accompanying rationale did not produce the same results, indicating the importance of interpretable rationales in elucidating the task.

The presented approach is the first study of how rationales can affect learning. While it contributes a number of important insights, it is also limited in several ways. First, RGA is limited to utility-based methods and can not be applied to arbitrary machine learning methods. Future work in generating rationales for alternate ML representations, such as reinforcement learning discussed in (Ehsan et al., 2019), should be explored. Exploration of such methodologies can lead to developing a rationale generating system that is model agnostic. Second, our work does not compare among the many different ways in which rationales can be phrased or structured. It is beneficial to investigate if and how non-verbal explanations and dynamic forms of explanations can provide better humanly understandable explanations. For example, it would be interesting to see whether visual animations of future chess moves help RGA rationales be as effective as RGA+ rationales as well as the effect visual animations have (applied to the Hints and None category) without RGA. Also, additional research is needed to evaluate how to present rationales in the most accessible and interpretable manner based on individual needs. Currently, RGA is developed to help beginner chess players improve their task performance, but it is worth exploring how to build a learned RGA that can tailor its explanations to varying levels of expertise. Another important area to investigate is the long term effects of RGA. While RGA has been shown as valuable in improving human task performance in a short period of time, it would be beneficial to see whether these trends are upheld over a longer time frame. Learning when the effects of RGA are minimal and when they are maximum can help establish its best time of usage, as well as measure its broader impact in improving human learning and human task performance.

Acknowledgements.
This material is based upon work supported by the NSF Graduate Research Fellowship under Grant No. DGE-1650044, and also in part by NSF IIS 1564080 and ONR N000141612835.

Footnotes

  1. conference: ACM International Conference on Intelligent User Interfaces; March 17–20, 2020; Cagliari, Italy
  2. booktitle: IUI ’20: ACM International Conference on Intelligent User Interfaces, March 17–20, 2020, Cagliari, Italy
  3. journalyear: 2020
  4. copyright: acmcopyright
  5. conference: 25th International Conference on Intelligent User Interfaces; March 17–20, 2020; Cagliari, Italy
  6. booktitle: 25th International Conference on Intelligent User Interfaces (IUI ’20), March 17–20, 2020, Cagliari, Italy
  7. price: 15.00
  8. doi: 10.1145/3377325.3377512
  9. isbn: 978-1-4503-7118-6/20/03
  10. ccs: Human-centered computing Interaction paradigms
  11. The presented variant of RGA requires a normalized utility function to generate rationales. Our objective in this work is to evaluate the effect of rationales on user performance, thus we did not focus on developing a fully model-agnostic rationale generation technique. In general, approaches such as (Ehsan et al., 2019) can be used to generate rationales for models that do not incorporate a utility function
  12. We limit the total number of moves to prevent novice players from moving around the board indefinitely.
  13. Our analysis showed no significant trends for within-session performance differences, likely due to the short duration of the sessions. However, we do observe significant learning effects across sessions, as discussed in the Section 6.

References

  1. Trends and trajectories for explainable, accountable and intelligible systems: an hci research agenda. In Proceedings of the 2018 CHI conference on human factors in computing systems, pp. 582. Cited by: §2.
  2. Peeking inside the black-box: a survey on explainable artificial intelligence (xai). IEEE Access 6, pp. 52138–52160. Cited by: §1, §2.
  3. Learning optimal chess strategies. In Machine intelligence 13, pp. 291–309. Cited by: §4.1.
  4. Conversations with my washing machine: an in-the-wild study of demand shifting with self-generated energy. In Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 459–470. Cited by: §2.
  5. Investigating the relationship between dialogue structure and tutoring effectiveness: a hidden markov modeling approach. International Journal of Artificial Intelligence in Education 21 (1-2), pp. 65–81. Cited by: §2.
  6. Intelligible models for healthcare: predicting pneumonia risk and hospital 30-day readmission. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1721–1730. Cited by: §2, §2.
  7. The role of deliberate practice in chess expertise. Applied Cognitive Psychology 19 (2), pp. 151–165. Cited by: §5.
  8. Exploring issues of user model transparency and proactive behaviour in an office environment control system. User Modeling and User-Adapted Interaction 15 (3-4), pp. 235–273. Cited by: §2.
  9. Doing the laundry with agents: a field trial of a future smart energy system in the home. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 813–822. Cited by: §2.
  10. Context-aware computing. In Ubiquitous computing fundamentals, pp. 335–366. Cited by: §2.
  11. Automated rationale generation: a technique for explainable ai and its effects on human perceptions. In Proceedings of the 24th International Conference on Intelligent User Interfaces, pp. 263–274. Cited by: §1, §2, §2, §7, footnote 1.
  12. What can ai do for me?: evaluating machine learning interpretations in cooperative play. In Proceedings of the 24th International Conference on Intelligent User Interfaces, pp. 229–239. Cited by: §2.
  13. Interpretable explanations of black boxes by meaningful perturbation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3429–3437. Cited by: §1.
  14. Explainable planning. arXiv preprint arXiv:1709.10256. Cited by: §3.
  15. A survey of methods for explaining black box models. ACM Comput. Surv. 51 (5). External Links: ISSN 0360-0300, Link, Document Cited by: §2, §2.
  16. DARPA’s explainable artificial intelligence program. AI Magazine 40 (2), pp. 44–58. Cited by: §2.
  17. Explainable artificial intelligence (xai). Defense Advanced Research Projects Agency (DARPA), nd Web 2. Cited by: §2.
  18. Knowledge-based intelligent tutoring system for teaching mongo database. Cited by: §2.
  19. IBM’s deep blue chess grandmaster chips. IEEE Micro 19 (2), pp. 70–81. Cited by: §1.
  20. Identifying medical diagnoses and treatable diseases by image-based deep learning. Cell 172 (5), pp. 1122–1131. Cited by: §1, §3.
  21. Understanding black-box predictions via influence functions. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pp. 1885–1894. Cited by: §1.
  22. Interpretable decision sets: a joint framework for description and prediction. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1675–1684. Cited by: §1.
  23. Interpretable classifiers using rules and bayesian analysis: building a better stroke prediction model. The Annals of Applied Statistics 9 (3), pp. 1350–1371. Cited by: §1, §2, §2.
  24. An intelligent tutoring system for teaching advanced topics in information security. Cited by: §2.
  25. Racing the beam: the atari video computer system. Mit Press. Cited by: §1.
  26. Why should i trust you?: explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1135–1144. Cited by: §1, §2.
  27. Skill in chess. In Computer chess compendium, pp. 175–188. Cited by: §5.
  28. Stockfish Note: \urlhttp://stockfishchess.org/ Cited by: §4.
  29. Domain knowledge for interactive system design. Springer. Cited by: §3.
  30. Multinet: real-time joint semantic reasoning for autonomous driving. In 2018 IEEE Intelligent Vehicles Symposium (IV), pp. 1013–1020. Cited by: §1.
  31. Beyond sparsity: tree regularization of deep models for interpretability. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §2, §2.
  32. Visual interpretability for deep learning: a survey. Frontiers of Information Technology & Electronic Engineering 19 (1), pp. 27–39. Cited by: §2.
  33. Interpreting cnn knowledge via an explanatory graph. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §2, §2.
  34. Interpreting cnns via decision trees. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6261–6270. Cited by: §2, §2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
407684
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description