Challenge AI’s Mind: A Crowd System for Proactive AI Testing

Challenge AI Mind: A Crowd System for Proactive AI Testing

Abstract

Artificial Intelligence (AI) has burrowed into our lives in various aspects; however, without appropriate testing, deployed AI systems are often being criticized to fail in critical and embarrassing cases. Existing testing approaches mainly depend on fixed and pre-defined datasets, providing a limited testing coverage. In this paper, we propose the concept of proactive testing to dynamically generate testing data and evaluate the performance of AI systems. We further introduce Challenge.AI, a new crowd system that features the integration of crowdsourcing and machine learning techniques in the process of error generation, error validation, error categorization, and error analysis. We present experiences and insights into a participatory design with AI developers. The evaluation shows that the crowd workflow is more effective with the help of machine learning techniques. AI developers found that our system can help them discover unknown errors made by the AI models, and engage in the process of proactive testing.

Crowdsourcing, AI, Proactive testing

5.2 Metrics

The general statistics of each job are displayed in Table 1. The Total trials, denoted as , include all sentences that the crowd have crafted using our system. Crowd workers have generated sentences for “Subtle sentiment cues” and for “Mixed-sentiment”, respectively. Since we do not limit the number of sentences the crowd craft in the generation process, the number of trials varies across different conditions. To obtain the ground truth of the sentiment label for each sentence, we validate sentences marked to have failed the model during the validation process. Validated trials () are number of sentences that successfully fail the model based on the ground truth. In addition, we count the number of distinct crowd workers for each condition (# workers). Accordingly, we propose three metrics to evaluate the performance of each crowd worker. We use instead of to represent that the statistics values correspond to one crowd worker.

Average time per trial () measures how much time that a worker needs to craft a trial on average. We assume that each trial does not exceed five minutes. Therefore, for trials made by one worker, the time of each trial is calculated by . Then we obtain: , where is the number of all trials made by one worker. measures how efficient a sentence can be crafted.

Success rate () is measured as . This value measures how easily a worker thinks s/he can generate samples to fail the model. The success rate is useful to measure the effectiveness of prompts, as well as to analyze the vulnerability of a model.

5.3 Analysis of crowd performance

On average, crowd workers spent 56.4 seconds (SD=18.2) in crafting a sentence with the enhanced prompt (LIME SP) and 65.4 seconds (SD=26.4) with (LIME SP). Figure 6(a) shows the average time per trial () under each condition. We found a significant effect (t = 1.9977, p<0.05) of the enhanced prompt in reducing (). The crowd used about 13.8% less time in crafting a sentence with (LIME SP) than (LIME SP). The reason may be that editing text in the input area requires less time compared to crafting a new sentence from scratch.

Figure 6(b) shows that crowd workers are indifferent in success rate () under two conditions (38.2% V.S. 37.7%). This result might be because crowd workers are not familiar with the color encoding of LIME explanation (Figure 3) and randomly sampled sentences do not help the crowd to craft sentences to fail the model.

Besides qualitative assessment, we received positive feedback from the crowd regarding the error generation tasks. For example, one of them commented, “This was fun, I sure hope my answers were good. If not please dont pay me, I enjoyed the task and want to be able to try some more in the future.”

To conclude, the enhanced prompts help the crowd use less time in crafting a sentence that can fail the model. To understand how the two key factors, e.g., accountability and starting point, interact with each other, we plan to recruit more crowd workers for error generation and perform a two-way ANOVA for detailed analysis in future research.

6 Evaluation with AI developers

To investigate how Challenge.AI helps AI developers understand and diagnose a model, we worked with the five AI developers that we collaborated during the formative study, and organized two rounds of semi-structured interview sessions to evaluate the effectiveness and usefulness of Challenge.AI.

6.1 Process

We followed the architecture of Challenge.AI (Figure 1) to evaluate the entire system. Before error generation, we started from the first sessions with AI developers to obtain initial categorization for errors. Based on the category information proposed by AI developers, we used Challenge.AI to generate errors belonging to these categories, and conducted validation and categorization for crafted sentences. Finally, we organized the second interview sessions (error analysis) to understand the usefulness and limitations of Challenge.AI from the perspective of AI developers. During the entire evaluation, we used a sentiment analysis model built by D1 as the target model to test. The input of the model is a sentence, and it outputs a sentiment label associated with a probability.

First sessions

The goal of the first sessions is to obtain the target categories of errors to test the model. To begin with, we tested the performance of the model using a public sentiment dataset [Rosenthal2017] where all 12284 sentences are collected from Twitter, and labeled with negative, neutral, or positive sentiment. After obtaining all misclassified sentences, we randomly sampled ones and stored them in a table (CSV file format) with four columns, e.g., a ‘Text’ column, a ‘Human_Label’ column showing the ground truth, an ‘AI_Label’ column displaying the results calculated by the model, and an empty column titled ‘Category’ to allow AI developers to label a potential category for the sentence.

Each interview started with the introduction of the dataset. After that, we presented the dataset to AI developers and asked them to identify the patterns of the misclassified samples and name new categories for them. AI developers were allowed to discard sentences that are hard to be categorized. An interview took about minutes. We encouraged them to express findings and thoughts using a think-aloud protocol and took notes about their feedback for further analysis.

Some AI developers have more experience in identifying patterns for errors. For example, when noticing a sentence whose benchmark label is positive, but misclassified as negative by the model, i.e., “Marissa Miller of Google makes shout out to the Khan Academy and the great things they’re doing for education. #fmsignal #sxsw (cc @mention”, D2 said, “I think the model made a wrong prediction because it does not understand what ‘shout out’ means.” From her experience, D2 further commented that the model may not understand sentiment indications that are domain-specific or context dependent. Besides summarizing patterns in the dataset, D3 asked for sentences containing both positive and negative indicators. “Do any of them have opposite sentiment words, like, I am happy, but… something like that?” The participant further explained, “Some models are designed to handle targeted sentiment, but determining relevant sentiment in mixed sentiment texts is challenging.” Finally we derived two categories of errors for model testing. One is called “Subtle Sentiment Cues” which means that a sentence is either positive or negative, and has positive or negative indications. The other is “Mixed-sentiment” which refers to sentences containing both positive cues and negative indicators. Further, we include three more types of errors for categorization. For example, a “Questions” category is added based on D1’s comments and an “Others” is included to be more general. A “No majority” category is added after categorization if human annotators cannot reach a consensus on the category of that sample.

Running Challenge.AI

After obtaining the categorization, we tested the model by walking through three main components of Challenge.AI, e.g., error generation, validation, and categorization. As mentioned above, we focused on the two categories, i.e., “Subtle Sentiment Cues” and “Mixed-sentiment” in error generation while we used five categories for error categorization. The results and analysis of crowd performance are described in Section 5.

Second interview sessions

We organized second interview sessions to evaluate how Challenge.AI helps AI developers understand the performance of the model.

After running Challenge.AI, we obtained samples that crowd workers generated to have successfully failed the model, where errors are categorized as “Subtle Sentiment Cues” and are “Mixed-sentiment”. During the interviews, we demonstrated the data at three levels of granularities using the interface shown in Figure 5.

Each interview took about minutes. We first presented the goal of Challenge.AI to AI developers and a detailed introduction to the data and interface. AI developers then freely explored the interface and we helped them resolve any questions they encountered. Next, the participants went through the interface to tell how they understood the performance of the model. They further identified new categories of errors by investigating detailed samples using the interface. Finally, a post-interview discussion was conducted to collect their feedback about the strengths and weaknesses of Challenge.AI. During the interview, AI developers were instructed to think aloud and we took notes about their feedback. We recorded the whole interview sessions for later analysis. We report the results of second interview sessions in the remaining of the section.

6.2 Value of proactive testing

A thorough testing is important for AI models before deployment. However, current practice of testing is limited in coverage, as D3 commented, “When doing the testing, we assume that the testing dataset and training dataset are in the same feature space.” Traditional testing approach is far from enough for deploying the model in the wild, which indicates the potential value of proactive testing in evaluating the model for production. To reduce critical and embarrassing errors, AI developers are able to identify corner cases to test, and Challenge.AI collect external dataset belonging to specific categories. In addition, by investigating external dataset, AI developers can discover unseen errors. For example, our participants identified two categories that are distinct from those found in the first interview session, e.g., bias in pronouns such as ‘He’ and ‘She’, and reversed sentiment containing words like ‘However’, ‘Though’, and ‘But’. Detailed discussions are reported below.

6.3 Getting a gist

First of all, AI developers were interested in the overall patterns of misclassified samples. The Statistics View (Figure 5(a)) provides a big picture of the entire dataset. From the stacked bar chart, D5 noticed that it is about equal distribution among high severity, middle, and low for most bars. However, the samples belonging to “Question” attracted her attention because high-severity errors account for the majority in this category. “The model could be improved (in the ‘Question’ category) for sure.” D5 further explained the way of improving the model, “In some of the supervised learning models, we need to use human heuristics to do the feature engineering (extraction) from the raw dataset. The quality of the feature extracted largely impacts the final performance.” The participant took the “Question” category as an example, “If a model a has high probability to make severe errors for question sentences, we may specify a feature in feature engineering to detect whether a sentence is a question or a statement. So with this feature, hopefully could help the model make decisions.”

From our observation of the first sessions, all AI developers had read through about a dozen of misclassified sentences because the process of error analysis requires great mental efforts. Displaying the errors at different levels of granularities would relieve AI developers in analyzing a large number of errors. As D2 commented, “I like the overview which gives me the impression of the entire dataset. You know, reading through two hundred errors is time-consuming and impossible (during the first interview session), and I did not do a good job last time.”

6.4 Examining errors by words

After examining the Statistics View, D4 switched his focus to the Cloud View showing sentiment words as tag cloud (Figure 5(b)). The participant noticed that the word “I” has the biggest font size while “Good” is the second biggest word. “Typically in sentiment analysis, you will not expect ‘I’ to be particularly positive or negative. ‘Good’ is the second one. It makes more sense but ‘I’, ‘is’, ‘was’, ‘he’, ‘me’, ‘my’, ‘she’, among the first line are not sentiment words.” However, the participant changed his mind after investigating sentences containing “He” and “She”. He first clicked “She” and the Table View updated. The participant noticed that the word contributes a lot to neutral sentences, and contributes once for negative and positive, respectively. Similarly, the participant further examined sentences containing the word “He”, and noticed that four out of eight are negative, and “He” contributes to the negative sentiment. “Well, it is interesting to see the difference between ‘She’ and ‘He’. I guess the model tends to regard ‘He’ as a negative word.” He added, “I think that it is necessary to examine the training data (of the model) to see whether the stop words are equal in distribution for each sentiment.”

Before using Challenge.AI, some AI developers (D1, D4, and D5) found it hard to identify patterns and categorize sentences. For example, during the first interview sessions, D4 did not know the reason for some of the predictions. The participant pointed to one question sentence and commented, “There is no reason to label this question into negative or positive. Because it apparently contains none of the words with any sentiment.” D4 and D5 noted that they did not agree with some ground truth labels. As D4 said, “I would recommend you have a category for mis-labeled because it is subjective.” The participant further pointed to a sentence whose benchmark label is neutral, and added, “Now here is one, ‘Social Is Too Important For Google To Screw Up A Big Launch Circus’. It sounds kind of negative to me, which is how the model classified it as.” By borrowing LIME [Ribeiro2016] to extract sentiment words, Challenge.AI provides explanation of errors at the word level, allowing AI developers to find potential bias in the training data.

D1 showed great interest in the exploration of samples in the “Mixed-sentiment” category. He clicked bars with dark red color under this category and read through these severe errors in the Table View. Then the participant noted, “Some sentences in this category are reversed sentiment.” Then the participant pointed to a sample and added, “Like in this case, it has the word ‘but’. All content after ‘but’ is the content that the speaker wants to emphasize. The former part is like warm up. So the later part highlights the whole meaning of the sentence. In this case, I will not say it is a mixed sentiment. It is reversed.” Then, the participant used the search box to find all sentences containing “however” but found no sample in the table. He commented, “I would like to test the model with sentences using reversing words, like ‘but’, ‘however’, ‘although’, etc. The model may not do a good job.”

During the first interview sessions, we realize that not all errors are worth investigation. When looking at the errors, D5 commented, “A lot of these are difficult for human. For those which are less obvious, you may ask three different people and got three difficult answers.” The participant further added, “Since sentiment analysis is subjective, if an error is ambiguous to human, I do not think the model made a severe mistake.” Therefore, the definition of severity helps AI developers focus on errors that are important to examine.

7 Design Implications

Proactive testing is a promising direction that helps AI developers get more insights into the model. Challenge.AI is the first prototype that supports proactive testing using the crowd force, and we suggest the following aspects that future research can explore.

First, include all the generated data by the crowd including those that can fail the model and those cannot. Because only the misclassified samples are not enough to help AI developers understand how the model performs in some cases. For example, D2 has found two sentences containing the word “Trump” by filtering. However, the participant could not conclude whether the model is biased to the word “Trump”. D2 commented, “I am only looking at the errors. It is hard to tell (whether the model is biased to “Trump”). I mean, these errors could be 99% of the instances in which case the model is doing very poorly. But this could be less than 1% of the instances in which case the model is doing fantastic.”

Second, apply better explanation techniques. In this study, we choose the LIME algorithm [Ribeiro2016] to identify and highlight sentiment words related to the prediction. However, our participants found that some sentiment words are confusing. For example, D4 found a positive sentence with AI labeled negative, “I can run longer now”. The word “can” is highlighted in green (positive) and “longer” highlighted in blue (neutral). He commented, “The AI label is negative. However, it is wired that no words are marked as negative.” However, when more advanced analytical techniques are developed in the future, such issue may be resolved.

Third, enhance the generation component for word-level categories. Challenge.AI has been proved to be effective in collecting samples belonging to concept-level categories such as “mixed-sentiment” and “subtle sentiment cues”. However, AI developers may sometimes seek to test the model using samples containing certain words, such as “Trump”. Intuitively, collecting samples with certain words could be more cost- and time-efficient by using techniques in information retrieval. We plan to study how various information retrieval techniques help in collecting samples of different category.

Fourth, provide real-time feedback for proactive testing. The main process of sample collection, e.g., generation, validation, and categorization, takes a long time and AI developers cannot test the model in real-time. One possible solution is to borrow workflows from real-time crowdsourcing [Lasecki2011, Cruz2015, Lundgard2018, Liao2018] to reduce the delay in obtaining the testing results. Another solution is to augment the error analysis interface as suggested by D2, “Since the model is already trained. Maybe you can (embed the model in the backend and) add an input box for real-time testing so that I can test some of the sentences in my mind.”

Fifth, augment error analysis with advanced analytical methods. our system borrows knowledge from AI developers to identify new patterns to test. However, the process is time-consuming and not scalable. It would be beneficial to incorporate automatic analytical methods, such as text classification or clustering, to assist AI developers in summarizing patterns among errors.

8 Generalizability and Future Work

Although our study grounds the exploration in the context of sentiment analysis, our system can be easily generalized to other text classification domains for crowd proactive testing, such as part-of-speech analysis. In addition, we found that explanation helps the crowd craft samples that fail the model. This idea can be borrowed in the generation of adversary samples using crowd intelligence in other fields, such as computer vision, for adversary learning.

There are a number of promising future directions. First, after error categorization, we obtained a testing dataset where each sample is labeled with a ground truth category and sentiment. We plan to release the dataset to the public to benefit more AI developers in testing sentiment analysis models. Second, we seek to have a comprehensive understanding of the crowd-crafted dataset by analyzing it from different perspectives. We plan to establish metrics to compare the generated dataset and open sourced ones from different perspectives, such as the distribution of sentence length, topic coverage, syntactic structure, uni-gram distribution, etc.

9 Conclusion

In this paper, we designed and built a new crowd system Challenge.AI for AI developers to proactively test their models. Challenge.AI consists of four components including error generation, validation, categorization, and analysis. Our system features an explanation-based error generation component that incorporates crowd intelligence and machine learning to facilitate the crowd in crafting errors to fail a sentiment analysis model. We conducted a crowd user study to quantitatively evaluate the effectiveness of the explanation-based error generation component, and we found that the explanation-based error generation technique saved crowd workers 13.8% in crafting sentences to fail the model. We also evaluated Challenge.AI with five AI developers and the study showed that the system helped participants to identify new error categories that have not been discovered before. We believe the proactive testing architecture developed in this work offers new opportunities and tools to reshape AI testing process.

References

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters