Riposte! A Large Corpus of Counter-Arguments
Constructive feedback is an effective method for improving critical thinking skills. Counter-arguments (CAs), one form of constructive feedback, have been proven to be useful for critical thinking skills. However, little work has been done for constructing a large-scale corpus of them which can drive research on automatic generation of CAs for fallacious micro-level arguments (i.e. a single claim and premise pair). In this work, we cast providing constructive feedback as a natural language processing task and create Riposte!, a corpus of CAs, towards this goal. Produced by crowdworkers, Riposte! contains over 18k CAs. We instruct workers to first identify common fallacy types and produce a CA which identifies the fallacy. We analyze how workers create CAs and construct a baseline model based on our analysis.
Critical thinking is a crucial skill necessary for valid reasoning, especially for students in a pedagogical context. Towards improving critical thinking skills for students, educators have evaluated the contents of a work and provided constructive feedback (i.e. criticism) to the student. Although such methods are effective, they require educators to articulately evaluate the contents of an essay, which can be time-consuming and varies depending on an educator’s critical thinking skills.
|Begging the Question ()||The truth of the premise is already assumed by the claim.||“If [something] is assumed to be true, then [something else] is already assumed to be true”.|
|Hasty Generalization()||Someone assumes something is generally always the case based on a few instances.||“It’s too hasty to assume that [text]”.|
|Questionable Cause ()||The cause of an effect is questionable.||“There is a questionable cause in the argument because [questionable cause] does/will not cause [effect]”.|
|Red Herring ()||Someone reverts attention away from the original claim by changing the topic.||“The topic being discussed is [first topic], but it is being changed to [second topic]”.|
In the field of educational research, the usefulness of identifying fallacies and counter-arguments, henceforth CAs, as constructive feedback has been emphasized de2008constructive; oktavia2014analysis; indah2015fallacies; song2013teaching, as both can help writers produce high-quality arguments while simultaneously improving their critical thinking skills. Shown in Figure 1 is an example of an argument with a fallacy (i.e. errors in the logical reasoning of the argument) and its CAs (i.e. attacks to the argument). In the field of NLP, previous works have addressed fallacy identification HABERNAL18.494, CA retrieval wachsmuth2017computational, and CA generation for macro-level arguments hua2018neural, and essay criteria such as thesis clarity persing2013modeling, argument strength persing2015modeling, and stance persing2016modeling have been evaluated. However, in the pedagogical context, macro-level arguments (e.g., an essay) may consist of several micro-level arguments (i.e. one claim/premise pair) that can each contain multiple fallacies. To bridge this gap, we create CAs for micro-level arguments which can be useful for automatic constructive feedback generation.
Several challenges exist for creating a corpus of CAs for constructive feedback. First, the corpus must contain a variety of different topics and arguments to both train and evaluate a model for unseen topics. Second, an argument can have many different fallacies which are not easily identifiable oktavia2014analysis; indah2015fallacies; el2017logical. Third, producing CAs is costly and time-consuming.
In this work, we design a task for automatic constructive feedback and create Riposte!, a large-scale corpus of CAs via crowdsourcing. Workers are first instructed to identify common fallacy types (begging the question, hasty generalization, questionable cause, and red herring) in educational research de2008constructive; oktavia2014analysis; indah2015fallacies; song2013teaching and create a CA for micro-level arguments. In total, we collect 18,887 CAs (see Figure 1 for examples of CAs in Riposte!). We then cast automatic constructive feedback as a text generation task and create a baseline model.
2 The Riposte! corpus
In this section, we determine if training data can easily be created. To the best of our knowledge, this is the first research that addresses corpus construction for automatic constructive feedback.
2.1 Counter-arguments as an NLP task
When designing a task for automatic constructive feedback, one must take into account real-world situations. In the pedagogical context, educators can choose the same topic for students annually. With automatic constructive feedback, educators may choose to use a pretrained, supervised model for a single topic with editable background knowledge (i.e., educators can choose which knowledge is necessary to automatically construct feedback). On the other hand, educators may choose a new topic each year, and thus a conditioned model for multiple topics may also be considered. The input to a model should be a topic and several claim and premise argument pairs, and the output would be a set of CAs useful for improving the argument.
2.2 Existing corpus of arguments
When training a model for constructive feedback, the data should consist of many CAs for a wide variety of topics. We use the Argument Reasoning Comprehension (ARC) dataset Habernal.et.al.2018.NAACL.ARCT, a corpus of 1,263 unique topic-claim-premise pairs (172 unique topics and 264 unique claims). We assume the arguments in ARC contain many fallacies because they were created by non-expert crowdworkers (i.e., workers are not experts in the field of argumentation).
2.3 Riposte! creation
For creating Riposte!, we use the crowdsourcing platform Amazon Mechanical Turk.111https://www.mturk.com/
One challenge for collecting training data for automatic constructive feedback is that the CAs should be useful for improving an argument. To assist with collecting such CAs, we adopt \newcitereisert2019annotation’s protocol for collecting CAs using crowdsourcing. We first make several modifications for our data collection (see Appendix). We create 4 separate crowdsourcing tasks (i.e., one for each fallacy type). For each of the 1,263 arguments in ARC, we ask 5 workers to produce a CA. For each fallacy type, we assist workers by providing them with a “fill-in-the-blank” template, where workers were instructed to fill in text boxes for a given pattern. The fallacy types and templates are shown in Table 1.
2.4 Riposte! statistics
The statistics of Riposte! are shown in Table 2.11,076 of the CAs are fallacy-specific (i.e. workers first identified a fallacy and then created the CA), and 7,811 CAs were created when a worker did not believe the specified fallacy existed in the argument. 6,373 instances were labeled as unsure (i.e. the worker was unsure about the fallacy type).
3 How did workers create CAs?
When creating training data for automatic constructive feedback, CAs should be useful and diverse. We determine how workers create CAs by calculating the similarity between i) a CA and argument and ii) CAs for single arguments.
How similar is one CA to the premise-claim?
In order to determine how annotators created their CAs, we calculate the BLEU papineni2002bleu score of each CA and the argument (e.g., premise/claim). The distribution in Figure 2 indicates that workers copied keywords directly from the original argument in some cases.
How similar are the CAs across annotators?
One design decision when building Riposte! was that with more annotators, we could collect a wide variety of diverse CAs for a single-argument regardless of the fallacy type. We first calculate the similarity of the CAs across annotators for a single argument. We tokenize the corpus using spaCy222https://spacy.io/ and remove stop words and punctuation. We then calculate the average Jaccard similarity score for all combinations of CAs per unique argument and average over all arguments. The results (see Table 3) indicate that the CAs are diverse.
4.1 Experimental design
In Section 3, we observed that workers copied keywords from the argument when creating a CA. Based on this observation, we experiment with different input settings to the model to better understand which parts of the argument annotators used to create their argument (e.g., topic (T) only, premise (P) only, claim (C) only, and so forth). We cast the task of automatic constructive feedback as a generation task and experiment with such settings.
Since both new and existing essay topics can be used and introduced by educators, we consider two possible settings: i) in-domain (i.e. topics are shared between splits) and ii) out-of-domain (i.e. topics are not shared).
For our generation model, we use gold fallacy type information.333We built an LSTM-encoder multi-label classifier and the results of 4-way classification was 36.02% F1 score, indicating more sophisticated features such as background knowledge and reasoning are necessary. This allows us to understand how well the model can generate CAs when correct fallacy types are predicted.
4.2 Data preparation
We filter out all unsure instances. We use majority vote for selecting CAs and their fallacy types. We split the data into 80% train, 10% test, and 10% dev. In each setting, we ensure that no unique claim-premise pairs are shared across splits.
For each experiment, we tokenize using spaCy and lowercase all tokens. For CAs, we replace the template with a special token (i.e. hg). For all other CAs, we discard the original template and add a special token between slot-fillers. This allows our baseline model to focus more on the content words found in the original argument.
Based on our observations in Section 4.1, we create a baseline for determining which parts of an argument annotators used to create CAs and how well a model can generate a CA.
Simple Overlap (SO)
We calculate simple BLEU overlap for each setting against the CA as a baseline. In order to directly compare the results to our seq2seq baseline model, we calculate the BLEU scores for the preprocessed data from our seq2seq baseline model with unknown words.
We preprocess and train our model using fairseq ott2019fairseq. We use pre-trained word embeddings (300-dimensional GloVe embeddings pennington2014glove) which are useful for generation tasks qi2018. We create two models (seq2seq-i and seq2seq-o) for in-domain and out-of-domain settings, respectively.444For seq2seq-i and seq2seq-o, we use the best hyperparameters from seq2seq-i (P+C) and seq2seq-o (P+C) across all settings, respectively.
We evaluate the results of our baselines using BLEU (see Table 4). Our SO results indicate that workers mainly used the premise and claim when creating CAs. We observe that seq2seq-o’s performance is low, indicating a simple model is not sufficient when unknown topics are introduced.
For evaluation, we would also like to compare the quality of gold CAs against generated CAs. We conduct an annotation study using AMT (3 workers per CA) and evaluate CA quality using 3 dimensions: Strength, Persuasiveness, and Relevance.555We use \newcitecarlile2018give’s guidelines and slightly modify for CAs. Please see the Appendix for our criteria. In total, we show 50 arguments and their gold/generated CAs, where each argument is annotated by 3 workers.666We use 50 generated CAs from seq2seq-i (P+C). The results are shown in Table 5.777We convert from a 5 to 3-scale for score calculation. We observed that workers found generated CAs more relevant, but the arguments were weaker and less persuasive. Examples of the generated output for our best model (seq2seq-i P+C) are shown in Table 6.
|home - schoolers should play for high school teams because all children should be able to participate in sports .||all children are to play in sports even home - schoolers will be playing sports .||all children should be able to participate in sports home - schoolers should play for high school teams .|
|the u.s . should lift sanctions with cuba because the embargo hurts our own economy .||the u.s . the embargo .||us sanctions our own economy .|
5 Conclusion and future work
In this work, we construct Riposte!, a large corpus of 18,887 crowdworker-produced CAs. Our analysis on Riposte! reveals that non-expert crowdworkers can produce reasonably diverse CAs. We cast automatic constructive feedback as a text generation task and create a baseline model.
In our future work, we will explore injecting background knowledge and reasoning into our model to generate CAs for unknown topics and provide detailed information to students about how to improve their original argumentation.
Appendix A Annotation Interface and Guidelines
We show the annotation interface used in our full-fledged crowdsourcing experiment in Figure 3. The conditions shown to workers for 3 fallacy types are shown in Figure 4. The interface for is shown in Figure 5.
The guidelines shown to workers is shown in Figure 6.
Appendix B Crowdsourcing settings
For our full-fledged experiment, we use the following settings: workers were required to have a number of Human Intelligence Tasks (HITs) approved to be greater than or equal to 100 and a HIT Approval Rate greater than or equal to 96%. For each HIT, workers were rewarded with $0.20 (in the case of hasty generalization, workers were rewarded with $0.10). An example of the guidelines for one fallacy type (e.g., questionable cause) are shown in Figure 6. For each of our experiments below, the settings are as follows. If workers selected no or unsure, they were required to provide a CA or reason, respectively. We inform workers that their work will be rejected if one or more of the following conditions is met. The CA is i) blank, ii) not a sentence, iii) a direct copy-paste of the original argument in the text box or copy-paste of the guidelines, or iv) not written in English. We manually reject responses that fall under this criteria.
Appendix C Model Hyperparameters
|dropout||0.1, 0.2, 0.3, 0.4, 0.5|
|hidden layers||128, 256, 512, 1024|
|learning rate||0.1, 0.01, 0.001|
For seq2seq-i (P+C) and seq2seq-o (P+C), we experiment with the hyperparameters shown in Table 7. The best hyperparameters for our experiment are as follows. For seq2seq-i, we use the following settings. The dropout is set to 0.4. We use SGD as an optimizer with a learning rate of 0.01. The number of encoder/decoder layers is set to 1, and the encoder/decoder hidden size is 256.
For seq2seq-o, we use the following settings. The dropout is set to 0.2. We use SGD as an optimizer with a learning rate of 0.01. The number of encoder/decoder layers is set to 1, and the encoder/decoder hidden size is 256.
|Relevant||Anyone can see how the counter-argument attacks the argument. The relationship between the two components is either explicit or extremely easy to infer. The relationship is thoroughly explained in the text because the two components contain the same words or exhibit coreference.|
|Persuasive||A very strong, clear counter-argument. It would persuade most readers and is devoid of errors that might detract from its strength or make it difficult to understand.|
|Strength||A very strong counter-argument with no fallacies. Not much can be improved in order to attack the argument better.|
Appendix D Annotation Criteria and Examples
The guidelines shown to crowdworkers when annotating the quality of CAs are shown in Table 8. We show the description for strong dimensions (i.e., score of 5).
Examples of CAs for one argument are shown in Figure 7.