What Gets Echoed? Understanding the “Pointers” in Explanations of Persuasive Arguments

What Gets Echoed? Understanding the “Pointers” in Explanations of Persuasive Arguments

David Atkinson    Kumar Bhargav Srinivasan    Chenhao Tan
Department of Computer Science
University of Colorado Boulder
Boulder, CO
david.i.atkinson, kumar.srinivasan, chenhao.tan@colorado.edu

Explanations are central to everyday life, and are a topic of growing interest in the AI community. To investigate the process of providing natural language explanations, we leverage the dynamics of the /r/ChangeMyView subreddit to build a dataset with 36K naturally occurring explanations of why an argument is persuasive. We propose a novel word-level prediction task to investigate how explanations selectively reuse, or echo, information from what is being explained (henceforth, explanandum). We develop features to capture the properties of a word in the explanandum, and show that our proposed features not only have relatively strong predictive power on the echoing of a word in an explanation, but also enhance neural methods of generating explanations. In particular, while the non-contextual properties of a word itself are more valuable for stopwords, the interaction between the constituent parts of an explanandum is crucial in predicting the echoing of content words. We also find intriguing patterns of a word being echoed. For example, although nouns are generally less likely to be echoed, subjects and objects can, depending on their source, be more likely to be echoed in the explanations.

1 Introduction

Explanations are essential for understanding and learning (Keil, 2006). They can take many forms, ranging from everyday explanations for questions such as why one likes Star Wars, to sophisticated formalization in the philosophy of science (Salmon, 2006), to simply highlighting features in recent work on interpretable machine learning (Ribeiro et al., 2016).

Original post (OP): CMV: most hit music artists today are bad musicians
Now I know, music is art and art has no rules, but this is only so true. Movies are art too but I think most of us can agree the emoji movie was objectively bad. That aside: I really feel like once you remove the persona and performances of the artists from the ”top 40” songs and listen to them as just a song, most are objectively bad. They’re super repetitive, the lyrics and painfully generic, and there’s hardly ever anything new or challenging. And from what I understand most of these artists don’t even write their own songs. Of course there are exceptions but I find them to be extremely rare. It seems to me they’re only popular become of who they are and how they look/perform. I realize this is probably a very snobbish view which is why I want to be enlightened, so can anyone convince me otherwise? Are they actually good musicians or just good performers? [one more paragraph …]
Persuasive comment (PC): Music appreciation is a skill, and it’s all about pattern recognition.
When we’re children, we need songs that are really simple, repetitive and with easy to recognize patterns. The younger we are, the simpler the songs. Toddlers like nursery rhymes, lullabies, jingles. Teens like pop music. And teens spend more on music than anyone else. [four more paragraphs …]
Lastly, you have to consider that music can be listened to in different ways and for different purposes. You can listen to it alone on headphones, and think about what it means and how it makes you feel. Or you can dance to it with your friends. Or maybe you need something on in the background during a dinner party, or a house party, or while you study, or are trying to fall asleep, or work out. Pop music is really good in some of these situations, really bad in others. But it serves a definite purpose and isn’t bad in any essential way.
Explanation: I guess I never really looked at it as music serving different purposes. I can see how pop music fills a certain purpose, and I guess the artist does n’t necessarily have to be the one to write the song (although I appreciate it when they do).
Table 1: An illustration of the pointers in an example explanation of /r/ChangeMyView. We color the words in the explanation based on whether it is used in the original post (e.g., artist), in the persuasive comment (e.g., purpose), or both (e.g., music). We stem all the words before matching and do not color stopwords for readability.

Although everyday explanations are mostly encoded in natural language, natural language explanations remain understudied in NLP, partly due to a lack of appropriate datasets and problem formulations. To address these challenges, we leverage /r/ChangeMyView, a community dedicated to sharing counterarguments to controversial views on Reddit, to build a sizable dataset of naturally-occurring explanations. Specifically, in /r/ChangeMyView, an original poster (OP) first delineates the rationales for a (controversial) opinion (e.g., in Table 1, “most hit music artists today are bad musicians”). Members of /r/ChangeMyView are invited to provide counterarguments. If a counterargument changes the OP’s view, the OP awards a to indicate the change and is required to explain why the counterargument is persuasive. In this work, we refer to what is being explained, including both the original post and the persuasive comment, as the explanandum.111The plural of explanandum is explananda.

An important advantage of explanations in /r/ChangeMyView is that the explanandum contains most of the required information to provide its explanation. These explanations often select key counterarguments in the persuasive comment and connect them with the original post. As shown in Table 1, the explanation naturally points to, or echoes, part of the explanandum (including both the persuasive comment and the original post) and in this case highlights the argument of “music serving different purposes.”

These naturally-occurring explanations thus enable us to computationally investigate the selective nature of explanations: “people rarely, if ever, expect an explanation that consists of an actual and complete cause of an event. Humans are adept at selecting one or two causes from a sometimes infinite number of causes to be the explanation” (Miller, 2018). To understand the selective process of providing explanations, we formulate a word-level task to predict whether a word in an explanandum will be echoed in its explanation.

Inspired by the observation that words that are likely to be echoed are either frequent or rare, we propose a variety of features to capture how a word is used in the explanandum as well as its non-contextual properties in Section 4. We find that a word’s usage in the original post and in the persuasive argument are similarly related to being echoed, except in part-of-speech tags and grammatical relations. For instance, verbs in the original post are less likely to be echoed, while the relationship is reversed in the persuasive argument.

We further demonstrate that these features can significantly outperform a random baseline and even a neural model with significantly more knowledge of a word’s context. The difficulty of predicting whether content words (i.e., non-stopwords) are echoed is much greater than that of stopwords,222We use the stopword list in NLTK. among which adjectives are the most difficult and nouns are relatively the easiest. This observation highlights the important role of nouns in explanations. We also find that the relationship between a word’s usage in the original post and in the persuasive comment is crucial for predicting the echoing of content words. Our proposed features can also improve the performance of pointer generator networks with coverage in generating explanations (See et al., 2017).

To summarize, our main contributions are:

  • [itemsep=0pt,leftmargin=*,topsep=0pt]

  • We highlight the importance of computationally characterizing human explanations and formulate a concrete problem of predicting how information is selected from explananda to form explanations, including building a novel dataset of naturally-occurring explanations.

  • We provide a computational characterization of natural language explanations and demonstrate the U-shape in which words get echoed.

  • We identify interesting patterns in what gets echoed through a novel word-level classification task, including the importance of nouns in shaping explanations and the importance of contextual properties of both the original post and persuasive comment in predicting the echoing of content words.

  • We show that vanilla LSTMs fail to learn some of the features we develop and that the proposed features can even improve performance in generating explanations with pointer networks.

Our code and dataset is available at https://chenhaot.com/papers/explanation-pointers.html.

2 Related Work

To provide background for our study, we first present a brief overview of explanations for the NLP community, and then discuss the connection of our study with pointer networks, linguistic accommodation, and argumentation mining.

The most developed discussion of explanations is in the philosophy of science. Extensive studies aim to develop formal models of explanations (e.g., the deductive-nomological model in Hempel and Oppenheim (1948), see Salmon (2006) and Woodward (2005) for a review). In this view, explanations are like proofs in logic. On the other hand, psychology and cognitive sciences examine “everyday explanations” (Keil, 2006; Lombrozo, 2006). These explanations tend to be selective, are typically encoded in natural language, and shape our understanding and learning in life despite the absence of “axioms.” Please refer to Wilson and Keil (1998) for a detailed comparison of these two modes of explanation.

Although explanations have attracted significant interest from the AI community thanks to the growing interest on interpretable machine learning (Doshi-Velez and Kim, 2017; Lipton, 2016; Guidotti et al., 2019), such studies seldom refer to prior work in social sciences (Miller, 2018). Recent studies also show that explanations such as highlighting important features induce limited improvement on human performance in detecting deceptive reviews and media biases (Lai and Tan, 2019; Horne et al., 2019). Therefore, we believe that developing a computational understanding of everyday explanations is crucial for explainable AI. Here we provide a data-driven study of everyday explanations in the context of persuasion.

In particular, we investigate the “pointers” in explanations, inspired by recent work on pointer networks (Vinyals et al., 2015). Copying mechanisms allow a decoder to generate a token by copying from the source, and have been shown to be effective in generation tasks ranging from summarization to program synthesis (See et al., 2017; Ling et al., 2016; Gu et al., 2016). To the best of our knowledge, our work is the first to investigate the phenomenon of pointers in explanations.

Linguistic accommodation and studies on quotations also examine the phenomenon of reusing words (Danescu-Niculescu-Mizil et al., 2011; Giles and Ogay, 2007; Leskovec et al., 2009; Simmons et al., 2011). For instance, Danescu-Niculescu-Mizil et al. show that power differences are reflected in the echoing of function words; Tan et al. (2018) find that news media prefer to quote locally distinct sentences in political debates. In comparison, our word-level formulation presents a fine-grained view of echoing words, and puts a stronger emphasis on content words than work on linguistic accommodation.

Finally, our work is concerned with an especially challenging problem in social interaction: persuasion. A battery of studies have done work to enhance our understanding of persuasive arguments (Wang et al., 2017; Zhang et al., 2016; Habernal and Gurevych, 2016; Lukin et al., 2017; Durmus and Cardie, 2018), and the area of argumentation mining specifically investigates the structure of arguments (Lippi and Torroni, 2016; Walker et al., 2012; Somasundaran and Wiebe, 2009). We build on previous work by Tan et al. (2016) and leverage the dynamics of /r/ChangeMyView. Although our findings are certainly related to the persuasion process, we focus on understanding the self-described reasons for persuasion, instead of the structure of arguments or the factors that drive effective persuasion.

3 Dataset

(a) Length correlations.
(b) Fraction of words that are echoed from the explanandum.
(c) Word-level echoing probability vs. document frequency.
Figure 1: Figure 0(a) shows the pairwise Pearson correlation coefficient between lengths of OP, PC, and explanation (all values are statistically significant with ). Figure 0(b) shows the average fraction of words in an explanation that are in its OP or PC, and the fraction of words in a PC that are in its OP. In Figure 2(a), the -axis represents the probability of a word in an OP or PC being echoed in the explanation, while the -axis shows the inverse document frequency of that word in training data. For each document frequency decile, we calculate the probability of a word in that decile being echoed, and plot those probabilities with the red line. In Figure 0(b) and Figure 2(a), the (small) error bars represent standard errors.

Our dataset is derived from the /r/ChangeMyView subreddit, which has more than 720K subscribers Tan et al. (2016). /r/ChangeMyView hosts conversations where someone expresses a view and others then try to change that person’s mind. Despite being fundamentally based on argument, /r/ChangeMyView has a reputation for being remarkably civil and productive CMV moderators (2019a), e.g., a journalist wrote “In a culture of brittle talking points that we guard with our lives, Change My View is a source of motion and surprise” Heffernan (2018).

The delta mechanism in /r/ChangeMyView allows members to acknowledge opinion changes and enables us to identify explanations for opinion changes (CMV moderators, 2019b). Specifically, it requires “Any user, whether they’re the OP or not, should reply to a comment that changed their view with a delta symbol and an explanation of the change.” As a result, we have access to tens of thousands of naturally-occurring explanations and associated explananda. In this work, we focus on the opinion changes of the original posters.

Throughout this paper, we use the following terminology:

  • [itemsep=-5pt,leftmargin=*,topsep=0pt]

  • An original post (OP) is an initial post where the original poster justifies his or her opinion. We also use OP to refer to the original poster.

  • A persuasive comment (PC) is a comment that directly leads to an opinion change on the part of the OP (i.e., winning a ).

  • A top-level comment is a comment that directly replies to an OP, and /r/ChangeMyView requires the top-level comment to “challenge at least one aspect of OP’s stated view (however minor), unless they are asking a clarifying question.”

  • An explanation is a comment where an OP acknowledges a change in his or her view and provides an explanation of the change. As shown in Table 1, the explanation not only provides a rationale, it can also include other discourse acts, such as expressing gratitude.

Using https://pushshift.io, we collect the posts and comments in /r/ChangeMyView from January 17th, 2013 to January 31st, 2019, and extract tuples of (OP, PC, explanation). We use the tuples from the final six months of our dataset as the test set, those from the six months before that as the validation set, and the remaining tuples as the training set. The sets contain 5,270, 5,831, and 26,617 tuples respectively. Note that there is no overlap in time between the three sets and the test set can therefore be used to assess generalization including potential changes in community norms and world events.

Preprocessing. We perform a number of preprocessing steps, such as converting blockquotes in Markdown to quotes, filtering explicit edits made by authors, mapping all URLs to a special @url@ token, and replacing hyperlinks with the link text. We ignore all triples that contain any deleted comments or posts. We use spaCy for tokenization and tagging (Honnibal and Montani, 2017). We also use the NLTK implementation of the Porter stemming algorithm to store the stemmed version of each word, for later use in our prediction task (Loper and Bird, 2002; Porter, 1980). Refer to the supplementary material for more information on preprocessing.

Data statistics. Table 2 provides basic statistics of the training tuples and how they compare to other comments. We highlight the fact that PCs are on average longer than top-level comments, suggesting that PCs contain substantial counterarguments that directly contribute to opinion change. Therefore, we simplify the problem by focusing on the (OP, PC, explanation) tuples and ignore any other exchanges between an OP and a commenter.

Below, we highlight some notable features of explanations as they appear in our dataset.

The length of explanations shows stronger correlation with that of OPs and PCs than between OPs and PCs (Figure 0(a)). This observation indicates that explanations are somehow better related with OPs and PCs than PCs are with OPs in terms of language use. A possible reason is that the explainer combines their natural tendency towards length with accommodating the PC.

count #sentences #words
Tuples of (OP, PC, Explanations)
Original Posts 26.3K 16.8 298.8
Persuasive comments 26.3K 12.6 218.3
Explanations 26.3K 5.3 79.8
All of /r/ChangeMyView during the training period
Original posts 93.4k 13.2 172.6
Top-level comments 681.6k 9.1 147.4
All comments 3.6M 6.5 98.9
Table 2: Basic statistics of the training dataset.

Explanations have a greater fraction of “pointers” than do persuasive comments (Figure 0(b)). We measure the likelihood of a word in an explanation being copied from either its OP or PC and provide a similar probability for a PC for copying from its OP. As we discussed in Section 1, the words in an explanation are much more likely to come from the existing discussion than are the words in a PC (59.8% vs 39.0%). This phenomenon holds even if we restrict ourselves to considering words outside quotations, which removes the effect of quoting other parts of the discussion, and if we focus only on content words, which removes the effect of “reusing” stopwords.

Relation between a word being echoed and its document frequency (Figure 2(a)). Finally, as a preview of our main results, the document frequency of a word from the explanandum is related to the probability of being echoed in the explanation. Although the average likelihood declines as the document frequency gets lower, we observe an intriguing U-shape in the scatter plot.333A similar U-shape exists if we examine the probability of a PC echoing its OP, but does not show up if we compare an OP echoing a different, randomly chosen OP. It is worth noting that PCs can also be viewed as explaining why the OP is problematic. However, constructing a PC involves selecting from a large number of possible counter perspectives (all of which are unobservable). See the supplementary material for a detailed discussion. In other words, the words that are most likely to be echoed are either unusually frequent or unusually rare, while most words in the middle show a moderate likelihood of being echoed.

Feature group Features and intuitions Echoed?
Non-contextual properties Inverse document frequency. As shown in Figure 2(a), although document frequency can have non-linear relationships with being copied, the average echoing probability is greater for more common words.
Number of characters. Longer words tend to be more complicated, and may be more likely to be echoed as part of the core argument.
Wordnet depth. Similar to number of characters, the depth in wordnet can indicate the complexity of a word and we expect words with greater depth to be echoed.
Echoing likelihood. We also compute the general tendency of a word being echoed in the training data. We expect the feature to be positively correlated with the label.
How a word is used in an OP or PC (OP/PC usage) Part-of-speech (POS) tags. We compute the percentage of times that the surface forms of a stemmed word appear as different part-of-speech tags. We expect nouns and verbs more likely to be echoed. Results: verb in an OP , noun in an OP (), verb in a PC (), noun in a PC: .
Subjects and objects from dependency labels. We compute the percentage of times that the word appears as subjects, objects, and others. We expect subjects and objects more likely to be echoed. Results: subjects in an OP: , objects in an OP: , others in an OP: , subjects in a PC: , objects in a PC: ; others in a PC: .
(Normalized) term frequency. We expect frequent terms to be echoed.
#surface forms. We expect words that have diverse surface forms to be echoed.
Location. For words that never show up in an OP or PC, the default value is 0.5. We expect later words to be echoed. Results: location in an OP: (not significant in stopwords); location in a PC: .
In quotes. We expect words in quotes to be echoed as they are already emphasized.
Entity. We expect entities to be echoed.
How a word connects an OP and PC (OP-PC relation) Occurs both in an OP and PC.
#Surface forms in an OP but not in the PC.
#Surface forms in a PC but not in the OP.
Jensen-Shannon (JS) distance between the OP and PC POS tag distributions of the word.
JS distance between subjects/objects distributions of the word in an OP and PC.
General OP/PC properties OP length.
PC length.
Difference in #words.
Difference in average #characters in words.
Part-of-speech tags distributional differences between an OP and PC.
Depth of the PC in the thread.
Table 3: Features to capture the properties of a word in the context of an explanandum. The last column shows test results after Bonferroni correction. indicates that words that are echoed have a greater value in the feature, while indicates the reverse. The number of arrows indicates the level of p-value: : , : , : , : . and indicate that the direction is flipped in content words and stopwords respectively. We show significance testing results in a condensed format for space reasons. Refer to the supplementary material for the complete testing results.

4 Understanding the Pointers in Explanations

To further investigate how explanations select words from the explanandum, we formulate a word-level prediction task to predict whether words in an OP or PC are echoed in its explanation. Formally, given a tuple of (OP, PC, explanation), we extract the unique stemmed words as . We then define the label for each word in the OP or PC, , based on the explanation as follows:

Our prediction task is thus a straightforward binary classification task at the word level. We develop the following five groups of features to capture properties of how a word is used in the explanandum (see Table 3 for the full list):

  • [itemsep=0pt,leftmargin=*,topsep=0pt]

  • Non-contextual properties of a word. These features are derived directly from the word and capture the general tendency of a word being echoed in explanations.

  • Word usage in an OP or PC (two groups). These features capture how a word is used in an OP or PC. As a result, for each feature, we have two values for the OP and PC respectively.

  • How a word connects an OP and PC. These features look at the difference between word usage in the OP and PC. We expect this group to be the most important in our task.

  • General OP/PC properties. These features capture the general properties of a conversation. They can be used to characterize the background distribution of echoing.

Table 3 further shows the intuition for including each feature, and condensed -test results after Bonferroni correction. Specifically, we test whether the words that were echoed in explanations have different feature values from those that were not echoed. In addition to considering all words, we also separately consider stopwords and content words in light of Figure 2(a). Here, we highlight a few observations:

  • [itemsep=0pt,leftmargin=*,topsep=0pt]

  • Although we expect more complicated words (#characters) to be echoed more often, this is not the case on average. We also observe an interesting example of Simpson’s paradox in the results for Wordnet depth (Blyth, 1972): shallower words are more likely to be echoed across all words, but deeper words are more likely to be echoed in content words and stopwords.

  • OPs and PCs generally exhibit similar behavior for most features, except for part-of-speech and grammatical relation (subject, object, and other.) For instance, verbs in an OP are less likely to be echoed, while verbs in a PC are more likely to be echoed.

  • Although nouns from both OPs and PCs are less likely to be echoed, within content words, subjects and objects from an OP are more likely to be echoed. Surprisingly, subjects and objects in a PC are less likely to be echoed, which suggests that the original poster tends to refer back to their own subjects and objects, or introduce new ones, when providing explanations.

  • Later words in OPs and PCs are more likely to be echoed, especially in OPs. This could relate to OPs summarizing their rationales at the end of their post and PCs putting their strongest points last.

  • Although the number of surface forms in an OP or PC is positively correlated with being echoed, the differences in surface forms show reverse trends: the more surface forms of a word that show up only in the PC (i.e., not in the OP), the more likely a word is to be echoed. However, the reverse is true for the number of surface forms in only the OP. Such contrast echoes Tan et al. (2016), in which dissimilarity in word usage between the OP and PC was a predictive feature of successful persuasion.

5 Predicting Pointers

We further examine the effectiveness of our proposed features in a predictive setting. These features achieve strong performance in the word-level classification task, and can enhance neural models in both the word-level task and generating explanations. However, the word-level task remains challenging, especially for content words.

(a) Overall Performance comparison between models.
(b) Feature importance of an ablated model with OP-PC relation.
(c) Performance vs. word source.
Figure 2: Figure 1(a) presents the performance of different models. We evaluate the performance of each model on the subset of stopwords and content words. Our features with XGBoost and logistic regression outperform the vanilla LSTM model, and adding our features to the vanilla LSTM model achieves similar performance as XGBoost. Figure 1(b) shows the normalized total gain of the classifier only based on features in OP-PC relation, while Figure 1(c) further breaks down the performance based on where the words come from.

5.1 Experiment setup

We consider two classifiers for our word-level classification task: logistic regression and gradient boosting tree (XGBoost) (Chen and Guestrin, 2016). We hypothesized that XGBoost would outperform logistic regression because our problem is non-linear, as shown in Figure 2(a).

To examine the utility of our features in a neural framework, we further adapt our word-level task as a tagging task, and use LSTM as a baseline. Specifically, we concatenate an OP and PC with a special token as the separator so that an LSTM model can potentially distinguish the OP from PC, and then tag each word based on the label of its stemmed version. We use GloVe embeddings to initialize the word embeddings Pennington et al. (2014). We concatenate our proposed features of the corresponding stemmed word to the word embedding; the resulting difference in performance between a vanilla LSTM demonstrates the utility of our proposed features. We scale all features to before fitting the models. As introduced in Section 3, we split our tuples of (OP, PC, explanation) into training, validation, and test sets, and use the validation set for hyperparameter tuning. Refer to the supplementary material for additional details in the experiment.

Evaluation metric. Since our problem is imbalanced, we use the F1 score as our evaluation metric. For the tagging approach, we average the labels of words with the same stemmed version to obtain a single prediction for the stemmed word. To establish a baseline, we consider a random method that predicts the positive label with 0.15 probability (the base rate of positive instances).

5.2 Prediction Performance

Overall performance (Figure 1(a)). Although our word-level task is heavily imbalanced, all of our models outperform the random baseline by a wide margin. As expected, content words are much more difficult to predict than stopwords, but the best F1 score in content words more than doubles that of the random baseline (0.286 vs. 0.116). Notably, although we strongly improve on our random baseline, even our best F1 scores are relatively low, and this holds true regardless of the model used. Despite involving more tokens than standard tagging tasks (e.g., Marcus et al. (1994) and Plank et al. (2016)), predicting whether a word is going to be echoed in explanations remains a challenging problem.

Although the vanilla LSTM model incorporates additional knowledge (in the form of word embeddings), the feature-based XGBoost and logistic regression models both outperform the vanilla LSTM model. Concatenating our proposed features with word embeddings leads to improved performance from the LSTM model, which becomes comparable to XGBoost. This suggests that our proposed features can be difficult to learn with an LSTM alone.

Despite the non-linearity observed in Figure 2(a), XGBoost only outperforms logistic regression by a small margin. In the rest of this section, we use XGBoost to further examine the effectiveness of different groups of features, and model performance in different conditions.

content stop
all features 0.286 0.600
random 0.116 0.205
forward backward
content stop content stop
Non-contextual prop. 0.177 0.582 0.285 0.561
OP usage 0.191 0.527 0.281 0.599
PC usage 0.233 0.520 0.275 0.598
OP-PC relation 0.280 0.542 0.289 0.600
General OP/PC prop. 0.153 0.266 0.285 0.598
Table 4: Ablation performance with XGBoost on content words and stopwords (each ablated model is tuned based on performance on all words). “forward” refers to only using a group of features, while “backward” refers to only removing a group of features.

Ablation performance (Table 4). First, if we only consider a single group of features, as we hypothesized, the relation between OP and PC is crucial and leads to almost as strong performance in content words as using all features. To further understand the strong performance of OP-PC relation, Figure 1(b) shows the feature importance in the ablated model, measured by the normalized total gain (see the supplementary material for feature importance in the full model). A word’s occurrence in both the OP and PC is clearly the most important feature, with distance between its POS tag distributions as the second most important. Recall that in Table 3 we show that words that have similar POS behavior between the OP and PC are more likely to be echoed in the explanation.

Overall, it seems that word-level properties contribute the most valuable signals for predicting stopwords. If we restrict ourselves to only information in either an OP or PC, how a word is used in a PC is much more predictive of content word echoing (0.233 vs 0.191). This observation suggests that, for content words, the PC captures more valuable information than the OP. This finding is somewhat surprising given that the OP sets the topic of discussion and writes the explanation.

As for the effects of removing a group of features, we can see that there is little change in the performance on content words. This can be explained by the strong performance of the OP-PC relation on its own, and the possibility of the OP-PC relation being approximated by OP and PC usage. Again, word-level properties are valuable for strong performance in stopwords.

Performance vs. word source (Figure 1(c)). We further break down the performance by where a word is from. We can group a word based on whether it shows up only in an OP, a PC, or both OP and PC, as shown in Table 1. There is a striking difference between the performance in the three categories (e.g., for all words, 0.63 in OP & PC vs. 0.271 in PC only). The strong performance on words in both the OP and PC applies to stopwords and content words, even accounting for the shift in the random baseline, and recalls the importance of occurring both in OP and PC as a feature.

Furthermore, the echoing of words from the PC is harder to predict (0.271) than from the OP (0.347) despite the fact that words only in PCs are more likely to be echoed than words only in OPs (13.5% vs. 8.6%). The performance difference is driven by stopwords, suggesting that our overall model is better at capturing signals for stopwords used in OPs. This might relate to the fact that the OP and the explanation are written by the same author; prior studies have demonstrated the important role of stopwords for authorship attribution (Raghavan et al., 2010).

Nouns are the most reliably predicted part-of-speech tag within content words (Table 5). Next, we break down the performance by part-of-speech tags. We focus on the part-of-speech tags that are semantically important, namely, nouns, proper nouns, verbs, adverbs, and adjectives.

Prediction performance can be seen as a proxy for how reliably a part-of-speech tag is reused when providing explanations. Consistent with our expectations for the importance of nouns and verbs, our models achieve the best performance on nouns within content words. Verbs are more challenging, but become the least difficult tag to predict when we consider all words, likely due to stopwords such as “have.” Adjectives turn out to be the most challenging category, suggesting that adjectival choice is perhaps more arbitrary than other parts of speech, and therefore less central to the process of constructing an explanation. The important role of nouns in shaping explanations resonates with the high recall rate of nouns in memory tasks (Reynolds and Flagg, 1976).

content all random
noun 0.354 0.361 0.130
adverb 0.342 0.411 0.127
verb 0.306 0.466 0.122
proper noun 0.280 0.336 0.109
adjective 0.237 0.289 0.111
Table 5: Performance on five non-function part-of-speech tags (sorted by performance within content words). As a comparison, we also show the performance of the random baseline on content words, which is relatively stable across part-of-speech tags.

5.3 The Effect on Generating Explanations

w/o features 18.91 4.12 17.05
with features 22.01 3.93 19.02
Table 6: ROUGE scores (F1) on the test dataset (Lin, 2004). The differences in ROUGE-1 and ROUGE-L are statistically significant with .

One way to measure the ultimate success of understanding pointers in explanations is to be able to generate explanations. We use the pointer generator network with coverage as our starting point See et al. (2017); Klein et al. (2017) (see the supplementary material for details). We investigate whether concatenating our proposed features with word embeddings can improve generation performance, as measured by ROUGE scores.

Consistent with results in sequence tagging for word-level echoing prediction, our proposed features can enhance a neural model with copying mechanisms (see Table 6). Specifically, their use leads to statistically significant improvement in ROUGE-1 and ROUGE-L, while slightly hurting the performance in ROUGE-2 (the difference is not statistically significant). We also find that our features can increase the likelihood of copying: an average of 17.59 unique words get copied to the generated explanation with our features, compared to 14.17 unique words without our features. For comparison, target explanations have an average of 34.81 unique words. We emphasize that generating explanations is a very challenging task (evidenced by the low ROUGE scores and examples in the supplementary material), and that fully solving the generation task requires more work.

6 Concluding Discussions

In this work, we conduct the first large-scale empirical study of everyday explanations in the context of persuasion. We assemble a novel dataset and formulate a word-level prediction task to understand the selective nature of explanations. Our results suggest that the relation between an OP and PC plays an important role in predicting the echoing of content words, while a word’s non-contextual properties matter for stopwords. We show that vanilla LSTMs fail to learn some of the features we develop and that our proposed features can improve the performance in generating explanations using pointer networks. We also demonstrate the important role of nouns in shaping explanations.

Although our approach strongly outperforms random baselines, the relatively low F1 scores indicate that predicting which word is echoed in explanations is a very challenging task. It follows that we are only able to derive a limited understanding of how people choose to echo words in explanations. The extent to which explanation construction is fundamentally random (Nisbett and Wilson, 1977), or whether there exist other unidentified patterns, is of course an open question. We hope that our study and the resources that we release encourage further work in understanding the pragmatics of explanations.

There are many promising research directions for future work in advancing the computational understanding of explanations. First, although /r/ChangeMyView has the useful property that its explanations are closely connected to its explananda, it is important to further investigate the extent to which our findings generalize beyond /r/ChangeMyView and Reddit and establish universal properties of explanations. Second, it is important to connect the words in explanations that we investigate here to the structure of explanations in pyschology (Lombrozo, 2006). Third, in addition to understanding what goes into an explanation, we need to understand what makes an explanation effective. A better understanding of explanations not only helps develop explainable AI, but also informs the process of collecting explanations that machine learning systems learn from (Hancock et al., 2018; Rajani et al., 2019; Camburu et al., 2018).


We thank Kimberley Buchan, anonymous reviewers, and members of the NLP+CSS research group at CU Boulder for their insightful comments and discussions; Jason Baumgartner for sharing the dataset that enabled this research.


  • C. R. Blyth (1972) On Simpson’s paradox and the sure-thing principle. Journal of the American Statistical Association 67 (338), pp. 364–366. Cited by: 1st item.
  • O. Camburu, T. Rocktäschel, T. Lukasiewicz, and P. Blunsom (2018) e-SNLI: natural language inference with natural language explanations. In Proceedings of NeurIPS, Cited by: §6.
  • Center for Language and Information Research (2016) ClearNLP Guidelines. Note: https://github.com/clir/clearnlp-guidelines/blob/master/md/specifications/dependency_labels.md[Online; accessed 21-May-2019] Cited by: item 22-24..
  • T. Chen and C. Guestrin (2016) XGBoost: a scalable tree boosting system. In Proceedings of KDD, Cited by: §5.1.
  • CMV moderators (2019a) CMV media coverage. Note: https://changemyview.net/subreddit/#media-coverage[Online; accessed 27-Apr-2019] Cited by: §3.
  • CMV moderators (2019b) The Delta System. Note: https://www.reddit.com/r/changemyview/wiki/deltasystem[Online; accessed 27-Apr-2019] Cited by: §3.
  • C. Danescu-Niculescu-Mizil, M. Gamon, and S. Dumais (2011) Mark my words!: linguistic style accommodation in social media. In Proceedings of WWW, Cited by: §2.
  • [8] C. Danescu-Niculescu-Mizil, L. Lee, B. Pang, and J. Kleinberg Echoes of power: language effects and power differences in social interaction. In Proceedings of WWW, Cited by: §2.
  • F. Doshi-Velez and B. Kim (2017) Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608. Cited by: §2.
  • E. Durmus and C. Cardie (2018) Exploring the role of prior beliefs for argument persuasion. In Proceedings of NAACL, Cited by: §2.
  • H. Giles and T. Ogay (2007) Communication accommodation theory. Explaining communication: Contemporary theories and exemplars, pp. 293–310. Cited by: §2.
  • J. Gu, Z. Lu, H. Li, and V. O. K. Li (2016) Incorporating Copying Mechanism in Sequence-to-Sequence Learning. In Proceedings of ACL, (en). External Links: Link Cited by: §2.
  • R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, and D. Pedreschi (2019) A survey of methods for explaining black box models. ACM computing surveys (CSUR) 51 (5), pp. 93. Cited by: §2.
  • I. Habernal and I. Gurevych (2016) What makes a convincing argument? Empirical analysis and detecting attributes of convincingness in web argumentation.. In Proceedings of EMNLP, Cited by: §2.
  • B. Hancock, P. Varma, S. Wang, M. Bringmann, P. Liang, and C. Ré (2018) Training Classifiers with Natural Language Explanations. In Proceedings of ACL, (en). External Links: Link Cited by: §6.
  • V. Heffernan (2018) Our best hope for civil discourse online is on … Reddit. Wired. External Links: Link Cited by: §3.
  • C. G. Hempel and P. Oppenheim (1948) Studies in the logic of explanation. Philosophy of science 15 (2), pp. 135–175. Cited by: §2.
  • M. Honnibal and I. Montani (2017) spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. Cited by: §A.1, §3.
  • B. D. Horne, D. Nevo, J. O’Donovan, J. Cho, and S. Adali (2019) Rating reliability and bias in news articles: does ai assistance help everyone?. In Proceedings of ICWSM, Cited by: §2.
  • F. C. Keil (2006) Explanation and understanding. Annu. Rev. Psychol. 57, pp. 227–254. Cited by: §1, §2.
  • D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In Proceedings of ICLR, Cited by: §A.4.
  • G. Klein, Y. Kim, Y. Deng, J. Senellart, and A. M. Rush (2017) OpenNMT: open-source toolkit for neural machine translation. In Proceedings of ACL, External Links: Link, Document Cited by: §5.3.
  • V. Lai and C. Tan (2019) On human predictions with explanations and predictions of machine learning models: a case study on deception detection. In Proceedings of FAT*, Cited by: §2.
  • J. Leskovec, L. Backstrom, and J. Kleinberg (2009) Meme-tracking and the dynamics of the news cycle. In Proceedings of KDD, Cited by: §2.
  • C. Lin (2004) ROUGE: a package for automatic evaluation of summaries. In Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, Barcelona, Spain, pp. 74–81. External Links: Link Cited by: Table 6.
  • W. Ling, P. Blunsom, E. Grefenstette, K. M. Hermann, T. Kočiský, F. Wang, and A. Senior (2016) Latent predictor networks for code generation. In Proceedings of ACL, Berlin, Germany, pp. 599–609. External Links: Link, Document Cited by: §2.
  • M. Lippi and P. Torroni (2016) Argumentation mining: state of the art and emerging trends. ACM Transactions on Internet Technology (TOIT) 16 (2), pp. 10. Cited by: §2.
  • Z. C. Lipton (2016) The mythos of model interpretability. arXiv preprint arXiv:1606.03490. Cited by: §2.
  • T. Lombrozo (2006) The structure and function of explanations. Trends in cognitive sciences 10 (10), pp. 464–470. Cited by: §2, §6.
  • E. Loper and S. Bird (2002) NLTK: the natural language toolkit. arXiv preprint cs/0205028. Cited by: §A.1, §3.
  • S. M. Lukin, P. Anand, M. Walker, and S. Whittaker (2017) Argument strength is in the eye of the beholder: audience effects in persuasion. In Proceedings of EACL, Cited by: §2.
  • M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz (1994) Building a large annotated corpus of english: the penn treebank. Computational Linguistics 19, pp. 313–330. Cited by: §5.2.
  • T. Miller (2018) Explanation in artificial intelligence: insights from the social sciences. Artificial Intelligence. Cited by: §1, §2.
  • R. E. Nisbett and T. D. Wilson (1977) Telling more than we can know: verbal reports on mental processes.. Psychological review 84 (3), pp. 231. Cited by: §6.
  • J. Pennington, R. Socher, and C. Manning (2014) GloVe: Global Vectors for Word Representation. In Proceedings of EMNLP, Doha, Qatar, pp. 1532–1543. External Links: Link, Document Cited by: §5.1.
  • B. Plank, A. Søgaard, and G. Yoav (2016) Multilingual part-of-speech tagging with bidirectional long short-term memory models and auxiliary loss. In Proceedings of ACL (short papers), Cited by: §5.2.
  • M. F. Porter (1980) An algorithm for suffix stripping. Program 14 (2), pp. 130–137. Cited by: §3.
  • S. Raghavan, A. Kovashka, and R. Mooney (2010) Authorship attribution using probabilistic context-free grammars. In Proceedings of ACL (short papers), pp. 38–42. Cited by: §5.2.
  • N. F. Rajani, B. McCann, C. Xiong, and R. Socher (2019) Explain yourself! leveraging language models for commonsense reasoning. In Proceedings of ACL, Cited by: §6.
  • A. G. Reynolds and P. W. Flagg (1976) Recognition memory for elements of sentences. Memory & Cognition 4 (4), pp. 422–432. Cited by: §5.2.
  • M. T. Ribeiro, S. Singh, and C. Guestrin (2016) Why should I trust you?: explaining the predictions of any classifier. In Proceedings of KDD, Cited by: §1.
  • W. C. Salmon (2006) Four decades of scientific explanation. University of Pittsburgh press. Cited by: §1, §2.
  • S. Schuster and C. Manning (2016) Enhanced English Universal Dependencies: an Improved Representation for Natural Language Understanding Tasks. LREC 2016. External Links: Link Cited by: item 6-21..
  • A. See, P. J. Liu, and C. D. Manning (2017) Get To The Point: Summarization with Pointer-Generator Networks. In Proceedings of ACL, External Links: Link Cited by: §1, §2, §5.3.
  • M. P. Simmons, L. A. Adamic, and E. Adar (2011) Memes online: extracted, subtracted, injected, and recollected. In Proceedings of ICWSM, Cited by: §2.
  • S. Somasundaran and J. Wiebe (2009) Recognizing stances in online debates. In Proceedings of ACL, Cited by: §2.
  • C. Tan, V. Niculae, C. Danescu-Niculescu-Mizil, and L. Lee (2016) Winning Arguments: Interaction Dynamics and Persuasion Strategies in Good-faith Online Discussions. In Proceedings of WWW, WWW ’16, Republic and Canton of Geneva, Switzerland, pp. 613–624. Note: event-place: Montréal, Québec, Canada External Links: ISBN 978-1-4503-4143-1, Link, Document Cited by: §2, §3, 5th item.
  • C. Tan, H. Peng, and N. A. Smith (2018) You are no Jack Kennedy: on media selection of highlights from presidential debates. In Proceedings of WWW, Cited by: §2.
  • O. Vinyals, M. Fortunato, and N. Jaitly (2015) Pointer networks. In Proceedings of NeurIPS, pp. 2692–2700. Cited by: §2.
  • M. A. Walker, P. Anand, R. Abbott, and R. Grant (2012) Stance classification using dialogic properties of persuasion. In Proceedings of NAACL, Cited by: §2.
  • L. Wang, N. Beauchamp, S. Shugars, and K. Qin (2017) Winning on the merits: the joint effects of content and style on debate outcomes. Transactions of the Association for Computational Linguistics. Cited by: §2.
  • R. A. Wilson and F. Keil (1998) The Shadows and Shallows of Explanation. Minds and Machines 8 (1), pp. 137–159 (en). External Links: ISSN 1572-8641, Link, Document Cited by: §2.
  • J. Woodward (2005) Making things happen: a theory of causal explanation. Oxford university press. Cited by: §2.
  • J. Zhang, R. Kumar, S. Ravi, and C. Danescu-Niculescu-Mizil (2016) Conversational flow in Oxford-style debates. In Proceedings of NAACL, Cited by: §2.

Appendix A Supplemental Material

a.1 Preprocessing.

Before tokenizing, we pass each OP, PC, and explanation through a preprocessing pipeline, with the following steps:

  1. Occasionally, /r/ChangeMyView’s moderators will edit comments, prefixing their edits with “Hello, users of CMV” or “This is a footnote” (see Table 7). We remove this, and any text that follows on the same line.

  2. We replace URLs with a “@url@” token, defining a URL to be any string which matches the following regular expression: (https?://[^\s)]*).

  3. We replace “” symbols and their analogues—such as “”, “&;#8710;”, and “!delta”—with the word “delta”. We also remove the word “delta” from explanations, if the explanation starts with delta.

  4. Reddit–specific prefixes, such as “u/” (denoting a user) and “r/” (denoting a subreddit) are removed, as we observed that they often interfered with spaCy’s ability to correctly parse its inputs.

  5. We remove any text matching the regular expression EDIT(.*?):.* from the beginning of the match to the end of that line, as well as variations, such as Edit(.*?):.*.

  6. Reddit allows users to insert blockquoted text. We extract any blockquotes and surround them with standard quotation marks.

  7. We replace all contiguous whitespace with a single space. We also do this with tab characters and carriage returns, and with two or more hyphens, asterisks, or underscores.

Sample footnote: Hello, users of CMV! This is a footnote from your moderators. We’d just like to remind you of a couple of things. Firstly, please remember to read through our rules. If you see a comment that has broken one, it is more effective to report it than downvote it. Speaking of which, *downvotes don’t change views**! If you are thinking about submitting a CMV yourself, please have a look through our **popular topics wiki first. Any questions or concerns? Feel free to message us**. Happy CMVing!*
Sample subreddit reference: r/ideasforcmv, /r/nba
Sample URL : https://www.quora.com/profile/
Sample user reference: u/Ansuz07
Sample edit: EDIT for clarification: This isn’t to suggest that you have to remain financially independent to vote
Table 7: Sample data that were affected by preprocessing.

Tokenizing the data. After passing text through our preprocessing pipeline, we use the default spaCy pipeline to extract part-of-speech tags, dependency tags, and entity details for each token444We ignore all tokens tagged as ”SPACE” by the part of speech tagger. (Honnibal and Montani, 2017). In addition, we use NLTK to stem words (Loper and Bird, 2002). This is used to compute all word level features discussed in Section 4 of the main paper.

a.2 PC Echoing OP

Figure 2(b) shows a similar U-shape in the probability of a word being echoed in PC. However, visually, we can see that rare words seem more likely to have high echoing probability in explanations, while that probability is higher for words with moderate frequency in PCs. As PCs tend to be longer than explanations, we also used the echoing probability of the most frequent words to normalize the probability of other words so that they are comparable. We indeed observed a higher likelihood of echoing the rare words, but lower likelihood of echoing words with moderate frequency in explanations than in PCs.

(a) Echoing probability between explananda and explanations.
(b) Echoing probability between OPs and their PCs.
(c) Echoing probability between OPs and other, randomly chosen OPs.
Figure 3: The U-shape exists both in Figure 2(a) and Figure 2(b), but not in Figure 2(c).

a.3 Feature Calculation

Given an OP, PC, and explanation, we calculate a 66–dimensional vector for each unique stem in the concatenated OP and PC. Here, we describe the process of calculating each feature.

  1. [leftmargin=0.5in]

  2. Inverse document frequency: for a stem , the inverse document frequency is given by , where is the total number of documents (here, OPs and PCs) in the training set, and is the number of documents in the training data whose set of stemmed words contains .

  3. Stem length: the number of characters in the stem.

  4. Wordnet depth (min): starting with the stem, this is the length of the minimum hypernym path to the synset root.

  5. Wordnet depth (max): similarly, this is the length of the maximum hypernym path.

  6. Stem transfer probability: the percentage of times in which a stem seen in the explanandum is also seen in the explanation. If, during validation or testing, a stem is encountered for the first time, we set this to be the mean probability of transfer over all stems seen in the training data.

  7. OP part–of–speech tags: a stem can represent multiple parts of speech. For example, both “traditions” and “traditional” will be stemmed to “tradit.” We count the percentage of times the given stem appears as each part–of–speech tag, following the Universal Dependencies scheme (Schuster and Manning, 2016).555Note that, for English, spaCy does not use the SCONJ tag. If the stem does not appear in the OP, each part–of–speech feature will be .

  8. OP subject, object, and other: Given a stem , we calculate the percentage of times that ’s surface forms in the OP are classified as subjects, objects, or something else by SpaCy. We follow the CLEAR guidelines, (Center for Language and Information Research, 2016) and use the following tags to indicate a subject: nsubj, nsubjpass, csubj, csubjpass, agent, and expl. Objects are identified using these tags: dobj, dative, attr, oprd. If does not appear at all in the OP, we let subject, object, and other each equal .

  9. OP term frequency: the number of times any surface form of a stem appears in the list of tokens that make up the OP.

  10. OP normalized term frequency: the percentage of the OP’s tokens which are a surface form of the given stem.

  11. OP # of surface forms: the number of different surface forms for the given stem.

  12. OP location: the average location of each surface form of the given stem which appears in the OP, where the location of a surface form is defined as the percentage of tokens which appear after that surface form. If the stem does not appear at all in the OP, this value is .

  13. OP is in quotes: the number of times the stem appears in the OP surrounded by quotation marks.

  14. OP is entity: the percentage of tokens in the OP that are both a surface form for the given stem, and are tagged by SpaCy as one of the following entities: PERSON, NORP, FAC, ORG, GPE, LOC, PRODUCT, EVENT, WORK_OF_ART, LAW, and LANGUAGE.

  15. PC equivalents of features 6-30.

  16. In both OP and PC: 1, if one of the stem’s surface forms appears in both the OP and PC. 0 otherwise.

  17. # of unique surface forms in OP: for the given stem, the number of surface forms that appear in the OP, but not in the PC.

  18. # of unique surface forms in PC: for the given stem, the number of surface forms that appear in the PC, but not in the OP.

  19. Stem part–of–speech distribution difference: we consider the concatenation of features 6-21, along with the concatenation of features 31-46, as two distributions, and calculate the Jensen–Shannon divergence between them.

  20. Stem dependency distribution difference: similarly, we consider the concatenation of features 22-24 (OP dependency labels), and the concatenation of features 47-49 (PC dependency labels), as two distributions, and calculate the Jensen–Shannon divergence between them.

  21. OP length: the number of tokens in the OP.

  22. PC length: the number of tokens in the PC.

  23. Length difference: the absolute value of the difference between OP length and PC length.

  24. Avg. word length difference: the difference between the average number of characters per token in the OP and the average number of characters per token in the PC.

  25. OP/PC part–of–speech tag distribution difference: the Jensen–Shannon divergence between the part–of–speech tag distributions of the OP on the one hand, and the PC on the other.

  26. Depth of the PC in the thread: since there can be many back–and–forth replies before a user awards a delta, we number each comment in a thread, starting at 0 for the OP, and incrementing for each new comment before the PC appears.

Feature all content words stopwords
Inverse document frequency
Stem length
Wordnet depth (min)
Wordnet depth (max)
Stem transfer probability
OP subject
OP object
OP other
OP term frequency
OP normalized term frequency
OP # of surface forms
OP location ——
OP in quotes
OP is entity
PC subject
PC object
PC other
PC term frequency
PC normalized term frequency
PC # of surface forms
PC location
PC in quotes
PC is entity
In both OP and PC
# of unique surface forms in OP
# of unique surface forms in PC
Stem POS distribution difference
Stem dependency distribution difference
OP length
PC length
Length difference
Avg. word length difference
OP/PC POS distribution difference
Depth of the PC in the thread
Table 8: Full testing results after Bonferroni correction.
Feature Total Gain (%)
Inverse document frequency 16.97
Stem length 0.15
Wordnet depth (min) 0.12
Wordnet depth (max) 0.1
Stem transfer probability 46.7
OP ADP 0.02
OP X 0.01
OP DET 0.02
OP ADJ 0.01
OP VERB 0.04
OP PART 0.01
OP INTJ 0.01
OP NOUN 0.04
OP NUM 0.01
OP ADV 0.15
OP SYM 0.0
OP AUX 0.0
OP subject 0.53
OP object 0.01
OP other 0.02
OP term frequency 3.23
OP normalized term frequency 0.26
OP # of surface forms 0.01
OP location 0.15
OP in quotes 0.01
OP is entity 0.02
PC ADP 0.02
PC PRON 0.09
PC X 0.81
PC DET 0.05
PC ADJ 0.01
PC VERB 0.01
PC PART 0.02
PC NOUN 0.04
PC NUM 0.70
PC ADV 0.02
PC SYM 0.2
PC AUX 0.0
PC subject 0.01
PC object 0.01
PC other 0.02
PC term frequency 3.33
PC normalized term frequency 2.92
PC # of surface forms 0.02
PC location 0.24
PC in quotes 0.04
PC is entity 0.02
In both OP and PC 4.88
# of unique surface forms in OP 0.01
# of unique surface forms in PC 0.03
Stem POS distribution difference 0.29
Stem dependency distribution difference 0.28
OP length 3.63
PC length 3.47
Length difference 2.59
Avg. word length difference 2.65
OP/PC POS distribution difference 3.15
Depth of the PC in the thread 1.4
Table 9: Feature importance for the full XGBoost model, as measured by total gain.
Without features With features
encoder type brnn brnn
glove vector dimension 300 300
rnn size 512 512
dropout 0.2 0.1
optim adagrad adam
learning rate 0.15 0.001
beam size 10 10
Table 10: Parameters tuned on validation dataset containing 5k instances.

a.4 Word–level Prediction Task

For each non–LSTM classifier, we train 11 models: one full model, and forward and backward models for each of the five feature groups. To train, we fit on the training set and use the validation set for hyperparameter tuning.

For the random model, since the echo rate of the training set is 15%, we simply predict 1 with 15% probability, and otherwise.

For logistic regression, we use the lbfgs solver. To tune hyperparameters, we perform an exhaustive grid search, with taking values from , and the respective weights of the negative and positive classes taking values from .

We also train XGBoost models. Here, we use a learning rate of , estimator trees, and no subsampling. We perform an exhaustive grid search to tune hyperparameters, with the max tree depth equaling 5, 7, or 9, the minimum weight of a child equaling 3, 5, or 7, and the weight of a positive class instance equaling 3, 4, or 5.

Finally, we train two LSTM models, each with a single 300–dimensional hidden layer. Due to efficiency considerations, we eschewed a full search of the parameter space, but experimented with different values of dropout, learning rate, positive class weight, and batch size. We ultimately trained each model for five epochs with a batch size of 32 and a learning rate of 0.001, using the Adam optimizer (Kingma and Ba, 2015). We also weight positive instances four times more highly than negative instances.

a.5 Generating Explanations

We formulate an abstractive summarization task using an OP concatenated with the PC as a source, and the explanation as target. We train two models, one with the features described above, and one without. A shared vocabulary of 50k words is constructed from the training set by setting the maximum encoding length to 500 words. We set the maximum decoding length to 100. We use a pointer generator network with coverage for generating explanations, using a bidirectional LSTM as an encoder and a unidirectional LSTM as a decoder. Both use a 256-dimensional hidden state. The parameters of this network are tuned using a validation set of five thousand instances. We constrain the batch size to 16 and train the network for 20k steps, using the parameters described in Table 10.

Original Post:I keep seeing this point when people bitch about escort quests . But I ’ve been thinking about it and like , consider the alternatives : 1 ) The NPC moves at your walking speed . Clearly this is a terrible option . Nobody has ever willingly moved at their walking speed in a video game unless they were trying to finesse something or sneak . a walking speed escort quest would be terrible . The fact people even mention this point when talking about NPCs is insane . The actual complaint is ” NPC is slower than my run speed ” . If the NPC exclusively moved your walk speed it would be 10000 times worse . 2 ) The NPC moves at your run speed . This seems better at first … but it means that you ca n’t pull ahead of the NPC if you want to , or catch up to them if they ever get ahead of you because you stopped to do anything . They ’re moving at 100 % of your max speed . Monsters up ahead ? That ’s a fucking shame because you do n’t have time to run up and pull aggro on them if the NPC is behind you and you are n’t going to be able to intercept them in time at all if the NPC ’s ahead because they ’ll always get there first . 3 ) The NPC moves at your exact speed behind you following your pathfinding and dynamically navigating traps / moving parts to keep a uniform distance from you . This renders the escort quest pointless . This is a solution to a different problem ( that escort quests are just terrible ) . If having to escort and NPC does n’t have any effect on your gameplay decisions they ’re a pointless inclusion . The NPC is generally SUPPOSED to require your attention . After all , the only reason you care that they move slower than you is because you have to watch over them . In games where you do n’t you just run ahead and let them be slow and it ’s no problem . If you need to watch over them , then they need to act in a way where you ca n’t just ignore them . Like , if you want to say escort quests are just terrible in general then I ’m on board . Escort quests suck . But if you have to have one with an NPC that has their own movement and pathfinding then they need to move close enough to your run speed that you are n’t walking , but far enough from it that you can control your distance from them to some degree while ahead and can catch up when behind . Of all the options available for NPC movement speeds ” about 75 - 85 % of PC run speed ” is the best for escort missions both in terms of being least annoying for the player and most able to create the gameplay changes the devs want to create with escort quests . .
persuasive comment: I think the best solution is allowing the player to select a speed equal to the escort . The frustration does n’t stem from having to move slower than you would without an escort . It ’s that there ’s no convenient way to move without running off and leaving them . In real life , it ’s simple to adjust your movement speed to a slower person . It ’s not about what pace you ’re moving , it ’s about being able to match pace .
Reference Explanation: Hmm … maybe ? I was initially 100% convinced by thinking on it , I dunno . I feel that the annoying thing is just that they ’re slow. Like , the fact you have to run laps on them when things are going well FEELS annoying but I think the ACTUAL annoying thing is just that it ’s slow and because them moving slow causes that you fixate on it . Running at pace with them would be equally annoying because you ’d still know you COULD go faster … That said you could be right and I ’m convinced enough that I think it ’s worth a delta
Generated Explanation w/o features: This is a very good point . I had n’t thought about it that way . Thank you for your time . I did n’t think of it that way
Generated Explanation with features: I ’m going to give you a delta because you did n’t change my view , but you ’ve convinced me that there is a difference between escort and escort ..
Table 11: Random generation from Open-NMT Pointer generator network with and without features. In our features model, words like “escort” are copied from the OP, but neither model is able to construct a coherent, human-like sentence addressing the explanatory context.
Original Post: Hi cmv , This post is not about whether or not abortion is morally permissible or ought to be legal . Rather , it ’s a meta - view about the way the abortion debate is structured . Often , those on either side of the debate invoke the circumstances of the pregnancy to support their arguments . Speaking broadly , pro - choice advocates often point to sexual violence or lack of consent as a trump example . Pro - life advocates tend to argue that sex is a responsibility and that women who engage in casual sex are obligated to see a pregnancy through based on that decision . Logically , however , I ca n’t see how the circumstances of a pregnancy hold bearing on whether an abortion is morally justifiable . Once a pregnancy has occurred , via any course of action , the moral quandary is the same - does the mother ’s right to bodily autonomy take precedence over the fetus ’ right to life ? Pick your favorite set of hypothetical circumstances , but at the end of the day the decision at hand is the same , and the logic that brings you to your conclusion ought to apply universally . While I understand the gut instinct to bring up rape and promiscuity when discussing this sensitive issue , I fail to see what bearing they hold to the core question , and believe they only serve to cloud and emotionally charge the debate . I ’m interested in well - structured arguments that connect the circumstances of pregnancy to the abortion debate . CMV !
persuasive comment: A major part of the abortion debate on the Pro - Life side is about legitimacy of the action . To someone that is Pro - Life abortion is the killing of a human being , but the killing of a human being is not by default murder . It is only murder if that killing is both unjustifiable and illegal . For those that want abortion to be classified as murder that means that they have to determine if the act is justifiable or not , and for many it is not an all or nothing proposition . Many who are Pro - Life consider abortion due to medical risk to the mother , severe disability of the child , or the child being the result of rape to be legitimate reasons to get an abortion and thus justifiable and in turn action that should be legal . It is the use of an abortion simply because you do not want a child , or because you forgot to use protection that they consider murder . This distinction in kind of human killing is also why killing in war , self defense , defense of another , and an actual pure accident not due to negligence are not crimes as well as why killing as an accident due to negligence is classified as manslaughter and not murder.
Reference Explanation: To someone that is Pro - Life abortion is the killing of a human being , but the killing of a human being is not by default murder . It is only murder if that killing is both unjustifiable and illegal. For those that want abortion to be classified as murder that means that they have to determine if the act is justifiable or not , and for many it is not an all or nothing proposition . You get a delta for this . The distinction between ending a life and murder is valid , and when operating with the latter definition , it makes sense that you ’d have to consider the extraneous factors . I may be limiting the debate too much by interpreting the pro - life position to be against ending a life rather than against murder .
Generated Explanation w/o features: “It is only about whether or not abortion is morally permissible or ought to be legal”. Rather , it ’s a meta - view about the way the decision at hand is the same , and the killing of a human , or the child being the result of rape to be legitimate reasons to get an abortion and thus justifiable and in turn that they have to determine if the act is morally . Once , pro - choice advocates tend to argue that sex is a responsibility and that women who engage in casual sex are
Generated Explanation with features: “It is only murder if that killing is both unjustifiable and illegal . For those that want abortion to be classified as murder that means that they have to determine if the act is justifiable or not , and for many it is not an all or nothing proposition”. !delta. Many who are Pro - Life consider abortion due to medical risk that they consider murder . This distinction in kind of human killing is also why killing as an accident due to the mother , or the child being the result of rape to be legitimate
Table 12: Random generation from Open-NMT Pointer generator network with and without features. We can observe that the generated explanations transfer entire quotes from the explanandum, indicating extractive summarization capabilities.
Original Post: When people use adblockers , they are hurting both consumers and producers . Adblockers take away the primary source of income for websites . Enough people use adblockers that this can seriously jeopardize the finances of a website . These sites include wikis , local newspapers , and many other valuable online resources . If the situation gets bad enough , it forces the producer to do one of 2 things . 1 . Shut down . OR 2 . Move to some sort of paid subscription service . Either way , the world just lost some valuable free information . This hurts the consumers . The benefits of adblockers are small compared to these consequences . Most people justify their use of adblockers by saying they want to avoid viruses / scams and/or intrusive / page - blocking / annoying ads If you are tech savvy enough to get an adblocker , you are probably tech savvy enough to understand what websites you should avoid . Plus you probably have an anti virus anyways . If you ’re bothered by intrusive ads , just do n’t visit the damn website . Shitty ads are the price you pay for going to some websites . If you are n’t willing to pay that price , do n’t go to those websites . That simple . That ’s all I have to say I guess . i ’ve just seen too many good websites go down the drain because of this .
persuasive comment: ” Most people justify their use of adblockers by saying they want to avoid viruses / scams and/or intrusive / page - blocking / annoying ads ” ” If you are tech savvy enough to get an adblocker , you are probably tech savvy enough to understand what websites you should avoid . Plus you probably have an anti virus anyways . ” Why avoid the website when you can neuter it with an adblocker ? It still has the content you were looking for after all . You ’re also discounting the massive resource savings adblocking can cause . One university deployed an adblocker and saw their traffic go down 30 % . That ’s huge , and that s just network resources , how many cpu cycles get wasted every second running poorly written javascript ads ? How much of your battery goes towards rendering ads ?
Reference Explanation: That ’s a good point about resource wasting . Not sure I 100% agree that that makes adblockers worth it but I think that ’s a valid reason for using one . I ’ll give you a delta
Generated Explanation w/o features: That ’s a good point . I had n’t thought about it that way , but I ’ll give you a delta for making me realize that it would be better. Thank you for changing my view
Generated Explanation with features: I ’m going to give you a delta because I did n’t really think of it in a way that makes sense to me . i ’m just going to give you a delta.
Table 13: Random generation from Open-NMT Pointer generator network with and without features. Here we see that both the generated examples fail to summarize any concepts specific to the explanandum. Each instead generates a template explanation for a view change.
Original Post: People ’s main argument is that the poor will have to play money , but they would only have to pay very little because they make so little . It would make everyone feel that they are accomplishmisg something for the nation . Also I am also saying that the rich will also pay their fair ammounts . I forgot where but it was calcuated if everyone would pay a 24 % tax it would work out for America . Also I would argue for a flat tax because it completely eliminate the need for the Internal Revenue Service . Flat taxes are especially popular in the business and investment communities , where it is argued that because income from dividends , capital gains and distributions is untaxed , freeing up money that would have gone to taxes , investments and savings are thereby encouraged . But that is just my view , so feel free to change it . :)
persuasive comment: I think it ’s worth examining what you mean by fair . My read of your flat tax is that you think equal treatment gives equal / fair footing . Since you mentioned elsewhere that you think minimum wage should not be a permanent career choice , I assume you think people should start at minimum wage but take a better job as soon as possible . So , here are some practical questions : 1 . How is a person supposed to go on job interviews if they are already not making enough money to support them self ? 2 . How is a person supposed to spend money economically if they never have enough of it on a reliable basis to take advantage of sales , bulk purchases , payment plans , etc . ? 3 . How is a person supposed to got to school if they do n’t have money to pay for their education ? And what responsible lender would loan someone that money knowing how little income they have available to make payments ? On the other side of the spectrum : 1 . How is a millionaire harmed by paying a higher marginal rate , given that he already has all of his basic need met , and then some ? 2 . What is the benefit to a millionaire putting his money into savings ? Does n’t that just make less money available to the rest of the economy ? 3 . Is it ’ fair ’ to allow a millionaire to invest in and profit in a business , while paying those whose labor produces the profits for that business make less than a living wage ?
Reference Explanation: 1 they start with a minimum wage job and work themselves up , that s how its always been 2 Plenty of ppl do nt have enough money right now , I am not saying that the system is perfect 3 There are plenty of people that give loans to students that do nt even have a job 4 He is harmed by losing money ? 5 They will use it to buy things thus boosting the econ . 6 Yea , because that s how they set it up and the gov . allows it to happen If anything boost the min . wage delta
Generated Explanation w/o features: That ’s a good point .Thanks for changing my view. delta
Generated Explanation with features: That ’s a good point . I was n’t aware of it that way . I still think that it should be a permanent career , but that ’s a good point .
Table 14: Random generation from Open-NMT Pointer generator network with and without features.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description