Learning from Fact-checkers: Analysis and Generation
of Fact-checking Language
In fighting against fake news, many fact-checking systems comprised of human-based fact-checking sites (e.g., snopes.com and politifact.com) and automatic detection systems have been developed in recent years. However, online users still keep sharing fake news even when it has been debunked. It means that early fake news detection may be insufficient and we need another complementary approach to mitigate the spread of misinformation. In this paper, we introduce a novel application of text generation for combating fake news. In particular, we (1) leverage online users named fact-checkers, who cite fact-checking sites as credible evidences to fact-check information in public discourse; (2) analyze linguistic characteristics of fact-checking tweets; and (3) propose and build a deep learning framework to generate responses with fact-checking intention to increase the fact-checkers’ engagement in fact-checking activities. Our analysis reveals that the fact-checkers tend to refute misinformation and use formal language (e.g. few swear words and Internet slangs). Our framework successfully generates relevant responses, and outperforms competing models by achieving up to 30% improvements. Our qualitative study also confirms that the superiority of our generated responses compared with responses generated from the existing models.
Our media landscape has been flooded by a large volume of falsified information, overstated statements, false claims, fauxtography and fake videos111https://cnnmon.ie/2AWCCix perhaps due to the popularity, impact and rapid information dissemination of online social networks. The unprecedented amount of disinformation posed severe threats to our society, degraded trustworthiness of cyberspace, and influenced the physical world. For example, $139 billion was wiped out when the Associated Press (AP)’s hacked twitter account posted fake news regarding White House explosion with Barack Obama’s injury.
To fight against fake news, many fact-checking systems ranging from human-based systems (e.g. Snopes.com), classical machine learning frameworks (Kwon et al., 2013; Popat et al., 2016; Nguyen et al., 2018) to deep learning models (Ma et al., 2016; Wang, 2017; Wang et al., 2018b; Popat et al., 2018) were developed to determine credibility of online news and information. However, falsified news is still disseminated like wild fire (Maddock et al., 2015; Zhao et al., 2015) despite dramatic rise of fact-checking sites worldwide (Lab, 2018). Furthermore, recent work showed that individuals tend to selectively consume news that have ideologies similar to what they believe while disregarding contradicting arguments (Ecker et al., 2010; Nyhan and Reifler, 2010). These reasons and problems indicate that using only fact-checking systems to debunk fake news is insufficient, and complementary approaches are necessary to combat fake news.
Therefore, in this paper, we focus on online users named fact-checkers, who directly engage with other users in public dialogues and convey verified information to them. Figure 1 shows a real-life conversation between a user, named original poster, and a fact-checker. In Figure 1, the original poster posts a false claim related to General Pershing. The fact-checker refutes the misinformation by replying to the original poster and provides a fact-checking article as a supporting evidence. We call such a reply a fact-checking tweet (FC-tweet). Recent work (Vo and Lee, 2018) showed that fact-checkers often quickly fact-checked original tweets within a day after being posted and their FC-tweets could reach hundreds of millions of followers. Additionally, (Friggeri et al., 2014) showed that the likelihood to delete shares of fake news increased by four times when there existed a fact-checking URL in users’ comments. In our analysis, we also observe that after receiving FC-tweets, 7% original tweets were not accessible because of account suspension, tweet deletion, and a private mode.
Due to the fact-checkers’ activeness and high impact on dissemination of fact-checked content, in this paper, our goal is to further support them in fact-checking activities toward complementing existing fact-checking systems and combating fake news. In particular, we aim to build a text generation framework to generate responses222We use the term “fact-checking tweets (FC-tweets)”, “fact-checking responses”, and “fact-checking replies” interchangeably. with fact-checking intention when original tweets are given. The fact-checking intention means either confirming or refuting content of an original tweet by providing credible evidences. We assume that fact-checkers choose the fact-checking URLs by themselves based on their interests (e.g., https://t.co/7Vyi5AoeG1 in Figure 1). Therefore, we focus on generating responses without automatically choosing specific fact-checking URLs, which is beyond the scope of this paper.
To achieve the goal, we have to solve the following research problems: (P1) how can we obtain a dataset consisting of original tweets and associated fact-checking replies (i.e., replies which exhibit fact-checking intention)?; (P2) how can we analyze how fact-checkers communicate fact-checking content to original posters?; and (P3) how can we automatically generate fact-checking responses when given content of original tweets?
To tackle the first problem (P1), we may use already available datasets (Jiang and Wilson, 2018; Vosoughi et al., 2018; Vo and Lee, 2018). However, the dataset in (Jiang and Wilson, 2018) contains relatively small number of original tweets (5,000) and many FC-tweets (170K). Since FC-tweet generation process depends on contents of original tweets, it may reduce diversity of generated responses. The dataset in (Vosoughi et al., 2018) is large but fully anonymized, and the dataset in (Vo and Lee, 2018) does not contain original tweets. Therefore, we collected our own dataset consisting of 64,110 original tweets and 73,203 FC-tweets (i.e., each original tweet receives 1.14 FC-tweet) by using Hoaxy system (Shao et al., 2016) and FC-tweets in (Vo and Lee, 2018).
To understand how fact-checkers convey credible information to original posters and other users in online discussions (P2), we conducted data-driven analysis of FC-tweets and found that fact-checkers tend to refute misinformation and employ more impersonal pronouns. Their FC-tweets were generally more formal and did not contain much swear words and Internet slangs. These analytical results are important since we can reduce the likelihood to generate racist tweets (Insider, 2016), hate speeches (Davidson et al., 2017) and trolls (Cheng et al., 2017).
To address the third problem (P3), we propose a deep learning framework to automatically generate fact-checking responses for fact-checkers. In particular, we build the framework based on Seq2Seq (Sutskever et al., 2014) with attention mechanisms (Luong et al., 2015).
Our contributions are as follows:
To the best of our knowledge, we are the first to propose a novel application of text generation for supporting fact-checkers and increasing their engagement in fact-checking activities.
We conduct a data-driven analysis of linguistic dimensions, lexical usage and semantic frames of fact-checking tweets.
We propose and build a deep learning framework to generate responses with fact-checking intention. Experimental results show that our models outperformed competing baselines quantitatively and qualitatively.
We release our collected dataset and source code in public to stimulate further research in fake news intervention333https://github.com/nguyenvo09/LearningFromFactCheckers.
2. Related work
In this section, we briefly cover related works about (1) misinformation and fact-checking, and (2) applications of text generation.
2.1. Misinformation and Fact-checking
Fake news is recently emerging as major threats of credibility of information in cyberspace. Since human-based fact-checking sites could not fact-check every falsified news, many automated fact-checking systems were developed to detect fake news in its early stage by using different feature sets (Qazvinian et al., 2011; Gupta et al., 2013; Zhao et al., 2015; Vosoughi et al., 2018; Jiang and Wilson, 2018), knowledge graph (Shiralkar et al., 2017) and crowd signals (Nguyen et al., 2018; Kim et al., 2018, 2019), and using deep learning models (Ma et al., 2015; Wang et al., 2018b; Popat et al., 2018). In addition, other researchers studied how to fact-check political statements (Wang, 2017), mutated claims from Wikipedia (Thorne et al., 2018) and answers in Q&A sites (Mihaylova et al., 2018).
Other researchers studied intention of spreading fake news (e.g. misleading readers, inciting clicks for revenue and manipulating public opinions) and different types of misinformation (e.g. hoaxies, clickbait, satire and disinformation) (Volkova et al., 2017; Rashkin et al., 2017). Linguistic patterns of political fact-checking webpages and fake news articles (Rashkin et al., 2017; Horne and Adali, 2017) were also analyzed. Since our work utilizes FC-tweets, analyzing users’ replies (Qazvinian et al., 2011; Friggeri et al., 2014; Vosoughi et al., 2018; Jiang and Wilson, 2018) are closely related to ours. However, the prior works had limited attention on analyzing how fact-checkers convey fact-checked content to original posters in public discourse.
Additionally, researchers investigated topical interests and temporal behavior of fact-checkers (Vo and Lee, 2018), relationship between fact-checkers and original posters (Hannak et al., 2014), how fake news disseminated when fact-checked evidences appeared (Friggeri et al., 2014), and whether users were aware of fact-checked information when it was available (Jiang and Wilson, 2018). Our work is different from these prior works since we focus on linguistic dimensions of FC-tweets, and propose and build a response generation framework to support fact-checkers.
2.2. Applications of Text Generation
Text generation has been used for language modeling (Mikolov et al., 2010), question and answering (Hermann et al., 2015), machine translation (Sutskever et al., 2014; Bahdanau et al., 2015; Luong et al., 2015), dialogue generation (Serban et al., 2015; Shang et al., 2015; Vinyals and Le, 2015; Serban et al., 2017; Wang et al., 2018a), and so on. Recently, it is employed to build chat bots for patients under depression444https://read.bi/2QZ0ZPn, customer assistants in commercial sites, teen chat bots (Insider, 2016), and supporting tools for teachers (Chen et al., 2018). Text generation has been also used to detect fake review (Yao et al., 2017), clickbait headlines (Shu et al., 2018), and fake news (Qian et al., 2018). Our study is the first work that generates responses based on FC-tweets as a supporting tool for fact-checkers. Our work is closely related with dialog generation in which there are three main technical directions: (1) deterministic models (Serban et al., 2015; Shang et al., 2015; Vinyals and Le, 2015; Wang et al., 2018a), (2) Variational Auto-Encoders (VAEs) (Serban et al., 2017), and (3) Generative Adversarial Networks (GANs) (Li et al., 2017). Although recently VAEs and GANs showed promising results, deterministic models are still dominant in literature since they are easier to train than VAEs and GANs, and achieve competitive results (Le et al., 2018). Thus, we propose a response generation framework based on Seq2Seq and attention mechanism (Luong et al., 2015).
In this section, we describe our data collection and preprocessing process. We utilized the dataset in (Vo and Lee, 2018) and the Hoaxy system (Shao et al., 2016) to collect FC-tweets, which contained fact-checking URLs from two popular fact-checking sites: snopes.com, and politifact.com. Totally, we collected 247,436 distinct fact-checking tweets posted between May 16, 2016 and May 26, 2018.
Similar to (Vo and Lee, 2018; Hannak et al., 2014), we removed non-English FC-tweets, and FC-tweets containing fact-checking URLs linked to non-article pages such as the main page and about page of a fact-checking site. Then, among the remaining fact-checking pages, if its corresponding original tweet was deleted or was not accessible via Twitter APIs because of suspension of an original poster, we further filtered out the fact-checking tweet. As a result, 190,158 FC-tweets and 164,477 distinct original tweets were remained.
To further ensure that each of the remaining FC-tweets reflected fact-checking intention and make a high quality dataset, we only kept a fact-checking tweet whose fact-checking article was rated as true or false. Our manual verification of 100 random samples confirmed that fact-checking tweets citing fact-checking articles with true or false label contained clearer fact-checking intention than fact-checking tweets with other labels such as half true or mixture. In other words, FC-tweets associated with mixed labels were discarded. After the pre-processing steps, our final dataset consisted of 73,203 FC-tweets and 64,110 original tweets posted by 41,732 distinct fact-checkers, and 44,411 distinct original posters, respectively. We use this dataset in the following sections.
4. Linguistic Characteristics of Fact-checking Tweets
Since our goal is to automatically generate responses with fact-checking intention, it is necessary to analyze what kind of linguistic characteristics FC-tweets have, and verify whether FC-tweets in our dataset have the fact-checking intention.
To highlight linguistic characteristics of FC-tweets, we compare FC-tweets with Normal Replies, which are direct responses to the same 64,110 original tweets without including fact-checking URLs, and Random Replies, which do not share any common original tweets with FC-tweets. Initially, we collected 262,148 English Normal Replies and 97M English Random Replies posted in the same period of the FC-tweets. Then, we sampled 73,203 Normal Replies and 73,203 Random Replies from the initial collection to balance the data with our FC-tweets. All of the FC-tweets, Normal Replies and Random Replies were firstly preprocessed by replacing URLs with url and mentions with @user, and by removing special characters. They were tokenized by the NLTK tokenizer. Then, we answer the following research questions. Note that we sampled 73,203 Random Replies and 73,203 Normal Replies four times more, and our analysis was consistent with the following results.
|Types||Topic 1||Topic 2||Topic 3||Topic 4||Topic 5|
Q1: What are underlying themes in FC-tweets?
To answer this question, we applied the standard LDA algorithm to each of the three types of replies, so we built three independent LDA models. Table 1 shows 5 topics extracted from each of the three LDA topic models with associated keywords. Firstly, FC-tweets exhibit clear fact-checking intention with keywords such as debunked, snopes, read, stop, check, and lie. Secondly, keywords of Normal Replies show awareness of misinformation. However, fact-checking intention is not clear compared with FC-tweets. The keywords of Random Replies are commonly used in daily conversations. Based on the analysis, we conclude that the main themes of FC-tweets are about fact-checking information in the original tweets.
Q2: What are the psycholinguistic characteristics of FC-tweets?
We employed LIWC 2015 (Pennebaker et al., 2015), a standard approach for mapping text to 73 psychologically-meaningful categories, for understanding psychological characteristics of FC-tweets. Given each of FC-tweets, we counted how many words of the FC-tweet belonged to each LIWC category. Then, we computed a normalized score for the category by diving the count by the number of words in the FC-tweet. Finally, we report the average scores and variances for each LIWC category based on FC-tweets. The same process was applied to Normal Replies and Random Replies. We examined all LIWC categories and report only the most significant results.
(A) FC-tweets have the highest usage of impersonal pronouns and the least utilization of personal pronouns. In Figure 2(a), we can see that FC-tweets exhibit the highest usage of impersonal pronouns (e.g. it, this, that) () in comparison with Normal Replies () and Random Replies (). This observation is statistically significant in Mann Whitney one sided U-test (). Examples of FC-tweets containing impersonal pronouns (called iprons in LIWC) are (i) @user This has been debunked repeatedly - url please stop spreading the lie, thanks!, and (ii) @user it is a wonderful quote but Lewis never said it : url and url.
Differently, Normal Replies show the highest mean score in 2nd person pronouns (named you category) in comparison to FC-tweets and Random Replies . Note that you in this context may refer to the original posters. In 1st person pronouns (named I category), Random Replies have highest score because they contain daily personal conversations between online users. Finally, FC-tweets have the smallest usage of we () and they () among three groups of replies .
(B) FC-tweets have a tendency to refute content of original tweets. Figure 2(b) shows the mean scores of adj, negate, differ and certain categories. Specifically, FC-tweets exhibit the highest mean score in adjectives category () in comparison to Normal Replies () and Random Replies . Prevalent adjectives in FC-tweets are fake, wrong, dump, false, and untrue. FC-tweets also tend to refute information of original tweets. Their mean score in negate category is , which is about two times higher than the mean score of Normal Replies . FC-tweets have also the highest usage of words in differ category (e.g. actually, but, except) among the three groups . In certain category (e.g. never, ever, nothing, always), FC-tweets’ mean score also doubles the average score of Normal Replies significantly . Examples of FC-tweets are: (i) @user wrong. never happened. url, (ii) @user except he didn’t. that tweet has been proven fake: url, and (iii) @user I sure hope you’re joking. url.
(C) FC-tweets are usually more formal and have low usage of swear words. In Figure 2(c), FC-tweets have lower mean score in informal category than Normal Replies and Random Replies . FC-tweets also use the least swear words among the three groups. In terms of netspeak category (i.e. Internet slangs), FC-tweets () generally have smaller average score than Random Replies . Furthermore, FC-tweets do not contain much words in assent category (e.g. OK, yup, okey) () compared with Random Replies . Regarding formality of FC-tweets, we conjecture that fact-checkers try to persuade original posters to stop spreading fake news, leading to more formal language, less usage of swear words. An example of FC-tweets is @user url I’m sure you’ll still say it’s true- but it simply isn’t. Google for facts and debunks please.
(D) FC-tweets emphasize on what happened in the past whereas Normal Replies and Random Replies focus on present and future. In Figure 2(d), FC-tweets usually employ verbs in past tense to mention past stories to support their factual corrections. Thus, the average score of focuspast category of FC-tweets is the highest among the three groups of replies, whereas Normal Replies and Random Replies emphasize on present and future. Particularly, FC-tweets have the least score in focuspresent and focusfuture categories. An example of FC-tweets is @user yeah, she merely said something that was only slightly less absurd. url.
Q3: How are semantic frames represented in FC-tweets?
So far, we examined every word independently without considering its dependencies with other words (e.g. surrounding words), which is helpful in understanding its meaning. Thus, we now employ SEMAFOR model (Das et al., 2010), trained on FrameNet data555https://framenet.icsi.berkeley.edu/fndrupal/frameIndex, to extract rich structures called semantic frames based on syntactic trees of sentences. For example, a frame Statement consists of a noun and a verb where the noun indicates a speaker and the verb implies the act of conveying a message. We measured the distribution of semantic frames of FC-tweets by firstly counting the number of occurrences of every frame type across all FC-tweets, and normalized it by the total number of detected frames in all FC-tweets. The same process was applied to Normal Replies and Random Replies. Figure 3 shows the percentage of different types of frames detected by SEMAFOR. We have the following observations:
(A) FC-tweets display high usage of Artificiality, Statement and Existence. In Figure 3, FC-tweets have the highest utilization of Artificiality (e.g. wrong, lie, fake, false, genuine, phoney) among three groups of replies ( according to one-sided z-test). This frame accounts for 5.71% detected frames in FC-tweets compared with Normal Replies (0.90%) and Random Replies (0.11%). FC-tweets also have the highest proportion of the frame Statement among three groups of replies. Words that evoke frame Statement in FC-tweets are said, says, claims, report, told, talk, and mention. Examples of FC-tweets are (i) @user You’re the one who has no clue. She never said this: url, and (ii) @user Snopes reports this rumor as false. url. To refer to verified information, FC-tweets employed frame Existence (1.38%, ) compared with Normal Replies (0.71%) and Random Replies (0.70%). The most popular phrases evoking frame Existence were real, there is, there are, exist, there were, and there have been. Examples of FC-tweets are: (i) @user There is no trucker strike in Puerto Rico url, (ii) @user That town doesnt exist url.
(B) FC-tweets exhibit the highest Morality_evaluation and have less usage of Desirability. As shown in Figure 3, FC-tweets contain the highest proportion of frame Morality_evaluation (1.06%, ) among three groups. The most popular words in frame Morality_evaluation are wrong, evil, dishonest, despicable, unethical, and immoral. Another supporting evidence of this observation is the lower usage of frame Desirability (e.g. good, better, bad, great, okay, cool) in FC-tweets (1.19%, ) than Normal Replies (1.89%) and Random Replies (2.5%). An example of FC-tweets is: @user you’re such an evil, despicable creature. url
(C) FC-tweets do not use much Temporal_collocation. FC-tweets show lower usage of Temporal_collocation (1.08%, ) than Normal Replies and Random Replies . The most common words that evoke this frame in FC-tweets are when, now, then, today, current, recently, future. It seems these words are mainly about the present and the future. This result supports the same observation that FC-tweets tend to focus on the past.
Q4: Do fact-checkers include details of fact-checking articles? We firstly derived latent representations of FC-tweets and articles by training two Doc2Vec models (Le and Mikolov, 2014) - one for FC-tweets and the other one for fact-checking articles. The embedding size is 50. Then, we measured cosine similarity between a FC-tweet and the fact-checking article embedded in the FC-tweet as shown in Figure 4(a). Interestingly, most FC-tweets do not have high similarity with FC-articles, suggesting that fact-checkers rarely include details from fact-checking articles in FC-tweets. However, there were several enthusiastic fact-checkers who extracted information from fact-checking articles to make FC-tweets more persuasive, as shown in two tails of the curve in Figure 4(a).
Q5: Is there any connection between tokens of FC-tweets and shares? Since sharing FC-tweets by retweets and quotes is important for increasing the visibility of credible information on online social networks, we examined correlation between tokens of FC-tweets and their shares. We only focus on tokens because it could help us to decide length of a generated response. Figure 4(b) shows a scatter plot of FC-tweets’ tokens and shares (i.e., quotes and retweets). Generally, most FC-tweets had shares=0. However, FC-tweets with tokens usually received more attention. To verify this, we created two lists – one containing shares of FC-tweets with tokens and another one for shares of FC-tweets with tokens –, and then conducted Mann Whitney one-sided U-test. We found that the latter one had significantly larger numbers than the former one . We conclude that very short FC-tweets may be not informative enough to draw readers’ attention, and lengthy FC-tweets may be too time-consuming to read, leading to small number of shares. Therefore, a reasonable length of FC-tweets is more preferable when we generate a response.
Q6: Is there any signal suggesting positive effect of FC-tweets? We examined what happened to original tweets after receiving fact-checking tweets. In Oct 2018 (i.e., five months after collecting our dataset), we re-collected original tweets via Twitter APIs to see if all of the original tweets were retrievable. Interestingly, 4,516 (7%) original tweets were not retrievable. There are three reasons: (i) User Suspension, (ii) Not Found Status (i.e. deleted status), and (iii) Not Authorized (i.e. original tweets are in the private mode).
In Figure 4(c), User Suspension accounted for 58.30% of the irretrievable original tweets. Although there may be many factors that potentially explain suspension (e.g. original posters may have other abusing behaviors that triggered Twitter security system), one obvious observation is that fact-checkers tended to target bad users (e.g. content polluters (Lee et al., 2010)), who usually have abusing behavior on social platform. It means that fact-checkers are enthusiastic about checking credibility of information on social networks. Regarding two reasons Not Authorized and Not Found Status, perhaps original posters were either aware of the wrong information they posted or were under pressure due to criticisms they received from other users, leading to deletion or hiding their original tweets.
In summary, our analysis reveals fact-checkers refuted content of original tweets, and their FC-tweets were more formal than Normal Replies and Random Replies. To provide supporting evidences, FC-tweets utilized semantic frames Existence and Statement. These results confirm that FC-tweets exhibit clear fact-checking intention.
5. Response Generation Framework
In the previous section, we analyzed common topics, lexical usages, and distinguishing linguistic dimensions of FC-tweets compared with Normal Replies and Random Replies. Our analysis revealed that FC-tweets indeed exhibited clear fact-checking intention, which is the property that we desired. Now, we turn our attention to proposing and building our framework, named Fact-checking Response Generator (FCRG), in order to generate responses with fact-checking intention. The generated responses are used to support fact-checkers and increase their engagement.
Formally, given a pair of an original tweet and a FC-tweet, the original tweet is a sequence of words and the FC-tweet is another sequence of words , where and are the length of the original tweet and the length of FC-tweet, respectively. We inserted a special token <s> as a starting token into every FC-tweet. Drawing inspiration from (Luong et al., 2015), we propose and build a framework as shown in Figure 5 that consists of three main components: (i) the shared word embedding layer, (ii) the encoder to capture representation of the original tweet and (iii) the decoder to generate a FC-tweet. Their details are as follows:
5.1. Shared Word Embedding Layer
For every word in the original tweet , we represent it as a one-hot encoding vector and embed it into a D-dimensional vector as follows: , where is an embedding matrix and is the vocabulary size. We use the same word embedding matrix for the FC-tweet. In particular, for every word (represented as one-hot vector ) in the FC-tweet , we embed it into a vector . The embedding matrix is a learned parameter and could be initialized by either pre-trained word vectors (e.g. Glove vectors) or random initialization. Since our model is designed specifically for fact-checking domain, we initialized with Normal Distribution and trained it from scratch. By using a shared , we could reduce the number of learned parameters significantly compared with (Luong et al., 2015). This is helpful in reducing overfitting.
The encoder is used to learn latent representation of the original tweet . We adopt a Recurrent Neural Network (RNN) to represent the encoder due to its large capacity to condition each word on all previous words in the original tweet . To overcome the vanishing or exploding gradient problem of RNN, we choose Gated Recurrent Unit (GRU) (Cho et al., 2014). Formally, we compute hidden state at time-step in the encoder as follows:
where the GRU is defined by the following equations:
where are learned parameters. is the new updated hidden state, is the update gate, is the reset gate, is the sigmoid function, is element wise product, and . After going through every word of the original tweet , we have hidden states for every time-step , where denotes concatenation of hidden states. We use the last hidden state as features of the original tweet .
The decoder takes x as the input to start the generation of a FC-tweet. We use another GRU to represent the decoder to generate a sequence of tokens . At each time-step , the hidden state is computed by another GRU: where initial hidden states are . To provide additional context information when generating word , we apply an attention mechanism to learn a weighted interpolation context vector dependent on all of the hidden states output from all time-steps of the encoder. We compute where each component of is the alignment score between the word in the FC-tweet and the output from the encoder. In this study, is computed by one of the following ways:
where softmax(.) is a softmax activation function and is a learned weight matrix. Note that we tried to employ other attention mechanisms including additive attention (Bahdanau et al., 2015) and concat attention (Luong et al., 2015) but the above attention mechanisms in Eq. 3 produced better results. After computing the context vector , we concatenate with to obtain a richer representation. The word at time-step is predicted by a softmax classifier:
where , and are weight matrices of a two-layer feedforward neural network and is the output size. is a probability distribution over the vocabulary. The probability of choosing word in the vocabulary as output is:
Therefore, the overall probability of generating the FC-tweet given the original tweet is computed as follows:
Since the entire architecture is differentiable, we jointly train the whole network with Teacher Forcing via Adam optimizer (Kingma and Ba, 2014) by minimizing the negative conditional log-likelihood for pairs of the original tweet and the FC-tweet as follows:
where and are the parameters of the encoder and the decoder, respectively. At test time, we used beam search to select top K generated responses. The generation process of a FC-tweet is ended when an end-of-sentence token (e.g. </s>) is emitted.
|Constraints||Model||BLEU-2||BLEU-3||BLEU-4||ROUGE-L||METEOR||Greedy Mat.||Vector Ext.||Avg. Rank|
|SeqAttB||7.148 (4)||4.050 (4)||3.261 (3)||26.474 (3)||17.659 (3)||43.566 (3)||15.837 (4)||3.43|
|HRED||7.301 (3)||4.073 (3)||3.248 (4)||26.222 (4)||17.545 (4)||42.734 (4)||18.929 (3)||3.57|
|our FCRG-BL||7.678 (1)||4.270 (2)||3.406 (2)||27.142 (2)||17.871 (1)||43.714 (2)||20.244 (1)||1.57|
|our FCRG-DT||7.641 (2)||4.303 (1)||3.500 (1)||27.352 (1)||17.750 (2)||45.302 (1)||19.993 (2)||1.43|
|SeqAttB||7.470 (4)||4.085 (4)||3.175 (4)||26.169 (4)||17.719 (3)||41.038 (4)||14.686 (4)||3.86|
|HRED||7.631 (3)||4.155 (3)||3.227 (3)||26.241 (3)||17.617 (4)||41.930 (3)||18.850 (3)||3.14|
|our FCRG-BL||7.925 (2)||4.285 (2)||3.295 (2)||26.953 (2)||17.885 (1)||42.899 (2)||20.052 (1)||1.71|
|our FCRG-DT||8.043 (1)||4.373 (1)||3.409 (1)||27.020 (1)||17.770 (2)||44.379 (1)||19.441 (2)||1.29|
|SeqAttB||6.398 (4)||3.319 (4)||2.434 (4)||22.250 (4)||16.568 (4)||36.298 (4)||10.198 (4)||4.00|
|HRED||6.540 (3)||3.373 (3)||2.462 (3)||22.980 (3)||17.106 (3)||37.513 (3)||15.537 (3)||3.00|
|our FCRG-BL||7.576 (2)||3.780 (2)||2.660 (2)||25.086 (1)||17.832 (1)||39.809 (1)||17.605 (1)||1.43|
|our FCRG-DT||7.955 (1)||3.914 (1)||2.751 (1)||24.635 (2)||17.662 (2)||39.374 (2)||16.081 (2)||1.57|
In this section, we thoroughly evaluate our models namely FCRG-DT (based on dot attention in Eq. 3) and FCRG-BL (based on bilinear attention in Eq. 3) quantitatively and qualitatively. We seek to answer the following questions:
RQ1: What are the performance of our models and baselines in word overlap-based metrics (i.e., measuring syntactic similarity between a ground-truth FC-tweet and a generated one)?
RQ2: How do our models perform compared with baselines in embedding metrics (i.e., measuring semantic similarity between a ground-truth FC-tweet and a generated one)?
RQ3: How does the number of generated tokens of responses affect performance of our models?
RQ4: Is our generated responses better than ones generated from baselines in a qualitative evaluation?
RQ5: What word embedding representatives in our model are close to each other in the semantic space?
6.1. Baselines and Our Models
Since our methods are deterministic models, we compare them with state-of-the-art baselines in this direction.
HRED: It (Serban et al., 2015) employs hierarchical RNNs for capturing information in a long context. HRED is a competitive method and a commonly used baseline for dialog generation systems.
our FCRG-BL: This model uses the bilinear attention.
our FCRG-DT: This model uses the dot attention.
6.2. Experimental Settings
Data Processing. Similar to (Serban et al., 2015) in terms of text generation, we replaced numbers with <number> and personal names with <person>. Words that appeared less than three times were replaced by <unk> token to further mitigate the sparsity issue. Our vocabulary size was 15,321. The min, max and mean tokens of the original tweets were 1, 89 and 19.1, respectively. The min, max and mean tokens of FC-tweets were 3, 64 and 12.3, respectively. Only 791 () original tweets contained 1 token which is mostly a URL.
Experimental Design. We randomly divided 73,203 pairs of the original tweets and FC-tweets into training/validation/test sets with a ratio of 80%/10%/10%, respectively. The validation set was used to tune hyperparameters and for early stopping. At test time, we used the beam search to generate 15 responses per original tweet (beam size=15), and report the average results. To select the best hyperparameters, we conducted the standard grid search to choose the best value of a hidden size , and an output size . We set word embedding size to 300 by default unless explicitly stated. The length of the original tweets and FC-tweets were set to the maximum value and , respectively. The dropout rate was 0.2. We used Adam optimizer with fixed learning rate , batch size , and gradient clipping was 0.25 to avoid exploded gradient. The same settings are applied to all models for the fair comparison.
A well known problem of the RNN-based decoder is that it tends to generate short responses. In our domain, examples of commonly generated responses were fake news url., you lie url., and wrong url. Because a very short response may be less interesting and has less power to be shared (as we learned in Section 4), we forced the beam search to generate responses with at least tokens. Since 92.4% of FC-tweets had tokens, and 60% FC-tweets had tokens, we chose . Moreover, as shown in Figure 4(b), FC-tweets with tokens usually had more shares than FC-tweets with tokens . In practice, fact-checkers can choose their preferred tokens of generated responses by varying .
Evaluation Metrics. To measure performance of our models and baselines, we adopted several syntactic and semantic evaluation metrics used in the prior works. In particular, we used word overlap-based metrics such as BLEU scores (Papineni et al., 2002), ROUGE-L (Lin, 2004), and METEOR (Banerjee and Lavie, 2005). These metrics evaluate the amount of overlapping words between a generated response and a ground-truth FC-tweet. The higher score indicates that the generated response are close/similar to the ground-truth FC-tweet syntactically. In other words, the generated response and the FC-tweet have a large number of overlapping words. Additionally, we also used embedding metrics (i.e. Greedy Matching and Vector Extrema) (Liu et al., 2016). These metrics usually estimate sentence-level vectors by using some heuristic to combine the individual word vectors in the sentence. The sentence-level vectors between a generated response and the ground-truth FC-tweet are compared by a measure such as cosine similarity. The higher value means the response and the FC-tweet are semantically similar.
6.3. RQ1 & RQ3: Quantitative Results based on Word Overlap-based Metrics
In this experiment, we quantitatively measure performances of all models by using BLEU, ROUGE-L, and METEOR. Table 2 shows results in the test set. Firstly, our FCRG-DT and FCRG-BL performed equally well, and outperformed the baselines – SeqAttB and HRED. In practice, FCRG-DT model is more preferable due to fewer parameters compared with FCRG-BL. Overall, our models outperformed SeqAttB perhaps because fusing global scheme (i.e. the last hidden state of the encoder) and output hidden state of every time-step in the encoder may be less effective than using only the latter one to compute context vector . HRED model utilized only global context without using context vector in generating responses, leading to suboptimal results compared with our models.
Under no constraints on tokens of generated responses, our FCRG-DT achieved 6.24% () improvement against SeqAttB on BLEU-3 according to Wilcoxon one-sided test. In BLEU-4, FCRG-DT improved SeqAttB by 7.32% and HRED by 7.76% (). In ROUGE-L, FCRG-DT improved SeqAttB and HRED by 3.32% and 4.31% with , respectively. In METEOR, our FCRG-DT and FCRG-BL achieved comparable performance with the baselines.
When tokens , we even achieve better results. The improvements of FCRG-DT over SeqAttB were 7.05% BLEU-3, 7.37% BLEU-4 and 3.25% ROUGE-L (). In comparison with HRED, the improvements of FCRG-DT were 5.25% BLEU-3, 5.64% BLEU-4, and 2.97% ROUGE-L . Again, FCRG-DT are comparable with SeqAttB and HRED in METEOR measurement.
When tokens 10, there was a decreasing trend across metrics as shown in Table 2. It makes sense because generating longer response similar with a ground-truth FC-tweet is much harder problem. Therefore, in reality, the Android messaging service recommends a very short reply (e.g., okay, yes, I am indeed) to reduce inaccurate risk. Despite the decreasing trend, our FCRG-DT and FCRG-BL improved the baselines by a larger margin. In particular, in BLEU-3, FCRG-DT outperformed SeqAttB and HRED by 17.9% and 16.0% , respectively. For BLEU-4, the improvements of FCRG-DT over SeqAttB and HRED were 13.02% and 11.74% , respectively. We observed consistent improvements over the baselines in ROUGE-L and METEOR.
Overall, our models outperformed the baselines in terms of all of the word overlap-based metrics.
|our FCRG-DT vs. SeqAttB||40%||28%||32%||0.725|
|our FCRG-DT vs. HRED||40%||36%||24%||0.592|
|Pairs of the original tweet (OT) and ground-truth FC-tweet||Generated responses of our FCRG-DT and two baselines|
6.4. RQ2 & RQ3: Quantitative Results based on Embedding Metrics
We adopted two embedding metrics to measure semantic similarity between generated responses and ground-truth FC-tweets (Liu et al., 2016). Again, we tested all the models under three settings as shown in Table 2. Our FCRG-DT performed best in all embedding metrics. Specifically, FCRG-DT outperformed SeqAttB by 3.98% and HRED by 6.00% improvements with in Greedy Matching. FCRG-DT’s improvements over SeqAttB and HRED were 26.24% and 5.62% , respectively in Vector Extrema. When tokens, our FCRG-DT also outperformed the baselines in both Greedy Matching and Vector Extrema. In tokens, our models achieved better performance than the baselines in all the embedding metrics. In particular, FCRG-BL model performed best, and then FCRG-DT model was the runner up. To sum up, FCRG-DT and FCRG-BL outperformed the baselines in Embedding metrics.
6.5. RQ4: Qualitative Evaluation
Next, we conducted another experiment to compare our FCRG-DT with baselines qualitatively. In the experiment, we chose FCRG-DT instead of FCRG-BL since it does not require any additional parameters and had comparable performance with FCRG-BL. We also used to generate responses with at least 10 tokens in all models since lengthy responses are more interesting and informative despite a harder problem.
Human Evaluation. Similar to (Shang et al., 2015), we randomly selected 50 original tweets from the test set. Given each of the original tweets, each of FCRG-DT, SeqAttB and HRED generated 15 responses. Then, one response with the highest probability per model was selected. We chose a pairwise comparison instead of listwise comparison to make easy for human evaluators to decide which one is better. Therefore, we created 100 triplets (original tweet, response, response) where one response was generated from our FCRG-DT and the other one was from a baseline. We employed three crowd-evaluators to evaluate each triplet where each response’s model name was hidden to the evaluators. Given each triplet, the evaluators independently chose one of the following options: (i) win (response is better), (ii) loss (response is better), and (iii) tie (equally good or bad). Before labeling, they were trained with a few examples to comprehend the following criteria: (1) the response should fact-check information in the original tweet, (2) it should be human-readable and be free of any fluency or grammatical errors, (3) the response may depend on a specific case or may be general but do not contradict the first two criteria. The majority voting approach was employed to judge which response is better. If annotators rated a triplet with three different answers, we viewed the triplet as a tie. Table 3 shows human evaluation results. The Kappa values show moderate agreement among the evaluators. We conclude that FCRG-DT outperforms SeqAttB and HRED qualitatively.
Case Studies. Table 4 presents examples of original tweets, ground-truth FC-tweets, and generated responses of the three models. Our FCRG-DT generated more relevant responses with clear fact-checking intention. For example, in the first example, FCRG-DT captured the uranium in the original tweet and generated a relevant response. We observed that SeqAttB usually generated non-relevant content. Responses generated by FCRG-DT were more formal than ones generated by the baselines.
6.6. RQ5: Similar Words in the Semantic Space
As our word embeddings were trained from scratch, we seek to investigate if our model can identify words semantically close/similar to each other. This analysis will help us gain more insights about our dataset. After training our FCRG-DT, we extracted the embedding vectors from the shared embedding layer shown in Figure 5. We selected three keywords such as obamacare, politifact and anti-trump. For each keyword, we found top 10 most similar words based on cosine similarity between extracted embedding vectors and used t-SNE to project these vectors into 2D space as shown in Figure 6. Firstly, the keyword obamacare associated with health care policies of the Obama administration in Figure 6(a) was similar with elder, AMT (Alternative Minimum Tax), immunity and checking. Next, politifact in Fig. 6(b) is close to debunks, there, anecdote, DNC (Democratic National Committee) and answered. Finally, with anti-trump in Fig. 6(c), our model identified obama-clinton, dictator, clickbait, and inability. Based on this analysis, we conclude that embedding vectors learned from our model can capture similar words in the semantic space.
Although our proposed models successfully generated responses with fact-checking intention, and performed better than the baselines, there are a few limitations in our work. Firstly, we assumed fact-checkers are freely choose articles that they prefer, and then insert corresponding fact-checking URLs into our generated responses. It means we achieved partial automation in a whole fact-checking process. In our future work, we are interested in even automating the process of selecting an fact-checking article based on content of original tweets in order to fully support fact-checkers and automate the whole process. Secondly, our framework is based on word-based RNNs, leading to a common issue: rare words are less likely to be generated. A feasible solution is using character-level RNNs (Kim et al., 2016) so that we do not need to replace rare words with <unk> token. In the future work, we will investigate if character-based RNN models work well on our dataset. Thirdly, we only used pairs of a original tweet and a FC-tweet without utilizing other data sources such as previous messages in online dialogues. As we showed in Figure 4(a), FC-tweets often did not contain content of fact-checking articles, leading to difficulties in using this data source. We tried to use the content of fact-checking articles, but did not improve performance of our models. We plan to explore other ways to utilize the data sources in the future. Finally, there are many original tweets containing URLs pointing to fake news sources (e.g. breitbart.com) but we did not consider them when generating responses. We leave this for future exploration.
In this paper, we introduced a novel application of text generation in combating fake news. We found that there were distinguishing linguistic features of FC-tweets compared with Normal and Random Replies. Notably, fact-checkers tended to refute information in original tweets and referred to evidences to support their factual corrections. Their FC-tweets were usually more formal, and contained less swear words and Internet slangs. These findings indicate that fact-checkers sought to persuade original posters in order to stop spreading fake news by using persuasive and formal language. In addition, we observed that when FC-tweets were posted, 7% original tweets were removed, deleted or hidden from the public. After analyzing FC-tweets, we built a deep learning framework to generate responses with fact-checking intention. Our model FCRG-DT was able to generate responses with fact-checking intention and outperformed the baselines quantitatively and qualitatively. Our work has opened a new research direction to combat fake news by supporting fact-checkers in online social systems.
This work was supported in part by NSF grant CNS-1755536, AWS Cloud Credits for Research, and Google Cloud. Any opinions, findings and conclusions or recommendations expressed in this material are the author(s) and do not necessarily reflect those of the sponsors.
- Neural machine translation by jointly learning to align and translate. ICLR. Cited by: §2.2, §5.3, 1st item.
- METEOR: an automatic metric for mt evaluation with improved correlation with human judgments. In ACL, Cited by: §6.2.
- LearningQ: a large-scale dataset for educational question generation. In ICWSM, Université de Fribourg. Cited by: §2.2.
- Anyone can become a troll: causes of trolling behavior in online discussions.. In CSCW, Cited by: §1.
- Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Cited by: §5.2.
- Probabilistic frame-semantic parsing. In NAACL, pp. 948–956. Cited by: §4.
- Automated hate speech detection and the problem of offensive language. arXiv preprint arXiv:1703.04009. Cited by: §1.
- Explicit warnings reduce but do not eliminate the continued influence of misinformation. Memory & cognition 38 (8), pp. 1087–1100. Cited by: §1.
- Rumor cascades.. In ICWSM, Cited by: §1, §2.1, §2.1.
- Faking sandy: characterizing and identifying fake images on twitter during hurricane sandy. In Proceedings of the 22nd international conference on World Wide Web, pp. 729–736. Cited by: §2.1.
- Get back! you don’t know me like that: the social mediation of fact checking interventions in twitter conversations.. In ICWSM, Cited by: §2.1, §3.
- Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems, pp. 1693–1701. Cited by: §2.2.
- This just in: fake news packs a lot in title, uses simpler, repetitive content in text body, more similar to satire than real news. NECO Workshop. Cited by: §2.1.
- Microsoft is deleting its ai chatbot’s incredibly racist tweets. Note: https://read.bi/2DgeRkN Cited by: §1, §2.2.
- Linguistic signals under misinformation and fact-checking: evidence from user comments on social media. HCI. Cited by: §1, §2.1, §2.1, §2.1.
- Homogeneity-based transmissive process to model true and false news in social networks. In WSDM, Cited by: §2.1.
- Leveraging the crowd to detect and reduce the spread of fake news and misinformation. In WSDM, Cited by: §2.1.
- Character-aware neural language models. In 30th AAAI Conference on Artificial Intelligence, AAAI 2016, Cited by: §7.
- Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.3.
- Aspects of rumor spreading on a microblog network. In International Conference on Social Informatics, pp. 299–308. Cited by: §1.
- Fact-checking triples over four years. Note: https://reporterslab.org/fact-checking-triples-over-four-years/ Cited by: §1.
- Variational memory encoder-decoder. In NIPS, Cited by: §2.2.
- Distributed representations of sentences and documents. In ICML, Cited by: §4.
- Uncovering social spammers: social honeypots+ machine learning. In SIGIR, Cited by: §4.
- Adversarial learning for neural dialogue generation. In EMNLP, Cited by: §2.2.
- Rouge: a package for automatic evaluation of summaries. Text Summarization Branches Out. Cited by: §6.2.
- How not to evaluate your dialogue system: an empirical study of unsupervised evaluation metrics for dialogue response generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2122–2132. Cited by: §6.2, §6.4.
- Effective approaches to attention-based neural machine translation. In EMNLP, Cited by: §1, §2.2, §5.1, §5.3, §5.
- Detecting rumors from microblogs with recurrent neural networks.. In IJCAI, pp. 3818–3824. Cited by: §1.
- Detect rumors using time series of social context information on microblogging websites. In CIKM, Cited by: §2.1.
- Characterizing online rumoring behavior using multi-dimensional signatures. In Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing, pp. 228–241. Cited by: §1.
- Fact checking in community forums. In AAAI, Cited by: §2.1.
- Recurrent neural network based language model. In ISCA, Cited by: §2.2.
- An interpretable joint graphical model for fact-checking from crowds. In AAAI, Cited by: §1, §2.1.
- When corrections fail: the persistence of political misperceptions. Political Behavior 32 (2), pp. 303–330. Cited by: §1.
- BLEU: a method for automatic evaluation of machine translation. In ACL, Cited by: §6.2.
- The development and psychometric properties of liwc2015. Technical report Cited by: Figure 2, §4.
- Credibility assessment of textual claims on the web. In CIKM, Cited by: §1.
- DeClarE: debunking fake news and false claims using evidence-aware deep learning. In EMNLP, Cited by: §1, §2.1.
- Rumor has it: identifying misinformation in microblogs. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1589–1599. Cited by: §2.1, §2.1.
- Neural user response generator: fake news detection with collective user intelligence.. In IJCAI, pp. 3834–3840. Cited by: §2.2.
- Truth of varying shades: analyzing language in fake news and political fact-checking. In EMNLP, Cited by: §2.1.
- Building end-to-end dialogue systems using generative hierarchical neural network models. arXiv preprint arXiv:1507.04808. Cited by: §2.2, 2nd item, §6.2.
- A hierarchical latent variable encoder-decoder model for generating dialogues.. In AAAI, Cited by: §2.2.
- Neural responding machine for short-text conversation. In ACL, Cited by: §2.2, 1st item, §6.5.
- Hoaxy: a platform for tracking online misinformation. In Proceedings of the 25th International Conference Companion on World Wide Web, pp. 745–750. Cited by: §1, §3.
- Finding streams in knowledge graphs to support fact checking. In ICDM, Cited by: §2.1.
- Deep headline generation for clickbait detection. In ICDM, Cited by: §2.2.
- Sequence to sequence learning with neural networks. In NIPS, pp. 3104–3112. Cited by: §1, §2.2.
- FEVER: a large-scale dataset for fact extraction and verification. In EMNLP, Cited by: §2.1.
- A neural conversational model. arXiv preprint arXiv:1506.05869. Cited by: §2.2.
- The rise of guardians: fact-checking url recommendation to combat fake news. In SIGIR, Cited by: §1, §1, §2.1, §3, §3.
- Separating facts from fiction: linguistic models to classify suspicious and trusted news posts on twitter. In ACL, Cited by: §2.1.
- The spread of true and false news online. Science 359 (6380), pp. 1146–1151. Cited by: §1, §2.1, §2.1.
- Chat more: deepening and widening the chatting topic via a deep model. In SIGIR, Cited by: §2.2.
- ” Liar, liar pants on fire”: a new benchmark dataset for fake news detection. arXiv preprint arXiv:1705.00648. Cited by: §1, §2.1.
- EANN: event adversarial neural networks for multi-modal fake news detection. In KDD, Cited by: §1, §2.1.
- Automated crowdturfing attacks and defenses in online review systems. In SIGSAC, Cited by: §2.2.
- Enquiring minds: early detection of rumors in social media from enquiry posts. In WWW, Cited by: §1, §2.1.