“Liar, Liar Pants on Fire”: A New Benchmark Dataset for Fake News Detection
Automatic fake news detection is a challenging problem in deception detection, and it has tremendous real-world political and social impacts. However, statistical approaches to combating fake news has been dramatically limited by the lack of labeled benchmark datasets. In this paper, we present liar: a new, publicly available dataset for fake news detection. We collected a decade-long, 12.8K manually labeled short statements in various contexts from PolitiFact.com, which provides detailed analysis report and links to source documents for each case. This dataset can be used for fact-checking research as well. Notably, this new dataset is an order of magnitude larger than previously largest public fake news datasets of similar type. Empirically, we investigate automatic fake news detection based on surface-level linguistic patterns. We have designed a novel, hybrid convolutional neural network to integrate meta-data with text. We show that this hybrid approach can improve a text-only deep learning model.
In this past election cycle for the 45th President of the United States, the world has witnessed a growing epidemic of fake news. The plague of fake news not only poses serious threats to the integrity of journalism, but has also created turmoils in the political world. The worst real-world impact is that fake news seems to create real-life fears: last year, a man carried an AR-15 rifle and walked in a Washington DC Pizzeria, because he recently read online that “this pizzeria was harboring young children as sex slaves as part of a child-abuse ring led by Hillary Clinton”
The broadly-related problem of deception detection [?] is not new to the natural language processing community. A relatively early study by Ott et al. focuses on detecting deceptive review opinions in sentiment analysis, using a crowdsourcing approach to create training data for the positive class, and then combine with truthful opinions from TripAdvisor. Recent studies have also proposed stylometric [?], semi-supervised learning [?], and linguistic approaches [?] to detect deceptive text on crowdsourced datasets. Even though crowdsourcing is an important approach to create labeled training data, there is a mismatch between training and testing. When testing on real-world review datasets, the results could be suboptimal since the positive training data was created in a completely different, simulated platform.
The problem of fake news detection is more challenging than detecting deceptive reviews, since the political language on TV interviews, posts on Facebook and Twitters are mostly short statements. However, the lack of manually labeled fake news dataset is still a bottleneck for advancing computational-intensive, broad-coverage models in this direction. Vlachos and Riedel are the first to release a public fake news detection and fact-checking dataset, but it only includes 221 statements, which does not permit machine learning based assessments.
To address these issues, we introduce the liar dataset, which includes 12,836 short statements labeled for truthfulness, subject, context/venue, speaker, state, party, and prior history. With such volume and a time span of a decade, liar is an order of magnitude larger than the currently available resources [?] of similiar type. Additionally, in contrast to crowdsourced datasets, the instances in liar are collected in a grounded, more natural context, such as political debate, TV ads, Facebook posts, tweets, interview, news release, etc. In each case, the labeler provides a lengthy analysis report to ground each judgment, and the links to all supporting documents are also provided.
Empirically, we have evaluated several popular learning based methods on this dataset. The baselines include logistic regression, support vector machines, long short-term memory networks [?], and a convolutional neural network model [?]. We further introduce a neural network architecture to integrate text and meta-data. Our experiment suggests that this approach improves the performance of a strong text-only convolutional neural networks baseline.
2liar: a New Benchmark Dataset
The major resources for deceptive detection of reviews are crowdsourced datasets [?]. They are very useful datasets to study deception detection, but the positive training data are collected from a simulated environment. More importantly, these datasets are not suitable for fake statements detection, since the fake news on TVs and social media are much shorter than customer reviews.
Vlachos and Riedel are the first to construct fake news and fact-checking datasets. They obtained 221 statements from Channel 4
|Training set size||10,269|
|Validation set size||1,284|
|Testing set size||1,283|
|Avg. statement length (tokens)||17.9|
|Top-3 Speaker Affiliations|
|None (e.g., FB posts)||2,185|
We show some random snippets from our dataset in Figure . The liar dataset
The speakers in the liar dataset include a mix of democrats and republicans, as well as a significant amount of posts from online social media. We include a rich set of meta-data for each speaker—in addition to party affiliations, current job, home state, and credit history are also provided. In particular, the credit history includes the historical counts of inaccurate statements for each speaker. For example, Mitt Romney has a credit history vector , which corresponds to his counts of “pants on fire”, “false”, “barely true”, “half true”, “mostly true” for historical statements. Since this vector also includes the count for the current statement, it is important to subtract the current label from the credit history when using this meta data vector in prediction experiments.
These statements are sampled from various of contexts/venues, and the top categories include news releases, TV/radio interviews, campaign speeches, TV ads, tweets, debates, Facebook posts, etc. To ensure a broad coverage of the topics, there is also a diverse set of subjects discussed by the speakers. The top-10 most discussed subjects in the dataset are economy, health-care, taxes, federal-budget, education, jobs, state-budget, candidates-biography, elections, and immigration.
3Automatic Fake News Detection
One of the most obvious applications of our dataset is to facilitate the development of machine learning models for automatic fake news detection. In this task, we frame this as a 6-way multiclass text classification problem. And the research questions are:
Based on surface-level linguistic realizations only, how well can machine learning algorithms classify a short statement into a fine-grained category of fakeness?
Can we design a deep neural network architecture to integrate speaker related meta-data with text to enhance the performance of fake news detection?
Since convolutional neural networks architectures (CNNs) [?] have obtained the state-of-the-art results on many text classification datasets, we build our neural networks model based on a recently proposed CNN model [?]. Figure 1 shows the overview of our hybrid convolutional neural network for integrating text and meta-data.
We randomly initialize a matrix of embedding vectors to encode the metadata embeddings. We use a convolutional layer to capture the dependency between the meta-data vector(s). Then, a standard max-pooling operation is performed on the latent space, followed by a bi-directional LSTM layer. We then concatenate the max-pooled text representations with the meta-data representation from the bi-directional LSTM, and feed them to fully connected layer with a softmax activation function to generate the final prediction.
4liar: Benchmark Evaluation
In this section, we first describe the experimental setup, and the baselines. Then, we present the empirical results and compare various models.
We used five baselines: a majority baseline, a regularized logistic regression classifier (LR), a support vector machine classifier (SVM) [?], a bi-directional long short-term memory networks model (Bi-LSTMs) [?], and a convolutional neural network model (CNNs) [?]. For LR and SVM, we used the LibShortText toolkit
We used grid search to tune the hyperparameters for LR and SVM models. We chose accuracy as the evaluation metric, since we found that the accuracy results from various models were equivalent to f-measures on this balanced dataset.
|Text + Subject||0.263||0.235|
|Text + Speaker||0.277||0.248|
|Text + Job||0.270||0.258|
|Text + State||0.246||0.256|
|Text + Party||0.259||0.248|
|Text + Context||0.251||0.243|
|Text + History||0.246||0.241|
|Text + All||0.247||0.274|
We outline our empirical results in Table 2. First, we compare various models using text features only. We see that the majority baseline on this dataset gives about 0.204 and 0.208 accuracy on the validation and test sets respectively. Standard text classifier such as SVMs and LR models obtained significant improvements. Due to overfitting, the Bi-LSTMs did not perform well. The CNNs outperformed all models, resulting in an accuracy of 0.270 on the heldout test set. We compare the predictions from the CNN model with SVMs via a two-tailed paired t-test, and CNN was significantly better (). When considering all meta-data and text, the model achieved the best result on the test data.
We introduced liar, a new dataset for automatic fake news detection. Compared to prior datasets, liar is an order of a magnitude larger, which enables the development of statistical and computational approaches to fake news detection. liar’s authentic, real-world short statements from various contexts with diverse speakers also make the research on developing broad-coverage fake news detector possible. We show that when combining meta-data with text, significant improvements can be achieved for fine-grained fake news detection. Given the detailed analysis report and links to source documents in this dataset, it is also possible to explore the task of automatic fact-checking over knowledge base in the future. Our corpus can also be used for stance classification, argument mining, topic modeling, rumor detection, and political NLP research.