Transformers Are Better Than Humans at Identifying Generated Text
Fake information spread via the internet and social media influences public opinion and user activity. Generative models enable fake content to be generated faster and more cheaply than had previously been possible. This paper examines the problem of identifying fake content generated by lightweight deep learning models. A dataset containing human and machine-generated headlines was created and a user study indicated that humans were only able to identify the fake headlines in of the cases. However, the most accurate automatic approach, transformers, achieved an accuracy of , indicating that content generated from language models can be filtered out accurately.
In recent years fake content has been spreading across the internet and social media with great speed, misinforming and affecting users’ opinion Kumar and Shah (2018). While much of this fake content is being created by paid writers Luca and Zervas (2013), content generated by automated systems is rising. There is therefore a need for models that can distinguish between human and computer-generated text, to filter out deceiving content before it reaches a wider audience. For example, malicious users can use models to generate positive reviews for their products on shopping platforms.
While a lot of work has been done recently showcasing the strengths of text generation models Dathathri et al. (2019); Subramanian et al. (2018), little research has been conducted on methods to detect automatically generated text. Thankfully, generative models have several shortcomings and their output text has some characteristics that set it apart from human-written text, like lower variance and smaller vocabulary (Holtzman et al. (2019); Gehrmann et al. (2019)). These differences between real and generated text can be used by pattern recognition models to differentiate between the two. In this paper we test this hypothesis by training classifiers to detect headlines generated by RNN-based models.
The work described in this paper is split into two parts: the creation of a dataset containing headlines written by both humans and machines (Section 3) and training of classifiers to distinguish between them (Section 4). The dataset is created using real headlines from the Reuters Corpus Kulkarni (2018) and headlines generated by neural language models. The training and development sets consist of headlines from 2015 while the testing set consists of 2016 and 2017 headlines. For the classifiers, a series of baselines and deep learning models were tested, including transfer learning and transformer architectures. Neural methods were found to greatly outperform humans, with transformers being at least more accurate.
This work highlights how difficult it is for humans to identify fake content even when it is generated from simpler and faster models, but that the problem can ultimately be tackled using automated approaches. This suggests that automatic methods for content analysis could have an important role in supporting readers to understand the veracity of content. The main contributions of this work are the development of a novel fake content identification task based on news headlines
2 Relevant Work
Kumar and Shah (2018) compiled a survey on fake content on the internet, which serves as an overview of how false information targets users and how automatic detection models operate. The sharing of false information is boosted by the natural susceptibility of humans to believe such information. Pérez-Rosas et al. (2018); Ott et al. (2011) showed that humans are able to identify fake content with an accuracy of 50-75%. Information that is well presented, using long text with limited errors, was shown to deceive the majority of readers. Yao et al. (2017) examine generation of fake reviews for online shopping platforms. They build an RNN-based model that is trained on a dataset of Yelp reviews.
In Zellers et al. (2019), neural fake news detection and generation are jointly examined in an adversarial setting. The Grover model achieves an accuracy of 92 when identifying real from generated news articles, which is approximately the accuracy of models examined in this work. Human evaluation though is lacking and Grover cannot realistically be used by malicious users for mass-production of fake content, due to the required computational resources.
Holtzman et al. (2019) investigated the pitfalls of text generation, showing that sampling methods such as Beam search can cause low quality and repetitive text. Gehrmann et al. (2019) showed that models generate text from a more limited vocabulary than humans, who choose low-probability words more often than computers. This means that text written by humans is more varied than that written by models. Lavoie and Krishnamoorthy (2010) employed a feature-based classification system to detect fake scientific papers from SCIgen
3.1 Dataset Development
The dataset was created using Reuters headlines from 2015, 2016 and 2017 Kulkarni (2018), training models on each individual year to generate new headlines. The approach to headline generation was based on the method described by Graves (2013). Multiple RNNs (GRUs Cho et al. (2014) and LSTMs Hochreiter and Schmidhuber (1997)) were trained to predict the next word given some context. We generated text by using random sampling with temperature and continuously re-feeding words into this model. The output headlines were filtered on their perplexity score
The real and generated headlines for each year were then merged. Real headlines for each set were chosen randomly, with duplicates removed. These three sets (one for each year) were used to train and evaluate the generated text classifier.
The 2015 set contains generated headlines and real ones; the 2016 set generated and real; in the 2017 set there are generated and real. In total, there are generated headlines and real.
3.2 Dataset Analysis
The generated headlines show significant similarity to the real headlines, as shown below. This indicates that the language models are indeed able to capture patterns in the original data. Even though the number of words in the generated headlines is bound by the maximum number of words learned in the corresponding language model, the distribution of words is similar across real and generated headlines. In Figures 1 and 2 we indicatively show the 15 most frequent words in the real and generated headlines respectively.
On average, the real headlines are slightly longer than the generated ones, with and words respectively.
Lastly, POS tag frequencies are shown in Table 1 for the top tags in each set. In real headlines, nouns and adjectives are used more often, whereas in generated headlines the distribution is smoother, consistent with the findings in Gehrmann et al. (2019).
A crowd-sourced survey
In total, there were 4174 answers to the ‘real or generated’ questions and 2244 (53.8%) were correct. The participants, when presented with a computer-generated headline, answered correctly 45.3% out of 2702 responses. The generated headlines were 57 and out of those, per their average response, 25 were identified as computer-generated. This is an indication that our models can indeed generate realistic-looking headlines. When presented with actual headlines, participants answered correctly 66.7% out of 1338 responses. In total 37 real headlines were presented and out of those, 30 were correctly identified as real (based on average response).
Of the 57 generated headlines, 6 were marked as real by over 90% of the participants, while for the real headlines, 3 out of 31 reached that threshold (although a lot fell just short). Some of these headlines follow, both real and generated, in no particular order.
Inside the Great Hall - China’s Party Congress
Defense chief to continue to support Trump policy
Africa’s central bank keeps rate unchanged at 8.25 pct
Mallinckrodt to pay $100 million to settle U.S. probe on drug pricing
Britain’s FTSE steadies after Q1 sales surge
At the other end of the spectrum, there were five generated headlines that over 80% of the participants correctly identified as computer-generated:
Copper raises 2017 outlook by China; Vietnam drilling weighs
Tax Keep CEO promises to 28 pct last year board
FOREX-Dollar rebounds on Dollar after US jobs data data
Lira lower bond index seen nearly high after year US jobs
Names CEO of options for smaller shift
All of these examples contain grammatical errors, particularly incorrect use of prepositions. The third headline also exhibits repetition (“Dollar … dollar”, “data data”). It is worth noting that participants appeared more likely to identify headlines containing grammatical errors as fake news than ones exhibiting semantic inconsistency.
For our classifier experiments, we used the three sets of data (2015, 2016 and 2017) we had previously compiled. Specifically, for training we only used the 2015 set, while the 2016 and 2017 sets were used for testing. Splitting the train and test data by the year of publication ensures that there is no overlap between the sets and there is some variability between the content of the headlines (for example, different topics/authors). Therefore, we can be confident that the classifiers generalize to unknown examples.
Furthermore, for hyperparameter tuning, the 2015 data was randomly split into training and development sets on a ratio. In total, for training there are headlines, for evaluation there are and for testing there are .
Four types of classifiers were explored: baselines (Logistic Regression, Elastic Net), deep learning (CNN, Bi-LSTM, Bi-LSTM with Attention), transfer learning via ULMFit Howard and Ruder (2018) and Transformers (BERT Devlin et al. (2019), DistilBERT Sanh et al. (2019)).
All classifiers are trained on a dataset containing real headlines from Reuters 2015 and headlines generated by an LSTM model trained on Reuters 2015. The architecture and training details can be found in Appendix B.
Each model was run three times and the results averaged. Results are shown in Table 2. Overall accuracy is the percentage accuracy over all headlines (real and generated), while precision and recall are calculated over the generated headlines. Precision is the percentage of correct classifications out of all the generated classifications, while recall is the percentage of generated headlines the model classified correctly out of all the actual generated headlines. High recall scores indicate that the models are able to identify a generated headline with high accuracy, while low precision scores show that models classify headlines mostly as generated.
We can observe from the results table that humans are overall less effective than all models, including the baselines, scoring the lowest accuracy. They are also the least accurate on generated headlines, achieving the lowest recall. In general, human predictions are almost as bad as random.
Deep learning models scored consistently higher than the baselines, especially on precision, while transfer learning outperformed all previous models, reaching an overall accuracy of around . Transformer architectures though perform the best overall, with recall in the region and very high precision, resulting in great accuracy across the board. BERT, the highest-scoring model, scores at least higher than humans in all metrics.
Since training and testing data are separate, this indicates that there are some traits in generated text that are not present in human text. Transformers are able to pick up on these traits to make highly-accurate classifications. For example, generated text shows lower variance than human text Gehrmann et al. (2019), which means text without rarer words is more likely to be generated than being written by a human.
4.2 Error Analysis
The following two headlines are indicative examples of those misclassified by BERT:
fiat chrysler not to stop self-driving cars -
justice dept finds indian in case linked to zika threat
The first headline is not only grammatically awkward, but also ends in a dash which is an obvious indicator that the headline is fake. It is likely that the model puts more weight on the connection between “fiat chrysler” (which is a made-up car name) and “self-driving cars” leading to a real classification.
In the second headline the phrases “justice dept”, “in case” and “threat” are connected by their appearance in similar contexts. BERT appears to have put emphasis on these connections, but didn’t pick up that Zika (a virus) is unlikely to be connected to an Indian under investigation.
In both cases, it is obvious that the headlines are machine-generated, either through a grammatical or a semantic error. Despite that, BERT classified them as real, quite possibly because there are strong connections between some tokens in the samples, even though overall the headlines are not coherent.
This paper examined methods to detect headlines generated by lightweight models. A dataset was created using headlines from Reuters and a survey conducted asking participants to distinguish between real and generated headlines. Real headlines were identified as real by of the participants, while generated ones were identified with a rate. The dataset was used to train a range of models, all of which were better able to identify fake headlines than humans. BERT scored , an improvement of over human accuracy.
For future work it would be interesting to explore how these methods generalise to different text types, such as reviews or tweets.
Appendix A Language Model Details
For the generation of headlines, two RNN types were used: GRUs and LSTMs.
Language model vocabulary sizes ranged from 1000 to 3500, epochs from 35 to 100, RNN units were in the range of 50 to 150 and embedding size was in the 150-300 range. Models were trained for 25-40 epochs. We experimented with stacking two LSTM layers, but results were not satisfactory. For the rest of the models, generated headline quality seems indistinguishable, although inter-model comparison was not conducted.
All the models were trained on www.kaggle.com, using the default GPU. Running time was 8 hours.
Appendix B Classifier Details
ULMFit and the Transformers require their own special tokenizers, but the rest of the models use the same method, a simple indexing over the most frequent tokens. No pretrained word vectors (for example, GloVe) were used for the Deep Learning models.
ULMFit uses pre-trained weights from the AWD-LSTM model Merity et al. (2018). For fine-tuning, we first updated the LSTM weights with a learning rate of for a single epoch. Then, we unfroze all the layers and trained the model with a learning rate of - for an additional epoch. Finally, we trained the classifier head on its own for one more epoch with a learning rate of .
For the Transformers, we loaded pre-trained weights which we fine-tuned for a single epoch with a learning rate of -. Specifically, the models we used were base-BERT (12 layers, 110m parameters) and DistilBERT (6 layers, 66m parameters).
The CNN has two convolutional layers on top of each other with filter sizes 8 and 4 respectively, and kernel size of 3 for both. Embeddings have 75 dimensions and the model is trained for epochs.
The LSTM-based models have one recurrent layer with 35 units, while the embeddings have 100. Bidirectionality is used alongside a spatial dropout of 0.33. After the recurrent layer, we concatenate average pooling and max pooling layers. We also experiment with a Bi-LSTM with self-attention Vaswani et al. (2017). These models are trained for epochs.
All the models were trained on www.kaggle.com on the default GPU. Running times for the deep learning models and Transformers was 8 hours.
- Data and code are available here.
- Calculated using a Part-of-Speech statistical model as extracted from the trigram POS tag probabilities of the training dataset.
- Participants were students and staff members in a mailing list from the University of Sheffield. Alongside this main survey, 12 PhD students (from the LMU) familiar with neural language models were polled, with similar results.
- real - generated - generated - real - generated
- Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1724–1734. External Links: Cited by: §3.1.
- Plug and play language models: a simple approach to controlled text generation. External Links: Cited by: §1.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Cited by: §4.1.
- GLTR: statistical detection and visualization of generated text. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Florence, Italy, pp. 111–116. External Links: Cited by: §1, §2, §3.2, §4.1.
- Generating sequences with recurrent neural networks. CoRR abs/1308.0850. External Links: Cited by: §3.1.
- Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. External Links: Cited by: §3.1.
- The curious case of neural text degeneration. CoRR abs/1904.09751. External Links: Cited by: §1, §2.
- Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 328–339. External Links: Cited by: §4.1.
- The Historical Reuters News-Wire. Harvard Dataverse. External Links: Cited by: §1, §3.1.
- False information on web and social media: A survey. CoRR abs/1804.08559. External Links: Cited by: §1, §2.
- Algorithmic detection of computer generated text. External Links: Cited by: §2.
- Fake it till you make it: reputation, competition, and yelp review fraud. SSRN Electronic Journal, pp. . External Links: Cited by: §1.
- Regularizing and optimizing LSTM language models. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, External Links: Cited by: Appendix B.
- Engineering a Tool to Detect Automatically Generated Papers. In BIR 2016 Bibliometric-enhanced Information Retrieval, Padova, Italy. External Links: Cited by: §2.
- Finding deceptive opinion spam by any stretch of the imagination. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA, pp. 309–319. External Links: Cited by: §2.
- Automatic detection of fake news. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA, pp. 3391–3401. External Links: Cited by: §2.
- DistilBERT, a distilled version of bert: smaller, faster, cheaper and lighter. External Links: Cited by: §4.1.
- Towards text generation with adversarially learned neural outlines. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi and R. Garnett (Eds.), pp. 7551–7563. External Links: Cited by: §1.
- Attention is all you need. CoRR abs/1706.03762. External Links: Cited by: Appendix B.
- Automated crowdturfing attacks and defenses in online review systems. CoRR abs/1708.08151. External Links: Cited by: §2.
- Defending against neural fake news. Curran Associates, Inc.. External Links: Cited by: §2.