Transformers Are Better Than Humans at Identifying Generated Text

Transformers Are Better Than Humans at Identifying Generated Text


Fake information spread via the internet and social media influences public opinion and user activity. Generative models enable fake content to be generated faster and more cheaply than had previously been possible. This paper examines the problem of identifying fake content generated by lightweight deep learning models. A dataset containing human and machine-generated headlines was created and a user study indicated that humans were only able to identify the fake headlines in of the cases. However, the most accurate automatic approach, transformers, achieved an accuracy of , indicating that content generated from language models can be filtered out accurately.


1 Introduction

In recent years fake content has been spreading across the internet and social media with great speed, misinforming and affecting users’ opinion Kumar and Shah (2018). While much of this fake content is being created by paid writers Luca and Zervas (2013), content generated by automated systems is rising. There is therefore a need for models that can distinguish between human and computer-generated text, to filter out deceiving content before it reaches a wider audience. For example, malicious users can use models to generate positive reviews for their products on shopping platforms.

While a lot of work has been done recently showcasing the strengths of text generation models Dathathri et al. (2019); Subramanian et al. (2018), little research has been conducted on methods to detect automatically generated text. Thankfully, generative models have several shortcomings and their output text has some characteristics that set it apart from human-written text, like lower variance and smaller vocabulary (Holtzman et al. (2019); Gehrmann et al. (2019)). These differences between real and generated text can be used by pattern recognition models to differentiate between the two. In this paper we test this hypothesis by training classifiers to detect headlines generated by RNN-based models.

The work described in this paper is split into two parts: the creation of a dataset containing headlines written by both humans and machines (Section 3) and training of classifiers to distinguish between them (Section 4). The dataset is created using real headlines from the Reuters Corpus Kulkarni (2018) and headlines generated by neural language models. The training and development sets consist of headlines from 2015 while the testing set consists of 2016 and 2017 headlines. For the classifiers, a series of baselines and deep learning models were tested, including transfer learning and transformer architectures. Neural methods were found to greatly outperform humans, with transformers being at least more accurate.

This work highlights how difficult it is for humans to identify fake content even when it is generated from simpler and faster models, but that the problem can ultimately be tackled using automated approaches. This suggests that automatic methods for content analysis could have an important role in supporting readers to understand the veracity of content. The main contributions of this work are the development of a novel fake content identification task based on news headlines1 and analysis of human and automatic approaches to the problem.

2 Relevant Work

Kumar and Shah (2018) compiled a survey on fake content on the internet, which serves as an overview of how false information targets users and how automatic detection models operate. The sharing of false information is boosted by the natural susceptibility of humans to believe such information. Pérez-Rosas et al. (2018); Ott et al. (2011) showed that humans are able to identify fake content with an accuracy of 50-75%. Information that is well presented, using long text with limited errors, was shown to deceive the majority of readers. Yao et al. (2017) examine generation of fake reviews for online shopping platforms. They build an RNN-based model that is trained on a dataset of Yelp reviews.

In Zellers et al. (2019), neural fake news detection and generation are jointly examined in an adversarial setting. The Grover model achieves an accuracy of 92 when identifying real from generated news articles, which is approximately the accuracy of models examined in this work. Human evaluation though is lacking and Grover cannot realistically be used by malicious users for mass-production of fake content, due to the required computational resources.

Holtzman et al. (2019) investigated the pitfalls of text generation, showing that sampling methods such as Beam search can cause low quality and repetitive text. Gehrmann et al. (2019) showed that models generate text from a more limited vocabulary than humans, who choose low-probability words more often than computers. This means that text written by humans is more varied than that written by models. Lavoie and Krishnamoorthy (2010) employed a feature-based classification system to detect fake scientific papers from SCIgen2, using 200 papers and leave-one-out validation. A similar study was performed in Nguyen and Labbe (2016), where 200 generated papers and 10,000 genuine ones were used for classification.

3 Dataset

3.1 Dataset Development

The dataset was created using Reuters headlines from 2015, 2016 and 2017 Kulkarni (2018), training models on each individual year to generate new headlines. The approach to headline generation was based on the method described by Graves (2013). Multiple RNNs (GRUs Cho et al. (2014) and LSTMs Hochreiter and Schmidhuber (1997)) were trained to predict the next word given some context. We generated text by using random sampling with temperature and continuously re-feeding words into this model. The output headlines were filtered on their perplexity score3 while we also removed headlines with fewer than five words. Multiple models were used for headline generation to make sure the classifier could generalize to different setups. For these models, we experimented with different RNN types and parameters. Details can be found in Appendix A.

The real and generated headlines for each year were then merged. Real headlines for each set were chosen randomly, with duplicates removed. These three sets (one for each year) were used to train and evaluate the generated text classifier.

The 2015 set contains generated headlines and real ones; the 2016 set generated and real; in the 2017 set there are generated and real. In total, there are generated headlines and real.

3.2 Dataset Analysis

The generated headlines show significant similarity to the real headlines, as shown below. This indicates that the language models are indeed able to capture patterns in the original data. Even though the number of words in the generated headlines is bound by the maximum number of words learned in the corresponding language model, the distribution of words is similar across real and generated headlines. In Figures 1 and 2 we indicatively show the 15 most frequent words in the real and generated headlines respectively.

Figure 1: Top 15 Words for real headlines
Figure 2: Top 15 Words for generated headlines

On average, the real headlines are slightly longer than the generated ones, with and words respectively.

Lastly, POS tag frequencies are shown in Table 1 for the top tags in each set. In real headlines, nouns and adjectives are used more often, whereas in generated headlines the distribution is smoother, consistent with the findings in Gehrmann et al. (2019).

Real Generated
POS freq POS freq
NN 0.334 NN 0.280
JJ 0.145 NNS 0.107
NNS 0.111 JJ 0.103
IN 0.090 IN 0.082
: 0.041 CD 0.045
CD 0.041 TO 0.033
VBZ 0.029 VB 0.030
VB 0.027 VBZ 0.025
TO 0.026 : 0.018
CC 0.023 POS 0.012
Table 1: Frequencies for the top 10 part-of-speech tags in real and generated headlines

3.3 Survey

A crowd-sourced survey4 was conducted to determine how realistic the generated text is. Participants were presented with headlines in a random order and asked to judge whether they were real or computer generated. Reuters 2017 data was used for the survey. The real headlines are headlines selected at random from Reuters 2017, while the generated headlines come from models trained on Reuters 2017. Only generated headlines with low POS perplexity were chosen to ensure the selection process was objective.

In total, there were 4174 answers to the ‘real or generated’ questions and 2244 (53.8%) were correct. The participants, when presented with a computer-generated headline, answered correctly 45.3% out of 2702 responses. The generated headlines were 57 and out of those, per their average response, 25 were identified as computer-generated. This is an indication that our models can indeed generate realistic-looking headlines. When presented with actual headlines, participants answered correctly 66.7% out of 1338 responses. In total 37 real headlines were presented and out of those, 30 were correctly identified as real (based on average response).

Of the 57 generated headlines, 6 were marked as real by over 90% of the participants, while for the real headlines, 3 out of 31 reached that threshold (although a lot fell just short). Some of these headlines follow, both real and generated, in no particular order.

Inside the Great Hall - China’s Party Congress

Defense chief to continue to support Trump policy

Africa’s central bank keeps rate unchanged at 8.25 pct

Mallinckrodt to pay $100 million to settle U.S. probe on drug pricing Britain’s FTSE steadies after Q1 sales surge 5

At the other end of the spectrum, there were five generated headlines that over 80% of the participants correctly identified as computer-generated:

Copper raises 2017 outlook by China; Vietnam drilling weighs

Tax Keep CEO promises to 28 pct last year board

FOREX-Dollar rebounds on Dollar after US jobs data data

Lira lower bond index seen nearly high after year US jobs

Names CEO of options for smaller shift

All of these examples contain grammatical errors, particularly incorrect use of prepositions. The third headline also exhibits repetition (“Dollar … dollar”, “data data”). It is worth noting that participants appeared more likely to identify headlines containing grammatical errors as fake news than ones exhibiting semantic inconsistency.

4 Classification

For our classifier experiments, we used the three sets of data (2015, 2016 and 2017) we had previously compiled. Specifically, for training we only used the 2015 set, while the 2016 and 2017 sets were used for testing. Splitting the train and test data by the year of publication ensures that there is no overlap between the sets and there is some variability between the content of the headlines (for example, different topics/authors). Therefore, we can be confident that the classifiers generalize to unknown examples.

Furthermore, for hyperparameter tuning, the 2015 data was randomly split into training and development sets on a ratio. In total, for training there are headlines, for evaluation there are and for testing there are .

4.1 Experiments

Four types of classifiers were explored: baselines (Logistic Regression, Elastic Net), deep learning (CNN, Bi-LSTM, Bi-LSTM with Attention), transfer learning via ULMFit Howard and Ruder (2018) and Transformers (BERT Devlin et al. (2019), DistilBERT Sanh et al. (2019)).

All classifiers are trained on a dataset containing real headlines from Reuters 2015 and headlines generated by an LSTM model trained on Reuters 2015. The architecture and training details can be found in Appendix B.

Each model was run three times and the results averaged. Results are shown in Table 2. Overall accuracy is the percentage accuracy over all headlines (real and generated), while precision and recall are calculated over the generated headlines. Precision is the percentage of correct classifications out of all the generated classifications, while recall is the percentage of generated headlines the model classified correctly out of all the actual generated headlines. High recall scores indicate that the models are able to identify a generated headline with high accuracy, while low precision scores show that models classify headlines mostly as generated.

We can observe from the results table that humans are overall less effective than all models, including the baselines, scoring the lowest accuracy. They are also the least accurate on generated headlines, achieving the lowest recall. In general, human predictions are almost as bad as random.

Deep learning models scored consistently higher than the baselines, especially on precision, while transfer learning outperformed all previous models, reaching an overall accuracy of around . Transformer architectures though perform the best overall, with recall in the region and very high precision, resulting in great accuracy across the board. BERT, the highest-scoring model, scores at least higher than humans in all metrics.

Method Ovr. Acc. Precision Recall
Human 53.8 56.9 54.6
Log. Reg. 61.6 43.1 66.6
Elastic Net 59.2 49.6 60.0
CNN 72.6 67.3 74.3
BiLSTM 75.3 72.4 75.4
BiLSTM/Att. 74.9 69.4 77.0
ULMFit 90.0 86.6 92.6
BERT 93.8 90.1 97.0
DistilBERT 93.1 88.8 96.9
Table 2: Experiment Results

Since training and testing data are separate, this indicates that there are some traits in generated text that are not present in human text. Transformers are able to pick up on these traits to make highly-accurate classifications. For example, generated text shows lower variance than human text Gehrmann et al. (2019), which means text without rarer words is more likely to be generated than being written by a human.

4.2 Error Analysis

The following two headlines are indicative examples of those misclassified by BERT:

fiat chrysler not to stop self-driving cars -

justice dept finds indian in case linked to zika threat

The first headline is not only grammatically awkward, but also ends in a dash which is an obvious indicator that the headline is fake. It is likely that the model puts more weight on the connection between “fiat chrysler” (which is a made-up car name) and “self-driving cars” leading to a real classification.

In the second headline the phrases “justice dept”, “in case” and “threat” are connected by their appearance in similar contexts. BERT appears to have put emphasis on these connections, but didn’t pick up that Zika (a virus) is unlikely to be connected to an Indian under investigation.

In both cases, it is obvious that the headlines are machine-generated, either through a grammatical or a semantic error. Despite that, BERT classified them as real, quite possibly because there are strong connections between some tokens in the samples, even though overall the headlines are not coherent.

5 Conclusion

This paper examined methods to detect headlines generated by lightweight models. A dataset was created using headlines from Reuters and a survey conducted asking participants to distinguish between real and generated headlines. Real headlines were identified as real by of the participants, while generated ones were identified with a rate. The dataset was used to train a range of models, all of which were better able to identify fake headlines than humans. BERT scored , an improvement of over human accuracy.

For future work it would be interesting to explore how these methods generalise to different text types, such as reviews or tweets.

Appendix A Language Model Details

For the generation of headlines, two RNN types were used: GRUs and LSTMs.

Language model vocabulary sizes ranged from 1000 to 3500, epochs from 35 to 100, RNN units were in the range of 50 to 150 and embedding size was in the 150-300 range. Models were trained for 25-40 epochs. We experimented with stacking two LSTM layers, but results were not satisfactory. For the rest of the models, generated headline quality seems indistinguishable, although inter-model comparison was not conducted.

All the models were trained on, using the default GPU. Running time was 8 hours.

Appendix B Classifier Details

ULMFit and the Transformers require their own special tokenizers, but the rest of the models use the same method, a simple indexing over the most frequent tokens. No pretrained word vectors (for example, GloVe) were used for the Deep Learning models.

ULMFit uses pre-trained weights from the AWD-LSTM model Merity et al. (2018). For fine-tuning, we first updated the LSTM weights with a learning rate of for a single epoch. Then, we unfroze all the layers and trained the model with a learning rate of - for an additional epoch. Finally, we trained the classifier head on its own for one more epoch with a learning rate of .

For the Transformers, we loaded pre-trained weights which we fine-tuned for a single epoch with a learning rate of -. Specifically, the models we used were base-BERT (12 layers, 110m parameters) and DistilBERT (6 layers, 66m parameters).

The CNN has two convolutional layers on top of each other with filter sizes 8 and 4 respectively, and kernel size of 3 for both. Embeddings have 75 dimensions and the model is trained for epochs.

The LSTM-based models have one recurrent layer with 35 units, while the embeddings have 100. Bidirectionality is used alongside a spatial dropout of 0.33. After the recurrent layer, we concatenate average pooling and max pooling layers. We also experiment with a Bi-LSTM with self-attention Vaswani et al. (2017). These models are trained for epochs.

All the models were trained on on the default GPU. Running times for the deep learning models and Transformers was 8 hours.


  1. Data and code are available here.
  3. Calculated using a Part-of-Speech statistical model as extracted from the trigram POS tag probabilities of the training dataset.
  4. Participants were students and staff members in a mailing list from the University of Sheffield. Alongside this main survey, 12 PhD students (from the LMU) familiar with neural language models were polled, with similar results.
  5. real - generated - generated - real - generated


  1. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1724–1734. External Links: Link, Document Cited by: §3.1.
  2. Plug and play language models: a simple approach to controlled text generation. External Links: 1912.02164 Cited by: §1.
  3. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: §4.1.
  4. GLTR: statistical detection and visualization of generated text. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Florence, Italy, pp. 111–116. External Links: Link, Document Cited by: §1, §2, §3.2, §4.1.
  5. Generating sequences with recurrent neural networks. CoRR abs/1308.0850. External Links: Link, 1308.0850 Cited by: §3.1.
  6. Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. External Links: Document, Link, Cited by: §3.1.
  7. The curious case of neural text degeneration. CoRR abs/1904.09751. External Links: Link, 1904.09751 Cited by: §1, §2.
  8. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 328–339. External Links: Link, Document Cited by: §4.1.
  9. The Historical Reuters News-Wire. Harvard Dataverse. External Links: Document, Link Cited by: §1, §3.1.
  10. False information on web and social media: A survey. CoRR abs/1804.08559. External Links: Link, 1804.08559 Cited by: §1, §2.
  11. Algorithmic detection of computer generated text. External Links: 1008.0706 Cited by: §2.
  12. Fake it till you make it: reputation, competition, and yelp review fraud. SSRN Electronic Journal, pp. . External Links: Document Cited by: §1.
  13. Regularizing and optimizing LSTM language models. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, External Links: Link Cited by: Appendix B.
  14. Engineering a Tool to Detect Automatically Generated Papers. In BIR 2016 Bibliometric-enhanced Information Retrieval, Padova, Italy. External Links: Link Cited by: §2.
  15. Finding deceptive opinion spam by any stretch of the imagination. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA, pp. 309–319. External Links: Link Cited by: §2.
  16. Automatic detection of fake news. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA, pp. 3391–3401. External Links: Link Cited by: §2.
  17. DistilBERT, a distilled version of bert: smaller, faster, cheaper and lighter. External Links: 1910.01108 Cited by: §4.1.
  18. Towards text generation with adversarially learned neural outlines. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi and R. Garnett (Eds.), pp. 7551–7563. External Links: Link Cited by: §1.
  19. Attention is all you need. CoRR abs/1706.03762. External Links: Link, 1706.03762 Cited by: Appendix B.
  20. Automated crowdturfing attacks and defenses in online review systems. CoRR abs/1708.08151. External Links: Link, 1708.08151 Cited by: §2.
  21. Defending against neural fake news. Curran Associates, Inc.. External Links: Link Cited by: §2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description