On the Benefit of Combining Neural, Statistical and External Features for Fake News Identification
Identifying the veracity of a news article is an interesting problem while automating this process can be a challenging task. Detection of a news article as fake is still an open question as it is contingent on many factors which the current state-of-the-art models fail to incorporate. In this paper, we explore a subtask to fake news identification, and that is stance detection. Given a news article, the task is to determine the relevance of the body and its claim. We present a novel idea that combines the neural, statistical and external features to provide an efficient solution to this problem. We compute the neural embedding from the deep recurrent model, statistical features from the weighted n-gram bag-of-words model and hand crafted external features with the help of feature engineering heuristics. Finally, using deep neural layer all the features are combined, thereby classifying the headline-body news pair as agree, disagree, discuss, or unrelated. We compare our proposed technique with the current state-of-the-art models on the fake news challenge dataset. Through extensive experiments, we find that the proposed model outperforms all the state-of-the-art techniques including the submissions to the fake news challenge.
Keywords:External features, Statistical features, Word embeddings, Fake news, Deep learning
Fake news being a potential threat towards journalism and public discourse has created a buzz across the internet. With the recent advent of social media platforms such as Facebook and Twitter, it has become easier to propagate any information to the masses within minutes. While the propagation of information is proportional to growth of social media, there has been an aggravation in the authenticity of these news articles. These days it has become a lot easier to mislead the masses using a single Facebook or Twitter fake post. For an instance, in the US presidential election of 2016, the fake news has been cited as the foremost contributing factor that affected the outcome .
|Headline||”Robert Plant Ripped up $800M Led Zeppelin Reunion Contract”||Stance|
|Body 1||Led Zeppelin’s Robert Plant turned down 500 MILLION to reform supergroup.||Agree|
|Body 2||No, Robert Plant did not rip up an $800 million deal to get Led Zeppelin back together.||Disagree|
|Body 3||Robert Plant reportedly tore up an $800 million Led Zeppelin reunion deal.||Discuss|
|Body 4||Richard Branson’s Virgin Galactic is set to launch SpaceShipTwo today.||Unrelated|
The root cause of this problem lies in the fact that none of the social networking sites use any automatic system that can identify the veracity of news flowing across these platforms. A possible reason for this failure is the open domain nature of the problem that adds to the intricacies. The recently organized Fake News Challenge (FNC-1)  is an initiative in this direction. The aim of this challenge is to build an automatic system that has the capability to identify whether a news article is fake or not. More specifically, given a news article the task is to evaluate the relatedness of the news body towards its headline. The relatedness or stance is the relative perspective of a news article towards a relative claim (shown in Table 1).
The idea behind building a countermeasure for fake news is to use machine learning and natural language processing (NLP) tools that can compute semantic and contextual similarity between the headline and the body, and classify the pairs into one of four categories. Deep learning models have been efficacious in solving many NLP problems that share similarities to fake news which includes but not limited to - computing semantic similarity between sentences [1, 18], community based question answering [31, 32], etc. The basic building blocks of all deep models are recurrent networks such as recurrent neural networks (RNN) , long short-term memory networks (LSTM)  and gated recurrent units (GRU) , and convolution networks such as convolution neural networks (CNN) . A deep architecture encodes the given sequence of words into fixed length vector representation which can be used to score the relevance of two textual entities, in our case, relevance of each headline-body pair.
Statistical information related to text can be encoded to vectors using the traditional bag-of-words (BOW) approach. The BOW approaches are often combined with term frequency (TF) and inverse document frequency (IDF), and n-grams that helps to encode more information related to the text [28, 12]. These approaches, however simple, have been used to ameliorate the performance of deep models in complex NLP problems such as community question answering  and answer sentence selection . Sometimes, it is beneficial to leverage feature engineering heuristics when combined with statistical approaches. The feature engineering heuristics or the external features are used to aid the learning model to successfully converge to a global solution [31, 32, 34]. The external features includes common observations such as number of n-grams, number of words match between headline and the body, cosine similarity between the headline and the body vector, etc. The FNC-1 baseline also includes a combination of feature engineering heuristics that alone achieves a competitive performance, even outperforming several widely used deep learning architectures. In this paper, we combine external features introduced in the baseline with some more heuristics that have been shown to be successful in other NLP tasks.
These days it is common to use pre-trained word embeddings such as Word2vec  and GloVe  along with deep models for NLP tasks. Similar to word embedding, the recurrent models have been used to encode an entire sentence to a vector. Some of the widely used sentence-to-vector models include doc2vec , paragraph2vec  and skip-thought vectors . These deep recurrent models helps to capture the semantic and contextual information of the textual pairs, in our case, body and its claim. In our work, we use the skip-thought vector to encode the headline and the body, and combine it with external features and statistical approaches.
Finally, the main contributions of the paper can be summarized as
We propose an approach that is based on the combination of statistical, neural and feature engineering heuristics which achieves state-of-the-art performance on the task of fake news identification.
We evaluate the proposed approach on FNC-1 challenge, and compare our results with the top-4 submissions to the challenge. We also analyze the applicability of several state-of-the-art deep models on FNC-1 dataset.
The rest of the paper is organized as follows. In section 2, we brief the previous idea over which our works builds, which is followed by applicability of state-of-the-art deep architectures on the problem of stance detection. In section 4 we describe the proposed approach in detail, followed by the experiment setup in section 5, that includes dataset description, training parameters, evaluation metrics used and results. Finally, our work is concluded in section 6.
2 Related Work
In this section, we discuss some previous work that is in relation to fake news identification such as rumor detection in news articles and hoax news identification. We also discuss the use of deep learning architecture used by some of the researchers with whom our work shares some similarity.
Fake news. From an NLP perspective, researchers have studied numerous aspects of credibility of online information. For example,  applied the time-sensitive supervised approach by relying on the tweet content to address the credibility of a tweet in different situations.  used LSTM in a similar problem of early rumor detection. In an another work,  aimed at detecting the stance of tweets and determining the veracity of the given rumor with convolution neural networks. A submission  to the SemEval 2016 Twitter Stance Detection task focuses on creating a bag-of-words auto encoder, and training it over the tokenized tweets.
FNC-1 submissions. In their work,  achieved a preliminary score of 0.8080, slightly above the competition baseline of 0.7950. They experimented on four basic models on which the final result was evaluated: Bag Of Words (BOW), basic LSTM, LSTM with attention and conditional encoding LSTM with attention (CEA LSTM). In our work, instead of using the models separately, we combine the best of these models.
Another team, , combined multiple models in an ensemble providing 50/50 weighted average between deep convolution neural network and a gradient-boosted decision trees. Though this work seems to be similar to our work, the difference lies in the construction of ensemble of classifiers. In a similar attempt, a team  concatenated various features vectors and passed it through an MLP model.
The work by , focuses on generating lexical and similarity features using (TF-IDF) representations of bag-of-words (BOW) which are then fed through a multi-layer perceptron (MLP) with one hidden layer. In their work,  divided the problem into two groups: unrelated and related. They were able to achieve 90% accuracy on the related/unrelated task by finding maximum and average Jaccard similarity score across all sentences in the article and choosing appropriate threshold values. A similar work of splitting the problem into two subproblems (related and unrelated) is also performed by . The work by  focuses on the use of recurrent models for fake news stance detection.
3 Technique Used
3.1 Deep Learning Architectures
To predict the stance for a given sample in FNC-1 dataset, a multi-channel deep neural network can be used to encode a given headline-body pair, which can be classified into one of the four stances. This is achieved by using a multi channel convolution neural network with layer at the output (shown in Figure 1). Similarly, instead of using the convolution and pooling layers, LSTM and GRU can be used to encode the headline-body pairs. The LSTMs and GRUs encode the given sequence of words into fixed length vector representation which can be used to score the relevance of headline-body pair. However, for long sequences, such as the body of a news article (which typically contain hundreds of words), the RNN models fail to completely encode the entire information into a fixed length vector. A solution to this problem is given in the form of attentional mechanism  which computes a weighted sum of all the encoder units that are passed on to the decoder. The decoder is learned in such a way that it gives importance to only some of the words. The attention mechanism also alleviates the bottleneck of encoding input sequences to fixed length vector and have been shown to outperform other RNN based encoder-decoder models on longer sequences . To alleviate the problem of limited memory we use attention mechanism as described in .
4 Proposed Idea
The headline-body pairs in the FNC-1 dataset are created by randomly assigning a news body to the given headline. This type of data augmentation has been successfully used in NLP problems such as non-factoid question answering where it results in reasonable performance by the deep learning models [31, 19]. However, in the case of FNC-1 challenge, the , , and headline-body pairs are relatively smaller in quantity than the stance. This bias leads to a uneven distribution of dataset across the four classes, with the category being the least interesting. Interestingness of a headline-body pair is evaluated in terms of information that it contains; It is easier to evaluate a pair, while the other three are contingent on exploring contextual relationship between the headline and its body, and are considered more interesting.
The uneven distribution of FNC-1 dataset thwarts the performance of deep learning architectures introduced in Section 3. Moreover, news articles are heavily influenced by some words that are generally associated with news to describe its polarity. For example, words like , , and are often used with negative connotation. If such words are present in both the news headline, or are present in one while absent from the other, then, it is easier to identify such a pair as or . Deep learning models are dependent on a huge training corpus (few million headline-body pairs) in order to identify such nuances in patterns. The FNC-1 dataset, though the largest publicly available dataset on stance detection, does not satiate this criteria. For this reason, we introduce a much simpler strategy that consists of heavy use of feature engineering. We leveraged several widely used state-of-the-art features used in natural language processing, and use a feed-forward deep neural network which aggregates all the individual features and computes a score for each headline-body pair.
4.1 Neural Embeddings
We use skip-thought vectors which encodes sentences to vector embedding of length 4800 (shown in Figure 2).
The skip-thought  is a encoder-decoder based recurrent model that computes the relative occurrence of sentences. In our work, we use the pre-trained skip-thought embedding which is trained on BookCorpus . We make the use of a pre-trained model since the FNC-1 dataset is relatively smaller than the dataset required to efficiently train a recurrent encoder-decoder model like skip-thought.
We follow the work of [18, 1] and compute two features from the skip-thought embeddings. These features have been shown to be effective in evaluating contextual similarity between sentences. The task of stance detection is analogous to the computation of contextual similarity between two sentences - headline and its body. We speculate that the features introduced by [18, 1] should be effective for stance detection as well. Given the skip-thought encoding of news and headline as and , we compute two features
where is the component-wise product and is the absolute difference between the skip-thought encoding of news and headlines. Both of these features results in a 4800 dimensional vector each.
4.2 Statistical Features
We capture the statistical information from the text to vectors with the help of BOW, TF-IDF and n-grams models. We follow the work of  and , and produce the following vectors for each headline-body pair
1-gram TF vector of the headline.
1-gram TF vector of the body.
This gives us a vector of 5000 dimension each. We concatenate both of the TF vectors and pass it to a MLP layer (as shown in Figure 2).
4.3 External Features
The external features include feature engineering heuristics such as number of similar words in the headline and body, cosine similarity between vector encodings of headline-body pairs, number of n-grams matched between the pairs, etc. We leveraged ideas for computing the external features from the baseline and add some extra features, which includes
Number of characters n-grams match between the headline-body pair, where .
Number of words n-grams match between the headline-body pair, where .
Weighted TF-IDF score between headline and its body using the approach mentioned in .
Sentiment difference between the headline-body pair, also termed as polarity and is computed using lexicon based approach.
N-gram refuting feature which is constructed using BOW on a lexicon of pre-defined words. It is similar to polarity based features with an addition of n-gram model.
All the external features adds up to a 50-dimensional feature vector and is passed to a MLP layer similar to neural and statistical features.
5.1 Dataset Description
We use the dataset provided in the FNC-1 challenge which is derived from the Emergent Dataset , provided by the fake news challenge administrators. The former consist of 49972 tuple with each tuple consisting of a headline-body pair followed by a corresponding class label stance of either agree, disagree, unrelated or discuss. Word counts roughly ranges between 8 to 40 for headlines and 600 to 7000 for article body. The distribution of FNC-1 dataset is shown in Table 2.
|49972||73.13 %||17.83 %||7.36 %||1.68 %|
|Hyperparameter||Skip-thought||External Features||TF-IDF Vectors|
|MLP neurons||500 ; 100||50||500 ; 50|
|Dropout||0.2 ; -||-||0.4 ; -|
|Activation||sigmoid ; sigmoid||relu||relu ; relu|
|Regularization||L2 - 0.00000001 ; -||-||L2 - 0.00005 ; -|
The final results are evaluated over a test dataset provided by fake news organization consisting of 25413 samples.
5.2 Training parameters
As shown in Figure 2, the proposed model computes the feature vectors separately and then combine these with the help of a MLP layer. We use cross-entropy as the loss function to optimize our architecture with a softmax layer at the output which classify the given headline-body pair into agree, disagree, discuss, and unrelated. The hyper-parameter setting is shown in Table 3.
5.3 Baselines and compared methods
Organizers of FNC-1 have provided a baseline model that consists of a gradient-boosting classifier over n-gram subsequences between the headline and the body along with several external features such as word overlap, occurrence of sentiment using a lexicon of highly-polarized words (like fraud and hoax). With this simple yet elegant baseline it is possible to outperform some of the highly used deep learning architectures that we have used in our work. Following the work of , we also introduce three new baselines for the FNC-1 dataset: word2vec+external features baseline, skip-thought baseline, and TF-IDF baseline. All these baselines focuses on performance of neural, statistical, and external features, when used individually.
We compare our proposed approach with the submissions of top 4 teams at FNC-1 111http://www.fakenewschallenge.org/
https://competitions.codalab.org/competitions/16843#results, which includes the work by , ,  and . Apart from the top submissions at FNC-1, we also compare the proposed architecture with four deep learning architectures introduced in Section 3, namely, CNN, biLSTM, BiLSTM+Attention and CNN+biLSTM.
5.4 Evaluation metrics
From Table 2 it is evident that the FNC-1 dataset shows a heavy bias towards unrelated headline-body pairs. Recognizing this data bias and the simpler nature of the classification problems, the organizers of FNC-1 introduced the following weighted accuracy score as their final evaluation metric.
We use the as the main evaluation criteria while comparing the proposed model with other related techniques. We also use the class-wise accuracy for further evaluation of the performance of all the techniques.
The results on FNC-1 test dataset are shown in Table 4. The first part of the table shows the performance of the baselines used in our work. The FNC-1 baseline achieves a score of which is better than the performance of all deep architectures introduced in Section 3. The FNC-1 baseline is comprised of training gradient tree classifier on the hand crafted features (described in Section 4.3). Provided the simplicity of this baseline, it is indeed remarkable to achieve such a high score. The FNC-1 baselines achieves higher class-wise accuracy on unrelated stance as compared to skip-thought baseline, whereas the latter receiving a higher . Skip-thought baselines achieves a higher accuracy on and than the stance.
|Word2vec + External Features||75.78||50.70||9.61||53.38||96.05||82.79|
|SOLAT in the SWEN ||82.05||58.50||1.86||76.18||98.70||89.08|
|UCL Machine Reading ||81.72||44.04||6.60||81.38||97.90||88.46|
|Chips Ahoy! ||80.12||55.96||0.28||70.29||98.98||88.01|
|biLSTM + Attention||63.17||58.74||0.03||63.48||77.49||73.27|
|CNN + biLSTM||64.95||74.09||2.46||57.85||74.87||72.89|
Since the interestingness of and is higher than the stance, therefore, skip-thought achieves a higher . This also explains the reason for the introduction of new scoring criterion by the FNC organizers (see Section 5.4). Finally, the by skip-thought, external features, and TF-IDF baselines are higher than the FNC-1 baseline. Therefore, our speculation to combine these three baselines models, is guaranteed to achieve a higher score on evaluation metric. Moreover, all the baselines achieves very low or zero score on the disagree stance. Therefore, apart from the , the class-wise performance is worth considering as a performance criterion.
The performance of top-4 teams that participated in FNC-1 are shown in the middle part of Table 4, with  winning the challenge achieving a score of 82.05. All the teams achieved higher score and class-wise accuracy on all stances except for the stance. This should be a concern, since the importance of is equivalent to the and stance. We observed that the news pairs in the category are not only very few, but also consists of divergent news articles. This is one of the reason for poor performance of most of the deep models, including the top teams, on identifying stance.
The lowest section in Table 4 shows the performance of the proposed model along with other architectures used in our work. The proposed model achieves highest score and highest class-wise accuracy on discuss stance whereas achieving high accuracy on other stances that is comparable to top submissions at FNC-1. From Table 5, it is evident that the overall accuracy achieved by the proposed model is slightly lower than , although the proposed model outperformed all the other techniques by a clear margin (in terms of ). The possible reason for this deviation is that the  gives more focus to the classification of stances rather than the rest, which is the reason for highest overall accuracy. Since stances are of least interest to us, this results in lower .
Finally, a confusion matrix is given in Table 5 that provides in-detail analysis of the performance of our approach.
In this paper, we explore the benefit of incorporating neural, statistical and external features to deep neural networks on the task of fake news stance detection. We also presented in-depth analysis of several state-of-the-art recurrent and convolution architectures (shown in Figure 1). The presented idea leverages features extracted using skip-thought embeddings, n-gram TF-vectors and several introduced hand crafted features.
We found that the uneven distribution of FNC-1 dataset undermines the performance of most deep learning architectures. The fewer training samples adds further to this aggravation. Creating a dataset for a complex NLP problems such as fake news identification is indeed a cumbersome task, and we appreciate the work by the FNC organizers, yet, a more detailed and elaborate dataset should make this challenge more suitable to evaluate.
- Agirre et al.  E. Agirre, D. Cer, M. Diab, A. Gonzalez-Agirre, and W. Guo. sem 2013 shared task: Semantic textual similarity, including a pilot on typed-similarity. In In* SEM 2013: The Second Joint Conference on Lexical and Computational Semantics. Association for Computational Linguistics. Citeseer, 2013.
- Andreas Hanselowski and Caspelherr  B. S. Andreas Hanselowski, Avinesh PVS and F. Caspelherr. Team athene on the fake news challenge. 2017.
- Augenstein et al.  I. Augenstein, A. Vlachos, and K. Bontcheva. Usfd at semeval-2016 task 6: Any-target stance detection on twitter with autoencoders. In SemEval@ NAACL-HLT, pages 389–393, 2016.
- Bahdanau et al.  D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014. URL http://arxiv.org/abs/1409.0473.
- Castillo et al.  C. Castillo, M. Mendoza, and B. Poblete. Predicting information credibility in time-sensitive social media. Internet Research, 23(5):560–588, 2013.
- Chaudhry et al.  A. K. Chaudhry, D. Baker, and P. Thun-Hohenstein. Stance detection for the fake news challenge: Identifying textual relationships with deep neural nets. 2017.
- Chen et al. [2017a] T. Chen, L. Wu, X. Li, J. Zhang, H. Yin, and Y. Wang. Call attention to rumors: Deep attention based recurrent neural networks for early rumor detection. arXiv preprint arXiv:1704.05973, 2017a.
- Chen et al. [2017b] Y.-C. Chen, Z.-Y. Liu, and H.-Y. Kao. Ikm at semeval-2017 task 8: Convolutional neural networks for stance detection and rumor verification. Proceedings of SemEval. ACL, 2017b.
- Cho et al.  K. Cho, B. van Merrienboer, Ç. Gülçehre, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR, abs/1406.1078, 2014. URL http://arxiv.org/abs/1406.1078.
- Chopra et al.  S. Chopra, S. Jain, and J. M. Sholar. Towards automatic identification of fake news: Headline-article stance detection with lstm attention models, 2017.
- Chung et al.  J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
- Davis and Proctor  R. Davis and C. Proctor. Fake news, real consequences: Recruiting neural networks for the fight against fake news. 2017.
- Dean Pomerleau  D. R. Dean Pomerleau. Fake news challenge. 2017.
- Feng et al.  M. Feng, B. Xiang, M. R. Glass, L. Wang, and B. Zhou. Applying deep learning to answer selection: A study and an open task. In Automatic Speech Recognition and Understanding (ASRU), 2015 IEEE Workshop on, pages 813–820. IEEE, 2015.
- Ferreira and Vlachos  W. Ferreira and A. Vlachos. Emergent: a novel data-set for stance classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. ACL, 2016.
- Graves and Schmidhuber  A. Graves and J. Schmidhuber. Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Networks, 18(5):602–610, 2005.
- He et al.  H. He, K. Gimpel, and J. J. Lin. Multi-perspective sentence similarity modeling with convolutional neural networks. In EMNLP, pages 1576–1586, 2015.
- Kiros et al.  R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, and S. Fidler. Skip-thought vectors. In Advances in neural information processing systems, pages 3294–3302, 2015.
- Mihaylov and Nakov  T. Mihaylov and P. Nakov. Semanticz at semeval-2016 task 3: Ranking relevant answers in community question answering using semantic similarity based on fine-tuned word embeddings. In SemEval@ NAACL-HLT, pages 879–886, 2016.
- Mikolov et al. [2013a] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013a.
- Mikolov et al. [2013b] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013b.
- Miller and Oswalt  K. Miller and A. Oswalt. Fake news headline classification using neural networks with attention. 2017.
- Neculoiu et al.  P. Neculoiu, M. Versteegh, M. Rotaru, and T. B. Amsterdam. Learning text similarity with siamese recurrent networks. ACL 2016, page 148, 2016.
- NYTimes  NYTimes. As fake news spreads lies, more readers shrug at the truth. 2016.
- Pennington et al.  J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In EMNLP, volume 14, pages 1532–1543, 2014.
-  S. Pfohl, O. Triebe, and F. Legros. Stance detection for the fake news challenge with attention and conditional encoding.
- Řehůřek and Sojka  R. Řehůřek and P. Sojka. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45–50, Valletta, Malta, May 2010. ELRA. http://is.muni.cz/publication/884893/en.
- Riedel et al.  B. Riedel, I. Augenstein, G. P. Spithourakis, and S. Riedel. A simple but tough-to-beat baseline for the fake news challenge stance detection task. arXiv preprint arXiv:1707.03264, 2017.
- Shang  J. Shang. Chips ahoy! at fake news challenge. 2017.
- Tan et al.  M. Tan, C. d. Santos, B. Xiang, and B. Zhou. Lstm-based deep learning models for non-factoid answer selection. arXiv preprint arXiv:1511.04108, 2015.
- Yang et al.  L. Yang, Q. Ai, D. Spina, R.-C. Chen, L. Pang, W. B. Croft, J. Guo, and F. Scholer. Beyond factoid qa: Effective methods for non-factoid answer sentence retrieval. In European Conference on Information Retrieval, pages 115–128. Springer, 2016.
- Yang et al.  Y. Yang, W.-t. Yih, and C. Meek. Wikiqa: A challenge dataset for open-domain question answering. In EMNLP, pages 2013–2018, 2015.
- Yu et al.  L. Yu, K. M. Hermann, P. Blunsom, and S. Pulman. Deep learning for answer sentence selection. arXiv preprint arXiv:1412.1632, 2014.
- Yuxi Pan  S. B. Yuxi Pan, Doug Sibley. Talos. http://blog.talosintelligence.com/2017/06/, 2017.
- Zhu et al.  Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. arXiv preprint arXiv:1506.06724, 2015.