Ecol: Early Detection of CoVid Lies Using Content, Prior Knowledge and Source Information
Abstract
Social media platforms are vulnerable to fake news dissemination, which causes negative consequences such as panic and wrong medication in the healthcare domain. Therefore, it is important to automatically detect fake news in an early stage before they get widely spread. This paper analyzes the impact of incorporating content information, prior knowledge, and credibility of sources into models for the early detection of fake news. We propose a framework modeling those features by using BERT language model and external sources, namely Simple English Wikipedia and source reliability tags. The conducted experiments on CONSTRAINT datasets demonstrated the benefit of integrating these features for the early detection of fake news in the healthcare domain.
Keywords:
Fake news detection Deep learning Prior knowledge1 Introduction
Social media is replacing traditional media as a source of information due to the ease of access, fast sharing, and the freedom to create content. However, social media is also responsible for spreading the massive amount of fake news [31]. Fake news propagation can manipulate significant events such as political elections or severely damage the society during crisis [14]. For example, a rumor that initially occurred in a UK tabloid paper claimed that neat alcohol could cure COVID-19. As a consequence of the spread of this rumor, hundreds of Iranians have lost their lives due to alcohol poisoning
Since the U.S presidential elections in 2016, tremendous efforts have been devoted by the research community to automate fake news detection. Most prior studies rely on leveraging propagation information, user engagement and content of news/social media posts [39, 26, 11]. However, the methods relying on propagation information [37] and/or on user engagement [16, 3, 2, 27] are not applicable for detecting fake news at an early stage since they are only available when the news starts disseminating. The methods solely based on content (e.g [2, 33]) could be misguided by claims that require additional context for their interpretation. For instance, the post in Figure 1 may sound very plausible for readers who know the relationship between the politicians mentioned in the post. However, the source is a satiric website which indicates that the post is fake.
Commonly, a lot of fake claims/news that are partially sharing similar content occur in different sources in a relatively long time-span. For example, in the healthcare domain, the most common fake claims are about unproven and alternative cures (e.g using alcohol) against diseases [28, 34]. These types of fake news have also been observed during COVID-19 [18]. The assumption is that early published claims are fact-checked and can be employed to detect later ones. Therefore, investigating previously published claims/news can provide important information in determining the truthfulness of posts [12, 25]. Specifically, encoding previously published fake news in healthcare domain as prior knowledge, content and source information could help detecting posts that would potentially be missed by solely content-based methods.

In this paper, we investigate and analyse those intrinsic and extrinsic features to detect fake news in the healthcare domain. To this end, we propose a Neural Network model that integrates (1) a contextual representation of the news content, (2) a representation of the relationship to similar validated fake news and (3) a source representation the embeds the reliability of the source. The main contributions of this paper are as follows:
-
We introduced a classification framework that models prior knowledge, content and source information by using BERT[7] and reliability tags to predict the truthfulness of social media posts in the healthcare domain. We share our source code and trained models on Github
4 . -
To evaluate the effectiveness of proposed model, we conducted an extensive experiment on the CONSTRAINT dataset. According to the obtained F1 score, ECOL is ranked 14 among 167 submissions at CONSTRAINT competition [21].
The rest of the paper is organized as follows. Section 2 briefly presents related work. Section 3 describes the proposed framework. Section 4 presents the dataset, baselines, ablation models and implementation details. Section 5 presents and discusses the results of the experiment. Finally, Section 6 concludes the paper and gives insights on future work.
2 Related Work
This section presents related work of textual content-based methods and approaches using external information to assess truthfulness.
2.1 Textual Content based Fake News Detection on Social Media
Textual content-based methods for fake news detection on social media [39, 26, 11] vary from traditional machine learning models [3, 13, 38] to neural networks [23, 2, 7]. For instance, the methods [3, 13] leverage features such as the sentiment and metadata information (e.g. replies, likes of social media posts) in addition to statistics derived from both post and metadata. Zhou et al. [38] investigate the features derived by the theories in social and forensic psychology. As examples of neural network models, CNN [33], RNN [4] and most recently context-aware language models such as ELMo [23, 2] and BERT [9, 7] have also been used. While CNN, RNN models ignore the context information, BERT and ELMo can learn the different meanings of the words depending on the context. Among context-aware language models, BERT has shown state-of-art results in many NLP tasks [7]. Therefore, in our study, we encode content information with BERT.
2.2 Extrinsic Features for Determining Truthfulness of Claims
As an extrinsic feature, encoding the top relevant evidence pages retrieved by commercial search engines (e.g. Google) has been a widely preferred approach [29, 1, 15] to determine the truthfulness of claims or posts. However, Augenstein et al. [1] stated that this method has a drawback of affecting veracity assessments when the results change over time. Firstly, this drawback could prevent reproducible results. Secondly, it would not be applicable to evaluate the posts which cannot be supported or denied with evidences when initially occurred.
Another extrinsic feature is leveraging the information of previously analyzed claims. Claim similarity between previously fact-checked claims in the political domain has been studied as part of fact-checking system [12, 6, 36] or as an information retrieval task [25]. Those studies aim to find claims or posts reporting about the same event. However, we aim to learn the similarity of the posts with previously detected fake news in the healthcare domain, not necessarily reporting about the same event.
Lastly, the credibility of user-profiles [16, 37] and source websites [17, 8, 20, 10] are also strong extrinsic features for determining truthfulness at an early stage [19]. To detect rumors and fake news on social media, Yuan et al. [37] used user credibility as weak signals in their graph-based neural network model. Li et al. [16] combined the credibility of users with post features and post embeddings for rumor detection. These two studies require propagation and metadata information to encode the aforementioned features. Other studies [17, 20, 10] focus on determining the credibility of sources that mostly report political news. Using only the credibility information of political news sources could be limited. Therefore, we leverage the content of the Simple English Wikipedia and source credibility information by Gruppi et al. [10] in our study.
In summary, ECOL utilizes the content of the post, its similarity to prior knowledge, and the credibility of its embedded source URLs, to detect their truthfulness early.
3 Ecol Approach
The overall architecture of ECOL framework is illustrated in Figure 1. Firstly, as a pre-processing step, the framework (1) lowers all cased words, (2) fixes the Unicode errors, (3) translates the text to the closest ASCII representation and (4) replaces exclusive content with tags. Specifically, URLs are replaced with with <URL>, emails with <EMAIL>, numbers with <NUMBER>, digits with <DIGIT>, phone numbers with <PHONE> and currency symbols with <CUR>. Secondly, the framework encodes (1) content information from solely posts (Section 3.1), (2) relation with top 10 relevant fake news in health domain (Section 3.2) and (3) the sources, that are embedded within the post, by concatenating reliability tags and their Simple Wiki descriptions (Section 3.3). Next, it concatenates the encoded features and feeds them into a fully connected layer. Finally, a softmax layer classifies the features as fake or real to express their truthfulness.
3.1 Content (C)
In this unit, ECOL tries to capture the writing style of fake and true posts. To well learn the content information, we encode the texts with BERT [7] which is a context-aware language model based on the transformer network and has been pre-trained on massive text corpus [30]. BERT learns specific NLP tasks after fine-tuning its pre-trained models. In order to obtain the post representations, we encode the input sequences of the post with the uncased base version of pre-trained BERT by using the transformers library [35]. Uncased base BERT consists of 12 layers and 12 attention heads and outputs 768-dimensional vectors for each word in the post. The first token of the input sequences is called [CLS] and indicates the classification label. The final hidden state of the [CLS] is used as a content representation. We tune the maximum size of texts to 128 and padded short texts.
3.2 Prior Knowledge (PK)
To leverage the relation of news event with previously published and proved fake news, we encoded a post’s relation with a set of similar fake news disseminated before COVID-19. Given a post as a query, an ad-hoc semantic search engine retrieves the top 10 fake news from a repository indexed with fake news in the health domain. To obtain a fake news vector (FN), we encoded each retrieved news with the BERT and took their average. Afterwards, we computed the cosine similarity between FN and the post (P) as follows , where is an one dimensional relatedness vector.
3.2.1 Ad Hoc Search and Indexing
To obtain the similar fake news, we used Elasticsearch
We indexed the title and article of the news in text format and to add the ability of semantic retrieval to the search engine, we encode title and the article of news also with the pre-trained sentence-BERT [24]. The search engine retrieves the top 10 documents whose fields match and have high cosine similarity with the query. If the number of retrieved documents is smaller than 10, the list of the documents is appended with an empty strings.
3.3 Source (S)
To encode information of the sources, for each source, we first unshortened any shortened links, such as the URL in Figure 1. Then, we extract the name of the source (e.g thespoof). We retrieve then the reliability and Simple Wiki description of the source and vectorized source information by concatenating the retrieved information.
3.3.1 Reliability
NELA 2019 is a dataset containing 260 news sources proposed by Gruppi et al [10]. The dataset contains source reliability labels from various assessment websites such as Media Bias/Fact Check (MBFC)
Simple Wiki Source Descriptions
The Simple English Wikipedia aims to provide access to an English encyclopedia for non-native English speakers and children. The entity descriptions of Simple English Wikipedia can be a clue for the trustworthiness of the sources. We downloaded the Simple English Wikipedia, February 2019 dump
4 Experiments
4.1 Dataset
The CONSTRAINT dataset [22] contains fake and real news about COVID-19 in the form of social media posts. Fake news samples were collected from various fact-checking websites and tools such as Politifact, IFCN chatbot. Real news samples were collected from verified Twitter accounts and manually checked by the organizers. The dataset is split into train, dev, and test splits. Table 1 presents statistical details of the datasets.
Real | Fake | Total | |||
---|---|---|---|---|---|
w | w/o | w | w/o | ||
Train | 2321 | 1039 | 1002 | 2058 | 6420 |
Dev | 780 | 340 | 327 | 693 | 2140 |
Test | 779 | 341 | 319 | 701 | 2140 |
4.2 Baselines
We compared ECOL framework against the baseline classifiers provided by the organizers, which are Support Vector Machine (SVM), Logistic Regression (LR), Gradient Boost, and Decision Trees (DT). The baseline classifiers are trained on term frequency-inverse document frequency (tf-idf) features.
4.3 Models
We compare and analyse the following variations of ECOL framework models:
-
C uses solely content information as feature.
-
PK uses solely prior knowledge as feature.
-
C_PK uses concatenation of content and prior knowledge as feature.
-
C_S uses concatenation of content and source as feature.
-
C_PK_S uses concatenation of content, source and prior knowledge.
4.4 Implementation
We implemented ECOL models using PyTorch Lightning
5 Results and Discussion
We present the experimental results on development and test sets in Table 2. We report Precision (P), Recall (R), F1 scores per class, and accuracy, weighted P, R, and F1 scores as the models’ overall performance. C, PK, C_S, C_PK_S average the predictions by the models trained with 42, 36, 0 as random seeds. The other models use a random seed of 42, which gave the highest F1 scores in our experiments. We entered the CONSTRAINT shared task with three entries: an average over the three C_PK_S models and the two C_PK_S models with the highest F1 scores (random seeds are 42, 36). The best performing model with random seed 42 ranked 14 among 167 submissions [21].
Set | Model | Fake | Real | Overall | |||||||
P | R | F1 | P | R | F1 | Acc | P | R | F1 | ||
Dev | SVM | 92.07 | 94.41 | 93.22 | 94.79 | 92.59 | 93.68 | 93.46 | 93.48 | 93.46 | 93.46 |
LR | 91.07 | 94.02 | 92.52 | 94.39 | 91.61 | 92.98 | 92.76 | 92.79 | 92.76 | 92.75 | |
GB | 83.41 | 90.20 | 86.67 | 90.36 | 83.66 | 86.88 | 86.78 | 87.03 | 86.78 | 86.77 | |
DT | 85.53 | 83.43 | 84.47 | 85.24 | 87.14 | 86.18 | 85.37 | 85.42 | 85.37 | 85.38 | |
C | 98.12 | 97.06 | 97.59 | 97.35 | 98.30 | 97.82 | 97.71 | 97.72 | 97.71 | 97.71 | |
PK | 97.69 | 95.20 | 96.43 | 95.72 | 97.95 | 96.82 | 96.63 | 96.67 | 96.64 | 96.64 | |
C_PK | 98.99 | 95.78 | 97.36 | 96.27 | 99.11 | 97.67 | 97.52 | 97.57 | 97.52 | 97.53 | |
CS | 98.39 | 95.98 | 97.17 | 96.42 | 98.57 | 97.48 | 97.34 | 97.37 | 97.34 | 97.34 | |
C_PK_S | 98.51 | 97.25 | 97.88 | 97.53 | 98.66 | 98.09 | 97.99 | 98.00 | 97.99 | 97.99 | |
C | 99.28 | 95.20 | 97.20 | 95.78 | 99.38 | 97.55 | 97.38 | 97.47 | 97.38 | 97.39 | |
PK | 99.17 | 93.53 | 96.27 | 94.40 | 99.29 | 96.78 | 96.54 | 96.70 | 96.54 | 96.55 | |
C_PK | 99.49 | 94.71 | 97.04 | 95.38 | 99.55 | 97.42 | 97.24 | 97.35 | 97.24 | 97.24 | |
C_S | 99.49 | 95.10 | 97.24 | 95.71 | 99.55 | 97.59 | 97.43 | 97.52 | 97.43 | 97.43 | |
C_PK_S | 99.09 | 96.08 | 97.56 | 96.52 | 99.20 | 97.84 | 97.71 | 97.75 | 97.71 | 97.71 | |
Test | SVM | 92.20 | 93.92 | 93.05 | 94.37 | 92.77 | 93.56 | 93.32 | 93.33 | 93.32 | 93.32 |
LR | 90.08 | 93.43 | 91.72 | 93.81 | 90.62 | 92.19 | 91.96 | 92.01 | 91.96 | 91.96 | |
GB | 83.39 | 90.59 | 86.84 | 90.70 | 83.57 | 86.99 | 86.92 | 87.20 | 86.92 | 86.91 | |
DT | 85.39 | 84.22 | 84.80 | 85.80 | 86.88 | 86.34 | 85.61 | 85.62 | 85.61 | 85.61 | |
C | 98.21 | 96.96 | 97.58 | 97.26 | 98.39 | 97.83 | 97.71 | 97.72 | 97.71 | 97.71 | |
PK | 97.78 | 95.20 | 96.47 | 95.73 | 98.04 | 96.87 | 96.68 | 96.72 | 96.68 | 96.68 | |
C_PK | 99.29 | 95.59 | 97.40 | 96.11 | 99.38 | 97.72 | 97.57 | 97.64 | 97.57 | 97.57 | |
CS | 98.69 | 96.08 | 97.37 | 96.51 | 98.84 | 97.66 | 97.52 | 97.56 | 97.52 | 97.53 | |
C_PK_S | 99.10 | 96.96 | 98.02 | 97.29 | 99.20 | 98.23 | 98.13 | 98.15 | 98.13 | 98.13 | |
C | 99.49 | 94.80 | 97.09 | 95.46 | 99.55 | 97.47 | 97.29 | 97.40 | 97.30 | 97.29 | |
PK | 98.56 | 94.02 | 96.24 | 94.77 | 98.75 | 96.72 | 96.50 | 96.60 | 96.50 | 96.50 | |
C_PK | 99.38 | 93.82 | 96.52 | 94.65 | 99.46 | 97.00 | 96.78 | 96.93 | 96.78 | 96.78 | |
C_S | 99.79 | 94.90 | 97.29 | 95.56 | 99.82 | 97.64 | 97.48 | 97.59 | 97.48 | 97.48 | |
C_PK_S | 99.59 | 95.29 | 97.39 | 95.88 | 99.64 | 97.72 | 97.57 | 97.66 | 97.57 | 97.57 |
![]() |
![]() |
By applying a T-test at the 0.01 significance level, we observe that the proposed models significantly outperformed the baselines. When we compare the content-based models (C and C) with the other proposed models, we first see that the prior knowledge and content information (C_PK) complements the source information (C_S). Moreover, the prior knowledge and content information helps to identify false news, but source information helps in identifying real news. Therefore, among the proposed models, C_PK_S and C_PK_S achieve the highest F1 scores by balancing the predictions towards real and fake news. However, the improvement is not significant. For instance, C misclassified only 49 samples while the C_PK_S, classified 9 samples more correctly.
PK and C_PK which incorporate the prior knowledge are the least successful models among the proposed models. We observed that the indexing method for the retrieval unit yields false-positive predictions. However, the models also outperformed the official baselines which implies that prior knowledge could be useful for fake news detection. For better healthcare retrieval, we plan to improve the indexing schema with the semantic concepts that define the health claim types such as treatment, alternative medicine.
We also analyzed how the presence of links in posts change the model predictions and present F1 scores of the models by grouping them into posts with and without links in Figure 2. The presence of links in fake news drastically degrades the performance of the models (Figure (a)a). For example, while C_S scores the posts with the links as 98.77, its F1 score is reduced to 93.22. However, encoding source information into the models (C_PK_S and C_S) improves identifying fake news posts with links. When we analyse the links, we see that they are Twitter accounts, medical websites and delete links that are not present in Simple English Wikipedia nor reliability dictionary. However, we found some samples in the test set, which have links to a fact-checking website (Politifact) but were annotated as fake, potentially yielding false predictions.
As seen in Figure (b)b, the content information is the only key feature for identifying real news. Prior knowledge and source information could not improve the prediction of real news posts with links. When we analyze real news posts that were misclassified by the models, we see that although the posts are written in reporting language, they also contain judgemental language. For example, Coronavirus: Donald Trump ignores COVID-19 rules with ’reckless and selfish’ indoor rally [URL] might confuse the models by combining the two language types.
6 Conclusion
In this paper, we presented a promising framework for the early detection of fake news on social media. The framework encodes content, prior knowledge, and credibility of sources from the URL links in the posts. We analyzed the impact of each encoded information on the models to detect fake news in the healthcare domain. We observed that using three perspectives could lead to precisely distinguish between fake and real news. In future work, we will improve the source linking by using structured data such as Wikidata [32] in order to encode more source knowledge. For a better retrieval in the healthcare domain, we plan also to index prior knowledge by categorizing it into semantic concepts such as cure, treatment and symptoms.
Footnotes
- email: {ibaris,boukhers}@uni-koblenz.de
- email: {ibaris,boukhers}@uni-koblenz.de
- https://www.independent.co.uk/news/world/middle-east/iran-coronavirus-methanol-drink-cure-deaths-fake-a9429956.html
- https://github.com/isspek/FakeNewsDetectionFramework
- docker.elastic.co/elasticsearch/elasticsearch:7.6.1
- https://www.healthnewsreview.org/
- https://mediabiasfactcheck.com/
- https://www.politifact.com/
- https://archive.org/details/simplewiki-20190201
- https://github.com/attardi/wikiextractor
- https://pytorch-lightning.readthedocs.io/en/0.7.1/introduction_guide.html
References
- (2019) MultiFC: A real-world multi-domain dataset for evidence-based fact checking of claims. In EMNLP/IJCNLP (1), pp. 4684–4696. Cited by: §2.2.
- (2019) CLEARumor at semeval-2019 task 7: convolving elmo against rumors. In SemEval@NAACL-HLT, pp. 1105–1109. Cited by: §1, §2.1.
- (2011) Information credibility on twitter. In WWW, pp. 675–684. Cited by: §1, §2.1.
- (2018) Call attention to rumors: deep attention based recurrent neural networks for early rumor detection. In PAKDD (Workshops), Lecture Notes in Computer Science, Vol. 11154, pp. 40–52. Cited by: §2.1.
- (2020) Ginger cannot cure cancer: battling fake health news with a comprehensive data repository. In ICWSM, pp. 853–862. Cited by: §3.2.
- (2020) Linked credibility reviews for explainable misinformation detection. In ISWC (1), Lecture Notes in Computer Science, Vol. 12506, pp. 147–163. Cited by: §2.2.
- (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT (1), pp. 4171–4186. Cited by: 1st item, §2.1, §3.1.
- (2018) Belittling the source: trustworthiness indicators to obfuscate fake news on the web. CoRR abs/1809.00494. Cited by: §2.2.
- (2019) BUT-FIT at semeval-2019 task 7: determining the rumour stance with pre-trained deep bidirectional transformers. In SemEval@NAACL-HLT, pp. 1097–1104. Cited by: §2.1.
- (2020) NELA-GT-2019: A large multi-labelled news dataset for the study of misinformation in news articles. CoRR abs/2003.08444. Cited by: §2.2, §3.3.
- (2020) The future of false information detection on social media: new perspectives and trends. ACM Comput. Surv. 53 (4), pp. 68:1–68:36. Cited by: §1, §2.1.
- (2017) ClaimBuster: the first-ever end-to-end fact-checking system. Proc. VLDB Endow. 10 (12), pp. 1945–1948. Cited by: §1, §2.2.
- (2013) Prominent features of rumor propagation in online social media. In ICDM, pp. 1103–1108. Cited by: §2.1.
- (2018) The science of fake news. Science 359 (6380), pp. 1094–1096. Cited by: §1.
- (2020) Connecting the dots between fact verification and fake news detection. In COLING, pp. 1820–1825. Cited by: §2.2.
- (2019) Rumor detection by exploiting user credibility information, attention and multi-task learning. In ACL (1), pp. 1173–1179. Cited by: §1, §2.2.
- (2019) News source credibility in the eyes of different assessors. In TTO, Cited by: §2.2.
- (2020) An exploration of how fake news is taking over social media and putting public health at risk. Health Information & Libraries Journal. Cited by: §1.
- (2020) Can we spot the ”fake news” before it was even written?. CoRR abs/2008.04374. Cited by: §2.2.
- (2019) NELA-GT-2018: A large multi-labelled news dataset for the study of misinformation in news articles. In ICWSM, pp. 630–638. Cited by: §2.2.
- (2021) Overview of constraint 2021 shared tasks: detecting english covid-19 fake news and hindi hostile posts. In Proceedings of the First Workshop on Combating Online Hostile Posts in Regional Languages during Emergency Situation (CONSTRAINT), Cited by: 2nd item, §5.
- (2020) Fighting an infodemic: COVID-19 fake news dataset. CoRR abs/2011.03327. Cited by: §4.1, Table 1.
- (2018) Deep Contextualized Word Representations. In NAACL-HLT, pp. 2227–2237. Cited by: §2.1.
- (2019) Sentence-bert: sentence embeddings using siamese bert-networks. In EMNLP/IJCNLP (1), pp. 3980–3990. Cited by: §3.2.
- (2020) That is a known lie: detecting previously fact-checked claims. In ACL, pp. 3607–3618. Cited by: §1, §2.2.
- (2017) Fake news detection on social media: A data mining perspective. SIGKDD Explor. 19 (1), pp. 22–36. Cited by: §1, §2.1.
- (2020) Leveraging multi-source weak social supervision for early detection of fake news. CoRR abs/2004.01732. Cited by: §1.
- (2018) Oncology,” fake” news, and legal liability.. The Lancet. Oncology 19 (9), pp. 1135. Cited by: §1.
- (2018) The fact extraction and verification (FEVER) shared task. CoRR abs/1811.10971. Cited by: §2.2.
- (2017) Attention is all you need. Advances in neural information processing systems 30, pp. 5998–6008. Cited by: §3.1.
- (2018) The spread of true and false news online. Science 359 (6380), pp. 1146–1151. Cited by: §1.
- (2014) Wikidata: a free collaborative knowledgebase. Communications of the ACM 57 (10), pp. 78–85. Cited by: §6.
- (2017) ”Liar, liar pants on fire”: A new benchmark dataset for fake news detection. In ACL (2), pp. 422–426. Cited by: §1, §2.1.
- (2018) The spread of medical fake news in social media–the pilot quantitative study. Health policy and technology 7 (2), pp. 115–118. Cited by: §1.
- (2020-10) Transformers: state-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, pp. 38–45. External Links: Link Cited by: §3.1.
- (2020) Untrue.news: A new search engine for fake stories. CoRR abs/2002.06585. Cited by: §2.2.
- (2020) Early detection of fake news by utilizing the credibility of news, publishers, and users based on weakly supervised learning. In COLING, pp. 5444–5454. Cited by: §1, §2.2.
- (2020) Fake news early detection: a theory-driven model. Digital Threats: Research and Practice 1 (2), pp. 1–25. Cited by: §2.1.
- (2020-09) A survey of fake news: fundamental theories, detection methods, and opportunities. ACM Comput. Surv. 53 (5). External Links: ISSN 0360-0300, Link, Document Cited by: §1, §2.1.
