Characterizing Linguistic Attributes for Automatic Classification of Intent Based Racist/Radicalized Posts on Tumblr Micro-Blogging Website

Characterizing Linguistic Attributes for Automatic Classification of Intent Based Racist/Radicalized Posts on Tumblr Micro-Blogging Website

Swati Agarwal Indraprastha Institute of Information Technology
New Delhi, India
Email: swatia@iiitd.ac.in
   Ashish Sureka ABB Corporate Research
Bangalore, India
Email: ashish.sureka@in.abb.com
Abstract

Research shows that many like-minded people use popular microblogging websites for posting hateful speech against various religions and race. Automatic identification of racist and hate promoting posts is required for building social media intelligence and security informatics based solutions. However, just keyword spotting based techniques cannot be used to accurately identify the intent of a post. In this paper, we address the challenge of the presence of ambiguity in such posts by identifying the intent of author. We conduct our study on Tumblr microblogging website and develop a cascaded ensemble learning classifier for identifying the posts having racist or radicalized intent. We train our model by identifying various semantic, sentiment and linguistic features from free-form text. Our experimental results shows that the proposed approach is effective and the emotion tone, social tendencies, language cues and personality traits of a narrative are discriminatory features for identifying the racist intent behind a post.

Intelligence and Security Informatics, Intent Classification, Machine Learning, Mining User Generated Content, Semantic Analysis, Sentiment and Tone Analysis, Social Media Analytics, Text Classification, Tumblr

I Introduction

Freedom of expressions provides leverage to an individual to share their opinions and beliefs about anything. However, many like-minded people misuse freedom of expression to make offensive comments or promote their beliefs that can lead to a negative impact on society [1]. Research shows that these individuals or groups of people use popular microblogging websites (Twitter and Tumblr) for such activities [2][3]. We find that there are users who misuse freedom of speech to post abusive and aggressive comments about a targeted people and other users who promote their beliefs about certain religion or community. Existing literature shows that racism is not specific to only minor communities. There are users who post racist comments targeting existing like-minded groups calling it as reverse racism [4]. For example, anti-white bias groups posting comments against white supremacy communities while Islamophobic groups posting hateful speech against Muslim communities. We will be using blogger, author, user and narrative terms interchangeably. Based on our analysis, we broadly define these groups into two categories: Religion and Race. Figure 1(a) and 1(b) shows examples of two Tumblr posts where bloggers mention about the Islam religion. In Figure 1(a), the intention of author is to provoke his Muslim followers for Jihad and develop a willingness to sacrifice themselves for their religion, whereas in Figure 1(b), the intention of author is to bring awareness that Islamophobic and other hate groups should stop misunderstanding Islam religion. This post was made on March , when #StopIslam hashtag was trending on Twitter. Similarly, Figure 1(c) and 1(d) shows examples of two different Tumblr posts where authors talk about black communities. Figure 1(c) depicts that the intent of author is to make a hateful and offensive post targeting various communities. While, in Figure 1(d) author’s intent is to highlight the challenging life of Black people in America and showing their support for them.

In this paper, we conduct our study on Tumblr microblogging website and address the challenge of mining intention of a narrative behind such posts.

(a) Topic- Religion, Intent-Yes
(b) Topic- Religion, Intent- No
(c) Topic- Race, Intent- Yes
(d) Topic- Race, Intent-No
Fig. 1: Concrete Examples of Tumblr Posts Showing the Different Topics of Racism and Different Intent of Bloggers

Intent mining from free-form social-media text is technically challenging problem due to the presence of multi-lingual script, incorrect grammar, misspell words, short text, acronyms and abbreviations, sarcasm and opinion based posts. We also find examples of ambiguous content which makes a post difficult to classify even for human annotation. For example, in a Tumblr post ”Yes I enjoyed and actually love CACW but I am so pissed off that the people who suffer / die In the end are a woman (Peggy dies) and two black men (T’challa’s father dies and Rhodes is paralyzed) Thank you marvel”, author has posted a movie review with an intention to highlight the racism and target a community of viewers who did not find it racist. However, when we did a manual inspection on the blogger’s page; we found that the author is actively involved in bashing and using foul language against certain blogs supporting MCU (Marvel Cinematic Universe) and hence the intent of author is not to support the women or black people but to make negative posts about MCU. Further, it is technically challenging to identify the intent of a post when a naive post has similar terms as a radicalized or racist post. For example, a post P1: ”All types of Jihad is to establish peace for all & Sharia also promote peace so there is no need to fix anything @simafaysal @profdstone”- posted by an author with screen name ’Prisoner’ and an another post P2: ”This settles it? ’Jihad is to establish peace’ ’there is no need to fix anything’ spoken like a true ’prisoner’” have similar content. Here, the intention of P1 is to show his support for Jihad and terrorism while intention of P2 is to make a sarcastic comment on P1 and author’s belief. Further, despite having hateful comments in a post, the intention of author can still be naive. For example, in January , Saudi Arabia released an official video on ’how to properly beat Muslim women’ with an intention of targeting women communities. Recently, as the video got published worldwide, users at microblogging websites shared that video and posted hateful comments in order to oppose the video with no racist intent. Whereas, some users posted comments opposing the video and targeting whole Muslim community with racist intentions bringing ambiguity in their posts. Tumblr website is popularly known for the use of gif images where users share their opinions by embedding reaction gifs in their posts. It also allows users to share content from external sources such as news websites or blogs (wordpress). Users can disguise themselves by sharing only articles or external URLs in their posts. Therefore, automatic identification of narrative’s intentions in such posts is a significantly technically challenging problem.

The work presented in this paper is motivated by the need to develop a system for automatically identifying the intent of a racist and radicalized post. The specific research aim of the study presented in this paper is the following:

  1. To investigate the efficacy of natural language processing techniques on microblogging dataset for topic and intent classification.

  2. To investigate the application of linguistic features such as taxonomy, emotions, language cues, personality traits and text semantics for classifying the intent of Tumblr posts.

  3. To conduct empirical analysis on real word dataset and examine the effectiveness of proposed one-class text classification approach. To compute the relative influence of each linguistic feature for identifying the posts having racist intent.

Ii Related Work

In this Section, we discuss closely related work to the study presented in this paper. We conduct a literature survey in the area of intent mining on social media platforms and divide our related work into following two categories:
Commercial Intent Classification: Wang et al. [5] present a graph based semi-supervised learning technique to classify intent tweets. They combine keyword based flagging (referred as intent keyword) and graph regularization method for classifying tweets into six categories. Purohit et al. [6] present an hybrid approach of combining knowledge-guided patterns and bag-of-tokens model for intent classification of short text. They conduct a study on Twitter for crisis events dataset and address the problem of ambiguity and sparsity in order to classify the intent of narrative. Ding et al. [7] present a transfer learning based convolutional neural network model for identifying users’ buying or consumption intentions from Sina Weibo- a Chinese microblogging service111http://weibo.com/. Geetha et al. [8] present a lexicon (sentiment Wordnet dictionary) based bootstrapping method to measure the polarity of opinion in short text data. They conduct a study on Twitter data and compare their results for movie reviews, election results and product reviews. Wang et al. [9] present a graph based ranking model to identify the commercial intent from trending topics on microblogging platforms.
Racism/Radicalization Intent Classification: Smith et al. [2] conduct a quantitative content analysis on public documents to distinguish radical groups from non-radical groups. Prentice et al. [10] conduct a quantitative text analysis on documents originated from extremist websites. They present a ’Conduct and Composition Analysis’ technique to classify the persuasion behavior of online extremist media varying for the documents posted before and after the Israeli activities in Gaza. Our literature survey reveals that there has been a lot of work in the area of commercial intention identification from free-form text whereas automatic detection of racist posts on social media platforms such as Tumblr is a relatively unexplored area.

Iii Research Contributions

In contrast to the existing work, our paper makes the following novel contributions:

  1. To the best of our knowledge, the study presented in this paper is the first work on racist and radicalization detection based on the intent of narrative unlike previous keyword spotting methods.

  2. We apply natural language processing techniques on Tumblr posts for identifying discriminatory features for intent classification.

  3. We publish the first ever semantically and sentimental enriched data of Tumblr posts and make our data publicly available for benchmarking and extension222http://dx.doi.org/10.17632/hd3b6v659v.2 [11].

  4. The study presented in this paper is an extended version of our work Agarwal et al. accepted as a short paper in European Intelligence and Security Informatics Conference (EISIC ) [12]. Due to the small page limit for short papers (at most four pages) in EISIC 333http://eisic.org/eisic2016/, several aspects including results and details of proposed approach are not covered. This paper presents the complete and detailed description of our work on intent based classification of racist and radicalized posts made on Tumblr micro-blogging website.

Iv Problem Statement

Given a dataset of Tumblr Posts , , a set of topics and a target class ; identify the intent of when .

Based on the definition of freedom of expression by Joshua Cohen [1], we define a Tumblr post as a racist intent post if 1) the topic of the content belongs to a race or a religion and 2) the post targets a community in an offensive or persuasive manner (in a recognizable way). In order to identify a racist or radicalized intent post, we propose following two hypotheses:

  1. In the absence of topic related key-terms, natural language processing can be an efficient approach to identify hidden taxonomy of a Tumblr post.

  2. Sentiment and semantic enrichment of text can be two discriminatory features for identifying the language of narrative and classifying the intent posts.

V Experimental Setup

Data Collection: We conduct our experiments on an open source and real time dataset extracted from Tumblr microblogging website. We perform a manual inspection and find most popular Tumblr posts having racist and radicalized intent. We extract the list of unique tags associated with these posts and create a lexicon of top K tags that are the most commonly used by racist or radicalized groups. For example, #islamophobia, #islam is evil, #supremacy, #blacklivesmatter, #white racism, #jihad, #isis and #white genocide. We implement a bootstrapping method to create our dataset and use this lexicon as seed tags for the Tumblr Search API444https://www.tumblr.com/docs/en/api/v2. For each tag, we extract only textual posts (text and quote) and extend our lexicon by acquiring other (unique) related tags associated with these posts. We execute our model until we get a desired number of posts or the model converges (it starts extracting duplicate posts). Using Tumblr Search API, we were able to extract a total of text posts made by unique bloggers consisting of unique tags. Table I shows a complete schema of additional metadata extracted for each post and unique blogger. The aim of the study presented in this paper is to build a one-class text classifier for identifying racist and hate promoting intent posts. Therefore, we conduct our experiments on post content (referred as description in Tumblr). Since, Tumblr generates a new identification number for each post (re-blogged or posted), despite having the unique Post IDs, we discard % () of the posts having similar or duplicate content and remove the bias from our data.

Posts
Post_ID Timestamp GMT Blogger URL Type Tags Num_Tags Notes Re-Blogged_From Title Description
Blogger
Blogger_ID Ask Ask_Anon #Likes #Posts Title Description
TABLE I: Detailed Schema of Tumblr Database Consisting of Posts and Bloggers’ Metadata
Fig. 2: Basic Statistics of English Language Posts from Experimental Dataset
A2
Topic NA
A1 Topic 292 24
NA 13 2127
(a) Topic Annotation
A2
Intent NA
A1 Intent 103 2
NA 12 175
(b) Intent Annotation
Topic Intent
Observed Agreement Po 0.98 0.95
Random Agreement Pr(e) 0.77 0.51
Kappa Coefficient 0.91 0.95
(c) Cohen’s Kappa Coefficient
TABLE II: Inter-Annotator Agreement Results for Topic and Intent Labelling of Experimental Dataset. Source: Agarwal et al. [12]
Fig. 3: A General Research Framework for the Experimental Setup and Proposed Methodology

The study presented in this paper focuses on intent mining on English language posts. We identify the language of each record by applying Alchemy language detection API555http://www.alchemyapi.com/api/language-detection on post description. Figure 2 reveals that only % ( out of ) of the posts have English language content and posts were identified as non-English. The language of remaining posts (% of the data) was identified as ’unknown’ due to the insufficient content in post description, for example, the posts containing only URLs. Figure 2 reveals that out of posts contain only URLs. We conduct our experiments on English language posts and discard the other non-English or unknown language records. We apply various natural language processing techniques for semantic and sentiment enrichment of our data (discussed in Section VI-A). We enhance our data and make it publicly available so that our experiments can be used for benchmarking and comparison [11]. Our dataset is the first ever published data of Tumblr posts and bloggers labeled with various sentiment and semantic features and can be downloaded from Mendeley Data666https://data.mendeley.com/datasets/hd3b6v659v/2. Figure 2 summarizes the statistics of our experimental dataset. Despite being a microblogging website, Tumblr has no word or character limit and allows users to make long posts and tag with any number or length of keywords. We remove all noisy text from the post descriptions and tags including special characters, emoticons, extra white spaces and compute their length. Data statistics reveals that % of the posts have word length between and while posts have length greater than words. Similarly, % ( out of ) of unique tags have a word length between to while unique tags have a length between to words.

Data Annotation: We use English language posts for annotation which spans only % of the extracted data. Since, we are using bootstrapping method to collect our data, it extracts a large number of noisy posts that do not belong to the defined topic (race and religion). Therefore, we first identify the topic related posts and later label them as intent (racist/radicalized) or unknown (we don’t know the intent of the author). To annotate these posts, we employ two annotators with to years of experience of using Tumblr website. Each annotator first labels a post as topic or unknown (NA) based on the content description and the tags associated with the post. If a post is annotated as topic then these annotators further label it as intent or unknown (NA). To create ground truth for our data, we measure the inter-annotator agreement and compute Cohen’s Kappa coefficient between both annotations.

Table II shows the results of topic and intent annotation performed on posts. Table III(a) reveals that we get ( topic and unknown) posts as same label from both the annotators. We discard the remaining posts with inconsistent annotation. Both the annotators further label these topic posts as intent or unknown. Table III(b) reveals that the annotators agree on posts ( intent, unknown) while there is an inconsistency in remaining posts. Table III(c) shows the value of Cohen’s Kappa coefficient between annotators for both topic and intent annotation. Results reveal that the annotators agree more than % of the time. Figure 2 shows that the intent posts are only % of topic posts and only % of the complete experimental dataset, revealing that the labeled data is highly imbalanced. Since, we use a tag search based bootstrapping method, we analyze all the tags extracted during the process and find that it happens due to the various limitations of user generated tags. For example, presence of noisy content (spell errors), long text, multi-lingual tags, use of featured tags and tags that redirects to a non-topic based post such as ’vote’, ’lol’, ’media’, ’news’, ’life’, ’travel’.

Vi Proposed Approach

Figure 3 shows the high-level architecture of proposed approach primarily consisting of three phases: Data Extraction, Feature Identification and Classification. Section V describes the bootstrapping method used for data collection and inter-annotator agreement used for creating ground truth. We describe the remaining two phases in the following sections:

Vi-a Features Identification

Based on the prior literature and our hypothesis design, we create our feature space by analyzing the linguistic features (semantic and sentiment tone) of Tumblr posts. We divide our features set into three categories: Topic Modeling, Tone Analysis and Semantic Tagging. We also discuss other contextual metadata features that can be extracted from Tumblr posts but are not applicable in intent classification.

Topic Modeling: The existing literature shows that there has been a lot of work in the area of mining user generated content on social media related to offensive speech [13], racism and radicalization [14][15]. However, our analysis and annotation reveals that despite not having certain topic specific key-terms, a post can be an intent post for which keyword based classification method do not work accurately and generates a large number of false alarms [3]. Therefore, we use statistical and natural language processing techniques to perform topic modeling on Tumblr posts. We use Alchemy Taxonomy API777http://www.alchemyapi.com/products/alchemylanguage/taxonomy to classify the post into the most likely topic and sub-topic categories. Alchemy API supports over categories broadly divided into topics. Sub-topic categories allows us to identify the more focused and targeted topic of post (upto levels of hierarchy). For example, society/crime/personal offense/hate crime. We also use Alchemy Concept Tagging API888http://www.alchemyapi.com/api/concept-tagging to identify the hidden concepts in the text that are similar to human annotation. Alchemy API learns about a post from linked data resources999http://www.alchemyapi.com/api/concept/ldata.html such as freebase, dbpedia, yago and tags the concepts that are high likely to be related to the given text. For example, for a Tumblr post ”If the Arabs put down their weapons today, there would be no more violence. If the Jews put down their weapons today, there would be no more Israel.”, Alchemy tags ”Ashkenazi Jews”, ”Palestinian people” and ”Jewish ethnic divisions” with a confidence score of , and respectively. We use these concepts to perform the topic modeling of a text along with the taxonomy. Statistically, the API returns a confidence score of each taxonomy conveying how likely the post belongs to derived category. We discard a category from taxonomy and concept lists if the confidence score is below %.

Sentiment and Tone Analysis: Inspired by the prior literature [10], we investigate language of narrative by analyzing various types of sentiments and personality traits in a post such as document sentiment, social tone, writing tone and emotions. We use Alchemy Document Sentiment API101010http://www.alchemyapi.com/api/sentiment-analysis to identify the document-level polarity of overall sentiment of a post. We define five categories of sentiment polarity: strongly negative, negative, neutral, positive and strongly positive and categorize each post based on it’s sentiment score. The sentiment of a document or post differs from the tone analysis of the content. Sentiment analysis can only identify the positive and negative polarity of a post while tone analysis measures the level of three categories including emotion, social and writing tones. We conduct a linguistic analysis on Tumblr posts using IBM Watson Tone Analyzer API111111https://tone-analyzer-demo.mybluemix.net.

Fig. 4: Example of Emotion, Social and Writing Tone Features Computed for a Tumblr Post, Topic: Race, Intent: No

Emotions tone analyzes the text of a post and gives a distribution of emotions namely joy, fear, sadness, anger and disgust. Social tendencies analyze the personality traits from the text that includes openness, conscientiousness, extraversion, agreeableness and emotional range of a narrative. Writing tone identifies the language cues of the author in context to the content written in a Tumblr post. It includes analytical, confident and tentative style of writing. The Tone Analyzer API analyzes the content of a post and computes two scores (document level and sentence level) for all three categories of tones. Since, the text length of posts in our experimental dataset varies from to   words, we select only document level measures of these tones. Similar to sentiment score, we create a feature vector of each tone and categorize each post based on the confidence score: very low, low, medium, high, and very high. Figure 4 shows a concrete example of Tumblr post related to Race topic and shows the level of emotion tone, language and personality traits of author.

Semantic Tagging: Semantic tagging of a post identifies the semantic role of each term present in the content. It also identifies the hidden phrases playing major role in the post. We use UCREL Semantic Analysis System (USAS)121212http://ucrel.lancs.ac.uk/usas/ to semantically tag each post in our dataset. USAS contains a hierarchy based lexicon of categories with major labels at top of the hierarchy. All the semantic tags in a post are composed of a general or high level label and a numeric value showing the division of each label in lexicon. A numeric value after the decimal shows a further sub-division of categories in the hierarchy. For example, term ”refugee” is tagged as ”M1/S2mf” where M1 denotes the tag ’moving from one location to another’, S2 denotes ’people’ and mf denotes the ’gender’. We use USAS for semantic tagging because it not only tags each word of the document but also tags multi-words unit in the post, if any. For example, term ”New York Times” is tagged as ”New_Z3c[i4.3.1 York_Z3c[i4.3.2 Times_Z3c[i4.3.3” where Z3 denotes the name of a company, c denotes an anaphora, i denotes a multi-words unit and following numeric terms present the number of words present in a unit (). We remove all punctuations and special characters (tagged as PUNC) from semantically tagged content and decode all remaining terms with their respective labels in tags’ hierarchy131313USAS published list of all semantic tags is available at http://ucrel.lancs.ac.uk/usas/semtags.txt. USAS tags a term as Z99, if the term is not identified and not present in USAS database. We however do not remove them from the tagged c ontent. Because USAS labels various topic specific terms as Z99 that are important for the intent identification. For example, ’Jihadist’, ’racial’, ’anti-white’, ’pro-black’, join words such as ’BlackLivesStillMatter’. It also includes the terms with hashtags, URLs, misspell words, acronyms and abbreviations.

Contextual Metadata: Tumblr API allows us to extract the following contextual metadata associated with each Tumblr post: number of tags, terms used in the tags, number of notes (reblog + like count) and link to multimedia content such as image, video or audio attached with the post. By further mining the content of a post, we can extract the following contextual information: hashtags, URLs, emoticons and Internet slang. However, due to various limitations, we exclude these contextual metadata from our feature space. 1) As discussed in Section V, the length of unique tags present in our dataset varies from to and contains a large amount of noisy text (multi-lingual terms, misspell words). Tags are user generated content and a Tumblr post can have any number of tags (upto in our dataset) or no tags at all. Further, the presence of a comma in a long sentence splits a tag into two separate terms. Given the length of tags in our dataset, number of tags cannot be a discriminatory feature. 2) For a given tag, Tumblr API allows us to extract only most recently published posts. These posts automatically has relatively less number of reblog or like count (referred as notes) in comparison to the posts containing featured or popular tags or uploaded before the current timestamp. Hence, the number of notes is not a valid feature for our experimental data. 3) Since, we extract only textual posts for our analysis, our dataset does not contain any multimedia content such as image, video or audio attached in the post description. 4) We conduct an exploratory data analysis on all topic related posts and our data reveals that for both intent and unknown posts, there are very few (upto maximum ) posts that contain either of hashtags (hashtags in Tumblr posts are not clickable and searchable), emoticons, Internet slangs (usually present in tags than the post content), @user mention or external URLs. We exclude contextual metadata from our feature space as those are not discriminatory for intent or topic classification.

Vi-B Classification

The third phase of our proposed framework is a cascaded ensemble learning based classifier primarily consisting of two stages: topic classification and intent classification. We train our model from feature vectors created in Phase 2 and perform one-class classification on Tumblr posts.

Topic Classification: To identify the posts that belongs to a defined topic (Race or Religion), we use topic modeling linguistic features extracted using natural language processing. We take a random sample of posts out of , annotated as topic posts and extract their taxonomy and concepts from the feature space. We create two independent lexicons of these concepts and labeled topics that has a confidence score above . We manually filter the list of taxonomy and finalize the following labels that strictly belong to the topic of this study: religion and spirituality, society/unrest and war, society/racism, society/personal offense/hate crime, law, govt & politics/espionage and intelligence/terrorism and law, govt & politics/legal issues/human rights.
We use a look-up based method and check if the post belongs to any of these taxonomies and has a confidence score above . If yes, then we classify it as a topic post. However, if a post contains a wide range of taxonomies (>) then we identify the top K concepts in the text and check if they exist in the concept lexicon of labeled topic posts. This stage of cascaded classifier is a one-class classifier that takes complete experimental dataset as an input and classifies topic related posts from unknown posts.

Code Grouped Features
F1 Document Sentiment
F2 Semantic Tagging
F3 Emotion {Anger, Fear, Joy, Disgust, Sadness}
F4 Writing {Analytical, Confident, Tentative}
F5 Social {Openness, Conscientiousness, Extraversion, Agreeableness and Emotional Range}
TABLE III: Feature Codes and Grouping of Similar Feature Vectors. Source: Agarwal et al. [12]
(a) Decision Tree
(b) Naive Bayes
(c) Random Forest
(d) Decision Tree
(e) Naive Bayes
(f) Random Forest
Fig. 5: Percentage Fall in Accuracy of One-Class Classifiers During Leave-P-Out Compilation (P=1- One Feature (Top), P=2- Two Features (Bottom)). Source: Agarwal et al. [12]

Intent Classification: An intent of a post (consisting of free-form text) cannot be fully determined only by mining the keywords in the content. But it also requires to understand and predict the psychological tendency, sentiment tones and language of the narrative. It also requires to analyze the semantic role of topic related keywords used in the post. We perform classification on Tumblr posts by training our model on sentiment, semantic and language cues based features of a text. On a high level, we create a vector space of features set (F1 to F5) which is further categorized into unique vectors. Table III shows the list of all features extracted and grouped into feature vectors. We define intent classification as a one-class classification problem. Therefore, our training data contains only positive class (intent) posts. We implement three different one-class classifiers (Random Forest (RF), Naive Bayes (NB) and Decision Tree (DT)) and compare their accuracy for the posts classified as topic in Stage . We train our model for each classifier and perform fold cross validation. As discussed in Section V and shown in Figure 2, only % of the posts are labeled as intent posts making our experimental dataset highly imbalanced. Further, intent classifier takes only the topic posts as an input classified by topic classifier which is again a small subset of whole dataset. Therefore, we select classification algorithms that works for small training data.

Vii Performance Evaluation

As described in Section VI-B, proposed method is a cascaded ensemble learning classifier in which topic classifier uses complete experimental dataset as an input while intent classifier takes input from Stage . In this Section, we present the accuracy results of each classifier and also discuss the influence of topic classification’s accuracy on intent post classification. Based upon the inter-annotator agreement results, we evaluate the accuracy of our classifier by comparing the observed results against actual labeled class. We conduct our experiments on posts, consistently labeled by both annotators. Proposed topic classifier classifies posts as target (topic) class and posts as unknown. Table IV reveals that there is a misclassification of % and % in identifying target and outliers (unknown) posts. Since, the focus of our study is to identify all such posts that have racist or radicalized intent, our aim is to achieve high precision as well as high recall. Our results reveal that for topic classification, we are able to achieve a precision of % (/(+)) and a recall of % (/(+)).

Predicted
Topic Unknown
Actual Topic TP=253 FN=39
Unknown FP=93 TN=2034
TABLE IV: Confusion Matrix for Topic Classification
Test-Data1 Test-Data2
DT RF NB DT RF NB
Recall 0.79 0.82 0.79 0.82 0.84 0.83
Precision 0.72 0.78 0.74 0.75 0.81 0.78
TABLE V: Performance Evaluation Metrics for Intent Classification. Source: Agarwal et al. [12]

Given that our data is highly imbalanced and only % of the posts are labeled as target (intent) class, we execute each of our classifiers (RF, NB, and DT) using a fold cross validation over the experimental dataset. Since, the accuracy measures are biased towards the majority class, we evaluate the performance of intent classifier using two standard information retrieval metrics i.e. precision and Area Under Operator Receiver Curve (AUC). Due to the misclassification in topic modeling, we evaluate the performance of intent classification in two steps. We first execute our model on all posts (Test-Data) classified as topic in previous stage. In second iteration, we evaluate the performance of intent classifier on Tumblr posts (Test-Data) correctly classified as topic. Table V shows the accuracy metrics for Random Forest (RF), Decision Tree (DT) and Naive Bayes (NB) algorithms.

Fig. 6: ROC Curve for Test-Data1 (Right) and Test-Data2 (Left). Source: Agarwal et al. [12]

Our results reveal that one-class intent classifier gives higher precision rate for Test-Data (refer to Table V). However, filtering non-topic based posts from the dataset further improves the accuracy of intent classification. This is probably associated with the fact that unknown posts represent a broad range of sentiments and language cues. Table V reveals that Random Forest outperforms Naive Bayes and Decision Tree algorithms and gives the maximum precision (, ) and recall (, ) for Test-Data1 and Test-Data2. In fact, both Naive Bayes and Random Forest generate almost similar classification results for topic posts with a difference of % to %. Our results reveal that wrongly classified posts at Stage provokes a decrement in accuracy of intent classification. As shown in Table V, classification accuracy for Test-Data is higher than Test-Data. Figure 6 shows the ROC curves generated for each type of classifiers executed for both Test-Data1 and Test-Data2. Graphs in Figure 6 shows that given a set of posts , Decision Tree based intent classifier has the high probability ( ) to classify them as target class. While, Random Forest and Naive Bayes have almost equal probability () to classify a post as intent or unknown. Figure 6 reveals that if the taxonomy of a post is unknown (Test-Data1) then each algorithm has a probability of approximately to classify it as intent post.

In order to evaluate the impact of each feature on intent classification performance, we test the leave-p-out cross validation for both Test-Data1 and Test-Data2. Figure 5 illustrates the percentage of fall in precision of each classifier and for both the datasets. Negative values rather shows the increment in precision. Figure 5(a) shows that in Test-Data1, removing F2 and F3 individually from the feature space does not impact the overall performance of Decision Tree (<%). While removing writing tone feature i.e. F4 decreases the precision by %. In fact, for Test-Data2, removing document sentiment vector from the feature space, it increases the performance of Decision Tree by %. It is possibly due to the reason because emotion tone gives a detailed classification of emotions (anger, fear, joy, sadness and disgust) while document sentiment feature gives overall sentiment of a post that can be biased in longer posts (word length >). Figure 5(b) reveals that in Naive Bayes algorithm, removal of any feature from Test-Data2 impacts the performance of classifier with a reasonably high percentage of fall in precision. If we remove feature F1 or F4, it decreases the overall precision upto %. Similarly, if the taxonomy of a post is unknown (Test-Data1) then removing emotion tone (F3) or language tone (F4) decreases the precision by %. Similar to Naive Bayes, for Random Forest algorithm (Figure 5(c)), removal of any feature declines the classifier’s performance upto %. While, for any unknown post, emotion tone () and writing cues of the narrative () are the most discriminatory features as removal of these features can decrease the performance of algorithm upto %.

We also report the variation in performance of classifiers if a combination of two features is removed from the training model. Leaving out two features at once also reveals the relative influence of each vector in feature space. Figure 5(d) reveals that feature F3 and F4 are the most discriminatory features as removal of any of these vectors does not influence the performance of other features but we observe a fall in the overall precision rate. For example, removing feature F1 (that increases the precision of Decision Tree algorithm upon leaving out individually) with F4 decreases the precision by % for Test-Data1 and % for Test-Data2. Similarly, leaving features F2 or F3 along with most of other features (F2 and F4) decreases the performances by % to % for both datasets. However, for Test-Data1, leaving these features out along with F1 rather increases the performance. It reveals that in Decision Tree intent classification, Feature F1 negatively impacts the performance of other features. In Naive Bayes intent classification, we find that for Test-Data1, F2 is an important feature for identifying intent posts (Figure 5(e)). This is possible because if the taxonomy of a post is unknown then semantic tagging of text can be an important feature for identifying the topic related posts. Figure 5(d) also reveals that in Naive Bayes classifier (Test-Data1), social tone of a text (F5) declines the performance of other features. For example, removing F1 individually decreases the precision by % while combining it with F5 does not make any change in the accuracy. Similarly, leaving out F3 and F4 features from training model individually makes a fall of % in overall performance while combining any of them with F5, the accuracy rather improves by % to %. It happens because if the posts are not topic related then they might have a wide range of taxonomy which impacts the social tone of a narrative. Due to the sparsity in social tendency attributes, it increases the number of false alarms. Unlike Decision Tree or Naive Bayes algorithms, in Random Forest, removing a combination of any two feature vectors decreases the performance rate of intent classifier for each dataset. Figure 5(f) reveals that removing any feature along with F3 declines the precision by at least %. While removing them with F4 can lower the performance by % to %. Our results reveal that emotion tone (F3) and writing cues (F4) are the two most discriminatory features for identifying intent post while using any of three classifiers and datasets. Semantic tagging (F2) and social tendency of narratives (F5) are two other important features if the post has a wide range of topics or emotional range making a post ambiguous. Classification results support our hypotheses that sentiment and semantic of text can be used to identify the language cues and personality traits of author and classify the intent post on microblogging platforms.

Limitations: In this paper, we conduct our analysis only on English language posts. Our proposed approach has dependencies with the open source APIs used for the feature extraction. If a post contains multi-lingual text (for example, Arabic + English) then the APIs might not be able to extract the taxonomy or semantic features accurately. We make our model generalized and it can be used to identify racist and radicalized intent for any given text. However, the model might require some pre-processing and large training data for microposts as the topic modeling and tone analysis might not be % accurate for very short text such as tweets.

Viii Conclusions and Future Work

In this paper, we study the problem of identifying racist and radicalized Tumblr posts based on the intent of narrative. We formulate our problem as a cascaded ensemble learning problem and propose a two-stage one-class classification approach to solve the problem. Our result shows that the proposed approach is effective for identifying intent posts unlike previous keyword based techniques. Our experimental results shows that emotion tone, writing cues and social personality traits of an author are discriminatory features for identifying the intent of the post. Further, topic classification of posts and filtering non-topic based (or noisy) posts improves the performance of the proposed intent classification.
Future work includes addressing the limitations of present study and improving the accuracy of linguistic features. Identification of multi-lingual posts by doing a sentence level language detection and enhancing the translated content for identifying intent posts. As mentioned in the previous sections, Tumblr is popularly known for the use of reaction gif images. Therefore, our future work involves mining users’ reactions from attached external images and enrichment of linguistic features of a post. Presence of long text in tags gives more information about the intent of an author as well as the content of the post. Future work also includes sentence detection in tags and identifying linguistic features at tag-level.

References

  • [1] J. Cohen, “Freedom of expression,” Philosophy & Public Affairs, vol. 22, no. 3, pp. 207–263, 1993.
  • [2] A. G. Smith and P. e. a. Suedfeld, “The language of violence: Distinguishing terrorist from nonterrorist groups by thematic content analysis,” Dynamics of Asymmetric Conflict, vol. 1, no. 2, pp. 142–163, 2008.
  • [3] S. Agarwal, A. Sureka, and V. Goyal, “Open source social media analytics for intelligence and security informatics applications,” in Big Data Analytics.   Springer, 2015, pp. 21–37.
  • [4] M. Norton and S. Sommers, “Whites see racism as a zero-sum game that they are now losing,” Perspectives on Psychological Science, 2011.
  • [5] J. Wang, G. Cong, and et al., “Mining user intents in twitter: A semi-supervised approach to inferring intent categories for tweets,” in AAAI, 2015.
  • [6] H. Purohit, G. Dong, and et al., “Intent classification of short-text on social media,” in SocialCom 2015.
  • [7] X. Ding, T. Liu, J. Duan, and J.-Y. Nie, “Mining user consumption intention from social media using domain adaptive convolutional neural network.” in AAAI, 2015, pp. 2389–2395.
  • [8] P. Geetha, R. Chandresh, and et al., “Feature selection framework for data analytics in microblogs,” in ’Emerging Research in Computing, Information, Communication and Applications’ ERCICA, 2014.
  • [9] J. Wang, W. X. Zhao, H. Wei, H. Yan, and X. Li, “Mining new business opportunities: Identifying trend related products by leveraging commercial intents from microblogs.” in EMNLP, 2013, pp. 1337–1347.
  • [10] S. Prentice, P. J. Taylor, P. Rayson, A. Hoskins, and B. O?Loughlin, “Analyzing the semantic content and persuasive composition of extremist media: A case study of texts produced during the gaza conflict,” Information Systems Frontiers, vol. 13, no. 1, pp. 61–73, 2011.
  • [11] S. Agarwal and A. Sureka, “Semantically analyzed metadata of tumblr posts and bloggers, mendeley data, v1, http://dx.doi.org/10.17632/hd3b6v659v.1,” 2016.
  • [12] A. Swati and S. Ashish, “But i did not mean it!- intent classification of racist posts on tumblr,” in 6th IEEE European Intelligence & Security Informatics Conference (EISIC), Uppsala, Sweden.   IEEE, 2016.
  • [13] Y. Chen, Y. Zhou, S. Zhu, and H. Xu, “Detecting offensive language in social media to protect adolescent online safety,” in Privacy, Security, Risk and Trust (PASSAT), SocialCom.   IEEE, 2012, pp. 71–80.
  • [14] S. Agarwal and A. Sureka, “Spider and the flies: Focused crawling on tumblr to detect hate promoting communities,” arXiv preprint arXiv:1603.09164, 2016.
  • [15] P. Burnap and M. L. Williams, “Us and them: identifying cyber hate on twitter across multiple protected characteristics,” EPJ Data Science, vol. 5, no. 1, p. 1, 2016.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
102569
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description