Embed2Detect: Temporally Clustered Embedded Words for Event Detection in Social Media

Embed2Detect: Temporally Clustered Embedded Words for Event Detection in Social Media

Abstract

Event detection in social media refers to automatic identification of important information shared in social media platforms on a certain time. Considering the dynamic nature and high volume of data production in data streams, it is impractical to filter the events manually. Therefore, it is important to have an automated mechanism to detect events in order to utilise social media data effectively. Analysing the available literature, most of the existing event detection methods are only focused on statistical and syntactical features in data, even though the underlying semantics are also important for an effective information retrieval from text, because they describe the connections between words and their meanings. In this paper, we propose a novel method termed Embed2Detect for event detection in social media by combining the characteristics in prediction-based word embeddings and hierarchical agglomerative clustering. The adoption of prediction-based word embeddings incorporates the semantical features in the text to overcome a major limitation available with previous approaches. This method is experimented on two recent social media data sets which represent the sports and politics domains. The results obtained from the experiments reveal that our approach is capable of effective and efficient event detection with the proof of significant improvements over baselines. For sports data set, Embed2Detect achieved 27% higher F-measure than the best performed baseline method and for political data set, it was an increase by 29%.

Keywords:
Word embedding Hierarchical clustering Dendrogram Vocabulary Social media

1 Introduction

Social media services like Twitter, Facebook, Snapchat are becoming more popular day by day. A recent survey by Chaffey (2019) estimated the number of active social media users around the world in January 2019 is 3.484 billion; 45% of the total population. The average of global increase in social media usage since January 2018 was found to be 9%. Another analysis was conducted on active users on social media in July 2019 to rank the social media services based on the popularity (Clement, 2019). According to its results, majority of the services have millions of users with Facebook in the leading having 2,375 million user base. About 473,400 tweets and 49,380 Instagram posts per minute were recorded in 2018 (James, 2018).

The data produced on social media contain different information such as opinions, breaking news, general status and personal updates. Also, social media facilitate the fast and wide spreading of information, because of its large user base which covers a vast geographical area (Castillo et al, 2011). People report nearby events instantly and it lets others on the same platform to know about the events within a short period. In some cases, social media was found to broadcast news faster than traditional news media by an analysis which compared Twitter trending topics with CNN news headlines (Kwak et al, 2010). Due to the inclusion of diverse information and real-time propagation to a large group, nowadays, there is a high tendency to consider social media as information networks which provide newsworthy contents. In 2017, the proportion of American adults who get news from social media was found to be 67% (Gottfried and Shearer, 2017). Considering this inclination, news services such as BBC and CNN also use social media actively to publish news to a huge user base instantly. Nonetheless, it is impractical to analyse the data manually to extract important or newsworthy contents in social media, because of its huge volume and dynamic nature. Therefore, in order to utilise the social media data effectively, the requirement of an automated and accurate event detection method becomes crucial (Small and Medsker, 2014).

A language is mainly built using two phenomena, namely, syntax and semantics (Sag and Pollard, 1987). Syntax defines the arrangement of words in word sequences, and semantics describes the connections between words and their meanings. Thus, a language can have multiple terms which express the same meaning as well as polysemous terms, which have multiple meanings depending on the context. Also, there can be different term orders which provide the same idea. Therefore, successful information retrieval from text requires the analysis of both, underlying syntax and semantics. Event detection from textual data in social media is also a sub domain of information retrieval from text. In addition to considering the underlying syntax and semantics, event detection requires the incorporation of statistical features in text to measure the qualities of events such as popularity. But, according to available literature (Corney et al, 2014; Xie et al, 2016), most of the existing methods only focused on statistical and syntactical features in the text without considering the semantics.

Due to the diversity in social media users, it is common to use different terms and term sequences to describe the same idea. For example, consider the tweets:

‘There are 13 million people living in poverty in the UK. 13M!!! Yet some MPs will vote for the deal with NO impact assessments. That 13M could become 20M?!#VoteTheDealDown #PeoplesVoteMarch  #PeoplesVote #StopBrexit’

‘Luciana Berger - Steve Barclay confirmed that no economic analysis of the #BrexitDeal has been done… let that sink in. So how can we be expected to vote on a deal, that will affect this country for decades, today?  #VoteDownTheDeal #PeoplesVote’

which were posted during the Brexit Super Saturday 2019. Even though both tweets describe the same idea, there are no common words between them except for a few hashtags. In addition, different word phases such as ‘impact assessments’ and ‘economic analysis’ were used to mention the same subject discussed in them. Without considering the underlying semantics, relationships between such terms and term sequences cannot be identified. Thus, huge amount of valuable information will be lost by ignoring the semantics.

Considering the lack of semantic involvement in previous research and importance of semantics for information extraction from text, this research proposes a novel event detection method termed Embed2Detect to combine the characteristics in prediction-based word embeddings and hierarchical agglomerative clustering. We used the time-based sliding window model and considered the temporal variations between cluster changes and vocabulary changes to identify the event occurrences. Since prediction-based word embeddings learn word representations based on contextual predictions, these vectors have a high capability in preserving syntactic and semantic relationships between words (Mikolov et al, 2013a). Also, these embedding models consider the statistics in training corpus while learning the embeddings. Further, we utilise term frequencies in our method for the inclusion of statistical features. In summary, Embed2Detect considers all the important features in textual data; syntax, semantics and statistics, which are needed for effective event detection.

To evaluate the proposed method, two recent social media data sets which represent two diverse domains; sports (English Premier League 19/20 on 20 October 2019 between the teams: Manchester United Football Club (FC) and Liverpool FC) and politics (Brexit Super Saturday 2019) are used. To collect the data, Twitter is used, because it is widely considered as an information network than social media (Adedoyin-Olowe et al, 2016; Kwak et al, 2010) and have limited restrictions with enough data coverage for this research. To measure the performance, we used the evaluation metrics of recall, precision, F-measure and keyword recall, which are widely used to evaluate the event detection methods. Further, we compared the effectiveness and efficiency of our method with three recently proposed event detection methods as baselines. We could obtain promising results for the evaluation with better performance from our method than baselines.

To the best of our knowledge, Embed2Detect is the only method which uses self-learned prediction-based word embeddings for event detection in social media. In summary, we list the contributions of this paper as follows:

  • Proposing a novel method for event detection in social media with the involvement of not only the statistical and syntactical features in the text as existing methods, but also the semantical features using the self-learned prediction-based word embeddings;

  • the application of unsupervised self-learning on targeted corpus to capture domain specific features for more effective and flexible event detection which is independent from characteristics specific to the social media service or language;

  • the application and evaluation of proposed method over recent and real data sets in different domains to prove the effectiveness and universality of the method while comparing it with recent baseline methods;

  • the publication of recent social media data sets17 which represent the domains (i.e. sports and politics) with ground truth event labels to support other research in the area of event detection; and

  • the release of method implementation as an open-source project18 to support applications and research in the area of event detection.

The rest of this paper is organised as follows. Available methods for event detection in social media and their capabilities are discussed in Section 2. Section 3 describes the background details including the support of word embeddings and hierarchical clustering for this research. The problem addressed by this research is stated in Section 4 and the proposed approach is explained under Section 5. Following this, a comprehensive experimental study is available under Section 6. Finally, the paper is concluded with a discussion in Section 7.

2 Related work

Considering the importance of automatic event detection in social media, different methods were proposed by previous research with the association of different techniques and characteristics including graph theory, rule mining, clustering, tensor decomposition, burstiness and social aspect. These techniques were supported by different text representations including tokens, n-grams, vectors, etc.; and extracted keywords such as named entities, noun phrases and hashtags as further discussed below.

Following the successful application of graph theory in sociology and social network analysis, there was a tendency to use graph based solutions for event detection in social media. Sayyadi et al (2009) proposed to transfer a data stream into a KeyGraph, which represents the keywords by nodes and connects the nodes if corresponding keywords co-occurred in a document, so that the communities in the graph represent events occurred in the data stream. As keywords, noun phrases and named entities with high document frequency were considered. In this approach, betweenness centrality score was used to extract the graph communities. As an improved version of social stream graph, a later research suggested to use posts in social media as graph nodes rather than using keywords (Schinas et al, 2015). It used Structural Clustering Algorithm for Networks (SCAN) to extract the communities in the graph. Unlike the betweenness centrality based cluster detection, SCAN has the ability to recognise bridges of clusters (hubs) and outliers, to allow sharing of hubs between clusters and recognition of outliers as noise (Xu et al, 2007). Also, this method used a supervised learning technique to identify the posts which belong to the same community during the graph generation. Therefore, this approach is unable to handle the events which are unknown to the training data. Another recent research suggested a named entity-based method considering the high computational cost associated with graph generation (Edouard et al, 2017). After identifying the entities, only the context around them was considered to extract nodes, edges and weights. Even though keyword-based methods speed up the graph processing, they are less expandable due to the usage of language or domain specific features for keyword extraction.

A trend of applying rule mining techniques for event detection could be found from previous research. Based on Association Rule Mining (ARM), Adedoyin-Olowe et al (2013) proposed a method for temporal analysis of evolving concepts in Twitter which was named Transaction-based Rule Change Mining (TRCM). To generate the association rules, hashtags in tweets were considered as keywords. This methodology was further evolved for event detection by showing that specific tweet change patterns, namely, unexpected and emerging, have high impact on describing underlying events (Adedoyin-Olowe et al, 2016). Having a fixed support value for Frequent Pattern Mining (FPM) was found as inappropriate for dynamic data streams and it was solved by the dynamic support calculation method proposed by Alkhamees and Fasli (2016). FPM considers all terms in equal utility. But, due to the short length in social media documents, frequency of a specific term related to an event could have rapid increase compared to other terms. Based on this finding, Choi and Park (2019) suggested High Utility Pattern Mining (HUPM) which finds not only the frequent but also high in utility item sets. In this research, the utility of terms was defined based on the growth rate in frequency. Even though the later two approaches which are based on FPM and HUPM are not limited to the processing of special keywords such as hashtags, they are focused on only identifying the topics/events discussed at each time frame without recognising temporal event occurrence details.

By considering the dynamicity and unpredictability in social media data streams, there was a tendency to use unsupervised methods such as clustering and tensor decomposition for event detection. McCreadie et al (2013) and Nur’Aini et al (2015) showed that K-means clustering can be successfully used for event detection. In order to improve the efficiency and effectiveness, they clustered low dimensional document vectors, which were generated using Locality Sensitive Hashing (LSH) and Singular Value Decomposition (SVD), respectively. Xie et al (2016) showed that tensor decomposition technique SVD can also be applied on word acceleration matrices to identify the event words. Considering the requirement of predefining the number of events in K-means clustering and SVD, there was a motivation for hierarchical or incremental clustering approaches (Corney et al, 2014; Li et al, 2017; Nguyen et al, 2019; Morabia et al, 2019). Different data representations were used with hierarchical clustering also. Corney et al (2014) proposed clustering word n-grams, Li et al (2017) proposed clustering semantic classes and Morabia et al (2019) proposed clustering segments. As an improved clustering method, Nguyen et al (2019) suggested clustering term frequency-inverse document frequency (tf-idf) vectors after identifying candidate clusters using entity inverted indices. Among these data representations, both semantic class and entity detection were done using rule-based approaches which are less expandable to other languages. Segments are text phrases which are meaningfully separated using a semantic resource such as Wikipedia. Therefore, they are more informative and specific than word n-grams.

Burstiness is commonly used as a measure to identify the events, because it expresses the changes occurred in data streams. In communication streams, a burst is defined as a transmission which involves a large amount of data in a short time than a usual amount. Van Oorschot et al (2012) suggested that occurrences of sport events in Twitter can be recognised by analysing the tweets at bursts in the data stream. But the events which do not make any significant increase in the data volume will be missed, if only the data at peak volumes are considered. To overcome this limitation, another research proposed to use bursts in word n-grams (Corney et al, 2014) to identify the important events. This research argues that even the data volume is stable, there will be an increase in word phrases specific to a particular event. But, frequency-based measures cannot differentiate the events from general topics such as car, music, food, etc., because social media contains a large proportion of data relevant to these topics. Moreover, the bursts in frequency will appear when an event becomes more popular or trending. To overcome these issues, bursts in word acceleration was suggested by another research (Xie et al, 2016). Using the acceleration, events could be identified more accurately at their early stages.

Recently, there was a focus on the social aspect also, considering the impact by user community on events. Guille and Favre (2015) proposed an approach which focuses on the bursts in mentions to incorporate the social aspect of Twitter for event detection. Since the mentions are links added intentionally to connect a user with a discussion or dynamically during re-tweeting, the social aspect of data can be revealed using them. Proving the importance of social aspect, this method outperformed the methods which are only based on term frequency and content similarity (Benhardus and Kalita, 2013; Parikh and Karlapalem, 2013). Another recent research suggested an improved version of Twevent (Li et al, 2012) by integrating more user diversity-based measures: retweet count and follower count with segment burstiness calculation (Morabia et al, 2019). However, the measures which sense social aspect are mostly specific to the social media platform and incorporation of them would require to customise the main flow accordingly.

Considering the textual features used by above mentioned event detection approaches, it is clear to us that the majority of previous research works were mainly focused on statistical features (e.g. term frequency, tf-idf, or burstiness), and syntactical features (e.g. co-occurrence, or local sensitivity). But, as a sub domain of information retrieval from text, effective event detection in social media requires the proper inclusion of semantical features also, even though we could find only few methods which considered the underlying semantics as described in Section 2.1.

2.1 Usage of semantics in event detection

When closely analysed how semantics is used for event detection in social media by previous research, we could find some rule-based and supervised learning-based approaches as further discussed below.

Li et al (2017) defined an event as a composition of answers to WH questions (i.e. who, what, when and where). Based on this definition, they considered only the terms which belong to the semantic classes: proper noun, hashtag, location, mention, common noun and verb for their event detection method. Rule-based approaches were used for the term extraction and categorisation. Likewise, another recent research (Nguyen et al, 2019) also used a rule-based approach to extract the named entities in the text in order to support their event detection method. Using the named entities, documents and clusters were represented as entity-document and entity-cluster inverted indices which were used for candidate cluster generation. Both of these methods only categorised the terms into semantical groups for the recognition of important terms related to events. Thus, none of these methods have the ability to identify the connections between words.

In contrast to the rule-based approaches, Chen et al (2017) suggested a deep neural network based approach for event detection. To identify event related tweets a neural network model was used and to input the data into the network, tweets were converted into fixed length vectors using pretrained GloVe embeddings (Pennington et al, 2014) while capturing the semantic and syntactic regularities in the text. But, it is not appropriate to use supervised learning techniques for real time event detection, because they require prior knowledge of events which can vary due to the dynamic nature in data streams and event specific qualities.

In summary, based on the available literature, we could not find any event detection approach which significantly involves semantics of underlying text while facilitating the real time execution. We propose our approach with the intention to fill this gap for more effective event identification.

3 Background

Considering the limitations in available approaches for event detection, we adopt a word embedding-based approach in this research. More details about word embeddings and their capabilities are discussed in Section 3.1. Additionally, in order to facilitate unsupervised event detection, we utilise the characteristics associated with hierarchical clustering. We selected hierarchical clustering among various clustering algorithms available, following the tendency by previous research and considering its advantages. Hierarchical clustering is further explained in Section 3.2.

3.1 Word embeddings

Word embeddings are numerical representations of text in vector space. Depending on the learning method, they are categorized into two main groups, namely, frequency-based embeddings and prediction-based embeddings. Frequency-based word embeddings consider different measures of word frequencies to represent text as vectors. Therefore, the main focus of these vectors is statistical features of the text. Term frequency vectors and tf-idf vectors are examples for frequency-based word embeddings. But unlike them, prediction-based word embeddings mainly focus on both syntactical and semantical features of the text as further described in Section 3.1.1. Thus, this research aims on incorporating the characteristics in prediction-based word embedding for effective event detection in social media.

Among the available prediction-based algorithms, we focus on Skip-gram algorithm in this research considering its ability to learn high quality word embeddings efficiently. More details of the Skip-gram architecture are described in Section 3.1.2. Following this theoretic exposure, Section 3.1.3 discusses the qualities of word embeddings obtained by learning Skip-gram models on real data sets which are useful for event detection.

Prediction-based word embedding

Prediction-based word embeddings learn word representations based on contextual predictions. Using context allows these vectors to learn both syntactical and semantical relationships between words.

Different model architectures such as Neural Network Language Model (NNLM) (Bengio et al, 2003) and Recurrent Neural Network Language Model (RNNLM) (Mikolov et al, 2010) were proposed by previous researches for the generation of word embeddings based on contextual predictions. But, considering the complexity associated with them, log-linear models which are known as Word2vec models (Mikolov et al, 2013a) were suggested and they are successfully used with many natural language processing (NLP) related tasks such as news recommendation (Zhang et al, 2019), question classification (Yilmaz and Toklu, 2020) and community detection (Škrlj et al, 2020) recently.

There are two architectures proposed under Word2vec models: (1) Continuous Bag-of-Words (CBOW) and (2) Continuous Skip-gram. CBOW predicts a word based on its context. In contrast to this, Skip-gram predicts the context of a given word. Both algorithms train a neural network with one hidden layer and use the adjusted weights between input and hidden layer as word embeddings. According to the results obtained by model evaluations, Mikolov et al (2013a) showed that these vectors have a high capability in preserving syntactic and semantic relationships between words. Further, the ability of Word2vec models in automatic organisation of concepts while implicitly learning their relationships was revealed by the demonstration of country and capital relationship (Mikolov et al, 2013b). In this experiment, vectors correspond to countries and their capitals were located in similar distances at the vector space. Also, due to the simplicity of these model architectures, their computational complexity is much low.

Among the Word2Vec algorithms CBOW and Skip-gram, we focus on Skip-gram model in this research, because it resulted in high semantic accuracy than CBOW (Mikolov et al, 2013a, b). Also, based on the initial experiments and analyses, Skip-gram outperformed the CBOW model.

Skip-gram model

Skip-gram model is a log-linear classifier which is composed by a 3-layer neural network with the objective to predict context/surrounding words of a centre word given a sequence of training words (Mikolov et al, 2013b). More formally, it focuses on maximizing the average log probability of context words of the centre word by following the objective function in Equation 1. The length of the training context is represented by .

(1)

The probability of a context word given the centre word; is computed using the softmax function.

(2)

In Equation 2, and represent the output and input (i.e. context and centre words respectively) and represents the length of vocabulary. The input and output vectors of a word is represented by and . The input vectors for words are taken from input-hidden layer weight matrix which is sized where is the number of hidden layers. Likewise, output vectors are taken from hidden-output layer weight matrix which is sized . The architecture of Skip-gram model including weight matrices are shown in Figure 1.

Figure 1: Architecture of Skip-gram model

Once the model converges, it obtains an ability to predict the probability distributions of context words with good accuracy. At that point, instead of using the model for trained purpose, adjusted weights between the input and hidden layers will be extracted as word representations or embeddings. Thus, by changing the number of hidden layers of the model, the number of neurons and also the dimensionality of vectors can be changed. Following the training procedure, model weights get adjusted by learning the connections between nearby words. Provided a sufficient data corpus, learning the connections between nearby words allows to capture underlying syntax and semantics with the capability of grouping similar words more effectively.

Skip-gram vector spaces learned on event data

An event discussed in a data stream will result in a collection of documents which describe that event using a set of words related to it. Due to the learning based on contextual predictions, prediction-based word embeddings has an ability to locate the vectors of contextually closer words in nearby vector space or group similar words. This characteristic allows to generate nearby vectors for the event related words when the embeddings are learned on the corresponding document corpus.

Let’s consider the sample events mentioned in Table 1. These events are extracted from English Premier League 19/20 on 20 October 2019 between the teams Manchester United FC and Liverpool FC relating to the players Marcus Rashford and Roberto Firmino. Both events corresponding to Firmino are about missed attempts. Rashford has two different events relating to a foul and a goal. By analysing the Twitter data posted during each minute, we could find significant amount of tweets which discuss these events. In these tweets, foul related words were used in the context of word ‘Rashford’ at 16:52 and goal related words were used at 17:06. Likewise, missed attempt related words were used in the context of ‘Firmino’ at 16:40 and 17:04.

Time Event Description
16:40 Attempt missed Attempt by Roberto Firmino (Liverpool)
16:52 Foul
Foul by Marcus Rashford (Manchester United) on Virgil van
Dijk (Liverpool)
17:04 Attempt saved Attempt by Roberto Firmino (Liverpool)
17:06 Goal First goal by Marcus Rashford (Manchester United)
Table 1: Sample events occurred during English Premier League 19/20 on 20 October 2019 (Manchester United - Liverpool)

To analyse the word embedding distribution over vector space and its temporal variations relating to these events, we trained separate Skip-gram models for each time window using Twitter data. In order to provide enough data for embedding learning, 2 minute time windows were used. Using the learned embeddings, most similar words to the player names Rashford and Firmino were analysed during the time windows 16:52-16:54 and 17:06-17:08. To visualise the similar words in two dimensional plane, T-distributed Stochastic Neighbor Embedding (t-SNE) algorithm (Maaten and Hinton, 2008) was used and resulted graphs are shown in Figures 2 and 3.

The similar word visualisation during 16:52-16:54 (Figure 2) shows that the foul related words are located closer to the word ‘Rashford’ in the vector space. Also, after 12 minutes, few words related to the missed attempt at 16:40 such as ‘loses’ and ‘destruction’ can be seen closer to the word ‘Firmino’. But, during 17:06-17:08, we can see more words related to the saved attempt as nearby vectors to ‘Firmino’, because this event occurred 2 minutes back (Figure 3). Also, the goal scored during 17:06 can be clearly identified by the words closer to ‘Rashford’. This time window has clearly separated nearby vector groups for ‘Firmino’ and ‘Rashford’ compared to the previous window 16:52-16:54 to indicate that both events are actively discussed during this time because they happened recently.

These similar word analyses prove that nearby vector groups have the ability to represent the events. Thus, the events described in a document corpus can be identified using the embeddings learned on it. Further, considering the capability in learning relationships between words, Skip-gram word embeddings locate directly as well as indirectly related words to an event in closer vector groups. For an example, the top 20 similar words to ‘Rashford’ at the time window; 17:06-17:08 (Figure 3), contains the words such as ‘goal’, ‘1-0’, ‘mufc’ and ‘36’ which are directly related to the event goal scored at 36 minute. Also, the similar words contain words such as ‘huge’ and ‘noise’ which relate indirectly to the event but describe it more. These characteristics associated with Skip-gram word embeddings can be utilised for effective event detection in social media data.

Figure 2: t-SNE visualisation of tokens closer to the words; ‘Rashford’ and ‘Firmino’

within time window 2019-10-20 16:52 - 16:54

Figure 3: t-SNE visualisation of tokens closer to the words; ‘Rashford’ and ‘Firmino’

within time window 2019-10-20 17:06 - 17:08

3.2 Hierarchical clustering

Even though flat clustering (e.g. K-means) is efficient compared to hierarchical clustering, flat clustering requires to predefine the number of clusters. Considering the unpredictability associated with social media data, it is not practical to identify the number of events in advance. Therefore, hierarchical clustering is more appropriate for social media data streams. Another advantage in hierarchical clustering is it outputs a hierarchy or structure of data points which is known as dendrogram rather than just returning the flat clusters. This hierarchy can be used to identify connections between data points. Considering these advantages, we decided to use hierarchical clustering for our event detection approach.

There exists two types of hierarchical clustering algorithms as bottom-up or agglomerative and top-down or divisive (Manning et al, 2008a). In hierarchical agglomerative clustering (HAC), all data points are considered as separate clusters at the beginning and then merge them based on cluster distance using a linkage method. The commonly used linkage criteria are single, complete and average. In single linkage, maximum similarity is considered and in complete linkage, minimum similarity is considered. Average of all similarities is considered in average linkage. In contrast to HAC, hierarchical divisive clustering (HDC), considers all data points as one cluster at the beginning and divide them until each data point is in its own cluster. For data division, HDC requires a flat clustering algorithm.

Among the two types of hierarchical clustering algorithms, top-down approach or HDC is more complex compared to HAC, due to the requirement of a second flat clustering algorithm. Therefore, when processing big data sets, it is advised to use HDC with some stopping rules to avoid the generation of complete dendrogram in order to reduce the complexity (Roux, 2018). Considering the data generation in social media, event detection will require to process big data sets. Also, for this research, we need to focus on clusters as well as complete dendrograms. Considering these requirements, we decided to use HAC for this research.

4 Problem definition

Events were described using various definitions by previous research. Sayyadi et al (2009) defined an event as some news related thing happening at a specific place and time. Another definition considered an event as an occurrence which has the ability on creating an observable change in a particular context (Aldhaheri and Lee, 2017). Focusing on the content of events, another research described an event as a composition of answers to WH questions (i.e. who, what, when and where) (Li et al, 2017). Considering the main idea used to describe an event, this research refers an incident or activity which happened at a certain time and discussed or reported in social media as an event. As examples for such events, a goal scored during a football match and speech done by a minister at a parliament session can be mentioned.

Let be a data stream of continuous and chronological series of posts or documents in social media. Any document belonging to this stream contains the time it generated with a content which describes its idea. The aim of event detection is automatic extraction of events described in the contents of documents relating to the time, when such a data stream is provided.

4.1 Notations of terms

Providing that the proposed approach is time window-based, the notations and are used to denote two consecutive time windows at time and . All the notations which are commonly used throughout this paper are summarised in Table 2.

Notation Description
window at time t
window at time t+1 (consecutive time window to )
document i in a data stream
word/token i in a data corpus
word embedding corresponding to the word/token i;
vocabulary corresponding to the data at
vocabulary corresponding to the data at
length of the vocabulary
dendrogram level
number of shared dendrogram levels between tokens; and from root
number of dendrogram levels from root; to node;
set of leaf nodes in a dendrogram
Table 2: Summary of notations used in the paper

5 Embed2Detect

Considering the applicability and features of prediction-based word embeddings, we propose a word embedding-based approach for event detection in social media which is named Embed2Detect. The Embed2Detect system contains four main components: (1) stream chuncker, (2) word embedding learner, (3) event window identifier and (4) event word extractor as shown in Figure 4. Self-learned word embeddings are used during event window identification and event word extraction phases. In order to evaluate the performance of this approach, event mapper is used to map detected events with ground truth events during experiments. Each of the components are further described in following sections (Section 5.1 - 5.4).

Figure 4: Overview of proposed method for event detection; Embed2Detect

5.1 Stream chunker

Mainly, data stream mining is supported by three different time models, namely, landmark model, tilted-window model and sliding window model (Tsai, 2009). In landmark model, all the data from a specific time to present is considered equally. Unlike this model, tilted-window model treats recent data with high importance than old data. Sliding window model splits data stream into windows based on fixed time period or number of transactions and performs data mining tasks on data that belong to each window.

Among these models, time-based sliding window model was widely used by previous research work in event detection (Sayyadi et al, 2009; Alkhamees and Fasli, 2016; Adedoyin-Olowe et al, 2016; Choi and Park, 2019). Analysing the performance of previous methods and considering the requirement of temporal event identification, Embed2Detect also uses the sliding window model with fixed time frame for event detection in social media data streams.

Stream chunker is the component which facilitates the separation of data stream into windows. Depending on the evolution of events which need to be identified, the length of time frames can be adjusted. Smaller time frames are preferred for highly evolving events.

5.2 Word embedding learner

In order to incorporate statistical, syntactical and semantical features in text for event detection, prediction-based word embeddings are used. Without using pretrained word embeddings, this research proposes to learn embeddings on targeted corpus to capture its unique characteristics. The word embedding learner transfers the text in social media posts in a selected time window to a vector space. For each time window, different vector spaces are learned to capture variations between them. Learned word embedding models are saved in a data storage to facilitate event window identification and event word extraction.

Considering the high quality vector representations by Skip-gram algorithm, we used it to learn embeddings in Embed2Detect. Due to the simplicity in this model architecture and usage of small training corpora (chunks of a data stream) time complexity on learning is not considerably high to make bad impact on real time event detection.

5.3 Event window identifier

Given a chronological stream of time windows , event window identifier recognises the windows where events have occurred. Since an event is an incident or activity which happened and discussed, such occurrence should make a significant change in data in the corresponding time window compared to its previous window. Based on this assumption, our method identifies windows with higher change than a predefined threshold () as event windows. Since normalised values are used to measure the change, value for need to be between 0 and 1.

Before moving into the change calculation phase, we preprocess the text in social media documents for more effective results and efficient calculations. We do not conduct any preprocessing steps before learning the word embeddings except tokenizing to preserve all valuable information, which would help the neural network model to figure things out during word embedding learning. As preprocessing in event window identification phase, punctuation marks and stop words in the text are removed, because they do not make any significant contribution to the idea described. Further tokens with frequency below a predefined threshold ( ) are removed as outlier tokens (e.g. words which are incorrectly spelled, or used to describe non-event information).

To calculate the textual data change between two consecutive time windows and , we considered two measures based on word embeddings and vocabularies of these windows. Event occurrence can make changes in nearby words of a selected word or introduce new words to the vocabulary over time. For example, in a football match, if a goal is scored at , ‘goal’ will be highly mentioned in the textual context of a player’s name. If that player receives a yellow card unexpectedly in , a new word; ‘yellow card’ will be added to the vocabulary and it will appear in the context of a player’s name, except the word ‘goal’. According to the Section 3.1.3, prediction-based word embeddings can be effectively used to identify nearby word changes based on both syntactical and semantical aspects. Therefore, we propose a word embedding-based approach to measure nearby word changes as cluster change calculation (Section 5.3.1). Also, to measure the vocabulary changes we propose vocabulary change calculation (Section 5.3.3). The final value for the overall textual change between time windows is calculated by aggregating the two measures, namely, cluster change and vocabulary change. As the aggregation method, we experimented maximum and average value calculations (Section 6.6). Among these two methods, the best results could be obtained by using the maximum calculation. An overview for window change calculation is shown in Figure 5 and complete flow of event window identification is summarised in Algorithm 1.

Figure 5: Overview of window change calculation
Result: : time windows where events occurred
1 = [];
2 = Predefined threshold for overall data change;
3 = Array of time windows ;
4 for index 1 to length(W)-1 do
5       = ;
6       = ;
7       = vocabulary at index;
8       = vocabulary at index+1;
9       /* Measure cluster change */;
10       = common vocabulary for and ;
11       = Length of ;
12       = Similarity matrix at t using ;
13       = Similarity matrix at t+1 using ;
14       = ;
15       /** Get average on upper triangular matrix **/ ;
16       = ;
17       /* Measure vocabulary change */;
18       = ;
19       /* Measure overall change */;
20       = ;
21       if  then
22             ;
23            
24       end if
25      
26 end for
Algorithm 1 Event Window Identification

Cluster change calculation

Cluster change calculation is proposed to measure nearby word or word group changes over time. To facilitate this calculation, similarity matrices are generated for each time window considering its next time window . A similarity matrix is an matrix where is the number of words in the vocabulary. Each cell in the matrix represents the similarity between and . Considering the requirement for calculating cluster similarity between words, we propose to use Dendrogram Level (DL) similarity (Section 5.3.2) as the similarity measure between words during matrix generation. In order to compare similarity matrices between two consecutive time windows, a common vocabulary needs to be used for matrix generation. Since we compare against , preprocessed vocabulary at is used as the common vocabulary for both windows.

After generating the similarity matrices at and using DL similarity between words, absolute difference of matrices is calculated. Then the average on absolute differences is measured as the value for cluster change in compared to . During the average calculation we only considered the values at upper triangular matrix except the diagonal, because the matrix is symmetric around the diagonal.

Dendrogram level similarity

A dendrogram is a tree diagram which illustrates the relationships between objects. These diagrams are typically used to visualise hierarchical clustering. A sample dendrogram generated on a selected word set from tweets posted during the first goal of English Premier League 19/20 on 20 October 2019 between Manchester United and Liverpool is shown in Figure 6. Each merge happens considering the distance between clusters or words and they are represented by horizontal lines. Merges between the closer groups such as name of the player who scored the goal ‘rashford’ and cluster which contains the word ‘goal’ happen at low distances (). In contrast to this, merges between distant groups such as another player name ‘firmino’ and cluster of ‘goal’ happen at high distance values ().

Figure 6: Sample dendrogram (y-coordinate denotes the cosine distance and x-coordinate denotes the selected words)

Focusing on the characteristics associated with dendrograms, we suggest dendrogram level (DL) similarity to measure the similarity between words based on their cluster variations. Each horizontal line or merge represents a dendrogram level. Given a dendrogram, similarity between a word pair and is calculated as the normalised value of shared levels from root between those two words, as follows.

(2)

The numerator of Equation 2 represents the number of shared dendrogram levels between and from the root. The denominator represents the maximum number of levels between root and leaf nodes. We added leaf node level also as a separate level during maximum level count calculation to make sure only the similarity between same word is 1 (). For example, the maximum number of dendrogram levels from root to leaves in the diagram in Figure 6 is 5. By adding the leaf node level, the maximum level count becomes 6. The count of shared levels between words ‘rashford’ and ‘goal’ is 4. But, words; ‘firmino’ and ‘goal’ shares only 1 level, because they appear in distant clusters. In measures, DL similarities between these words are as follows.

In order to compare DL similarities of words between time windows, dendrograms need to be generated per window. To generate the dendrograms, we applied HAC on word embeddings learned for each window. As the linkage method, we used the average scheme in order to involve all the elements that belong to clusters during distance calculation. In average linkage, distance between two clusters; and is measured by following the Equation 3 (Müllner, 2011).

(3)

where represents the distance between cluster elements and which belong to the clusters and respectively. This distance is measured using cosine distance, because it proved effectiveness for measurements in textual data (Mikolov et al, 2013a, b; Antoniak and Mimno, 2018). Since cosine distance calculation is independent from magnitude of vectors, it does not get biased by the frequency of words (Schakel and Wilson, 2015).

Vocabulary change calculation

A vocabulary is a set of distinct words that belong to a particular language, person, corpus, etc. In this research, we consider the words that belong to data corpora at each time window as separate vocabularies. Vocabulary change calculation is proposed to measure new word addition into time windows over time. Also it incorporates the statistical details in the data set. In order to have a comparable value over all time windows, we calculated normalised vocabulary change value for compared to following the equation 4.

(4)

The numerator of Equation 4 represents the cardinality of new words that appeared in the vocabulary of compared to , and the denominator represents the size of the vocabulary that belongs to .

5.4 Event word extractor

After identifying a time window as an event occurring window, event word extractor facilitates the extraction of words in that window which are related to the occurred events. Since events make changes to the textual corpus, this component marks all the words in a window which showed cluster changes compared to its previous windows as event words. Since we use a common vocabulary between consecutive windows during similarity matrix generation, cluster change calculation identifies the newly added words to also as words with changes. All words belong to the word pairs with change above 0 are considered as the words which have temporal cluster changes.

6 Experimental study

In this section, we present the main results of the experiments which are conducted on social media data sets. More details about the data sets are described in Section 6.1. To evaluate the results, we used the evaluation metrics mentioned under Section 6.2. Also, we considered three recent event detection methods as baselines to compare the performance of proposed system Embed2Detect (Section 6.3).

We implemented a prototype of Embed2Detect in Python 3.7 which has been made available on GitHub 19. All experiments were conducted on an Ubuntu 18.04 machine which has 2.40GHz 16 core CPU processor with 16GB RAM. We analysed the impact by different parameter settings (Section 6.4), preprocessing techniques (Section 6.5) and aggregation methods (Section 6.6) on the effectiveness of Embed2Detect approach. Further, we evaluated efficiency of Embed2Detect to experiment its appropriateness for real time event detection and results are described under Section 6.7. Embed2Detect was also compared with the baseline methods by considering both effectiveness and efficiency, and the obtained results show that it outperforms the baselines (Section 6.8). Additionally, we conducted some experiments to suggest possible extensions to Embed2Detect using other word embedding models and the obtained results and suggestions are summarised in Section 6.9.

6.1 Data sets and preparation

To conduct the experiments and evaluations, we used real social media data sets. This section describes the details of data sets (Section 6.1.1), data collection methods (Section 6.1.2) and data cleaning methods (Section 6.1.3) used in this research.

Data sets

Embed2Detect was applied on real and recent social media data sets. For the evaluation, a set of ground truth (GT) labels which describes the events in data sets need to be defined. But, it is impractical to extract all the events in a data stream due to its high volume. Therefore, we had to select data sets which are filtered from a data stream to facilitate the GT generation. While selecting the data sets, we focused on two different domains, namely, sports and politics to prove the domain independence of our method.

To generate the sports data set, English Premier League 19/20 match between two popular teams, specifically, Manchester United and Liverpool was selected. This match was held in Old Trafford, Manchester on 20 October 2019. During the match, each team scored a single goal and it ended as a draw. Starting from 16:30, the total duration of match was 115 minutes including the half time break. This data set will be referred as ‘MUNLIV’ in the following sections.

To generate the political data set, Brexit Super Saturday in 2019 was selected. It is a UK parliament session which exceptionally happened on Saturday, 19 October 2019. This was the first Saturday session in 37 years. Even though it was organised to have a vote on a new Brexit deal, the vote was cancelled due to an amendment passed against the deal. This event started at 09:30 and held until around 16:30. This data set will be referred to as ‘BrexitVote’ in the following sections.

As GT, events found under the above-mentioned two topics were considered. These events were extracted by analysing the news media and social media data related to the particular events in corresponding time periods. Each event was supported using a set of keywords taken from news and social media to compare with the identified event words. We made these data sets including the GT labels publicly available 20.

Data collection

Even though the proposed method is applicable to any social media data set, considering the restrictions, support and coverage given on data collection by social media services, we decided to use Twitter data sets for experiments. Twitter developer Application Programming Interfaces (APIs) 21 were used to extract the tweets posted during selected time periods under the selected topics. Initially data belonging to a particular topic was extracted using a trending hashtag. Then the hashtags found in the extracted data set were ranked based on their popularity and popular hashtags were used for further data extraction 22.

For MUNLIV, we collected 118,700 tweets during the period 16:15-18:30. Among them we used 99,995 (84.2%) tweets posted during the match for experiments, because we could extract GT events only for this period using news media. For BrexitVote, we collected 276,448 tweets during the period 08:30-18:30, but only used 174,835 (63.2%) tweets posted from beginning of the parliament session until the vote on amendment for experiments. Similar to the scenario with MUNLIV, the focus by news media was found to be high until the vote to extract more accurate GT events. Considering the evolution rate and sufficient data requirement to learn word embeddings, for the sports data set MUNLIV, 2 minute, and for the political data set BrexitVote, 30 minute time windows are selected. After separating the data into chunks, on average there were 1,724 and 14,530 tweets per time window in sport and political data sets respectively.

Data Cleaning

To learn embeddings on separate tokens, embedding models need tokenised text. Since we focused on Twitter data sets during the experiments, we used the TweetTokenizer model available with Natural Language Toolkit (NLTK) 23 to tokenise the text in tweets. This tokeniser was designed to be flexible on new domains with the consideration on characteristics in social media text such as repeating characters and special tokens. It has the ability of removing characters which repeats more than 3 times to generalise the various word forms introduced by users. For example, both words ‘goalll’ and ‘goallll’ will be replaced as ‘goal’. Further, it tokenises the emotions and words specific to social media context (e.g. 1-0, c’mon, #LFC, :-)) correctly. Also, we did not preserve the case sensitivity in tokenised text.

In addition to tokenising, retweet notations, links and hash symbols were removed from text. Retweet notations and links were removed because they do not make any contribution to the idea described. Hash symbols were removed to treat hashtags and other words similarly during embedding learning. To automate these removals, text pattern matching based on regex expressions were used.

6.2 Evaluation metrics

In order to evaluate the performance of proposed method and baselines, event words are compared with GT event keywords using the following metrics. In the equations stated below, set of all event windows in the data set, detected event windows and relevant event windows found in detected windows are represented by , and respectively.

  • Recall: Fraction of the number of relevant event windows detected among the total number of event windows that exist in the data set

  • Precision: Fraction of the number of relevant event windows detected among the total number of event windows detected

  • F-Measure (F1): Weighted harmonic mean of precision and recall

  • Keyword Recall: Fraction of the number of correctly identified words among the total number of keywords mentioned in the GT events (Aiello et al, 2013). To calculate a final value for a set of time windows, micro averaging (Manning et al, 2008b) is used.

    represents the event occurred time frames, represents the words/ keywords and represents the set of ground truth events.

While calculating the recall, precision and F-measure, a detected window is marked as a relevant event window, if all the events occurred during that time period are found in the event words identified for that window. A match between event words and a GT event is established, if at least one GT keyword corresponding to that event is found from the event words. Likewise, for keyword recall calculation, if at least one word mentioned in a synonym (similar) word group in GT is found, it is considered as a match. Therefore, the total number of GT keywords is calculated as the total of synonym word groups.

6.3 Baseline methods

Since there is no any specific data set to evaluate event detection performance, available methods cannot be compared with each other to pick the best baseline. Therefore, considering the requirements of event detection and available competitive areas, we selected three recently proposed methods as baselines. The major requirements we focused during this selection are effectiveness, efficiency and expandability. Also, we assure to cover different competitive areas which can be summarised as incorporation of social aspect, word acceleration over frequency, unsupervised learning (tensor decomposition and clustering) and segments over uni-grams, to make the baselines strong enough. All these methods process the whole data stream without considering only some keywords (e.g. hashtags) to identify temporal events, similar to our approach. More details on selected baseline methods are as follows.

  • MABED (Guille and Favre, 2015): Anomalous user mention-based statistical method
    Mention anomalies were taken into consideration in this research in order to incorporate social aspect of Twitter with event detection rather than only focusing on textual contents of tweets. User mentions are links added intentionally to connect a user with a discussion or dynamically during re-tweeting. Anomalous variations in mention creation frequency and their magnitudes were used for event detection. To extract the event words, co-occurrences of words and their temporal dynamics were used.

  • TopicSketch (Xie et al, 2016): Word acceleration-based tensor decomposition method
    Word acceleration is suggested by this research to support event detection because it has the ability to differentiate bursty topics (events) from general topics like car, food, or music. Events have an ability to force people to discuss about them intensively. This force can be expressed by acceleration and this research proposed it as a good measure over frequency for event detection. To extract the event words, a tensor decomposition method, namely, singular value decomposition (SVD) was used.

  • SEDTWik (Morabia et al, 2019): Segment-based clustering method powered by Wikipedia page titles
    This is an extension to the Twevent system (Li et al, 2012). Text segments are focused in this research because they are more meaningful and specific than uni-grams. Wikipedia page titles were used as a semantic resource during segment extraction to preserve the informativeness of identified segments. To identify the events, bursty segments were clustered using Jarvis-Patrick algorithm. Burstiness of segments is calculated using both text statistics and user diversity-based measures.

6.4 Parameter selection

In Embed2Detect, word embedding learner and event window identifier require some hyper-parameters. Sections 6.4.1 and 6.4.2 describe the impact by different hyper-parameter settings and hyper-parameters which need to be changed based on the characteristics of the data set.

Parameters for word embedding learning

Word embedding learning mainly requires 3 hyper-parameters: minimum word count, context size, and vector dimension. Given a minimum word count, the learning phase ignores all the tokens with less total frequency than it. Context size defines the number of words around the word of interest to consider during the learning process. Vector dimension represents the number of neurons in the hidden layer which also will be used as the dimensionality of word embeddings.

Considering the limited amount of data available in a data window, we fixed the minimum word count to 1. But, we analysed how the effectiveness of event detection and its execution time vary with different context sizes and vector dimensions before selecting the values for them. To evaluate the effectiveness, F-measures (F1) were used. The results obtained for both data sets are visualised in Figure 11. Based on the results, there was no any significant change in F1 with different context sizes and vector dimensions. But, there was a gradual increase in execution time when both hyper-parameter values are increased. Considering the time and requirement on providing sufficient knowledge for learning, we selected 5 as the context size for both data sets. Likewise, considering the time and size of the data corpora, we fixed 100 dimensions for word embeddings. It is recommended to learn high dimensional word vectors using much larger data sets, in order to provide enough data to adjust the weights in the neural network model properly (Mikolov et al, 2013a).

(a) F1 with different context sizes (with vector dimension=100)
(b) F1 with different vector dimensions (with context size=5)
(c) Time taken with different context sizes (with vector dimension=100)
(d) Time taken with different vector dimensions (with context size=5)
Figure 11: Analysis on F1 and execution time with different values for word embedding learning parameters; context size and vector dimension (Average time taken to execute the full process on single data window is used for time values in both data sets)

Parameters for event window identification

As described in Section 5.3, event window identifier requires 2 hyper-parameters, and . is used to remove the outlier tokens, and is used during the process of extracting event windows. A graph visualisation of variations in effectiveness of event detection (F1) with different threshold values is shown in Figure 14.

According to the results in Figure (a)a, there is a clear decline in F1, when increasing in both data sets. The tokens which describe the events can be removed in addition to the outliers, when high is used. Therefore, we propose to use a low value for this threshold. We obtained highest F1 for MUNLIV and BrexitVote at two different values, because usage of words vary depending on the nature of domain.

According to the results in Figure (b)b, after achieving the peak in F1, there is a decline for further increasing values. Similar to the scenario with , we obtained these peak points at two different threshold values for MUNLIV and BrexitVote. The reason is having high evolution in sports data than political data.

Based on the above-mentioned experimental results, we cannot use fixed values for and , because our data sets are from two domains where word usage and evolution are different. Therefore, for the following experiments, we used a range of values for both thresholds and the best results are reported. A similar strategy was used with the hyper-parameters in baseline methods also.

(a) F1 with different values (with =0.14)
(b) F1 with different values (with =10)
Figure 14: Analysis on F1 with different values for event window identification parameters; and

6.5 Impact by preprocessing

We experimented the impact by different preprocessing techniques on the effectiveness of event detection in Embed2Detect. We only used cluster change calculation to identify the events in these experiments, because it has the high influence by changes in tokens.

According to the results we obtained (Table 3), highest F1 for both sport and political data sets is obtained for tokens without punctuation and stop-words. Even though there is an improvement in the performance measures with preprocessing, these results show that we can obtain good measures without preprocessing also. This ability will be helpful in situations where we cannot integrate direct preprocessing mechanisms such as removing stop words in a less commonly used language and removing stop words in a data set which is composed by more than one language. Since both data sets used for this research are mainly written in English, we used the tokens without punctuation and stop-words for the following experiments.

Data set MUNLIV BrexitVote
Method Recall Precision F1 Recall Precision F1
all tokens 0.826 0.463 0.594 1.000 0.800 0.889
without punctuation 0.913 0.457 0.609 1.000 0.727 0.842
without punctuation
and stop-words
0.696 0.552 0.615 1.000 0.800 0.889
Table 3: Evaluation results with different preprocessing techniques

6.6 Aggregation method

In this phase of research, as aggregation methods, we only experimented the simple and commonly used techniques: average and maximum. This aggregation happens between the values obtained by cluster change calculation and vocabulary change calculation (Section 5.3). The evaluation results on the selected data sets using these aggregation methods are shown in Table 4.

According to the results, for MUNLIV data set, these is a slight change in F1 between average and maximum calculations. But, we can see balanced values for both recall and precision when maximum is used. In BrexitVote, there is a clear change in F1, with higher value using the maximum calculation.

Based on the results we obtained using the data sets in two different domains, we decided to use maximum calculation as the aggregation method in Embed2Detect.

Data set MUNLIV BrexitVote
Method Recall Precision F1 Recall Precision F1
average 0.696 0.615 0.653 1.000 0.727 0.842
maximum 0.652 0.652 0.652 1.000 0.800 0.889
Table 4: Evaluation results with different aggregation methods

6.7 Efficiency evaluation

Efficiency is a critical measure for real time event detection. Therefore, we evaluated the efficiency of Embed2Detect by measuring the execution time with increasing data size. As data size, we used the number of documents (tweets) per time window, because it can be increased into a higher value in full data stream. As time, execution time of a single window is measured. The obtained results are shown in Figure 15.

According to the results, Embed2Detect takes nearly 10 seconds to process 5,000 documents and it can be increased up to 40 seconds to process 25,000 documents in sequential manner. To make the execution faster, we parallelised our implementation. Depending on the resources available in the machine, worker count can be adjusted. When we used 8 workers, 25,000 documents could be processed within 20 seconds. This proves that Embed2Detect has good efficiency to utilise it for real time event detection in social media data streams.

Figure 15: Execution time on different data sizes including the effect by sequential and parallel processing

6.8 Comparison with baselines

We compared the effectiveness and efficiency of Embed2Detect with selected baseline methods: MABED, TopicSketch and SEDTWik (Section 6.3). Effectiveness was measured using the evaluation metrics: recall, precision, F1 and keyword recall (Section 6.2). To measure the efficiency, total time taken to execute the complete process on full data sets and average time taken per time window by each method were used. Different hyper-parameter settings were tried out for each method to obtain the best results, because the data sets used for these experiments are from two different domains. The experiment results for MUNLIV and BrexitVote are given in Tables 5 and 6 respectively. The corresponding parameter settings are summarised in Table 7. There were more parameters for TopicSketch method and we used the default values for others which are not mentioned here considering its high time complexity. For Embed2Detect, parallel processing with 8 workers is used and for other methods, sequential processing is used according to the implementations available.

Method Recall Precision F1 Keyword Recall Execution Time(s)
Total Average
MABED 0.478 0.193 0.275 0.348 168 2.947
TopicSketch 0.609 0.246 0.350 0.400 25492 447.228
SEDTWik 0.652 0.268 0.380 0.386 1290 22.632
Embed2Detect 0.652 0.652 0.652 0.843 202 3.544
Table 5: Performance comparison of Embed2Detect with baseline methods using MUNLIV data set
Method Recall Precision F1 Keyword Recall Execution Time(s)
Total Average
MABED 0.625 0.455 0.526 0.403 532 48.364
TopicSketch 0.500 0.364 0.421 0.254 15887 1444.273
SEDTWik 0.750 0.500 0.600 0.426 702 63.818
Embed2Detect 1.000 0.800 0.889 0.985 310 28.182
Table 6: Performance comparison of Embed2Detect with baseline methods using BrexitVote data set
Method Parameter Setting (MUNLIV) Parameter Setting (BrexitVote)
MABED
k = 150
min. absolute frequency = 10
max. relative frequency = 0.4
p = 20
= 0.7
= 0.5
k = 150
min. absolute frequency = 10
max. relative frequency = 0.4
p = 20
= 0.6
= 0.5
TopicSketch
detection threshold = 60
bucket size = 5000
detection threshold = 35
bucket size = 5000
SEDTWik
H = 3
M = 2
k = 6
= 0.7
H = 3
M = 2
k = 6
= 0.2
Embed2Detect
= 20
= 0.23
= 10
= 0.16
Table 7: Parameter settings used by each method for the best results

Based on the results, we noticed that Embed2Detect outperforms the baseline methods in both data sets with F1 of 0.652 on MUNLIV and F1 of 0.889 on BrexitVote. This proves that our method has the ability to detect the events effectively in two different domains, specifically, sports and politics than the available methods. If the total execution time is considered, for MUNLIV, MABED took 168 seconds and Embed2Detect took 34 seconds more than MABED. But on BrexitVote, Embed2Detect completed the execution in 310 seconds – 222 seconds faster than MABED. In terms of average execution time per window, Embed2Detect took 2.947 seconds for a 2 minute window in MUNLIV data set and 28.182 seconds for a 30 minute window in BrexitVote data set. These time measures prove that Embed2Detect is sufficiently fast for real time event detection.

6.9 Extension to other word embedding models

Even though we used Skip-gram model to prove our concept, other word embedding models also can be used with Embed2Detect. Since, we implemented word embedding learner as a separated module in the Embed2Detect architecture, different word embeddings can be easily connected. But, it is important to consider the learning time while selecting a word embedding model, because our goal is to provide real time event detection. We analysed the time taken by different architectures to learn word embeddings to discuss about their appropriateness and obtained results are summarised in Table 8.

In this experiment, we compared fastText, BERT and DistilBERT models with Skip-gram. FastText is an updated version of Skip-gram model which consider subword information while learning word representations (Bojanowski et al, 2017). Both BERT and DistilBERT are transformer-based models. According to the recent advances in the domain of NLP, transformers gained success in many areas such as language generation (Devlin et al, 2019), named entity recognition (Liang et al, 2020) and question answering (Yang et al, 2019). BERT: Bidirectional Encoder Representations from Transformers (Devlin et al, 2019) is the first transformer model which gained a wide attention. This model is designed to pretrain from unlabelled text using the masked language modelling (MLM) objective and to fine-tune for a downstream task, as a solution for high data requirement by deep neural networks. DistilBERT is a distilled version of BERT which is light and fast (Sanh et al, 2019).

Both Skip-gram and fastText models were trained from scratch using Twitter data as suggested by this research. Following the idea presented with transformers, for both BERT and DistilBERT, we retrained available models using our data. We did not try training transformers from scratch, because time windows do not have sufficient data to adjust the weights in a deep neural network properly. As the pretrained BERT model bert-base-uncased is selected and as DistilBERT model distilbert-base-uncased is selected.

According to the obtained results (Table 8), classic word embedding models (e.g. Skip-gram and fastText) learn the representations faster than transformer-based models (e.g. BERT and DistilBERT). Comparing fastText and Skip-gram, fastText takes more time because it processes sub word information. But, incorporation of sub words allows this model to capture connections between modified words. For an example, consider the goal related words found within top 20 words with high cluster change during a goal score:

Skip-gram- goal, goalll, rashyyy, scores
fastText- goalll, goooaaalll, rashford, rashyyy, @marcusrashford, scored, scores

fastText captures more modified words than Skip-gram. We could not run a complete evaluation using fastText embeddings, because it requires a manual process since GT keywords only contain the words in actual form.

Considering the time taken by transformer-based models, they are too longer than the time taken by both Skip-gram and fastText. Comparing BERT and DistilBERT, DistilBERT is found to be faster than BERT. However, learning time of DistilBERT is also not fast enough for real time processing because it exceeds the tweet generation time. For an example to learn from tweets posted during 2 minute time window, it took approximately 7.2 minutes. According to these findings, we cannot achieve the efficiency required for real time social media data processing using transformer-based models. Considering this limitation, we did not conduct further experiments using transformers.

Time Window
Length
Tweet Count Embedding Learning Time (s)
Skip-gram fastText BERT DistilBERT
2 min.(120 s)
1705
1 12 646 433
30 min.(1800 s)
20133
18 41 21442 11699
Table 8: Time taken to learn embeddings by different architectures

7 Conclusions and future work

In this paper, we proposed a novel event detection method coined Embed2Detect to identify the events occurrence in social media data streams. Embed2Detect mainly combines the characteristics in prediction-based word embeddings and hierarchical agglomerative clustering. This method uses self-learned word embeddings to capture the features in targeted corpus in order to facilitate domain, platform or language independent event detection. Therefore, Embed2Detect can be easily applied on any social media data set in any language even though the majority of available methods are limited to specific platforms (e.g. Twitter) and languages (e.g. English). Further, this approach is also applicable for multilingual data sets. The ability to process multilingual data sets can be mentioned as an important requirement to process the data in social media considering its user base which is distributed all over the world.

In contrast with prior work, Embed2Detect not only considers syntax and statistics in underlying text but also incorporates semantics. Inclusion of semantics allows to understand the relationships between words. Due to the huge and diverse user base, social media text contains different words and word sequences which describe the same idea. Knowing the relationships between words, differently described similar ideas and their connections can be extracted. Therefore, our approach is capable to reduce the information loss experienced in previous approaches due to the lack of semantic involvement.

According to the evaluations conducted, Embed2Detect performed significantly better than the recently suggested event detection methods, namely, MABED, TopicSketch and SEDTWik on both data sets MUNLIV and BrexitVote from the domain of sports and politics. As evaluation metrics, we used recall, precision F-measure and keyword recall to conduct a comprehensive evaluation. Also, we considered data from two contrasting domains which have different word usage, audience and evolution rate to evaluate the universality of methods. In addition to focusing on effectiveness, we measured the efficiency of Embed2Detect also, because real time event detection is a time-critical operation. Embed2Detect performed event detection in both data sets within short time period and it could handle increasing data volume to indicate its appropriateness for real time application. In summary, the results we obtained from the experiments conclude that Embed2Detect can detect the events in social media data effectively and efficiently without depending on domain specific features.

As an extension to Embed2Detect, more advanced word embedding learning methods can be applied. But, considering the learning time, to preserve the efficiency of the method, it is more suitable to use classic word embedding models such as Skip-gram than advanced word embedding models such as BERT. Under classic word embeddings, we hope to further analyse the impact by subword and character-based models which can be used to capture the connections between informal or modified text and their formal versions. Such an approach would be useful to understand informal text which are common to the context of social media. Further, focusing on the recent improvements to the domain of NLP by available transformer-based models, their pretrained word embeddings can be supported to generate more comprehensive event details such as summaries using the detected event words in a future phase of this research. Also, we plan to further extend our method to identify event evolution over time to facilitate both event detection and tracking together.

Compliance with ethical standards

Conflict of Interest: The authors declare that they have no conflict of interest.

Footnotes

  1. email: Hansi.Hettiarachchi@mail.bcu.ac.uk
  2. email: Mariam.Adedoyin-Olowe@bcu.ac.uk
  3. email: Jagdev.Bhogal@bcu.ac.uk
  4. email: Mohamed.Gaber@bcu.ac.uk
  5. email: Hansi.Hettiarachchi@mail.bcu.ac.uk
  6. email: Mariam.Adedoyin-Olowe@bcu.ac.uk
  7. email: Jagdev.Bhogal@bcu.ac.uk
  8. email: Mohamed.Gaber@bcu.ac.uk
  9. email: Hansi.Hettiarachchi@mail.bcu.ac.uk
  10. email: Mariam.Adedoyin-Olowe@bcu.ac.uk
  11. email: Jagdev.Bhogal@bcu.ac.uk
  12. email: Mohamed.Gaber@bcu.ac.uk
  13. email: Hansi.Hettiarachchi@mail.bcu.ac.uk
  14. email: Mariam.Adedoyin-Olowe@bcu.ac.uk
  15. email: Jagdev.Bhogal@bcu.ac.uk
  16. email: Mohamed.Gaber@bcu.ac.uk
  17. Data sets are available on https://github.com/hhansi/twitter-event-data-2019
  18. Embed2Detect implementation is available on https://github.com/hhansi/embed2detect
  19. Python implementation of Embed2Detect is available on https://github.com/hhansi/embed2detect
  20. Data sets including the GT events are available on https://github.com/hhansi/twitter-event-data-2019
  21. More details about Twitter developer service including its APIs are available at https://developer.twitter.com/
  22. For MUNLIV data collection, hashtags; #MUNLIV, #MUFC, #LFC, #Liverpool, #GGMU, #PL, #VAR and #YNWA were used and for BrexitVote data collection, hashtags; #BrexitVote, #SuperSaturday, #Brexit, #BrexitDeal, #FinalSay, #PeoplesVote, #PeoplesVoteMarch were used.
  23. NLTK documentation is available at https://www.nltk.org/

References

  1. Adedoyin-Olowe M, Gaber MM, Stahl F (2013) Trcm: a methodology for temporal analysis of evolving concepts in twitter. In: International Conference on Artificial Intelligence and Soft Computing, Springer, pp 135–145
  2. Adedoyin-Olowe M, Gaber MM, Dancausa CM, Stahl F, Gomes JB (2016) A rule dynamics approach to event detection in twitter with its application to sports and politics. Expert Systems with Applications 55:351–360
  3. Aiello LM, Petkos G, Martin C, Corney D, Papadopoulos S, Skraba R, Göker A, Kompatsiaris I, Jaimes A (2013) Sensing trending topics in twitter. IEEE Transactions on Multimedia 15(6):1268–1282
  4. Aldhaheri A, Lee J (2017) Event detection on large social media using temporal analysis. In: 2017 IEEE 7th Annual Computing and Communication Workshop and Conference (CCWC), IEEE, pp 1–6
  5. Alkhamees N, Fasli M (2016) Event detection from social network streams using frequent pattern mining with dynamic support values. In: 2016 IEEE International Conference on Big Data (Big Data), IEEE, pp 1670–1679
  6. Antoniak M, Mimno D (2018) Evaluating the stability of embedding-based word similarities. Transactions of the Association for Computational Linguistics 6:107–119
  7. Bengio Y, Ducharme R, Vincent P, Jauvin C (2003) A neural probabilistic language model. Journal of machine learning research 3(Feb):1137–1155
  8. Benhardus J, Kalita J (2013) Streaming trend detection in twitter. International Journal of Web Based Communities 9(1):122–139
  9. Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5:135–146
  10. Castillo C, Mendoza M, Poblete B (2011) Information credibility on twitter. In: Proceedings of the 20th international conference on World wide web, ACM, pp 675–684
  11. Chaffey D (2019) Global social media research summary 2019 — smart insights. https://www.smartinsights.com/social-media-marketing/social-media-strategy/new-global-social-media-research/
  12. Chen G, Kong Q, Mao W (2017) Online event detection and tracking in social media based on neural similarity metric learning. In: 2017 IEEE International Conference on Intelligence and Security Informatics (ISI), IEEE, pp 182–184
  13. Choi HJ, Park CH (2019) Emerging topic detection in twitter stream based on high utility pattern mining. Expert Systems with Applications 115:27–36
  14. Clement J (2019) Global social media ranking 2019 — statista. https://www.statista.com/statistics/272014/global-social-networks-ranked-by-number-of-users/
  15. Corney D, Martin C, Göker A (2014) Spot the ball: Detecting sports events on twitter. In: European Conference on Information Retrieval, Springer, pp 449–454
  16. Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota, pp 4171–4186, DOI 10.18653/v1/N19-1423, URL https://www.aclweb.org/anthology/N19-1423
  17. Edouard A, Cabrio E, Tonelli S, Le Thanh N (2017) Graph-based event extraction from twitter. In: RANLP17-Recent advances in natural language processing
  18. Gottfried JA, Shearer E (2017) News use across social media platforms 2017. https://www.journalism.org/2017/09/07/news-use-across-social-media-platforms-2017/
  19. Guille A, Favre C (2015) Event detection, tracking, and visualization in twitter: a mention-anomaly-based approach. Social Network Analysis and Mining 5(1):18
  20. James J (2018) Data never sleeps 6.0. 2018. https://www.domo.com/blog/data-never-sleeps-6/
  21. Kwak H, Lee C, Park H, Moon S (2010) What is twitter, a social network or a news media? In: Proceedings of the 19th international conference on World wide web, AcM, pp 591–600
  22. Li C, Sun A, Datta A (2012) Twevent: segment-based event detection from tweets. In: Proceedings of the 21st ACM international conference on Information and knowledge management, pp 155–164
  23. Li Q, Nourbakhsh A, Shah S, Liu X (2017) Real-time novel event detection from social media. In: 2017 IEEE 33rd International Conference on Data Engineering (ICDE), IEEE, pp 1129–1139
  24. Liang C, Yu Y, Jiang H, Er S, Wang R, Zhao T, Zhang C (2020) Bond: Bert-assisted open-domain named entity recognition with distant supervision. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp 1054–1064
  25. Maaten Lvd, Hinton G (2008) Visualizing data using t-sne. Journal of machine learning research 9(Nov):2579–2605
  26. Manning CD, Raghavan P, Schütze H (2008a) Introduction to information retrieval. Cambridge university press
  27. Manning CD, Raghavan P, Schütze H (2008b) Text classification and Naive Bayes, Cambridge University Press, p 234–265. DOI 10.1017/CBO9780511809071.014
  28. McCreadie R, Macdonald C, Ounis I, Osborne M, Petrovic S (2013) Scalable distributed event detection for twitter. In: 2013 IEEE international conference on big data, IEEE, pp 543–549
  29. Mikolov T, Karafiát M, Burget L, Černockỳ J, Khudanpur S (2010) Recurrent neural network based language model. In: Eleventh annual conference of the international speech communication association
  30. Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space. arXiv preprint arXiv:13013781
  31. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013b) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119
  32. Morabia K, Murthy NLB, Malapati A, Samant S (2019) Sedtwik: Segmentation-based event detection from tweets using wikipedia. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, pp 77–85
  33. Müllner D (2011) Modern hierarchical, agglomerative clustering algorithms. arXiv preprint arXiv:11092378
  34. Nguyen S, Ngo B, Vo C, Cao T (2019) Hot topic detection on twitter data streams with incremental clustering using named entities and central centroids. In: 2019 IEEE-RIVF International Conference on Computing and Communication Technologies (RIVF), IEEE, pp 1–6
  35. Nur’Aini K, Najahaty I, Hidayati L, Murfi H, Nurrohmah S (2015) Combination of singular value decomposition and k-means clustering methods for topic detection on twitter. In: 2015 International Conference on Advanced Computer Science and Information Systems (ICACSIS), IEEE, pp 123–128
  36. Parikh R, Karlapalem K (2013) Et: events from tweets. In: Proceedings of the 22nd international conference on world wide web, pp 613–620
  37. Pennington J, Socher R, Manning C (2014) Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543
  38. Roux M (2018) A comparative study of divisive and agglomerative hierarchical clustering algorithms. Journal of Classification 35(2):345–366
  39. Sag IA, Pollard C (1987) Information-based syntax and semantics. CSLI lecture notes 13
  40. Sanh V, Debut L, Chaumond J, Wolf T (2019) Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:191001108
  41. Sayyadi H, Hurst M, Maykov A (2009) Event detection and tracking in social streams. In: Third International AAAI Conference on Weblogs and Social Media
  42. Schakel AM, Wilson BJ (2015) Measuring word significance using distributed representations of words. arXiv preprint arXiv:150802297
  43. Schinas M, Papadopoulos S, Petkos G, Kompatsiaris Y, Mitkas PA (2015) Multimodal graph-based event detection and summarization in social media streams. In: Proceedings of the 23rd ACM international conference on Multimedia, ACM, pp 189–192
  44. Škrlj B, Kralj J, Lavrač N (2020) Embedding-based silhouette community detection. Machine Learning pp 1–33
  45. Small SG, Medsker L (2014) Review of information extraction technologies and applications. Neural computing and applications 25(3-4):533–548
  46. Tsai PS (2009) Mining frequent itemsets in data streams using the weighted sliding window model. Expert Systems with Applications 36(9):11617–11625
  47. Van Oorschot G, Van Erp M, Dijkshoorn C (2012) Automatic extraction of soccer game events from twitter. In: DeRiVE@ ISWC, pp 21–30
  48. Xie W, Zhu F, Jiang J, Lim EP, Wang K (2016) Topicsketch: Real-time bursty topic detection from twitter. IEEE Transactions on Knowledge and Data Engineering 28(8):2216–2229
  49. Xu X, Yuruk N, Feng Z, Schweiger TA (2007) Scan: a structural clustering algorithm for networks. In: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 824–833
  50. Yang W, Xie Y, Lin A, Li X, Tan L, Xiong K, Li M, Lin J (2019) End-to-end open-domain question answering with BERTserini. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), Association for Computational Linguistics, Minneapolis, Minnesota, pp 72–77, DOI 10.18653/v1/N19-4013, URL https://www.aclweb.org/anthology/N19-4013
  51. Yilmaz S, Toklu S (2020) A deep learning analysis on question classification task using word2vec representations. Neural Computing and Applications pp 1–20
  52. Zhang L, Liu P, Gulla JA (2019) Dynamic attention-integrated neural network for session-based news recommendation. Machine Learning 108(10):1851–1875
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
414463
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description