Pachinko Prediction: A Bayesian method for event prediction from social media data
Abstract
The combination of large open data sources with machine learning approaches presents a potentially powerful way to predict events such as protest or social unrest. However, accounting for uncertainty in such models, particularly when using diverse, unstructured datasets such as social media, is essential to guarantee the appropriate use of such methods. Here we develop a Bayesian method for predicting social unrest events in Australia using social media data. This method uses machine learning methods to classify individual postings to social media as being relevant, and an empirical Bayesian approach to calculate posterior event probabilities. We use the method to predict events in Australian cities over a period in 2017/18.
Keywords— Bayesian statistics, social unrest, machine learning, prediction
1 Introduction
Developing automated methods to give advance warning of large gatherings of people, such as protests and social unrest events, are of interest to government agencies worldwide. With such events often being organised over online social media platforms, there exists the possibility to provide prior warning of large events solely through monitoring online data streams. Researchers have used open online data sources such as Twitter (BorgeHolthoefer et al., 2016; Agarwal and Sureka, 2016), Facebook, Tumblr (Xu et al., 2014), and Flickr (Alanyali et al., 2015) to characterise information propagation processes around protests, and have deployed machine learning methods on social media as well as blogs, news sources, and the dark web (Korkmaz et al., 2016) to predict civil unrest events. Twitter data in particular has been used broadly to monitor diverse largescale trends such as stock behaviour (Bollen et al., 2011), public opinion polling around issues like climate change (Cody et al., 2015), and health characteristics (Alajajian et al., 2017). Recent studies have focussed on Twitter’s role in particular in mobilisation and discourse around protest action in the United States (Theocharis et al., 2015; Gallagher et al., 2018).
Of particular note, and a catalyst for much of the research in this area, the US IARPA (Intelligence Advanced Research Projects Activity) OSI (Open Source Indicators) program^{1}^{1}1https://www.iarpa.gov/index.php/researchprograms/osi was started in 2011 to provide analysts with advance warning of civil unrest events in South America based on open data gathered from social media and other sources. This contest provided a “gold standard record” of event timings and locations to teams in order for them to develop (usually supervised) machine learning approaches to predict future events from OSI. At that time, sophisticated learning approaches employing the fusion of multiple model outputs into a single prediction were found to be particularly effective, with the EMBERS (Early Model Based Event Recognition using Surrogates) model (Ramakrishnan et al., 2014) ultimately providing the best predictions according to the evaluation metrics set for the contest. A key distinguishing feature of this model compared with other machine learning approaches was the use of multiple models combined via fusion (Hoegh et al., 2015), along with a novel method for suppressing spurious model outputs.
One general characteristic of the models forming inputs to EMBERS, as well as of many other models in the IARPA OSI program, was that they produced binary predictions of future days as having an event or not. Underlying each model in EMBERS was a “hard” (binary) classification of future daylocation pairs into event/nonevent, either from a rulebased scheme, e.g., the socalled “Planned Protest” model (Muthiah et al., 2015), or by converting probabilities from a GLM such as logistic regression to a binary output, e.g., the “Volumebased” (Korkmaz et al., 2016) or “Cascade” models (Cadena et al., 2015). This was likely guided by the evaluation methodology set out for the OSI program by IARPA. Teams’ model predictions were given an overall “quality score” comprising scores for date, location, event type, and population group, and each score only took “hard” classifications as inputs. For example, the “date score” DS for the competition was defined to be , with no ability to account for a model’s confidence in or uncertainty about the predicted event date.
In the context of the challenge this was likely a reasonable choice, however it leaves open an important question: how confident was each model (or fusion of models) in each prediction? If all models in the OSI program tended to lend only slightly higher weight to particular event days, then in the presence of randomness the outcomes of the program could be largely a matter of chance. While this may not have been an issue with the OSI program (although it is difficult to know without access to the models), it nonetheless opens avenues for further investigation into prediction from unstructured, “noisy” sources such as social media. In particular, for analysts potentially using the outputs of forecasting systems such as EMBERS it is surely of interest to have access to a measure of the system’s confidence around a given prediction, rather than a simple binary classification.
Thinking further from the perspective of a potential enduser of an automated forecasting system gives other desirable characteristics. Ideally one would be able to disentangle the components of a prediction coming from different data sources, or different models in a fusion framework. For example, for a given predicted event day, knowing that the prediction was because that day is a typical protest day (e.g., Labor Day, or Australia Day in Australia) is very different to if the prediction was due to a sudden spike in social media activity which appeared suddenly. Such a framework would potentially impart greater understanding, rather than mere predictions, of upcoming events.
In this paper, we develop a framework for predicting social unrest events from social media which incorporates these features. We adopt an empirical Bayesian approach, where a prior belief about the days and locations likely to see events is made explicit, and then evidence (in the form of social media postings) is used to update this prior. We focus on Twitter as an example of social media, however our methodology could be transferred to other social media platforms as well. Each individual tweet will form our smallest unit of observation. As well as giving a probabilistic interpretation of predictions and enabling us to disentangle different components of a prediction (through the prior and likelihood), this framework empowers a simple conceptual understanding of the model. As our algorithm involves sorting pieces of evidence into bins for different days and locations, we adopt an analogy of coloured marbles being sorted into jars, and hence the name “Pachinko Prediction”.
The structure of the paper is as follows. In Section 2, we describe the “Gold Standard Record” of events in Australia over 2017/18 as well as the Twitter dataset used for training the model. In Section 3, we describe the Pachinko Prediction model framework and detail the underlying empirical Bayesian method. Section 4 evaluates how the model performs at predicting events in Australian cities, and explores some qualitative features of the model. We conclude with a discussion in Section 5.
2 Datasets
2.1 Gold standard record
A key component of our method, or any social unrest prediction effort such as the OSI program, is a “Gold Standard Record” (GSR) of historical events in a region. Our project focuses on protests in Australia, therefore we use a custombuilt GSR dataset specific to this region. The GSR dataset is generated by analysts who were employed on a casual basis to read major news websites daily and record articles on any civil unrest events within Australia. Each of these events are recorded with additional attributes such as event name, location, time, and whether the event was violent or nonviolent. The dataset included manual verification to correct any errors in the gathering.
For this study we consider all the events for Adelaide, Brisbane, Canberra, Darwin, Hobart, Melbourne, Perth, and Sydney between the dates 21 July 2017 to 14 February 2018 inclusively. Figure 1 gives a tileplot of each day for each city, with red tiles indicating events. We have ordered the cities by decreasing total number of events, and observe that Melbourne and Sydney – the two largest cities in Australia – had more observed events over the period.
Initially, we aim to predict whether an event occurs on a given day or not. We define the random variable to be a Bernoulli random variable, equal to 1 if an event occurs on day in location , and 0 if it does not.
We first tested if there was a statistically significant difference in the proportion of days that had an event for the predictors: month, weekday, and city. We found a significant association between month and events (Chisquared test, , Pvalue ), and also city and events (Chisquared test, , Pvalue ). We did not find a significant association between weekday and events. To illustrate, Figure 2 shows the proportion of days having an event for each month (left), and the proportion of days having events for each city (right). There is a significant decrease in the proportion of days having events in December, January, and February compared to the other months. One possible explanation is that these are the summer months in Australia, and hot weather may decrease the number of events. Larger cities – Melbourne, Sydney, and Brisbane – have a significantly increased proportion of event days. This is intuitive, due to the larger pool of people who might be involved in protest events in these cities.
2.2 Twitter
Our Twitter data was collected using the public API^{2}^{2}2https://developer.twitter.com/en/docs.html between 21 July 2017 to 14 February 2018 inclusive. Roughly 50 million tweets were ingested per month, totalling approximately 350 million tweets. We applied three filters on location, temporal, and relevance characteristics, to reduce this collection to the (much smaller) final dataset we used for making predictions. The location filter aimed to target tweets relevant to Australian capital cities, by including all tweets matching any of the following criteria:

the “Location” field of the tweet’s bio information contains the name of an Australian capital city, or

the tweet is geolocated to within a 25 mile radius of the centre of each Australian capital city, or

there is a mention of an Australian capital city within the tweet body.
The temporal filter selected only futurereferencing tweets by scanning the body of the tweet and resolving any time references mentioned. For example, a tweet published on the 2nd of January 2018 containing the sentence “Let’s protest tomorrow at the University of Melbourne” would be resolved to the 3rd of January 2018, by resolving the “tomorrow” in the text. We made this choice because our primary interest is in prediction, and so we wish to utilise only tweets that reference events in the future, rather than events that have already occurred (e.g., news reports). This step is fundamental to our approach; experimentation with purely “volumebased” models using tweets resolved to the date they were authored (e.g., Korkmaz et al. (2016)) produced poor results, with news reports of previous events swamping any potential signal. We used Stanford NLP’s SUTime (Chang and Manning, 2012) to identify temporal mentions along with HeidelTime (Strötgen and Gertz, 2013) for processing multilingual tweets. We applied the location and temporal filters simultaneously; doing so left us with 51259 tweets. The numbers of tweets from each location are given in Table 1.
City  Tweets 

Adelaide  2212 
Brisbane  3327 
Canberra  2370 
Darwin  270 
Hobart  565 
Melbourne  19980 
Perth  2602 
Sydney  19933 
Our relevance filter was a custom civil unrest classifier used to select only tweets of interest. This classifier was created as follows. We manually labelled a random sample of 7898 tweets, containing 1504 positive examples linked to GSR events, and 6394 negative examples. This training set is made available along with this paper (Mitchell, 2018).
We then tokenised tweets using the CountVectorizer in Python scikitlearn (Pedregosa et al., 2011) to create 1 and 2grams from the text. These tokens formed the features for the classifier. Model selection was performed between four models: Gaussian and Bernoulli Naive Bayes, and a linear SVM with and penalty functions, on the basis of highest F1 score using 5fold crossvalidation. The bestperforming model selected was the linear SVM with penalty, having an F1 score of 0.94. All models were implemented using scikitlearn with default parameters.
Applying this classifier to the data left us with 51,259 tweets. We remark that while this is a relatively small dataset compared to typical modern “big data” approaches, it is the result of an extensive filtering procedure designed to leave us mostly with informative tweets regarding the events of interest. This Twitter dataset formed the input training data for our model, which we describe in the next section.
3 Method
An overview of our method is given in Figure 3. This shows the stages of the prediction process: setting up appropriate data structures, then classifying individual social media postings as relevant for prediction or not, then making event predictions using a Bayesian classifier. The analogy we use throughout is that of a Bean machine^{3}^{3}3No relation to N. G. Bean, an author of this paper., alternately called a Galton Board or quincunx, which illustrates the central limit theorem by using a table of pegs to sort marbles into jars. Our method can be conceptualised as a pegboard arrangement (an algorithm) to sort marbles (tweets) into jars (daylocation pairs) for the purposes of making predictions. These marbles can be either red or green (classified as being indicative of an event or not), and the method works by monitoring the ratio of red to green marbles. We will therefore refer to tweets as “marbles” when describing our method in this paper; this analogy was a useful device for communicating our methodology to prospective endusers of this tool. Historically the Bean machine is a precursor to the Japanese gambling game Pachinko, where a large number of marbles are randomly sorted into bins via a pegboard – some bins worth prizes, but most having no value. With our method also requiring to filter out a large volume of offtopic tweets in order to predict events, we therefore refer to it as Pachinko Prediction (PP).
The data structure for the method conceptualises a large table filled with a grid of jars, into which marbles will be sorted. First we label each jar with a date and a location. For example, we need jars for each of

Melbourne, 21 July 2017,

Melbourne, 22 July 2017,


Melbourne, 14 February 2018,

Adelaide, 21 July 2017,

Each of these jars represent a datelocation combination that is of interest to the researcher.
3.1 Generative model
For each day we counted the number of tweets mentioning that day. We denote the number of tweets for day as . Figure 4 plots the empirical cumulative distribution function (CDF) of . We considered two possible models to describe the number of tweets per day. The first was a Poisson distribution, i.e.,
We estimated the parameter using maximum likelihood estimation with the fitdistplus package (DelignetteMuller et al., 2015) in R, and obtained an estimate of The fitted CDF is given in Figure 4, and shows a large deviation away from the empirical CDF. This is confirmed if we examine the observed mean and variance of . The mean is 30.82, while the variance is 10199.21, indicating severe overdispersion.
To deal with overdispersion for this Poisson random variable, we used the negativebinomial distribution, with parameters , the mean number of observations, and , used to deal with the overdispersion. There exist many different notations for the negative binomial distribution in the literature, so to be specific, we define the negative binomial probability mass function here as
For instance, the version of the negative binomial in R switches and in the above expression.
Using maximum likelihood estimation, we obtain the following parameter estimates for the Twitter dataset: Figure 4 shows the CDF for the negativebinomial with these parameters, showing a far better correspondence to the empirical CDF.
3.2 Bayesian classifier
Based on our observations in Section 2.1 on associations between various different predictors we consider stratifying the observed data to best utilise this information for prediction. We denote the probability of an event on day in location by , where is a stratification of the data on day in location . In Section 4 we consider a number of different stratifications on the observed data. We denote the number of red marbles (tweets indicative of a future event) mapped to day and location , which are contained in strata , as . Henceforth we will model only red marbles or indicative tweets. We found that the number of green marbles or nonindicative tweets were roughly constant across each daylocation, making them noninformative for our analysis. However, we note that the following method generalises trivially to utilising green marbles as well.
We assume that all the probabilities for days and cities contained in strata have the same prior distribution, which we denote . This assigns the same prior probability of an event on any given day in strata , regardless of location. While this prior could be customised for different cities (for example, it it reasonable to assume that larger cities have a higher probability of seeing events occur than smaller ones) we wanted to begin with a fairly weak prior. We assume that this prior has a beta distribution with hyperparameters and , i.e.,
Let the number of indicative tweets in strata be . Of these, occur on days having events. We assume that given is binomially distributed, i.e.,
which has the probability mass function
It can easily be shown, that the posterior distribution of , given , is
i.e., a beta distribution with parameters and .
To estimate the hyperparameters and we use an empirical Bayes approach and borrow information from the rest of the GSR dataset. We set to be the number of event days across the entire country in the GSR dataset, and to be the number of nonevent days in the GSR dataset. Note that we would obtain an equivalent parameterization if we used a noninformative prior and then obtained a posterior distribution for the overall proportion of an event.
Now, we consider , the number of indicative tweets from day in location which are contained in strata . Based on the fitting we did in Figure 4, we model the relationship between and using a negative binomial model:
Using the posterior distribution for given as the prior, we get
which is once again a beta distribution. Note that we still need to estimate the parameter . Once again we take an empirical Bayes approach, and use the MLE of from all indicative tweets.
4 Results
4.1 Bayesian analysis
In the previous section we outlined the Bayesian model that we will use to find the posterior distribution for an event on day in location . In this model, we consider stratification of the observations to account for the observed relationship between location and month on the probability of an event in the GSR record (Section 2.1).
We considered three possible strata, based on:

location,

month, and

both location and month.
Tables 2 and 3 give the observed number of events and nonevents for the location and month strata respectively. We omit the table of location+month strata for brevity.
Location  Days  Events  No events 

Adelaide  206  12  194 
Brisbane  209  37  172 
Canberra  206  27  179 
Darwin  208  5  203 
Hobart  207  11  196 
Melbourne  209  57  152 
Perth  209  30  179 
Sydney  209  47  162 
Month  Days  Events  No events 

Jul  79  14  65 
Aug  248  41  207 
Sep  240  41  199 
Oct  248  40  208 
Nov  240  47  193 
Dec  248  20  228 
Jan  248  18  230 
Feb  112  5  107 
Figure 5 shows the posterior distribution for given the location strata. For comparison, we have included the prior distribution for given only the overall number of event and nonevents. For cities with an increased proportion of events, e.g., Melbourne, Sydney, and Brisbane, the distribution is shifted to the right, while for cities like Darwin, that have a low proportion of events, the posterior distribution is shifted to the left. This is intuitive, as we observed in Section 2.1 that larger cities are more likely to see more protest events.
To analyse the performance of our models, we choose to use Receiver Operating Characteristic (ROC) curves. We do this because we do not wish to evaluate the model on a “hard” classification of predictions as being strictly event/nonevent days. Our model outputs a posterior distribution, from which we take the posterior mean as our prediction of the probability of seeing an event at daylocation . To convert this probability to a binary prediction would require selecting an arbitrary threshold, above which we would issue an alert for an upcoming event. On the other hand, the ROC curve allows us to visualise the performance of the model for all possible thresholds, which respects the nature of the predictions output by the model better. Figure 6 gives the ROC curves for a variety of models. We considered five models, as follows:
 Overall

This uses the overall number of events and nonevents in the data and does not use any strata or any of the indicative tweets. This nonpredictive model is equivalent to flipping a (biased) coin on each day we predict for. (Note: The ROC curve for this is by definition the line TPR = FPR, so we omit it from the figure for brevity.)
 Tweets only

This uses just the overall number of events and nonevents in the data, and the observed number of indicative tweets for each daylocation , but does not use any strata. It is effectively a “dataonly” model, using a (close to) uninformative prior.
 Location+tweets

This uses the strata based on the location, plus indicative tweet data.
 Month+tweets

This uses the strata based on the month, plus indicative tweets.
 Month/location+tweets

This uses the strata based on the month and location, plus indicative tweets.
We see an improvement in the ROC curves (contained in upperleft quadrant) once we start utilising the Twitter data. The best results are for the strata using the strata based on both the month and city. This is further seen in the Table 4 which gives the area under the ROC curve (AUC) for each model. We obtained similar AUC values using the other models. Furthermore, performing the same experiment using cross validation with a random 70/30 train/test split produces comparable ROC curves showing similar AUC values.
Strata  AUC 

Overall  0.50 
Tweets only  0.73 
Month+tweets  0.74 
Location+tweets  0.74 
Month/location+tweets  0.76 
We also considered the observed data by city to examine how well the model performs at predicting events in individual cities, rather than averaged Australiawide. The ROC curves are given in Figure 7 and the AUC are given in Table 5. We see that for Hobart the model performs particularly well, while the predictions for Melbourne are relatively poor, with an AUC of just 0.6. We explore this discrepancy further in the next subsection. Once again, we obtain similar results when performing cross validation with a random 70/30 split (omitted for brevity).
City  AUC 

Melbourne  0.60 
Sydney  0.65 
Canberra  0.67 
Darwin  0.68 
Perth  0.68 
Adelaide  0.72 
Brisbane  0.76 
Hobart  0.83 
4.2 Prediction model
Figure 8 shows predictions from our model over the full time period considered. Comparison with Figure 1 shows that the general trends of predictions are similar to those in the GSR, with the model predicting higher probabilities of events occurring in larger cities Sydney and Melbourne. More events were predicted to occur in the later months due to an increase in the volume of indicative tweets over this period.
To examine the amount of lead time in predictions made by our method, we performed daysahead predictions, by considering only the tweets collected referring to an event that were authored up to days before that event. Figure 9 shows the decay in AUC for predictions made based on the Twitter data available 0 to 30 days before each event. Here we consider all cities, and we use the model with no stratification. The model performs reasonably well up to a lead time of one week, after which there is a drop and the AUC continues to decay. In particular, note that 1day head predictions perform almost as well (in terms of AUC) as 0day ahead do. This indicates that there is consistently usable information contained within Twitter data in the day before an event. For a potential enduser interested in advance warning of upcoming events this represents an actionable output from our model. We note that even after 30 days the model still outputs some usable predictions, performing slightly better than an uninformative (coinflip) model, with an AUC slightly above 0.5.
To examine why predictions for Melbourne are relatively poor relative to the other cities, we compared Sydney and Melbourne, being two cities of roughly equal population. They also had a similar number of events over the period: 57 for Melbourne, and 47 for Sydney. The main difference between the cities becomes clear when we examine the days with low numbers of indicative tweets more closely. Figure 10 shows days with 25 indicative tweets or less in Sydney and Melbourne. While for Sydney there are only two event days containing 25 indicative tweets or less, we observe that there were a large number of events occurring in Melbourne for which very few indicative tweets were detected by our system. The lines are logistic regression fits to the data for each city. These make clear that for days having a small number of indicative tweets, the logistic regression predicts a decreasing probability of an event for increasing number of indicative tweets for Melbourne. This suggests that either there were few tweets authored referring to these events, or that our protest classifier described in Section 2 is poorly tuned to detect these events.
Examining the events in Melbourne with fewer than 25 tweets shows that these events often concern smaller subpopulations. For example, some of the headlines for corresponding to these lowtweet events include:
Vic pathology staff on indefinite strike
Accused Neil Erikson in bid to dismiss mock beheading charges
DVA rally: Families want royal commission after series of veteran suicides
Airport workers in Melbourne stage protest
In each of these cases, the subpopulation involved is small (e.g. pathology staff, or families of veterans). These groups likely use different methods (potentially other social media platforms, or a different medium entirely) to organise these events. Indeed, when we went back to the historical Twitter record to search for tweets mentioning these events before they occurred, we found only a very small number of tweets. It is clearly the case that PP will struggle to detect events concerning only small subpopulations of this type. It has been observed previously for other methods (even EMBERS) that signals for all types of events do not necessarily appear in all types of data sources (Korkmaz et al., 2016). We remark that while Twitter may not be the appropriate medium for detecting these particular events, it is likely that combining multiple datasets (e.g., Facebook posts or appropriate web searches) would improve our predictions, and our framework is flexible to allow for doing this in the same manner as we have done here for tweets.
5 Discusssion
In this paper we have developed a Bayesian methodology for predicting events from Twitter data. Our method makes the contribution of being interpretable, with the ability to both explicitly show the evidence upon which a particular prediction is made, as well as separating between the components of the prediction coming from the Twitter data and the prior belief about event probabilities. Furthermore, the model predicts the probability of an event occurring, rather than a binary classification of a particular day being an event/nonevent day. This empowers a greater understanding of the uncertainties associated with each prediction, and gives the enduser an indication of how much confidence they should place in any given prediction. Combined with the clear “audit trail” of evidence underlying each prediction output by our model, we argue that this facilitates more informed decisionmaking by potential endusers.
The framework developed here naturally generalises to incorporating multiple heterogeneous data sources for predictions as in Korkmaz et al. (2016). Future work will explore methods to incorporate other open data sources such as Facebook pages or search activity. We will also look at predicting other characteristics of the events contained within the GSR data such as population group, type of event, and whether it is likely to be violent or nonviolent.
In this work we treated individual tweets as being independent, which of course is unlikely to be a valid assumption. With the authors of these tweets being embedded in a complex social network, there exist clear dependencies between their activity patterns (Bagrow et al., 2017) which are unaccounted for here. Utilising this network structure between accounts may help uncover bot accounts (Nasim et al., 2018), which while potentially still useful for making predictions, should at the very least be accounted for in the model. Future work will investigate using network characteristics of tweets referencing the same day as features in predictive models. For example, networks of tweets may be formed with edges representing some shared characteristic between those tweets (e.g., common hashtags, authors, or replies). Using structural characteristics of these networks may reduce “noise” in the data used for prediction, and produce betterquality evidence for upcoming events.
References
 Agarwal and Sureka (2016) Agarwal, S. and Sureka, A. (2016). Investigating the Potential of Aggregated Tweets as Surrogate Data for Forecasting Civil Protests. In Proceedings of the 3rd IKDD Conference on Data Science, 2016 (CODS ‘16).
 Alajajian et al. (2017) Alajajian, S. E., Williams, J. R., Reagan, A. J., Alajajian, S. C., Frank, M. R., Mitchell, L., Lahne, J., Danforth, C. M., and Dodds, P. S. (2017). The lexicocalorimeter: Gauging public health through caloric input and output on social media. PLoS ONE, 12(2):e0168893.
 Alanyali et al. (2015) Alanyali, M., Preis, T., and Moat, H. S. (2015). Tracking Protests Using Geotagged Flickr Photographs. Under Review, pages 27–30.
 Bagrow et al. (2017) Bagrow, J. P., Liu, X., and Mitchell, L. (2017). Information flow reveals prediction limits in online social activity. arXiv preprint, 1708.04575.
 Bollen et al. (2011) Bollen, J., Mao, H., and Zeng, X. (2011). Twitter mood predicts the stock market. Journal of Computational Science, 2(1):1–8.
 BorgeHolthoefer et al. (2016) BorgeHolthoefer, J., Perra, N., Gonçalves, B., GonzálezBailón, S., Arenas, A., Moreno, Y., and Vespignani, A. (2016). The dynamics of informationdriven coordination phenomena: A transfer entropy analysis. Science Advances, 2(4).
 Cadena et al. (2015) Cadena, J., Korkmaz, G., Kuhlman, C. J., Marathe, A., Ramakrishnan, N., and Vullikanti, A. (2015). Forecasting Social Unrest Using Activity Cascades. PLoS ONE, 10(6):e0128879.
 Chang and Manning (2012) Chang, A. X. and Manning, C. D. (2012). SUTime: A library for recognizing and normalizing time expressions. Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC2012), (iii):3735–3740.
 Cody et al. (2015) Cody, E. M., Reagan, A. J., Mitchell, L., Dodds, P. S., and Danforth, C. M. (2015). Climate change sentiment on Twitter: An unsolicited public opinion poll. PLoS ONE, 10(8):e0136092.
 DelignetteMuller et al. (2015) DelignetteMuller, M. L., Dutang, C., et al. (2015). fitdistrplus: An R package for fitting distributions. Journal of Statistical Software, 64(4):1–34.
 Gallagher et al. (2018) Gallagher, R. J., Reagan, A. J., Danforth, C. M., and Dodds, P. S. (2018). Divergent discourse between protests and counterprotests: #BlackLivesMatter and #AllLivesMatter. PLoS ONE, 13(4):1–23.
 Hoegh et al. (2015) Hoegh, A., Leman, S., Saraf, P., and Ramakrishnan, N. (2015). Bayesian Model Fusion for Forecasting Civil Unrest. Technometrics, 57(3):332–40.
 Korkmaz et al. (2016) Korkmaz, G., Cadena, J., Kuhlman, C. J., Marathe, A., Vullikanti, A., and Ramakrishnan, N. (2016). Multisource models for civil unrest forecasting. Social Network Analysis and Mining, 6(1).
 Mitchell (2018) Mitchell, L. (2018). Civil unrest eventrelevant Twitter classifier training data. Mendeley Data, v2, doi:10.17632/mxcsxp3jxn.2.
 Muthiah et al. (2015) Muthiah, S., Huang, B., Arredondo, J., Mares, D., Getoor, L., Katz, G., and Ramakrishnan, N. (2015). Planned protest modeling in news and social media. In Proceedings of the National Conference on Artificial Intelligence, volume 5, pages 3920–3927.
 Nasim et al. (2018) Nasim, M., Nguyen, A., Lothian, N., Cope, R., and Mitchell, L. (2018). Realtime detection of content polluters in partially observable Twitter networks. In Proceedings of the 26th International Conference on the World Wide Web (WWW ’18) Companion, pages 1331–1339.
 Pedregosa et al. (2011) Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikitlearn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.
 Ramakrishnan et al. (2014) Ramakrishnan, N., Butler, P., Muthiah, S., Self, N., Khandpur, R., Saraf, P., Wang, W., Cadena, J., Vullikanti, A., Korkmaz, G., Kuhlman, C., Marathe, A., Zhao, L., Hua, T., Chen, F., Lu, C.t., Huang, B., Srinivasan, A., Trinh, K., Getoor, L., Katz, G., Doyle, A., Ackermann, C., Zavorin, I., Ford, J., Summers, K., Fayed, Y., Arredondo, J., Gupta, D., and Mares, D. (2014). ‘Beating the news’ with EMBERS: Forecasting Civil Unrest using Open Source Indicators. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD ’14), pages 1799–1808.
 Strötgen and Gertz (2013) Strötgen, J. and Gertz, M. (2013). Multilingual and crossdomain temporal tagging. Language Resources and Evaluation, 47(2):269–298.
 Theocharis et al. (2015) Theocharis, Y., Lowe, W., van Deth, J. W., and GarcíaAlbacete, G. (2015). Using Twitter to mobilize protest action: online mobilization patterns and action repertoires in the Occupy Wall Street, Indignados, and Aganaktismenoi movements. Information Communication and Society.
 Xu et al. (2014) Xu, J., Lu, T. C., Compton, R., and Allen, D. (2014). Civil unrest prediction: A Tumblrbased exploration. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), volume 8393 LNCS.