SAVITR: A System for Real-time Location Extraction from Microblogs during Emergencies
We present SAVITR, a system that leverages the information posted on the Twitter microblogging site to monitor and analyse emergency situations. Given that only a very small percentage of microblogs are geo-tagged, it is essential for such a system to extract locations from the text of the microblogs. We employ natural language processing techniques to infer the locations mentioned in the microblog text, in an unsupervised fashion and display it on a map-based interface. The system is designed for efficient performance, achieving an F-score of 0.79, and is approximately two orders of magnitude faster than other available tools for location extraction.
[SMERP’18]SMERP workshop with The Web Conference2018 \acmYear2018 \copyrightyear2018 \acmPrice15.00
¡ccs2012¿ ¡concept¿ ¡concept_id¿10002951.10003317¡/concept_id¿ ¡concept_desc¿Information systems Information retrieval¡/concept_desc¿ ¡concept_significance¿500¡/concept_significance¿ ¡/concept¿ ¡/ccs2012¿
Information systems Information retrieval
Online social media sites, especially microblogging sites like Twitter and Weibo, have been shown to be very useful for gathering situational information in real-time [Imran et al. (2015), Rudra et al. (2015)]. Consequently, it is imperative to not only process the vast incoming data stream on a real-time basis, but also to extract relevant information from the unstructured and noisy data accurately.
It is especially crucial to extract geographical locations from tweets (microblogs), since the locations help to associate the information available online with the physical locations. This task is challenging since geo-tagged tweets are very sparse, especially in developing countries like India, accounting for only 0.36% of the total tweet traffic. Hence it becomes necessary to extract locations from the text of the tweets.
This work proposes a novel and fast method of extracting locations from English tweets posted during emergency situations. The location is inferred from the tweet-text in an unsupervised fashion as opposed to using the geo-tagged field. Note that several methodologies for extracting locations from tweets have been proposed in literature; some of these are discussed in the next section. We compare the proposed methodology with several existing methodologies in terms of coverage (Recall) and accuracy (Precision). Additionally, we also compared the speed of operation of different methods, which is crucial for real-time deployment of the methods. The proposed method achieves very competitive values of Recall and Precision with the baseline methods, and the highest F-score among all methods. Importantly, the proposed methodology is several orders of magnitude faster than most of the prior methods, and is hence suitable for real-time deployment.
We deploy the proposed methodology on a system available at http://savitr.herokuapp.com, which is described in a later section.
2 Related Work
We discuss some existing information systems for use during emergencies, and some prior methods for location extraction from microblogs.
2.1 Information Systems
A few Information Systems have already been implemented in various countries for emergency informatics, and their efficacy has been demonstrated in a variety of situations. Previous work on real-time earthquake detection in Japan was deployed by [Sakaki et al. (2010)] using Twitter users as social sensors. Simple systems like the Chennai Flood Map [Map (2015)], which combines crowdsourcing and open source mapping, have demonstrated the need and utility of Information Systems during the 2015 Chennai floods. Likewise, Ushahidi  enables local observers to submit reports using their mobile phones or the Internet, thereby creating a temporal and geospatial archive of an ongoing event. Ushahidi has been deployed in situations such as earthquakes in Haiti, Chile, forest fires in Italy and Russia.
Our system also works on the same basic principle as the aforementioned ones – information extraction from crowdsourced data. However, unlike Mapbox [Map (2015)] and Ushahidi , it is not necessary for the users to explicitly specify the location. Rather, we infer it from the tweet text, without any prior manual labeling.
2.2 Location Inferencing methods
Location inferencing is a specific variety of Named Entity Recognition (NER), whereby only the entities corresponding to valid geographical locations are extracted.
There have been seminal works regarding location extraction from microblog text, inferring the location of a user from the user’s set of posted tweets and even predicting the probable location of a tweet by training on previous tweets having valid geo-tagged fields.
Publicly available tools like Stanford NER [Finkel
et al. (2005)], TwitterNLP [Ritter
et al. (2011)], OpenNLP [Ope (2010)] and Google Cloud
We focus our work only on extracting the locations from the tweet text, since we have observed that (i) a very small fraction of tweets are geo-tagged, and (ii) even for geo-tagged tweets, a tweet’s geo-tagged location is not always a valid representative of the incident mentioned in the tweet text. For instance, the tweet “Will discuss on TimesNow at 8.30 am today regarding Dengue Fever in Tamil Nadu.” clearly refers to Tamil Nadu, but the geo-tagged location is New Delhi (from where the tweet was posted).
We give an overview of the different types of methodologies used in location extraction systems. Prior state-of-the-art methods have performed common preprocessing steps like noun-phrase extraction and phrase matching [Malmasi and Dras (2016)], or regex matching [Bontcheva et al. (2013)] before employing the following techniques for location extraction.
Supervised methods: Well-known supervised models used in this current context are:
Maximum entropy based models such as the OpenNLP was deployed by [Lingad et al. (2013)] without training and it infers location using ME.
Semi-supervised methods: The work [Ji et al. (2016)] used semi-supervised methods such as beam-search and structured perceptrons to label sequences and linked them with corresponding Foursquare location entities.
3 Extracting locations from microblogs
We now describe the proposed methodology for inferring locations from tweet text. The methodology involves the following tasks.
3.1 Hashtag Segmentation
Hashtags are a relevant source of information in Twitter. Especially for tweets posted during emergency situations, hashtags often contain location names embedded in them, e.g., #NepalQuake, #GujaratFloods. However, due to the peculiar style of coining hashtags, it becomes imperative to break them into meaningful words. Similar to [Malmasi and Dras (2016)] and [Al-Olimat et al. (2017)], we adopt a statistical word segmentation based algorithm [Norvig (2009)] to break a hashtag into distinct words, and extract locations from the distinct words. We also retain the original hashtag, to ensure we do not lose out on meaningful remote locations simply because they are uncommon.
We have observed that hashtag segmentation has some unforeseen outcomes. While trying to optimize recall from a tweet, it hampers precision, especially when the segmented words corresponds to actual locations. For example ‘#Bengaluru’ (a place in India) is broken down into ‘bengal’ and ‘uru’, which are two other places in India. Again ‘#Kisiizi’ (name of a hospital in Uganda) is incorrectly segmented into ‘kissi’ and ‘zi’, none of which are location names.
In spite of these limitations of hashtag segmentation, we still carry out this step since we seek to extract all possible location names, including those embedded within hashtags.
3.2 Tweet Preprocessing
We then applied common pre-processing techniques on the tweet text and removed URLs, mentions, and stray characters like ’RT’, brackets, # and ellipses and segmented CamelCase words. We did not perform case-folding on the text since we wanted to detect proper nouns. Likewise, we also abstained from stemming since location names might get altered and cannot be detected using the gazetteer.
3.3 Disambiguating Proper Nouns from Parse Trees
Since most location names are likely to be proper nouns, we use a heuristic to determine whether a proper noun is a location.
We first apply a Parts-of-Speech (POS) tagger to yield POS tags.
There are several POS taggers publicly available, which could be applied, such as SPaCy
Let denote the POS tag of the word of the tweet. If corresponds to a proper noun, we keep on appending words that succeed , provided they are also proper nouns, delimiters or adjectives. We develop a list of common suffixes of location names (explained below). If is followed by a noun in this suffix list, we consider it to be a viable location. Acknowledging the fact that Out of Vocabulary (OOV) words are common in Twitter, we also consider those words which have a high Jaro-Winkler similarity with the words in the suffix list.
We also check the word immediately preceding , to see if it is a preposition that usually precedes a place or location, such as ‘at’, ‘in’, ‘from’, ‘to’, ‘near’, etc. We then split the stream of words via the delimiters. Thus we attempt to infer from the text proper nouns which conform to locations from their syntactic structure.
3.4 Regex matches
As mentioned in the previous section, we have compiled a suffix list
containing words that usually come after a location name.
The suffix list comprises different naming conventions for
We perform this additional task of regex similarity to account for cases when the tweet is posted in lowercase, making it difficult to detect and disambiguate proper nouns. Using the suffix list enables us to detect places like ‘Vinayak hospital’ and ‘Gujranwala town’ from the tweet “Urgent B+ group platelets suffering from dengue,Ankit Arora At Vinayak hospital, Gujranwala town,delhi”.
|Landforms||doab, lake, steam, river, island, valley, mountain, hill|
|Roads||street, st, boulevard, junction, lane, rd, avenue, bridge|
|Buildings||hospital, school, shrine, cinema,villa, temple, mosque,|
|Towns||city, district, village, gram, place,town, nagar,|
|Directions||south, eastern, NW, SE, west, western, north east,|
|Diseases||dengue, ebola, cholera, zika, malaria, chikungunya|
|Disasters||earthquake, floods, drought, tsunami, landslide, rains|
3.5 Dependency Parsing of Emergency words
So far, the methodology aims at improving the precision, but does not look to improve recall. This step is meant to improve recall by capturing those locations which do not follow the common patterns listed above.
Considering that our objective is to monitor emergency scenarios, we identify a set of words corresponding to
As an example, Figure 1 shows the dependency graph for the tweet “Mumbai lost its mudflats and wetlands, now floods with every monsoon.”. We see that the distance between Mumbai and floods in the dependency graph of the tweet is 2, whereas the actual distance between the words in the text is 7. Hence we can identify Mumbai as a proper location via dependency parsing, Also, we extract the noun phrases from the dependency graph (as in [Malmasi and Dras (2016)]) and use the SpaCy NER tagger as in [Malmasi and Dras (2016), Lingad et al. (2013), Gelernter and Zhang (2013)].
3.6 Gazetteer Verification
The list of phrases and locations extracted by the above methods are then verified using a gazetteer, to retain only those words that correspond to real-world locations. For our system, the gazetteers also returns the geo-spatial coordinates to enable plotting the location on a map.
4 Comparative evaluation of the location inference
In this section, we describe the evaluation of the proposed methodology, and compare it with several baseline methods. We start by describing the dataset and some design choices made by us.
We used the Twitter Streaming API
4.2 Gazetteer employed
In this work, we currently focus on collecting and displaying tweets within the bounding box of the country of India.
Thus, we need some lexicon / gazetteer to disambiguate whether a place is located inside India and what are its geographical coordinates. To that end, we scraped the data publicly available from Geonames
Consequently, we explored another gazetteer –
the Open Street Map gazetteer
Thus the choice of the gazetteer is governed by a trade-off between recall and efficiency. We report performances using both gazetteers in this paper. Hence we consider two variants of the proposed methodology:
GeoLoc- Our proposed methodology using Geonames as the gazetteer.
OSMLoc- Our proposed methodology using Open Street Maps as the gazetter.
4.3 Baseline methodologies
We compared the proposed approach of our algorithm with several baseline methodologies which are enlisted below:
UniLoc- Take all unigrams in the processed tweet text and infer if any of those correspond to a possible location (by referring to a gazetteer).
BiLoc- Similar to UniLoc, except we consider both unigrams and bigrams in the tweet text.
StanfordNER - Employs the NER of coreNLP parser [Finkel et al. (2005)].
TwitterNLP - Employ the NER of Twitter NLP parser developed by Ritter et al. [Ritter et al. (2011)]
Google Cloud- Use the Google cloud API to infer locations.
SpaCyNER - Use the trained SpaCy NER tagger.
For all the baseline methods, the potential locations are checked using the GeoNames gazetteer.
4.4 Evaluation Measures
Given a tweet text, we wish to infer all possible locations contained in the tweet. Thus we should prefer a method which has higher recall. However, since we also aim to plot the location obtained from the tweet, the precision of our extracted locations also matters. Hence we apply the following measures.
where ‘Correct locations’ is the set of locations actually mentioned in a tweet, as found by human annotators, and ‘Retrieved locations’ is the set of locations inferred by a certain methodology from the same tweet. To get an idea of both precision and recall, we use F-score which is the harmonic mean of precision and recall.
Moreover, since we wish to deploy the system on a real-time basis, the evaluation time taken by a method is also a justifiable metric.
4.5 Evaluation results
We randomly selected 1,000 tweets from the collected set of tweets (as described earlier), and asked human annotators to identify those tweets which contain some location names. The annotators identified a set of 101 tweets that contained at least one location name. Hence the comparative evaluation is carried out over this set of 101 tweets.
|Method||Precision||Recall||F-score||Timing (in s)|
Table 2 compares the performances of the baseline methods and the proposed method.
The last column shows the average time in seconds needed to process the tweets that we are using for evaluation.
We observe that GeoLoc performs the best in terms of F1 score as compared to all other methods. It also scores high on precision, ranking only third to StanfordNER and SpaCyNER. The high precision of SpaCyNer is counterbalanced by its very bad recall due to which it was hardly able to detect remote places like Mohali and May Hosp. from the tweet
The slight decrease in precision is attributed to some common words like ‘song’, ‘monsoon’, ‘parole’ being chosen as potential locations due to incorrect hashtag segmentation, and then the gazetter tagging these words as locations, since these are also names of certain places in India.
It can also be seen that, the proposed method using GeoNames gazetteer is much faster than the other methods which achieve comparable performance (e.g., StanfordNER).
Choice of gazetteer: As stated earlier, the Geonames gazetteer lacks information of a granular level. Consequently specific places pertaining to hospitals and streets are often not recognized as valid locations. This hampers the recall of the system, e.g., the proposed methodology was unable to detect ‘star hospital’ in the tweet “We need O-ve blood grup for 8 years boy suffering with dengue in star hospital in karimnagar , please Contact.”
Open Street Map (OSM) is able to detect such specific locations and thus exhibits the highest recall amongst all other methods. However, using OSM has the side-effect of classifying many simple noun phrases as valid locations. For instance, ‘silicon city’ is detected as a location in the tweet “@rajeev_mp seems its time to rename Bangalore as Floods city I/O silicon city.”, since ‘silicon city’ is judged a shortening for the entry ‘Concorde Silicon Valley, Electronics City Phase 1, Vittasandra, Bangalore’. As a result of such errors, the method using OSM has the lowest precision score amongst all the methods.
Performance over the entire dataset: From the entire set of 239,276 distinct tweets, only 3,493 were geo-tagged, out of which 869 were from India (which corresponds to a minute of the entire dataset). The number of tweets which were successfully tagged from the entire dataset, using our proposed technique and Geonames was 68,793, which corresponds to approximately . Hence the coverage is increased drastically. We manually observed many of the inferred locations, and found a large fraction of the correct. The method could identify niche and remote places in India, like ‘Ghatkopar’, ‘Guntur’, ‘Pipra village’ and ‘Kharagpur’, besides metropolitan cities like ‘Delhi’, ‘Kolkata’ and ‘Mumbai’.
5 SAVITR: Deploying the location inference method
We have deployed the proposed techniques (using GeoNames) on a system named SAVITR, which is live at http://savitr.herokuapp.com. The software architecture of Savitr is presented in Figure 2. Since the amount of data to be displayed is massive, we had to implement certain design considerations so that the information displayed is both compact and visually enriching, while at the same time scalable. The system was built using the Dash framework by Plotly [Plo (2016)]. For our visualization purpose, we settled on a mapbox Map at the heart of the UI, with various controls, as described below. A snapshot of the system is shown in Figure 3.
A search bar at the top of the page. Whenever a term is entered into the search bar, the map refreshes and shows tweets pertaining to that query term. It also supports multiple search queries like ”Dengue, Malaria”.
The tweets on the map are color coded according to the time of the day. Tweets posted in the night are darker.
A date-picker – if one wishes to visualize tweets posted during a particular time duration, this provides fine grained date selection, both at the month and date level.
A Histogram – this shows the number of relevant (tagged) tweets posted per day.
Untagged tweets – Finally, at the bottom of the page we display the tweets for which location could not be inferred (and hence they could not be shown on the map).
We report the performance of the system during the massive dengue outbreak that plagued India in the fall of 2017.
Though the SAVITR system presently infers locations within India, it can be easily extended to infer locations within other countries, and the whole world in general.
6 Concluding discussion
We proposed a methodology for real-time inference of locations from tweet text, and deployed the methodology in a system (SAVITR). The proposed methodology performs better than many prior methods, and is much more suitable for real-time deployment.
We observed several challenges that remain to be solved. For instance, for some geo-tagged tweets, the tweet is posted from a different place as compared to the locations mentioned in the text. A common phenomenon is that a tweet posted from a metropolitan city (e.g., Delhi) contains some information about a suburb. How to deal with such tweets is application-specific. Again, there are multiple locations having the same name. Hence disambiguating location names is a major challenge. We plan to explore these directions in future.
- http://www.cs.cmu.edu/ ark/TweetNLP/
- Urgent B + blood needed for a crit dengue patient at May Hosp. , Mohali,(Chandigarh)
- 2008. Ushahidi. https://www.ushahidi.com/. (2008). Accessed: 2018-01-22.
- 2010. OpenNLP. http://opennlp.apache.org. (2010).
- 2015. Chennai floods map: Crowdsourcing data and crisis mapping for emergency response. https://osm-in.github.io/flood-map/chennai.html#11/13.0000/80.2000. (2015). Accessed: 2018-01-22.
- 2016. Plotly. https://plot.ly/products/dash/. (2016).
- Hussein S. Al-Olimat, Krishnaprasad Thirunarayan, Valerie L. Shalin, and Amit P. Sheth. 2017. Location Name Extraction from Targeted Text Streams using Gazetteer-based Statistical Language Models. CoRR abs/1708.03105 (2017). arXiv:1708.03105 http://arxiv.org/abs/1708.03105
- Kalina Bontcheva, Leon Derczynski, Adam Funk, Mark A Greenwood, Diana Maynard, and Niraj Aswani. 2013. TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text. In In Proceedings of the Recent Advances in Natural Language Processing (RANLP 2013), Hissar. Citeseer.
- Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (ACL ’05). Association for Computational Linguistics, Stroudsburg, PA, USA, 363–370. DOI:http://dx.doi.org/10.3115/1219840.1219885
- Judith Gelernter and Wei Zhang. 2013. Cross-lingual geo-parsing for non-structured data. In Proceedings of the 7th Workshop on Geographic Information Retrieval. ACM, 64–71.
- Muhammad Imran, Carlos Castillo, Fernando Diaz, and Sarah Vieweg. 2015. Processing Social Media Messages in Mass Emergency: A Survey. Comput. Surveys 47, 4 (June 2015), 67:1–67:38.
- Zongcheng Ji, Aixin Sun, Gao Cong, and Jialong Han. 2016. Joint recognition and linking of fine-grained locations from tweets. In Proceedings of the 25th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1271–1281.
- John Lingad, Sarvnaz Karimi, and Jie Yin. 2013. Location extraction from disaster-related microblogs. In Proceedings of the 22nd international conference on world wide web. ACM, 1017–1020.
- Shervin Malmasi and Mark” Dras. 2016. Location Mention Detection in Tweets and Microblogs. In Computational Linguistics, Kôiti Hasida and Ayu Purwarianti (Eds.). Springer Singapore, Singapore, 123–134.
- Stuart E Middleton, Lee Middleton, and Stefano Modafferi. 2014. Real-time crisis mapping of natural disasters using social media. IEEE Intelligent Systems 29, 2 (2014), 9–17.
- Peter Norvig. 2009. Natural Language Corpus Data. (01 2009).
- Alan Ritter, Sam Clark, Mausam, and Oren Etzioni. 2011. Named Entity Recognition in Tweets: An Experimental Study. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP ’11). Association for Computational Linguistics, Stroudsburg, PA, USA, 1524–1534. http://dl.acm.org/citation.cfm?id=2145432.2145595
- Koustav Rudra, Subham Ghosh, Pawan Goyal, Niloy Ganguly, and Saptarshi Ghosh. 2015. Extracting Situational Information from Microblogs during Disaster Events: A Classification-Summarization Approach. In Proc. ACM CIKM.
- Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo. 2010. Earthquake Shakes Twitter Users: Real-time Event Detection by Social Sensors. In Proc. International Conference on World Wide Web (WWW). 851–860.