Twitter Corpus of the #BlackLivesMatter Movement And Counter Protests: 2013 to 2020

Twitter Corpus of the #BlackLivesMatter Movement And Counter Protests: 2013 to 2020

Abstract

Black Lives Matter (BLM) is a grassroots movement protesting violence towards Black individuals and communities with a focus on police brutality. The movement has gained significant media and political attention following the killings of Ahmaud Arbery, Breonna Taylor, and George Floyd and the shooting of Jacob Blake in 2020 [1]. Due to its decentralized nature, the #BlackLivesMatter social media hashtag has come to both represent the movement and been used as a call to action. Similar hashtags have appeared to counter the BLM movement, such as #AllLivesMatter and #BlueLivesMatter. We introduce a data set of 41.8 million tweets from 10 million users which contain one of the following keywords: BlackLivesMatter, AllLivesMatter and BlueLivesMatter1. This data set contains all currently available tweets from the beginning of the BLM movement in 2013 to June 2020. We summarize the data set and show temporal trends in use of both the BlackLivesMatter keyword and keywords associated with counter movements. In the past, similarly themed, though much smaller in scope, BLM data sets have been used for studying discourse in protest and counter protest movements [3, 2], predicting retweets [5], examining the role of social media in protest movements [6, 4] and exploring narrative agency [7]. This paper open-sources a large-scale data set to facilitate research in the areas of computational social science, communications, political science, natural language processing, and machine learning.

\keywords

Social media Twitter hashtags social movements protests policing

1 Value of the Data

  • These data are useful because they showcase the entire course of a large, ongoing social movement (Black Lives Matter) and its counter protests (All Lives Matter and Blue Lives Matter). To our knowledge, no other Twitter data sets exist that cover the entire span of the Black Lives Matter movement to date.

  • All researchers interested in systemic racism, social movements, grassroots campaigns, racial inequality, police brutality and counter protests, especially those working in the fields of computational social science, communications, and political science, can benefit from this data.

  • The data set contains 41.8 million posts from 10 million users and can be used to identify linguistic patterns associated with the social movements and their counter protests, social networks (through friend/follower user data), temporal and spatial patterns (through the use of timestamps and latitude/longitude coordinates), inter- and intra- movement dialog and the spread of news and misinformation (through retweets and tweets linking news articles).

  • Since 2013, the BLM movement has grown exponentially, resulting in global protests and several counter protests. This historical data, starting in 2013 and ending in 2020, permits researchers to track this grassroots movement from its social media beginnings.

2 Data Description

Tweets containing the keywords BlackLivesMatter, AllLivesMatter and BlueLivesMatter were collected from the Twitter API from January 2013 to June 2020. Table 1 contains counts of total number of tweets and users for the entire data set and each keyword. It also includes counts for the following: retweets (original tweets which are shared by other users on the platform), replies (tweets which directly respond to another tweet), geotagged (latitude/longitude coordinates associated with the tweet) and top languages (automatically detected language of the tweet). Retweets may or may not contain additional content created by the user doing the retweeting.

Tweets Users Retweets Replies Geotagged Top Languages
All 41,801,153 10,136,019 30,377,162 2,033,245 69,969 en, fr, es, pt, ja
BlackLivesMatter 36,892,699 9,543,924 27,565,206 1,583,077 61,392 en, fr, es, pt, ja
AllLivesMatter 3,001,012 1,462,712 1,463,972 368,035 8,977 en, es, nl, ja, fr
BlueLivesMatter 3,352,437 811,805 2,174,139 195,525 2,049 en, fr, es, ja, de
Table 1: Descriptive counts for the entire data set and each keyword. Note that tweets can contain more than one keyword and can therefore be included in more than one row. ISO 639-1 Language codes: en = English, fr = French, es = Spanish, pt = Portuguese, ja = Japanese, nl = Dutch, de = German.

Tweets also contain a large number of other pieces of metadata, such as user profile data and place information. User profiles contain information such as user handles, free text descriptions and profile images. Places are named locations users decide to associate with a tweet. While Places describe physical locations, they do not imply the tweet originated from this location. Twitter users may manually tag a location when their tweet is about that Place. Due to the large number of additional fields available for each tweet, we do not provide counts for any additional content.

The monthly volume of each keyword is plotted in Figure 1. Here we plot the seven day running average of the total count (logged) of all tweets containing one of our keywords. All labels marked with a single name indicate the date of high profile police brutality-related killings.

Figure 1: Seven day moving average of logged monthly tweet count from 2013 to 2020 of all three keywords. We include markers for high profile events associated with the BLM movement.

3 Experimental Design, Materials and Methods

On July 14, 2016, we set up a data puller using the Python package TwitterMySQL2 to collect tweets matching at least one of our keywords: BlackLivesMatter, AllLivesMatter and BlueLivesMatter. This package uses the official Twitter Application Programming Interface (API) to stream tweets in real time. The data puller continuously collected tweets from the Twitter stream until the time of writing (July 2020). In total we collected 50,574,955 tweets. While the Twitter API was queried using the keywords BlackLivesMatter, AllLivesMatter and BlueLivesMatter, the API delivers a more robust set of matching tweets. For example, a tweet might contain the phrase “black lives matter” or “blm”, among other variations, instead of the keyword BlackLivesMatter.

We note that the Twitter API limits such streams to 1% of the total Twitter volume at any given moment. To see if our keyword data set was limited at any point, we compared the monthly keyword volume to a full 1% monthly pull (not limited to any single keyword, location, etc.). Over the 4-year time span, our keyword data set pulled in a monthly average of 1,176,161 tweets (4,629,878 SD) as compared to a monthly average of 94,893,476 tweets (27,394,826 SD) from the full 1% pull. Thus, we do not believe our data set was limited by the Twitter API.

Due to server maintenance, there were periods when we were unable to collect data. These include: October 17, 2016 through November 23, 2016; January 1, 2017 through January 21, 2017; March 11, 2017 through March 16, 2017; May 2, 2018 through December 18, 2018; and March 16, 2019 through March 20, 2019. Additionally, the Black Lives Matter movement began in 2013, roughly three years before the beginning of our data collection. In order to fill these gaps, we used the Python package GetOldTweets3 [9], which pulls historical tweets containing a given keyword. These tweets were pulled in June 2020. Using this method, we collected 4,276,423 historical tweets.

Having two separate methods of pulling tweet data (prospective using the streaming API and retrospective using GetOldTweets33) caused inconsistencies when reconstructing timelines of keyword use. While Twitter data is publicly available, at any point a user may delete a tweet, delete their account, or set their account to private. Thus, when pulling prospective data, we collected tweets which may have been deleted or made private at some point after the initial pull. On the other hand, one cannot pull deleted or private tweets with a retrospective collection. In order to ensure the data set only contained presently available tweets, we executed a one-time historical pull in June 2020. As a result, any tweet deleted after our initial pull will not be made available. Our final data set consisted of 41,801,153 tweets.

Due to Twitter’s Terms of Service, only numeric tweet IDs can be publicly shared. The numeric IDs can be used to pull the full tweet set using the Twitter API. There are a number of open source software packages which allow researchers to easily interface with the API. The authors used the Python package TwitterMySQL, which saves tweet information in a MySQL database. Other packages exist which do not rely on relational databases, such as the Python package twarc4, which saves tweets to text files in JSON format. Finally, Hydrator5, which relies on an easy to use GUI, saves tweets to both JSON and CSV formats.

4 Ethics Statement

The data used in this article is publicly available and distributed within Twitter’s Terms of Services. Additionally, no human subjects were used in the data collection.

5 Acknowledgments

This research was supported in part by the Intramural Research Program of the NIH, National Institute on Drug Abuse (NIDA).

6 Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships which have, or could be perceived to have, influenced the work reported in this article.

Footnotes

  1. Data available at: https://doi.org/10.5281/zenodo.4056563
  2. https://github.com/dlatk/TwitterMySQL
  3. https://github.com/Mottl/GetOldTweets3
  4. https://github.com/DocNow/twarc
  5. https://github.com/DocNow/hydrator

References

  1. M. Anderson, M. Barthel, A. Perrin and E. A. Vogels (2020-06) #BlackLivesMatter surges on twitter after george floyd’s death. Pew Research Center. Note: \urlhttps://www.pewresearch.org/fact-tank/2020/06/10/blacklivesmatter-surges-on-twitter-after-george-floyds-death/ External Links: Link Cited by: Twitter Corpus of the #BlackLivesMatter Movement And Counter Protests: 2013 to 2020.
  2. J. L. Blevins, J. J. Lee, E. E. McCabe and E. Edgerton (2019) Tweeting for social justice in# ferguson: affective discourse in twitter hashtags. new media & society 21 (7), pp. 1636–1653. Cited by: Twitter Corpus of the #BlackLivesMatter Movement And Counter Protests: 2013 to 2020.
  3. R. J. Gallagher, A. J. Reagan, C. M. Danforth and P. S. Dodds (2018) Divergent discourse between protests and counter-protests: #blacklivesmatter and #alllivesmatter. PloS one 13 (4), pp. e0195644. Cited by: Twitter Corpus of the #BlackLivesMatter Movement And Counter Protests: 2013 to 2020.
  4. J. Ince, F. Rojas and C. A. Davis (2017) The social media response to black lives matter: how twitter users interact with black lives matter through hashtag use. Ethnic and racial studies 40 (11), pp. 1814–1830. Cited by: Twitter Corpus of the #BlackLivesMatter Movement And Counter Protests: 2013 to 2020.
  5. K. Keib, I. Himelboim and J. Han (2018) Important tweets matter: predicting retweets in the# blacklivesmatter talk on twitter. Computers in Human Behavior 85, pp. 106–115. Cited by: Twitter Corpus of the #BlackLivesMatter Movement And Counter Protests: 2013 to 2020.
  6. M. Mundt, K. Ross and C. M. Burnett (2018) Scaling social movements through social media: the case of black lives matter. Social Media+ Society 4 (4), pp. 2056305118807911. Cited by: Twitter Corpus of the #BlackLivesMatter Movement And Counter Protests: 2013 to 2020.
  7. G. Yang (2016) Narrative agency in hashtag activism: the case of #blacklivesmatter. Media and communication 4 (4), pp. 13. Cited by: Twitter Corpus of the #BlackLivesMatter Movement And Counter Protests: 2013 to 2020.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
414618
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description