Tweets on the road
The pervasiveness of mobile devices, which is increasing daily, is generating a vast amount of geo-located data allowing us to gain further insights into human behaviors. In particular, this new technology enables users to communicate through mobile social media applications, such as Twitter, anytime and anywhere. Thus, geo-located tweets offer the possibility to carry out in-depth studies on human mobility. In this paper, we study the use of Twitter in transportation by identifying tweets posted from roads and rails in Europe between September 2012 and November 2013. We compute the percentage of highway and railway segments covered by tweets in countries. The coverages are very different from country to country and their variability can be partially explained by differences in Twitter penetration rates. Still, some of these differences might be related to cultural factors regarding mobility habits and interacting socially online. Analyzing particular road sectors, our results show a positive correlation between the number of tweets on the road and the Average Annual Daily Traffic on highways in France and in the UK. Transport modality can be studied with these data as well, for which we discover very heterogeneous usage patterns across the continent.
An increasing number of geo-located data are generated everyday through mobile devices. This information allows for a better characterization of social interactions and human mobility patterns Watts2007 (); Vespignani2009 (). Indeed, several data sets coming from different sources have been analyzed during the last few years. Some examples include cell phone records Onnela2007 (); Eagle2009 (); Gonzalez2008 (); Song2010 (); Phithakkitnukoon2012 (); Wang2009 (); Ratti2006 (); Reades2007 (); Soto2011 (); Frias2012 (); Isaacman2012 (); Toole2014 (); Pei2013 (); Louail2014 (), credit card use information Hasan2012 (), GPS data from devices installed in cars Gallotti2012 (); Furletti2013 (), geolocated tweets Mocanu2013 (); Bailon2011 (); Hawelka2013 (); Lenormand2014 () or Foursquare data Noulas2012 (). This information led to notable insights in human mobility at individual level Gonzalez2008 (); Hawelka2013 (), but it makes also possible to introduce new methods to extract origin-destination tables at a more aggregated scale Phithakkitnukoon2012 (); Isaacman2012 (); Lenormand2014 (), to study the structure of cities Louail2014 () and even to determine land use patterns Soto2011 (); Frias2012 (); Pei2013 (); Lenormand2014 ().
In this work, we analyze a Twitter database containing over million geo-located tweets from European countries with the aim of exploring the use of Twitter in transport networks. Two types of transportation systems are considered across the continent: highways and trains. Tweets on the road and on the rail between September 2012 and November 2013 have been identified and the coverage of the total transportation system is analyzed country by country. Differences between countries rise due to the different adoption or penetration rates of geo-located Twitter technology. However, our results show that the penetration rate is not able to explain the full picture regarding differences across counties that may be related to the cultural diversity at play. The paper is structured as follows. In the first section, the datasets are described and the method used to identify tweets on highways and railways is outlined. In the second section, we present the results starting by general features about the Twitter database and then comparing different European countries by their percentage of highway and railway covered by the tweets. Finally, the number of tweets on the road is compared with the Average Annual Daily Traffic (AADT) in France and in the United Kingdom to assess its capacity as a proxy to measure traffic loads.
Ii Materials and Methods
The dataset comprehends geo-located tweets across Europe emitted by Twitter users in the period going from September to November . The data was gathered through the general data streaming with the Twitter API API (). It is worth noting that the tweets are not uniformly distributed, see Figure 1a. Countries of western Europe seem to be well represented, whereas countries of Eastern Europe are clearly under-represented (except for Turkey and Russia).
The highway (both directions) and the railway European networks were extracted from OpenStreetMap OSM () (see Figure 1b and 1c for maps of roads and railways, respectively). A close look at the three maps reveals that while tweets concentrate in cities, there is a number of tweets following the main roads and train lines. In this sense, even roads that go through relatively low population areas can be clearly discerned such as those on Russia connecting the main cities, the area of Monegros in Spain, North of Zaragoza or the main roads in the center of France (see the maps country by country in appendix). Here we analyze in detail the statistics of the tweets posted on the roads and railways and discuss the possibility that they are a proxy for traffic and cultural differences. It is important to stress that we considered only the main highways (motorways and international primary roads), not rural roads, while for railways we considered all the main lines (standard gauge of the considered country). The European highways and railways that we consider have a total length of kilometers and kilometers, respectively, which have been divided into segments of kilometers each. The histograms of total lengths by country of highways (panel (a)) and railways (panel (b)) are plotted in Figure 2. Russia, Spain, Germany, France and Turkey represent of the highways total length in Europe. While, for the railway, Russia and Germany represent of the total length. Figure 2c shows that most of European countries have a railway network larger than the highway network except for Turkey, Norway, Greece, Spain, Portugal and Finland. In particular, Turkey has a highway system three times larger than the railway network.
ii.2 Identify the tweets on the road
To identify the tweets on the road/rail, we have considered all the tweets geo-located less than meters away from a highway (both directions) or a railway. Then, each tweet on the road/rail is associated with the closest segment of road/rail. Using this information, we can compute the percentage of road and rail segments covered by the tweets (hereafter called highway coverage and railway coverage). A segment of road or rail is covered by tweets if there is at least one tweet associated with this segment.
ii.3 Ethics statement
iii.1 General features
iii.1.1 Penetration rate
To evaluate the representativeness of the sample across European countries, the Twitter penetration rate, defined as the ratio between the number of Twitter users and the number of inhabitants of each country, is plotted in Figure 3a. This ratio is not distributed uniformly across European countries. The penetration rate is lower in countries of central Europe. It has been shown in previous studies Mocanu2013 (); Hawelka2013 () that the Gross Domestic Product (GDP) per capita (an indicator of the economic performance of a country) is positively correlated with the penetration rate at a world-wide scale. Figure 3b shows the penetration rate as a function of the GDP per capita in European countries. No clear correlation is observed in this case. This fact does not conflict with the previous results since our analysis is restricted to Europe and as shown in Mocanu2013 (), in this relationship, countries from different continents cluster together. This means that a global positive correlation appears if countries from all continents are considered but it is not necessarily significant when the focus is set instead on a particular area of the world.
iii.1.2 Social network
The penetration rate of geo-located tweets is different across European countries and does not show a clear relation to the GDP per capita of each country. There are several factors that can contribute to this diversity such as the facility of access or prices of the mobile data providers. In addition, generic cultural differences when facing a delicate issue from the privacy perspective such as declaring the precise location in posted messages can be also present. One can then naturally wonder whether these differences extend to other aspects of the use of Twitter or are constraint to geographical issues. One obvious question to explore is the structure of the social network formed by the interactions between users. We extract interaction networks by establishing the users as nodes and connecting a pair of them when they have interchanged a reply. Replies are specific messages in Twitter designed to answer the tweets of a particular user. It can be seen as a direct conversation between two users and as shown in Grabowicz2012 () (and references therein) can be related to more intense social relations. A network per country was obtained by assigning to each user the country from which most of his or her geo-located tweets are posted.
Figure 4a shows the distribution of the social network’s degree (number of connections per user) of countries (Belgium, Croatia, Estonia, Hungary and the UK) drawn at random among the considered. The slope of these distributions are very similar and can be fitted using a power-law distribution. More systematically, in Figure 4b and Figure 4c we have respectively plotted the box plot of the fitted exponent values obtained for the countries and the box plot of the R associated with these fits. All the networks have very similar degree distributions, although they show a different maximum degree as a result of the diverse network sizes. These networks are sparse due to the fact that we are keeping only users if they post geo-located tweets and connections only if a reply between two users have taken place. Still and beyond the degree distribution, other topological features such as the average node clustering seems to be quite similar across Europe laying between and for the most populated countries (where we have more data for the network).
iii.2 Highway and railway coverage
iii.2.1 Dependence between coverage and penetration rate
The percentage of segments (i.e., km) covered by the tweets in Europe is for the highway and for the railway. The highway coverage is better than the railway coverage probably because the number of passenger-kilometers per year, which is the number of passengers transported per year times kilometers traveled, on the rail network is lower. However, the coverage is very different according to the country. Indeed, in Figure 5 we can observe that western European countries have a better coverage than countries of eastern Europe except Turkey and, to a lesser extent, Russia.
Figure 6 shows the top European countries ranked by highway coverage (Figure 6a) and railway coverage (Figure 6b). The two countries with the best highway and railway coverages are the United Kingdom and the Netherlands. The tweets cover of the highway system in UK and in Netherlands. On the other hand, the tweets cover up to of the railway network in the UK and in Netherlands. Inversely, the country with the lowest coverage is Moldavia with a highway coverage of and a railway coverage of . The first factor to take into account to understand such differences is the penetration rate. In fact, as it can be observed in Figure 7a and Figure 7b, as a general trend, the coverage of both highway and railway networks is positively correlated with the penetration rate. And, as a consequence, a positive correlation can also be observed between the highway coverage and the railway coverage (Figure 7c). However, these relationships are characterized by a high dispersion around the regression curve. Note that the dispersion is higher than what it can look in a first impression because the scales of the plots of Figure 7 are logarithmic. For the two first relationships the mean absolute error is around and for the third one the mean absolute error is around . This implies that divergences on the geo-located Twitter penetration does not fully explain the coverage differences between the European countries.
iii.2.2 Differences across European countries
Disparity in coverage between countries can neither be satisfactorily explained by differences in fares or accessibility to mobile data technology. For example, two countries as France and Spain are similar in terms of highway infrastructure, mobile phone data fares and accessibility, but the geo-located Twitter penetration rates are very different as also are their highway coverage in Spain and in France. Besides penetration rates, divergences in coverage might be the product of cultural differences among European countries when using Twitter in transportation. As it can be observed in Figure 8a, the proportion of tweets geo-located on the highway or railway networks is very different from country to country. In the following, we focus on three examples of countries with similar characteristics in the sense of penetration rates but displaying significant differences in transport network coverage.
Ireland and United Kingdom The most explicit example of the impact of cultural differences on the way people tweet in transports could be given by the Ireland and United Kingdom case studies. Indeed, these two countries have very similar penetration rates but UK has a proportion of tweets in transports more than two times higher than Ireland. Moreover, both highway and railway coverages are one and a half times higher in UK than in Ireland.
Turkey and Netherlands Turkey and Netherlands, which have similar penetration rate, are also an interesting example illustrating how cultural and economical differences may influence coverage. Despite the fact that they both have a high highway coverage, Netherlands has a railway coverage three times higher than Turkey. Different economic levels of train and car travelers in Turkey could be, for instance, an explanation for this.
Belgium and Norway For countries having similar penetration rate, the higher the proportion of tweets in transports, the better the coverage. However, some exceptions exist, for example, Norway has a proportion of tweets in transports higher than Belgium but, inversely, Belgium has a highway coverage three times higher than Norway. Given the very extensive highway system of Norway, some of the segments, especially on the North, can have very low traffic, which could be the origin of this difference.
|Pair of countries||Difference between the percentage of rail passenger-kilometers||Difference between the railway coverages|
In general, the distribution of tweets according to the transport network is also very different from country to country (Figure 8b) but also region by region. For example, countries from North and Central Europe have a higher proportion of tweets on the road than tweets on the rail than others European countries. This is probably due to difference regarding the transport mode preference among European countries. To check this assumption, we studied the distribution of rail passenger-kilometers in Eurostat () according to the proportion of tweets on the rail. Figure 9 shows box plots of the distribution of rail passenger-kilometers expressed in percentage of total inland passenger-kilometers according to the proportion of tweets on the rail among the tweets on the road and rail. Globally, the number of rail passenger-kilometers is lower for countries having a low proportion of tweets on the rail, which confirms our assumption.
In the same way, the distribution of rail passenger-kilometers in can be used to understand why two countries having the same highway coverage might have very different railway coverages. For example, Switzerland and Estonia have the same highway coverage with about of road segments covered by the tweets but the railway coverage is very different, with about of rail segments covered in Switzerland and in Estonia. This can be explained by the fact that in Switzerland trains accounted for of all inland passenger-kilometers in (which was the highest value among European countries in that year) and inversely, in Estonia, trains accounted for of all inland passenger-kilometers (one of the lowest in Europe). More systematically, for each pair of countries having similar highway coverages, we compared the difference between railway coverages and the difference between the percentage of rail passenger-kilometers. First, pair of countries having a highway coverage higher than and an absolute different between their highway coverages lower than are selected. Thus, we have selected pairs of countries with a similar highway coverage. Table 1 displays the difference between the percentage of rail passenger-kilometers and the difference between the railway coverages for these pairs of countries. In out of cases, the differences have the same sign. This fact points towards a possible correlation between traffic levels and tweet coverage.
iii.3 Average Annual Daily Traffic
To assess more quantitatively this hypothetical relation between the number of vehicles and the number of tweets on the road, we compared the number of tweets and the Average Annual Daily Traffic (AADT) on the highways in United Kingdom in UK () and in France in FR (). The AADT is the total number of vehicle traffic of a highway divided by days. The number of highway segments for which the AADT was gathered is in UK and in France. The average length of these segments is kilometers in UK and in France. As in the previous analysis, the number of tweets associated with a segment was computed by identifying all the tweets geo-located at less than meters away from the segment.
Figure 10a and 10c shows a comparison between the AADT and the number of tweets on the road for both case studies. There is a positive correlation between the AADT and the number of tweets on the road but the Pearson correlation coefficient values are low, around for the France case study and around for the UK case study. This can be explained by the large number of highway segments having a high AADT but a very low number of tweets. To understand the origin of such disagreement between tweets and traffic, we have divided the segments into two groups: those having a high AADT and a very low number of tweets (red points) and the rest (blue points). These two types of segments have been separated using the black lines in Figure 10a and 10c. Figure 10b and 10d show the box plots of the highway segment length in kilometer according to the segment type for both case studies. It is interesting to note that the segments having a high AADT and a low number of tweets are globally shorter than the ones of the other group. Indeed, according to the Welch two sample t-test Welch1951 () the average segment length of the first group ( km in France and in UK) is significantly lower than the one of the second group ( km in France and in UK). Given a similar speed one can assume that the shorter the road segment is, the lower time people have to post a tweet. Other factors that may influence this result is the nature of the segments, rural vs urban, and the congestion levels that can significantly alter the time spent by travelers in the different segments.
In this work, we have investigated the use of Twitter in transport networks in Europe. To do so, we have extracted from a Twitter database containing more than million geo-located tweets posted from the highway and the railway networks of European countries. First, we show that the countries have different penetration rates for geo-located tweets with no clear dependence on the economic performance of the country. Our results show, as well, no clear difference between countries in terms of the topological features of the Twitter social network. Dividing the highway and railway systems in segments, we have also studied the coverage of the territory with geo-located tweets. European countries can be ranked according to the highway and railway coverage. The coverages are very different from country to country. Although some of this disparity can be explained by differences in penetration rate or by the use of different transport modalities, a large dispersion in the data still persist. Part of it could be due to cultural differences among European countries regarding the use of geo-located tools. Finally, we explore whether Twitter can be used as a proxy to measure of traffic on highways by comparing the number of tweets and the Average Annual Daily Traffic (AADT) on the highways in United Kingdom and France. We observe a positive correlation between the number of tweets and the AADT. However, the quality of this relationship is reduced due to the short character of some AADT highway segments. We conclude that the number of tweets on the road (train) can be used as a valuable proxy to analyze the preferred transport modality as well as to study traffic congestion provided that the segment length is enough to obtain significant statistics.
Partial financial support has been received from the Spanish Ministry of Economy (MINECO) and FEDER (EU) under projects MODASS (FIS2011-24785) and INTENSE@COSYP (FIS2012-30634), and from the EU Commission through projects EUNOIA, LASAGNE and INSIGHT. ML acknowledges funding from the Conselleria d’Educació, Cultura i Universitats of the Government of the Balearic Islands and JJR from the Ramón y Cajal program of MINECO.
- (1) Watts DJ (2007) A twenty-first century science. Nature 445: 489.
- (2) Vespignani A (2009) Predicting the Behavior of Techno-Social Systems. Science 325: 425–428.
- (3) Onnela J, Saramaki J, Hyvonen J, Szabo G, Lazer D, et al. (2007) Structure and tie strengths in mobile communication networks. Proc Natl Acad Sci USA 104: 7332–7336.
- (4) Eagle N, Pentland AS, Lazer D (2009) From the Cover: Inferring friendship network structure by using mobile phone data. Proceedings of the National Academy of Sciences 106: 15274–15278.
- (5) Gonzalez MC, Hidalgo CA, Barabasi AL (2008) Understanding individual human mobility patterns. Nature 453: 779–782.
- (6) Song C, Qu Z, Blumm N, Barabási AL (2010) Limits of Predictability in Human Mobility. Science 327: 1018–1021.
- (7) Phithakkitnukoon S, Smoreda Z, Olivier P (2012) Socio-geography of human mobility: A study using longitudinal mobile phone data. PLoS ONE 7: e39253.
- (8) Wang P, González M, Hidalgo C, Barabási AL (2009) Understanding the spreading patterns of mobile phone viruses. Science 324: 1071–1076.
- (9) Ratti C, Pulselli RM, Williams S, Frenchman D (2006) Mobile landscapes: using location data from cell phones for urban analysis. Environment and Planning B: Planning and Design 33: 727-748.
- (10) Reades J, Calabrese F, Sevtsuk A, Ratti C (2007) Cellular census: Explorations in urban data collection. Pervasive Computing, IEEE 6: 30-38.
- (11) Soto V, Frías-Martínez E (2011) Automated land use identification using cell-phone records. In: Proceedings of the 3rd ACM international workshop on MobiArch. New York, NY, USA: ACM, HotPlanet ’11, pp. 17–22. DOI: 10.1145/2000172.2000179. http://doi.acm.org/10.1145/2000172.2000179.
- (12) Frías-Martínez V, Soto V, Hohwald H, Frías-Martínez E (2012) Characterizing urban landscapes using geolocated tweets. In: SocialCom/PASSAT. IEEE, pp. 239-248.
- (13) Isaacman S, Becker R, Cáceres R, Martonosi M, Rowland J, et al. (2012) Human mobility modeling at metropolitan scales. In: Proceedings of the International Conference on Mobile Systems, Applications, and Services (MobiSys). ACM, pp. 239–252. DOI: 10.1145/2307636.2307659. http://dx.doi.org/10.1145/2307636.2307659.
- (14) Toole J, Ulm M, González M, Bauer D (2014) Inferring land use from mobile phone activity. Proceedings of the ACM SIGKDD International Workshop on Urban Computing pp 1–8.
- (15) Pei T, Sobolevsky S, Ratti C, Shaw SL, Zhou C (2013) A new insight into land use classification based on aggregated mobile phone data. ArXiv e-print arxiv:1310.6129.
- (16) Louail T, Lenormand M, Garcia Cantú O, Picornell M, Herranz R, et al. (2014) From mobile phone data to the spatial structure of cities. ArXiv e-print arxiv:140:4540.
- (17) Hasan S, Schneider CM, Ukkusuri SV, González MC (2012) Spatiotemporal patterns of urban human mobility. Journal of Statistical Physics 151: 1–15.
- (18) Gallotti R, Bazzani A, Rambaldi S (2012) Towards a statistical physics of human mobility. International Journal of Modern Physics 23: 1250061.
- (19) Furletti B, Cintia P, Renso C, Spinsanti L (2013) Inferring human activities from gps tracks. In: Proceedings of the 2Nd ACM SIGKDD International Workshop on Urban Computing. New York, NY, USA: ACM, UrbComp ’13, pp. 5:1–5:8.
- (20) Mocanu D, Baronchelli A, Perra N, GonÃ§alves B, Zhang Q, et al. (2013) The Twitter of Babel: Mapping World Languages through Microblogging Platforms. PLoS ONE 8: e61981.
- (21) González-Bailón S, Borge-Holthoefer J, Rivero A, Moreno Y (2011) The dynamics of protest recruitment through an online network. Scientific Reports 1: 197.
- (22) Hawelka B, Sitko I, Beinat E, Sobolevsky S, Kazakopoulos P, et al. (2013) Geo-located twitter as a proxy for global mobility patterns. ArXiv e-print arXiv:1311.0680.
- (23) Lenormand M, Picornell M, Garcia Cantú O, Tugores A, Louail T, et al. (2014) Cross-checking different source of mobility information. ArXiv e-print arXiv:1404.0333.
- (24) Noulas A, Scellato S, Lambiotte R, Pontil M, Mascolo C (2012) A tale of many cities: Universal patterns in human urban mobility. PloS one 7: e37027.
- (25) Twitter API, section for developers of Twitter Web page, https://dev.twitter.com.
- (26) Open Street Map ( http://www.openstreetmap.org).
- (27) Gdp per capita in by the International Monetary Fund (http://www.imf.org/external/pubs/ft/weo/2012/01/weodata/weoselgr.aspx).
- (28) Grabowicz PA, Ramasco JJ, Moro E, Pujol JM, Eguiluz VM (2012) Social Features of Online Networks: The Strength of Intermediary Ties in Online Social Media. PLoS ONE 7: e29358.
- (29) Source: Eurostat (http://epp.eurostat.ec.europa.eu/statistics_explained/index.php/Passenger_transport_statistics).
- (30) Data available online at http://www.dft.gov.uk/traffic-counts/.
- (31) These data are available upon request at Service studies on transport, roads and facilities (setra) (http://dtrf.setra.fr/).
- (32) Welch BL (1951) On the comparison of several mean values: An alternative approach. Biometrika 38: pp. 330–336.