From Twitter to Traffic Predictor: Next-Day Morning Traffic Prediction Using Social Media Data

From Twitter to Traffic Predictor: Next-Day Morning Traffic Prediction Using Social Media Data


The effectiveness of traditional traffic prediction methods, such as autoregressive or spatio-temporal models, is often extremely limited when forecasting traffic dynamics in early morning. The reason is that traffic can break down drastically during the early morning commute, and the time and duration of this break-down vary substantially from day to day. Early morning traffic forecast is crucial to inform morning-commute traffic management, but they are generally challenging to predict in advance, particularly by midnight (called ‘next-day morning traffic prediction’ thereafter). In this paper, we propose to mine Twitter messages as a probing method to understand the impacts of people’s work and rest patterns in the evening/midnight of the previous day to the next-day morning traffic. The model is tested on freeway networks in Pittsburgh as experiments. The resulting relationship is surprisingly simple and powerful. We find that, in general, the earlier people rest as indicated from Tweets, the more congested roads will be in the next morning. The occurrence of big events in the evening before, represented by higher or lower tweet sentiment than normal, often implies lower travel demand in the next morning than normal days. Besides, people’s tweeting activities in the night before and early morning (by 5 am) are statistically associated with congestion in morning peak hours. We make use of such relationships to build a predictive framework which forecasts morning commute congestion using people’s tweeting profiles extracted by 5 am. In most cases, the tweet information collected by the midnight before is sufficient to make good prediction for next-day morning traffic. The Pittsburgh study supports that this framework can precisely predict morning congestion, particularly for some road segments upstream of roadway bottlenecks with large day-to-day congestion variation, while its prediction performance being no worse than baseline methods on other roads. Through experiments, we demonstrate our approach considerably outperforms those existing methods without Twitter message features, and it can learn meaningful representation of demand from tweeting profiles that offer managerial insights. The proposed social media empowered framework can be a promising tool for real-time traffic management and potentially extended for traffic prediction at other times of day.

Traffic prediction; Long-term prediction; Social media; Tweet; Sentiment analysis; Clustering

\ul \biboptionsauthoryear

1 Introduction

Existing traffic prediction methods are often of limited use to early morning commuters. According to American Community Survey (2011-2015) by U.S. Census Bureau (2015), 13% of the population nationwide were reported to leave home for work before 6 am to avoid the worst commute times, and 4.4 % were even out the door by 5 am. Congestion on roadways is known be extremely sensitive to travel demand. On each day, the departure time and volumes of those early morning commuters could cause traffic break down at various times of day, which has ripple effects on the following morning commuters. From day to day, morning commute would result in completely different congestion patterns on some road segments (as will be seen in the examples later). Therefore, a model capable of forecasting traffic during the entire morning commute (e.g. 6-10 am) for those road segments (particularly those primary ones) before early morning (e.g. by 5 am), would be mostly desired. Only with the ahead-of-curve prediction on when, where roads will start to congest and how long it would last, traffic managers can proactively engage information dissemination and various operational strategies.

Unfortunately, characteristics of free-flow traffic prevalent in transportation networks before the early morning is uninformative and do not necessarily represent congestion patterns during the morning rush hours. As illustrated in Fig. 1(a), autoregression-based method (AR) with a lag of 5 or 10 minutes is often unable to pick up the traffic break-down using only the time-series traffic data up to 5 am, except for a small fraction of days when heavy delay appeared already earlier than 5 am. Also, historical information does not seem to help due to the high variance of congestion patterns from day to day as shown in Fig. 1(b). Consequently, traditional traffic prediction models, which primarily rely on correlations between future and real-time traffic speed data, could potentially fail for such a next-day morning traffic prediction setting. This limitation by the lack of explanatory data, however, can be best tackled by employing data sources implying broader impacts on morning rush hour traffic. Our overarching idea is to use general data implying population general activities during the night before and early morning by 5 am to predict traffic patterns of the entire morning commute period, called ‘next-day morning traffic prediction’ thereafter.

(a) Autoregression-based (AR) method.
(b) History-based method.
Figure 1: Actual and predicted travel time of a representative highway road segment in Pittsburgh (ID:104-04531): time series data prior to 5 am is generally uninformative

Following traffic flow theory, traffic congestion occurs when travel demand exceeds available link supply (Sheffi, 1985). While supply-side descriptors, such as weather or traffic events (e.g., road construction, bridge precaution, etc.) can be easily collected, daily travel demand data is almost inaccessible on a daily basis. Historical travel surveys cannot account for the day-to-day variance of travel demand. Most crowdsourcing techniques (e.g., wearable sensors, mobile phones, etc.) which are capable of tracking and tracing individual activities before early morning (Harrison et al., 2020) are often insufficient for sensing the entire metropolitan area, nor do they offer predictive insights on congestion patterns. Questions arise: (1) is there a pervasive computing tool for collecting population’s night and early morning general activities in regional networks that would strongly imply travel characteristics in the morning commute? and (2) can we mine and translate the collected data into morning traffic predictors by 5 am on each day?

Sleep-wake status (sleep) “Good night [Sleeping face] I’m off to bed, but before I go, remember I love you.”
Sleep-wake status (wake) “Got up at 4am to take some friends to the airport. Def bout to take a nap”
Local event
“Not a good night for the @Pirates but it was great to spend some time near PNC Park
  with @TheBuccosFan @mryannagy @ANTOINETTE180 @padavies3.”
(Planned) traffic incident
“Short-term lane restrictions on Route 22 in Robinson Township, will begin on Friday,
  July 31 through August 21 between the 7AM to 5PM. Both east and westbound
  directions between the I-576 and the Bavington exits will be restricted.”
Table 1: Tweet examples capturing three types of information useful for next-day morning traffic.

Recently, the rise of social media provides new opportunities to understand the relationship between people’s daily activities and urban systems. User-generated contents, together with account profiles, posting time, and locations can be readily available in real-time with Internet access. Tweets posted in the night and early morning may imply people’s activity patterns, such as sleep-wake statuses or travel plans on the next day. Some demonstrative example tweets are listed in Table 1.

This paper presents tweet2traffic, a new class of social computing models for learning meaningful representation of morning travel demand directly from Twitter messages and using it to improve next-day morning traffic prediction in transportation networks. The proposed models integrate state-of-the-art language representation (e.g. neural language model, sentiment analysis) and geoprocessing techniques to map each tweet information into a high dimension feature space, and extracts daily tweeting profiles through aggregation by space and time. Specifically, we argue that tweets capture three types of useful information for explaining next-day morning traffic, which includes people’s sleep-wake status, local events, and (planned) traffic incidents showcased in Table 1. Considering the sparsity of geocoded tweets in the night before and early morning, we propose a novel social media data augmentation method which first filters influential users (i.e., residents) from noisy users, such as tourists, infers their approximate home locations, and augments the dataset with user timelines such as non-geocoded tweets, retweets, and favorites by assuming residents stay at home during the night before. A sentiment analysis model with a neural language model as a backend is constructed to detect abnormal local events, and incident records are extracted from traffic-related tweets.

Using freeway networks in the Pittsburgh metropolitan area as a case study, the proposed models encode the supply-side variables (traffic incidents, weather, etc.) and demand-side variables (tweet profiles, temporal variables, etc.) through a clustered learning structure which makes use of ordered spatio-temporal congestion patterns of segments along a road. Through experiments, we demonstrate that our approach outperforms existing methods without social media features for morning traffic prediction in terms of congestion starting time, duration, and planning time index prediction errors, and can learn meaningful representations of travel demand from tweeting profiles. Generally, we discover that the earlier people go to sleep, the more congested the road will be in the next morning. Tweeting activities in the early morning by 5 am are positively associated with congestion in the morning peak hours. Also, we find that the occurrence of big events in the evening before, represented by higher or lower tweet sentiment than normal, often results in lower travel demand in the next morning than normal weekdays. We illustrate through ablation studies that social media data, traffic events, and weather conditions are important data sources for next-day morning traffic prediction and their roles vary by space and time. We find that tweet related features can largely improve traffic prediction for road segments near the edge of spillback segments of a traffic bottleneck, where travel demand plays a crucial role in determining congestion. Besides, the implications of social media are more significant when special events such as sports games occur during the night, which possibly reduces or puts off morning travel demand. For traffic incidents and weather conditions, their impacts are obvious on non-recurrent days with roadworks or adverse weather conditions. Traffic incidents downstream to road segments generally shift congestion starting time while upstream incidents reduce congestion duration. Adverse weather conditions such as snow, rain, and floods often make roads more congested by bringing forward congestion starting time and increasing congestion duration and scales. Finally, our approach proves insensitive to the increase of forecasting horizons even before midnight and outperforms benchmarks using the same length of traffic or tweet data without the clustered structure.

1.1 Related work

Transportation research and social media Transport operators have recently incorporated social media into their situation awareness strategies (Cottrill et al., 2017; Rashidi et al., 2017). Promising transportation applications on social media can be categorized into traffic related incident detection (Zhang et al., 2018; Suma et al., 2017; Gu et al., 2016; D’Andrea et al., 2015; Schulz et al., 2013), social event detection (Khare et al., 2019), trip purpose (Cui et al., 2018; Gkiotsalitis and Stathopoulos, 2015) and travel demand modeling (Markou et al., 2019; Berlingerio et al., 2017; Hu and Jin, 2017), human mobility exploration (Zhang et al., 2017; Zhao and Zhang, 2017; Hasan and Ukkusuri, 2014), travel information retrieval (Kuflik et al., 2017) and so on. A detailed review of recent work can be found in (Zhang and He, 2019). Another branch of research used geotagged social media as traveler reviews and explored relations between public opinions, represented by tweet check-in types or counts and content, with infrastructure design (Huang et al., 2017; Ali et al., 2017). On the contrary, little has been done using the information from social media to directly inform real-time traffic prediction. Steiger et al. (2016) explored the spatiotemporal relationship between mobility patterns with traffic-related tweets using self-organizing maps. The strong correlations between spatiotemporal tweet clusters with proximity to special events, traffic incidents, and hazard reports show that social media can serve as a proxy indicator of collective mobility events and predict short-term traffic during unplanned events. Zhang et al. (2016) studied the correlation of traffic-related tweets with traffic surges within certain spatial and temporal ranges around posting coordinates and timestamps. Ni et al. (2016) used the counts of tweet users and event-related tweets to predict passenger flow at a subway station under event conditions. Lin et al. (2015) extracted adverse weather features from tweets and used it for traffic speed prediction. However, prior work either examined near-term or lagged effects of social media on traffic or focused on specific events, improving long-term prediction of day-to-day traffic with social media was not explored. The closest work to this is the work of (He et al., 2013), which used tweet posting activity and bag-of-words semantic features to enhance the prediction of traffic volumes averaged over the next 1,2,12 and 24 hours. Although one of their forecasting horizons (12 h) is similar to our study, the time resolution of predicted traffic is insufficient for morning commuters. The spatial information of tweets and user profiles were not used. When, where, and how can social media patterns and semantics during the night and early morning improve next-day morning traffic prediction were also unexamined.

Social media analytics Social media departs from other GPS-enabled crowdsourcing techniques in two ways: (1) social media data only include geographical positions at sparse points in time and the pathway between these points is unknown, and (2) user-generated contents and account profiles come with each data point. Therefore, state-of-the-art social media analytics focuses on mining spatiotemporal patterns from check-in coordinates and opinions embedded in contents. Clustering methods such as K-means (Kanungo et al., 2002), Gaussian Mixture Model or GMM (Banfield and Raftery, 1993), Density-Based Spatial Clustering of Applications with Noise or DBSCAN (Ester et al., 1996) and Mean-Shift (Comaniciu and Meer, 2002) are the most popular choices for identifying spatial clusters from tweet posting coordinates and exploring human mobility patterns (França et al., 2016; Wu et al., 2014; Hasan et al., 2013). Recent literature (Lin and Cromley, 2018; Huang et al., 2016) successfully inferred user home locations from just a few geocoded tweets without supervision. Our study is built upon DBSCAN and features (e.g., check-in rates) in these works for locating user home locations. For information retrieval from social media contents, keyword-based approach is mostly used, where a domain-specific dictionary is built beforehand and can be expanded later through the Apriori algorithm (Agarwal et al., 1994). Sentiment analysis derives people’s opinions and sentiment from a piece of writing, classifying it as is positive, negative or neutral (Liu, 2012). Commonly-used methods include lexicon-based (Bakshi et al., 2016; Hu and Liu, 2004) and machine learning approaches (Agarwal et al., 2011). Topic modeling can embed a tweet into a vector of topic features where each topic is represented by relevant word distributions. Popular topic modeling approaches are Latent Dirichlet Allocation or LDA (Blei et al., 2003) and Nonnegative Matrix Factorization or NMF (Shahnaz et al., 2006). Recently, deep neural language models (LM) via self-supervision have produced several pre-trained models for general-purpose language understanding, which include BERT (Devlin et al., 2019), OpenAI GPT-3 (Brown et al., 2020) and so on. Neural language models project a document of variable length into a high-dimensional vector while keeping semantics and local structure of the document. Our work is built upon BERT to extract sentiment from tweets for detecting local events.

1.2 The contribution of this study

Our paper is differentiated from prior work in four ways:

  1. We propose a novel machine learning model to use social media data to directly improve the morning traffic prediction on daily basis. This problem is fundamentally different from prior work that can be classified in the following three types: usage of social media for traffic incident detection or travel demand modeling, the examination of near-term effects of social media on traffic, or studying specific events;

  2. Three types of traffic-related information including people’s sleep-wake status, local events, and (planned) traffic incidents are defined and extracted from tweets. A novel social media data augmentation method that tackles the sparsity of geocoded tweets at night and early morning is proposed to supplement the dataset by geotagging non-geocoded tweets during this period with inferred user home locations. Those pieces of information can improve the prediction accuracy tremendously;

  3. Meaningful travel demand representation from tweeting profiles is extracted through learning this next-day morning traffic prediction. This is a data driven approach different from most existing works constructing complex and hypothetical steps to approximate travel demand;

  4. Our study illustrates the different roles of social media, traffic incidents, and weather conditions in traffic prediction by space and time. When, where and how the incorporation of multi-source data would improve traffic prediction offer implementation guidelines to transportation operators and ’interprets” the black box of machine learning.

2 Dataset

The main objective of this paper is to use tweets posted during the night before and early morning to improve the prediction of morning traffic congestion. This section describes the data sources used in the paper: (1) traffic datasets, which are comprised of probe-sourced traffic speed data from INRIX Traffic2 and Road Condition Report System (RCRS) incident data3 maintained by the Pennsylvania Department of Transportation; (2) weather datasets from Weather Underground4; and (3) Twitter streaming datasets from the free Twitter Streaming API5. Data within the City of Pittsburgh, Pennsylvania from Jan 23, 2014 to Dec 31, 2014 were collected.

Figure 2: Datasets used in this paper.

2.1 INRIX speed dataset

Historical traffic speed data were obtained from INRIX which cover major US highways in Pittsburgh metropolitan. The traffic speed data were georeferenced by Traffic Message Channel (TMC) coding. The raw datasets spanned 4,254 TMCs in the City of Pittsburgh. Since this paper focuses on morning traffic congestion, only 53 major highway TMC segments, where congestion (defined as the travel time index ) is observed in the morning periods (from 5 am to 11 am) at least once in the analysis period, are considered. The simplified morning transportation network with 53 TMCs along I-279 Southbound (I-279 S), PA-28 Southbound (PA-28 S), I-376 Westbound (I-376 W), and I-376 Eastbound (I-376 E) are shown in Fig. 2. All selected TMCs, not surprisingly, are heading from suburban areas to downtown Pittsburgh. The road speed data were reported every 5 minutes. Each record includes TMC code, timestamp, observed speed (mph), average speed (mph), reference speed (mph) and two parameters for the confidence of the speed, namely confidence score and confidence value.

2.2 PennDOT RCRS incident dataset

RCRS data feeds provide real-time information for traffic incidents, roadwork, winter road conditions, and other events that cover all state-owned roads. RCRS incident records include the incident location, road closure and open time stamps, as well as the category of the incident. In 2014, RCRS reported 2,696 traffic incidents in Pittsburgh. Fig. 2 shows the incidents in our simplified network by location and category. It is found that most incidents are roadwork, which can be assumed to be planned ahead of time.

2.3 Weather underground dataset

Weather underground reported hourly weather measurements. Each entry contains temperature, pressure, dew point, humidity, wind speed, precipitation, pavement condition, and visibility, etc. Weather data collected in Pittsburgh International Airport were used, and the location of weather station is shown in Fig. 2.

2.4 Twitter dataset

We construct our Twitter streaming dataset by collecting all geocoded and non-geocoded tweets posted within the bounding box (-80.20, 40.29; -79.80, 40.62) visualized in Fig. 2. 1,782,636 tweets from Jan 23, 2014 to Dec 31, 2014 were collected. This is done by first inquiring Twitter Streaming APIs using this bounding box to retrieve 1,349,179 geocoded tweets from 43,670 users, and then furthermore collecting user timeline tweets posted by those 43,670 users through Twint6, a Twitter web scraper. Those users were used to retrieve additional non-geocoded tweets and favorites, adding up to all 1,782,636 raw tweets including both geocoded and non-geocoded. Of those raw tweets, 672,527 (37.72%) tweets posted by those 43,670 users inquired from Twitter Streaming APIs are tagged with accurate locations. The Twitter data include date/time, text, user ID, language, latitude and longitude (if available), user profile location, etc. Also, 2014 US Census Tract Cartographic Boundary Shapefiles are used to join geocoded tweets with the geographical information.

3 Descriptive analysis: feasibility of using tweets for next-day traffic prediction

We first conduct a study of the correlation between the Twitter data from late night to 5 am and the morning traffic data for the roads in our simplified transportation network. This can be seen as a proof of concept to show the feasibility of our study. To examine the correlation in a straightforward way, we characterize the common patterns of morning traffic congestion and tweeting activities using cluster analysis. The relationship between traffic and tweeting patterns are visualized and tested by chi-squared statistics.

3.1 Clustering of morning traffic congestion

We follow the steps illustrated in Fig. 3 to identify typical morning congestion patterns of a road. First, we transform the raw speed data into Travel Time Index () to measure congestion on TMCs. The Travel Time Index is defined as the ratio of travel time in the peak period to the travel time at free-flow conditions (FHWA, 2019). Thus, on a road segment at time index can be computed by Eq. 1, as the ratio of the free-flow (reference) traffic speed to the observed speed on segment . To determine the reference (free-flow) speed of a TMC , the 85 percentile of observed speed on that segment for all time periods (Eq. 2) is used, which is the recommended approach for computing reference speed from probe-based speed data (Jha et al., 2018).

Figure 3: Speed processing steps for clustering morning congestion.

Then, we generate daily congestion profiles of a road, by concatenation of from 5 am to 11 am, for all TMCs on that road. A daily profile for road becomes a vector of , where denotes the number of TMCs on that road and is the number of sampling points of morning periods (in 5-min time intervals). In this study, we adopt principle component analysis (PCA) to address the sparseness and multicollinearity of high-dimension spatiotemporal congestion profiles. To remove noise while keeping most information, the first components that capture 90% of the variance are kept. Finally, clustering analysis, using K-means on the reduced -dimension matrix, is conducted to identify typical daily morning congestion patterns for each road in the network. Optimal cluster size is selected by the elbow method.

The morning road congestion profiles show ordered spatio-temporal patterns, as presented in Fig. 4 where the clusters are ordered from top to bottom by average of the cluster centroid. Because of spill-back effects, severe congestion with greater usually has more spatial and temporal impacts. Surprisingly, we also find that congestion profiles with greater are often coupled with earlier congestion starting time on the road. This might suggest that how early commuters hit the road have some impacts on morning congestion patterns. We intend to use social media data the night before to directly learn about the early-morning demand behavior so as to better interpret/predict the morning congestion.

Figure 4: Road congestion cluster centers ordered by average travel time index (TTI).

3.2 Characterization of tweeting activities during the night before and early morning

One intuition for using tweets from late night to early morning to predict morning traffic is that tweeting activities can imply daily sleep-wake related activities in the regional network and might be used to probe how early and how many commuters hit the road, which consequently, as illustrated in Fig. 4, impacts morning congestion patterns. An example would be if some users tweet late at 2 am, they cannot or don’t have the plan to get up early and add to morning traffic.

This section characterizes such sleep-wake patterns captured by tweets with cluster analysis. For simplicity, we remove spatial information of the tweets, i.e., all tweets within the bounding box are seen as posted from the same Pittsburgh region. To build tweeting profiles of each day, tweets posted from 6 pm to 5 am are first selected, and their counts are grouped by day and (half) hour index. To remove noise, a moving average smoother of a two-hour window is applied. Then, we normalize the tweeting activity curve by its total number, so that each daily profile sums up to one. A daily tweeting profile becomes a vector of . Finally, similar K-means clustering setups in Section 3.1 are used to derive typical sleep-wake patterns captured by tweets.

Figure 5: Typical late night and early morning tweeting profiles.

Four representative tweeting activity clusters are identified from the data. As shown in Fig. 5, while most tweets were posted between 8:30 and 10:30 pm, when the peak is reached and how tweeting activities declined after the peak vary substantially among clusters, and from day to day. To illustrate this, both clusters 1 and 2 show intense tweeting activities in midnight but cluster 2 implies steady decrease in activities till early morning, which differs from a dramatic decline to low tweeting activities immediately after 10:30 pm in cluster s1, 3 and 4. Clusters 3 and 4 show intense tweeting activities in the evening and late night, but cluster 4 reaches a peak earlier before 9:00 pm, and its activities are more concentrated around the peak.

3.3 Correlation analysis

We visualize and analyze the correlation by conditioning traffic congestion cluster distribution on tweeting activity patterns. Fig. 6 implies a surprisingly simple relationship, that the earlier tweeting activity profile achieves the peak during the night (tweeting peak: clusters ), the more likely severe congestion (traffic clusters: or ) occurs in the next-day morning. The intuition is that average sleep time of the population can be inferred from tweeting activity peaks (since apparently a person can’t tweet when they are asleep). As a result, because the earlier a person goes to sleep, the more likely they get up in the morning early and need to make the commuting trip, both of which can increase morning traffic congestion. Hence, tweeting activities are likely to be transformed into a probing measurement of morning travel demand. It is also observed that the more concentrated the tweeting activity is (cluster 4), the more likely congestion is to occur. This also suggests the feasibility of using Twitter messages as a representation of morning travel demand.

Figure 6: Relationships between tweeting and morning traffic patterns.

We perform chi-squared test and compute Cramer’s V for all four roads and present the results in Table 2. The chi-squared test indicates characteristic relationships between tweeting and traffic clusters with . Note that for simplicity of showing the relationship, we used discrete cluster representation and removed spatial and semantics information of tweets. While the Cramer’s V is not high, the results have proved that tweeting activities explain some variances of next-day morning traffic. It is also important to note that the night period of tweeting activities before midnight is the key factor for explaining morning traffic.

By far, we show the great potential of mining Twitter messages for predicting next-day morning traffic. The moderate correlation calls for a model that effectively makes use of spatial and semantics information of tweets, along with supply-side features, such as weather and (planned) traffic incidents.

Road name Number of traffic against tweet clusters statistic p-value Cramer’s V
I-279 S 4 – 4 106.291 <.001 0.330
I-376 W 4 – 4 110.003 <.001 0.337
I-376 E 4 – 4 92.744 <.001 0.307
PA-28 S 3 – 4 67.362 <.001 0.321
Table 2: Chi-squared test and Cramer’s V between traffic and tweeting clusters. Note: (1) Total sample size is 300 for all four roads, and (2) Cramers’s V between 0.3 and 0.5 indicates moderate association.

4 Method

Our proposed tweet2traffic predicts morning traffic, which is characterized by congestion status (S), congestion starting time (CST), congestion duration (CD) and planning time index (PTI) for each of TMC segments in the regional network, using multi-source data provided before early morning. We first describe the data processing steps. The proposed model architecture is then presented. Finally, we introduce baseline models for benchmarking the prediction performances and model variants for showing the effectiveness of our design through ablation study.

4.1 Data processing

Data processing steps for model outputs and inputs, which include traffic speed, social media, traffic incidents, and weather and temporal variables are introduced in this section.

Morning traffic congestion output

The morning period is defined as from 05:00 to 11:00 am, which consists of 72 data points of TMC speed measured every 5 minutes. Instead of predicting the entire morning speed time-series, we propose to describe morning traffic with a quadruple of variables , including congestion status (S), congestion starting time (CST), congestion duration (CD) and planning time index (PTI). We argue that this model output representation is straightforward to predict and is sufficient as providing ahead-of-curve travel information tp morning commuters.

  • Congestion status: a road segment is defined as congested at time on day (), if the observed travel time index , computed by Eq. 1. is greater than , i.e., , for at least minutes, as defined in Eq. 3. A segment is defined as congested on day , i.e., , if it is congested at any time during the morning period.


    The term “accuracy” is used thereafter to indicate the percentage of binary congestion status that is predicted correctly.

  • Congestion starting time: denotes the starting point of the congested period, and if multiple congested periods exist, CST is the starting point of the earliest period. To be consistent, we reverse the time indices for computing CST, so that the earlier congestion occurs, the larger CST values are. For example, if congestion starts at 5 am and if no congestion occurs during the morning.

  • Congestion duration: measures the length of congestion and is thus defined as the interval between the first congestion starting point and the last congestion ending point.

  • Planning time index: is computed as 95th percentile travel time during morning periods divided by free-flow travel time (Lyman and Bertini, 2008). It is a travel time reliability measure which represents how much total time a traveler should allow to ensure on-time arrival 95 percent of the time.

Tweet processing pipeline

We develop a social media processing pipeline to augment, clean, geocode and encode noisy Twitter messages into tweeting encoded feature vectors. We argue that tweets capture three types of information that can explain next-day morning traffic variances.

  • Sleep-wake patterns: as illustrated in Fig. 6, tweeting activities in the night before and early morning capture the sleep-wake patterns in urban districts and thus explain next-day morning traffic;

  • Event indicators: counts and sentiments of geocoded tweets, aggregated by space and time, have been proved in many prior works as an effective indicator for events and holidays, which can impact next-day morning traffic;

  • (Planned) traffic incidents: Twitter accounts owned by public agencies and media (e.g., @511PAPittsburgh) automatically report real-time traffic incidents (e.g., crashes and planned roadworks, etc.), which have impacts on next-day morning traffic.

We process and encode separately the three types of information into features. The detailed steps are as follows.

1. Augmenter A novel social media data augmentation method is proposed to alleviate user sparsity issues when extracting sleep-wake patterns in urban areas. Geocoded tweets are sparse because the sampled users change every day. For example, only 153 tweet users (1%) posted geocoded tweets between 9 pm-5 am for more than 7 days in a year. The sparsity issue adds noises to the estimates of daily sleep-wake variances. We propose to reduce user sparsity by focusing on influential users, i.e., those who claim as residents in Pittsburgh and have posted a sufficient number of geocoded tweets that can be used to infer their approximate home locations. We use all their tweeting profiles scraped from user timelines retrieved by using the open-source Twint, which includes non-geocoded tweets, retweets, favorites, etc., to augment the estimates of sleep-wake pulses. Besides constructing a consistent user sample set, this augmentation method refines the estimates of sleep-wake pulses because (1) given that most residents sleep at home, the augmented tweets can recover full user tweeting activities in late night and early morning, while safely probing their geotags with home location; (2) to explain next-day traffic, spatially-segmented features extracted from tweets should focus on users who live in that community and thus commuting from that area next morning, rather than temporary visitors who posted geocoded tweets there last evening or night, but left afterward. The difficulty of this augmentation approach, however, is to filter influential users from a large number of noisy users (e.g. visitors, etc.), and to infer their home locations with geocoded tweets.

Influential user filtering

Twitter users who posted at least five geocoded tweets in our dataset are selected. We then make use of self-reported user profiles to identify local residents of Pittsburgh. For users who have posted geocoded tweets within the bounding box, we select those who entered their residence with places in Pittsburgh. A resident classifier is carefully implemented using Regular Expressions (REs) that match local city names and nicknames (e.g. pittsburgh, pgh, da burgh, steel city, etc.), sports teams (steeler, etc.), zip codes (e.g. 15213, etc.), area code (e.g. 412), universities (e.g. cmu, chatham, etc.), townships (moon, robinson, etc.), neighborhoods (e.g. shadyside, oakland, etc.) and coordinates within the bounding box (e.g. 40.429, -79.932, etc.).

Home location inference

We infer users’ approximate home locations by two steps: (1) a density-based clustering algorithm called DBSCAN (Ester et al., 1996) is applied on tweet coordinates to identify a user’s frequently visited places; (2) a rule-based classifier is tuned to find user homes from frequently visited places using features of that place cluster, including land-use variables, check-in ratios, night-time activities and home-related tweet features.

(a) DBSCAN clustering (Ester et al., 1996).
(b) Home clusters and weighted average home inference.
Figure 7: Weighted user home location inference with DBSCAN.

As illustrated in Fig. 7(a), DBSCAN requires two input parameters: and . The algorithm iteratively starts with a random unvisited point in the graph and finds nearby points within radius neighborhood. A cluster is formed if more than points are found in the neighborhood area. The algorithm terminates if every point has been either visited or clustered and points not reachable from any cluster, at last, are identified as noise. Input parameters and are used to cluster the coordinates of geocoded tweets by user.

We propose a rule-based classifier to find a user’s home among identified clusters. For each of the coordinate clusters, we compute features including:

  • Land-use composition: a vector containing lane-use composition (%) of tweets posted in that cluster. Pittsburgh zoning map7 is used to join geocoded tweets spatially with land use information, which includes residence, downtown, education, industry, mixed-use, and amenity;

  • Check-in ranking: a positive rank of tweet amount posted in that cluster by the user;

  • Midnight activity: a binary variable indicating if the user has posted tweets in that cluster between 0 and 6 am;

  • Home tweets: a binary variable indicating if at least one home-related tweet is posted in that cluster. We apply a few keywords (e.g. sleep, wake, tv, sofa, bath, bed, etc.) to match tweets that users are likely to post at home;

  • Last destination: a binary variable indicating if the cluster is the last destination of the user for at least one day.

Our proposed home classifier is an extension to the most-check-in method (Hossain et al., 2016). The classifier labels user home with six steps below:

  1. Initialize all clusters as non-home cluster. For each influential user and for each of user’s coordinate clusters, repeat the steps below in sequence;

  2. If check-in ranking of the cluster is top 1, and midnight activity is positive, and industry and amenity account for less than 50% of land use, then label the cluster as user’s home candidate;

  3. If check-in ranking of the cluster is in top 3, and midnight activity is positive, and industry and amenity account for less than 50% of land use, and home tweets are positive, then label it as user’s home candidate;

  4. If that cluster is never user’s last destination or industry and amenity account for more than 50% of land use, then label that cluster as non-home cluster;

  5. If no home tweet has been posted in that cluster and other clusters of that user has been labelled as candidate home location, then label that cluster as non-home cluster;

  6. If multiple clusters of a user are labelled as home candidates, the cluster with the highest check-in ranking are then chosen as home cluster. Otherwise, the home candidate is chosen as home cluster of that user.

After a user’s home cluster is classified, check-in points within a home cluster can still spread across multiple land-use areas, as illustrated in Fig. 7(b). To locate user’s home, handcrafted weights are used to approximate the probability of a point in that land-use area being a residential place, which include weights for: (1) residence=1.0, (2) mixed-use=0.5, (3) education=0.2, (4) downtown=0.2, (5) Industry=0.0 and (6) amenity=0.0. Finally, the coordinates of the user’s geocoded tweets posted in the home cluster are weighted by land-use weights to approximate the user’s home location.


For influential users’ timeline activities during late night and early morning (9 pm - 5 am), we fill the missing tweeting coordinates with their inferred home locations by assuming residents sleep at home.

2. Cleaner We develop two text cleaners separately for individual user tweets and traffic incident related tweets. Individual tweet cleaner is proposed for geocoded tweets and retrieved user timeline tweets, and traffic tweet cleaner is applied to traffic incident tweets posted by traffic agencies, i.e., @511PAPittsburgh in this study.

Individual user tweets

Tweets contain spams and advertisements posted by bots, which should be removed for the purpose of extracting people’s tweeting patterns. To identify these bots, we first filter suspicious accounts by computing their tweeting location ranges. 173 users with tweeting location range less than 10 meters are selected as suspicious bots in this step. Then, we adopt a publicly-available Botometer API (Davis et al., 2016) to give each suspicious account a “bot” score. Botometer is a random forest bot classifier built with more than 1,000 features extracted from available account meta-data, interaction patterns, and content scraped from 200 most recent tweets. Finally, 103 accounts with “bot” score higher than 2.0 are labeled as bots. The threshold is selected by manual inspection. After removing spams, we follow a common procedure described below to clean individual tweet content:

  1. Lower the text content;

  2. Remove special tweet entities (e.g. urls, emojis, email addresses, phone numbers, user names, etc.);

  3. Segment hashtags to words and remove #, e.g., #LetsGoPens – lets go pens;

  4. Concatenate consecutive () single-character tokens, e.g. Ain’t H A P P Y – ain’t happy;

  5. Remove repeated suffix, e.g., Soooo goood lololol…– so good lol;

  6. Translate slangs to formal words with a slang dictionary, e.g. lol – laughing out loud;

  7. Remove special characters and brackets, e.g., *([)]&=;

  8. Strip and remove extra whitespaces;

  9. Add ending mark “.” to unfinished sentences and fill empty tweets with an ending mark.

Traffic incident related tweets

We extend the text processing steps in (Gu et al., 2016) to parse traffic incident tweets posted by public agencies. @511PAPittsburgh is an official Twitter account of the Pennsylvania Department of Transportation (PennDOT) for Southwest PA. The account reports real-time traffic incident status with a series of computer-generated tweets. For example, a full incident record is reported with tweets:

2019-12-27 06:42 – Multi vehicle crash on I-376 eastbound at Mile Post: 74.0. There is a lane restriction.
2019-12-27 07:18 – UPDATE: Multi vehicle crash on I-376 eastbound at Mile Post: 74.0. All lanes closed.
2019-12-27 08:02 – CLEARED: Multi vehicle crash on I-376 eastbound at Mile Post: 74.0.

Considering the fixed content format, we carefully implement Regular Expressions (REs) to extract highway or road names (e.g., I-376), direction (e.g., eastbound), exit or milepost (e.g., 74.0), incident type (e.g., crash), lane status (e.g., lane restriction – full closure – open) and tweet flag (e.g., occur – update – clear) from each incident tweet. Compared with (Gu et al., 2016), we add the retrieval of lane status and flags to better describe the incident record.

3. Geocoder Individual (geocoded+timeline) tweets are spatially joined with census tracts on their posting coordinates. For traffic incident tweets, we applied the GIS developed in (Gu et al., 2016) to translate the parsed incident highway/road name and exit/mile post into incident latitude/longitude coordinates.

4. Encoder Three types of information, i.e., sleep-wake patterns, local events and traffic incidents, embedded in tweets are encoded into features. Note that the three groups of features are extracted from different processed tweet datasets: (1) sleep-wake patterns are extracted from augmented influential user timeline tweets; (2) local event indicator features are built from geocoded tweets, and (3) traffic incidents are parsed from traffic incident tweets.

Individual user tweets

Two groups of features are extracted from user tweeting activities and sentiment. To extract tweet sentiment, a neural sentiment model is first constructed to embed tweet content into a vector. We finetune BERT (Devlin et al., 2019), a state-of-the-art neural language model shown in Fig. 8(a), on Sentiment140 dataset (Go et al., 2009). The dataset contains 1,600,000 tweets annotated with the polarity of content (i.e., Pos=positive, Neg=negative). We use the developed individual tweet cleaner to clean and normalize tweet content. A held-out dataset of 80,000 tweets is used to early-stop the training when validation accuracy does not increase. As shown in Fig. 8(b), the constructed model achieves an accuracy of 87.2% for classifying tweet sentiment on the validation dataset. Then, the fine tuned model is applied to user tweets to embed their content into vectors. We use the last sigmoid layer output , i.e, the probability of a tweet being positive, to label content into three categories (Pos=positive, Neu=neutral, Neg=negative). We label tweet content as positive if , neutral if and negative if . The features extracted from user tweeting activities and sentiment are described below.

(a) BERT language model (Devlin et al., 2019).
(b) Finetuning BERT on Sentiment140 dataset.
Figure 8: BERT-based neural tweet sentiment model.
  • Spatio-temporal sleep-wake activities: two histogram-like vectors describe the distributions of sleeping and waking up times of influential tweet users. The last augmented timeline tweets that the influential users post between 9 pm-3 am and the first augmented timeline tweets they post between 3 and 5 am are selected. Their counts, aggregated spatially by census tract and temporally by hour, are normalized by total counts to build the two histogram-like feature vectors.

  • Event indicators: abnormal posting frequency and sentiment of geocoded tweets describe the occurrence of events. Six periods, including EM–Early Morning (3 am-5 am), AM–morning (5 am-9 am), DA–Mid-day (9 am-6 pm), EV–Evening (6 pm- 9 pm), LN–Late Night (9 pm-0 am), and MN–Mid Night (0 am-3 am) are used to segment the day. The number of geocoded tweets posted in these periods are used as features. Moreover, the percentage of neutral tweets in the six periods, among all sentiment categories, are also encoded as features. We expect that the tweeting frequency will be higher while the neural tweet percentage will be lower than average if some special events occur because most users either show positive or depressing attitudes towards events.

Traffic incident related tweets

Traffic incident records, including closure/open timestamps, highway/road names, incident start/end coordinates, lane closure type are translated from the parsed incident tweet series. The same data format is also used in the PennDOT RCRS incident dataset so these two data sources can be easily integrated. Lane closure/open timestamps are defined as the first and last tweeting timestamps of the series of tweets describing an incident. If two or more exits/mileposts appear in a tweet, incident start location is defined as the coordinates of the smallest exit/milepost and incident end location uses the coordinates of the largest exit/milepost. Otherwise, the incident start and end locations both use the coordinates of the incident. Two types of lane closure, i.e., partial and full closure, are used to describe the severity of incident. We define an incident as a full-closure incident if lane status turns to full closure for at least once during the incident record.

Traffic incidents

Traffic incidents data, which include PennDOT RCRS incidents and the processed traffic incident tweets, are encoded so that those features represent the impacts of incidents on TMC segments. Three aspects of incident impacts are considered: (1) lane closure types, (2) incident location, and (3) incident time window. Two vectors, partial closure impacts and full closure impacts , are extracted as follows.

Figure 9: Five types of road closure location w.r.t. road segment location.

Impacts of lane closure type Let and be two complement binary variables describing the lane closure status of incident . when road is partially-closed and when road has full closure. The two types of lane closure are encoded separately into two features and , because lane closure types have been shown to impact very differently the change of traffic patterns in the network (Qian and Michael Zhang, 2012).

Impacts of incident location To encode incident locations, a three-dimension vector is used. Fig. 9 shows the five possible incident locations relative to road segments. We use to denote locations, where means the incident occurs on downstream, or intersects downstream of the TMC, means the incident contains the TMC, and means the incident is on upstream or intersects upstream of the TMC. Let be the minimum distance between incident start/end location and segment start/end location, and km is the farthest distance with which incident can impact the TMC in this study. Let be the impacts of an incident , which decays linearly with distance to incident location in Eq. 4. Note that elements in range from 0 to 1.


Impacts of incident time window A vector of binary variables is used where if the road is closed in hour t for incident . Note that is included at the front to describe if there are incidents happening in midnight before 5 am. Incidents that occur after morning periods, i.e., after 11 am, are not considered.

Weather and time features

Weather features used in this paper include six continuous variables – temperature, humidity, wind speed, pressure, visibility, hourly precipitation, and a binary variable – pavement condition. Continuous weather variables are scaled to 0-1 range by Min-Max normalization. Time features include four categorical variables: week-of-year, month-of-year, day-of-week, and holiday. We process them into an 8-dimension variable. For the cyclic month and week of year variables, we use sine and cosine functions to transform them into a two-dimension vector :


where denotes the week/month index and denotes the total weeks/months in 2014. An advantage of this “clockwise” encoding is that each variable is mapped onto a circle such that the lowest value for that variable appears right next to the largest value (e.g. January is next to December). For day-of-week and holiday variables, we apply one-hot encoding after combining similar time features. Specifically, while Monday and Friday are encoded separately, Tuesday-Thursday are merged into one variable, so are Saturday, Sunday, and official holiday variables. To include prior and lagged effects of weekends and holidays, we include the number of days before and after the nearest holiday or weekend in the feature set.

4.2 Our proposed model

We develop a clustered learning pipeline that makes use of the spatio-temporal patterns of morning traffic along a road for predicting segment-level traffic. As shown in Fig. 10, the pipeline model consists of three building blocks: (1) road-level traffic descriptor, (2) segment-level traffic classifier, and (3) segment-level traffic regressor.

Figure 10: tweet2traffic clustered model architecture.

Road-level traffic descriptor

This module predicts morning congestion scales of a road. Because of spillback effects, morning road traffic shows ordered spatio-temporal clustering patterns as described in Section 3.1. It is natural to use ordered logit to model the relationship between road traffic cluster index and road-level input features, which include tweet features, weather and time variables. Note that traffic incident features are redundant at road level and hence are only included in segment-level models. We develop a set of “one-versus-rest” binary classifiers for the ordered logit model. The binary classifiers respectively predict if the road congestion cluster index is greater than a given value . For example, the road-level descriptor of I-279 S which has four identified congestion clusters is comprised of three binary classifiers respectively predicting if the cluster index , and . Due to the high dimensions and co-linearity of spatio-temporal tweet features, we apply -norm regularization on model coefficients in Eq. 9 to remove irrelevant variables and to learn stable relationships.


Clustering features , which describe the predicted morning congestion scales of the road, are included in feature sets of segment-level models of the road.

Segment-level traffic classifier

-regularized logistic regression classifiers are separately built for each segment to predict next-day morning congestion status. As defined in Eqs. 10-11, penalty pushes the model to select critical features that explain segment-level variances, with road-level congestion explained by upper-level traffic descriptor .


For each segment, we fit two non-linear models on the selected features:

Random-forest-classification (RF) Random forest classification model trained on the features selected by -regularized logistic regression.

Neighbor-based-classification (KNN) A neighbors-based regression where the neighbors are obtained by selecting days with the closest learned road-level congestion scales . Uniform weights are applied to traffic outputs of neighbor days to trigger predictions on congestion status .

Segment-level traffic regressor

Similarly, linear models with regularization (Lasso) are trained on days when congestion occurs () to predict congestion starting time (), congestion duration () and planning time index () for each segment. These predictors learn weights such that:


For each segment, we fit two non-linear models on the selected features.

Random-forest-regression (RF) Random forest regression model trained on the features selected by Lasso.

Neighbor-based-regression (KNN) Similar to KNN classifier. Uniform weights are applied to traffic outputs of neighbor days to trigger continuous predictions on .

4.3 Baselines and model variants

Two baselines are implemented: HM interprets recurrent traffic patterns with historical data, and SAR show the additional power of real-time traffic speed data before early morning for explaining morning peak hour traffic.

Historical-mean (HM) Baseline model for next-day morning traffic prediction without using real-time data. HM makes predictions using day-of-week historical average of traffic outputs with a rolling window tuned by cross validation.

Seasonal-autoregression (SAR) Baseline model for traffic prediction with real-time traffic speed data. As defined by Eq. 15, real-time traffic speed before early morning and past day-of-week seasonal speeds are combined to predict TMC speed in the morning by an autoregressive way. Traffic outputs are computed from speed predictions following data processing steps in Section 4.1.1.


We experiment with three model variants to show the effectiveness of our model design.

Tweet2traffic-no-feature A model variant trained with the removal of a selected feature type. We assess the role of each data source by removing its all processed features from our model and computing the performance deduction.

Tweet2traffic-no-cluster A model variant without road-level traffic descriptor (i.e., k-means clustering and ordered logit model blocks). The high-dimensional social media features are fed into segment-level models directly. Other blocks (e.g. classifier, regressor, regularization, etc.) remain the same.

Tweet2traffic-before-time A model variant trained with data (e.g. tweet, weather, traffic incidents, speed, etc.) accessed before an earlier time than 5 am. This variant tests how ahead-of-curve our model can predict morning traffic, by quantifying the trade-offs between forecasting horizon and prediction performances.

5 Results and discussion

We run experiments with 300 days of traffic and social media data on 53 TMC road segments in the City of Pittsburgh in 2014. For all experiments, we use and min to identify morning congestion and to generate ground-truth traffic congestion labels. For days with multiple congested periods, we combine the periods if their gap is less than 15 minutes. We’ve used different training and testing datasets for (1) evaluating prediction performances and (2) interpreting the model, which are described as follows:

  1. When evaluating the model prediction performances, our model uses data known before 5 am, or earlier discussed in Section 5.4 to inform morning traffic predictions of on all segments, described by congestion status (), congestion starting time (), congestion duration () and planning time index () in morning periods. Hence, for traffic incidents, only planned traffic events (e.g. road construction, bridge precaution, etc.) and crashes reported before 5 am (or earlier depending on the cut-off time of data feed) are fed as inputs to the model. To train the models and evaluate performances, nested time-series cross validation (tsCV) is employed. In the outer loop, tsCV splits the dataset in sequence such that in the kth split, the first k folds are used for training and the (k+1)th fold is for testing. The inner loop applies cross validation (CV) to tune hyperparameters. Finally, the model is fitted on the whole training set with tuned model hyperparameters and evaluated on the test set in the outer loop. The experiment setting ensures that: (1) model training and hyperparameter tuning access only training sets, and (2) future information is not abused to predict the past when testing. The models are evaluated on each testing split by metrics including classification accuracy, precision and recall score for , and root-mean-squared-error (RMSE) for and . Note that for classification of , all training and testing samples are used by models but for regression of , only days with morning congestion (i.e., ) are fed into the model. Similarly, models are only tested on congested days. By applying 10-fold tsCV in the outer loop and 4-fold CV in the inner loop, we report the model prediction performances by each model and testing split. An aggregated score weighted by the number of samples in each testing split is also reported for each model.

  2. When interpreting model parameters and predictions, all 300 days of data are used for training, and hyperparameters are selected by 4-fold cross validation. This is to ensure the fitted model parameters are consistent across the splits. All types of traffic incidents (i.e., planned and unplanned traffic events) in the morning are assumed to be known to the model.

Tweet features
Spatio-temporal sleep-wake pulse Aggregated count distribution of influential users’ last tweets in night and first tweets
in early morning by hour and census tract (Feature name: Hour_CensusID)
Aggregated count Aggregated geocoded tweet count distribution by day period defined in Section 4.1.2
(Feature name: Period)
Aggregated sentiment Percentage of neutral geocoded tweets by day period (Feature name: Neu_Period)
Incident features
Partial closure impacts () Traffic impacts of a partial closure incident at location of the road segment in hour
(Feature name: p_Location_Hour)
Full closure impacts () Traffic impacts of a full closure incident at location of the road segment in hour
(Feature name: f_Location_Hour)
Weather features
Air temperature Air temperature in degree F immediately by hour (Feature name: temp_Hour)
Relative humidity Relative humidity in percent by hour (Feature name: hum_Hour)
Wind speed Wind speed in feet by hour (Feature name: wspd_Hour)
Pressure Mean sea level pressure by hour (Feature name: pressure_Hour)
Visibility Prevailing hourly visibility (Feature name: vis_Hour)
Pavement condition Binary variable (1 for wet, icy, flooded) by hour (Feature name: pave_cond_Hour)
Adverse weather Categorical descriptions of observed adverse weather conditions. Larger values
indicate severer weather conditions (Feature name: wx_phrase_Hour)
Hourly precipitation Hourly precipitation (Feature name: precip_hrly_Hour)
Time features
Cyclic month Clockwise encoding of month of year (Feature name: mon_sin, mon_cos)
Cyclic week Clockwise encoding of week of year (Feature name: week_sin, week_cos)
Holiday Binary variable indicating if the day is a holiday or during weekend
(Feature name: dow_wkd_holiday)
Day of week (Monday) Binary variable indicating if the day is Monday (Feature name: dow_mon)
Day of week (Tues–Thurs) Binary variable indicating if the day is between Tuesday and Thursday
(Feature name: dow_tue_thu)
Day of week (Friday) Binary variable indicating if day is Friday (Feature name: dow_fri)
Prior holiday effects Number of days before next holiday or weekend (Feature name: nxt_rest)
Lagged holiday effects Number of days after last holiday or weekend (Feature name: lst_rest)
Clustering features
Road congestion scale Probabilities of traffic congestion on road greater than a given level
predicted by road traffic descriptor (Feature name: c_Level)
Table 3: List of features used by the predictive model.

The features used by the predictive model are summarized in Table 3. We compare model performances with baselines in Section 5.1. Section 5.2 elaborates the respective contributions of tweets, traffic events and weather data to model prediction, what data patterns it captures, and discusses when and where our method outperforms the variants. In Section 5.3 we show the effectiveness of our proposed clustered model structure. How in advance can morning traffic be predicted with social media data is further examined in Section 5.4.

5.1 Next-day morning traffic prediction

The model prediction performances of our proposed model tweet2traffic against baselines, described by aggregated evaluation metrics across testing splits, are summarized in Table 4 and visualized by road segment in Fig. 11. The poor prediction performances of two baseline models, i.e., seasonal-autoregression (SAR) and historical-mean (HM), confirm the limitation of using real-time speed data for morning traffic prediction in Section 1 generally exist in traffic networks and Fig. 1 is no special case.

First we find that the autoregression-based method (SAR) can trigger reasonable traffic predictions of the next time step (5 min) for most road segments (). However, when used to predict traffic for whole morning periods using data by 5 am, SAR can often predict no congestion at all, which is supported by the extremely low recall of congestion status and high prediction errors of congestion starting time, duration and planning time index presented in Table 4. This is because congestion often doesn’t exist or traffic demand can grow very slowly prior to 5 am on some weekdays, and SAR is unable to pick up the drastic traffic break-down using only the traffic data up to 5 am. In addition, adding the seasonal features to AR models doesn’t seem to be a solution for the problem setting, as the model puts most weights on real-time traffic speed (i.e., AR(1) term), which can only be inferred autoregressively from model predictions if using only the traffic data up to 5 am. On the contrary, historical-mean (HM) baseline which doesn’t use any real-time information performs much better than SAR in terms of predictions of all congestion measurements. This is no surprise, as HM offers insights for day-to-day recurrent traffic. This is why HM is used by many traffic management centers to tune next-day traffic management strategies beforehand, and is offered as the ’typical traffic’ in Google Maps’. However, it is shown in Table 4 that the overall prediction performances of HM, with 79% accuracy for predicting congestion status, 48 minutes error for congestion starting time and 74 minutes error for congestion duration, do not provide adequate travel advice for morning commuters.

Method Congestion measurements
Congestion status (%) Starting time (h) Duration (h) Planning time index
Accuracy Precision Recall RMSE RMSE RMSE
Our model
Tweet2traffic 0.88 0.05 0.79 0.16 0.85 0.15 0.57 0.22 1.05 0.76 6.53 3.98
Tweet2traffic-RF 0.86 0.06 0.77 0.16 0.78 0.19 0.65 0.28 1.16 0.88 7.12 3.87
Tweet2traffic-KNN 0.85 0.07 0.78 0.14 0.75 0.20 0.61 0.25 1.19 1.01 6.56 3.72
SAR 0.58 0.16 0.22 0.29 0.09 0.20 3.50 0.59 2.33 1.57 15.29 13.28
HM 0.79 0.07 0.60 0.24 0.68 0.26 0.80 0.34 1.24 0.78 8.15 5.17
Table 4: Comparisons of model performances averaged over evaluation periods and road segments. The reported standard deviations () account for variances of evaluation metrics across different road segments.
(a) Morning traffic prediction error by road segment.
(b) Performance improvement against historical-mean baseline.
Figure 11: Visualization of tweet2traffic congestion prediction model performances by road segment: (a) Congestion prediction error of our model by road segment, and (b) percentage improvement against historical-mean baseline. Note: RMSE root mean square error; CST congestion starting time; CD congestion duration; PTI planning time index; percentage improvement.

Results show that our proposed models tweet2traffic generally outperform baselines for all congestion measurements and the performance improvement is more significant for predictions of congestion status and congestion starting time, if compared with the HM baseline. In addition, if compared with two nonlinear model variants of tweet2traffic using random forest (RF) and neighbors-based (KNN), the model performances degrade on most predictions of congestion measurements. Although one may expect adding feature nonlinearity often boosts performances, our experiments show otherwise, possibly because of the relatively small sample size in some time-series splits. To further locate the sources of performance improvement, the percentage improvement of our method is visualized against HM by road segment in Fig. 11.

It is found that the model prediction performances differ a lot by spatial locations. As shown in Fig. 11(a), some segment models can offer accurate CST predictions with less than 20 minutes error and CD predictions with less than 25 minutes error, while some other segments can have CST error over 65 minutes and CD error even over 2 hours where no significant improvement over HM is observed. A similar pattern is observed for percentage performance improvement against HM baselines in Fig. 11(b), where tweet2traffic can have 64% CST prediction improvement and 39% CD improvement for some special locations (e.g. at the edge of spillback regions upstream to traffic bottlenecks) in the regional network, while several segments only show a trivial improvement in terms of all congestion measurements with multi-source data features added. Besides different traffic conditions, the results suggest that the multi-source data features, including tweets, incidents, weather and temporal variables, are only effective for improving morning traffic predictions on some locations in the network, but not for all. This is completely anticipated as some road segments are much more likely to be influenced or implied by weather, population activities and incidents than others. The data driven approach precisely helps learn the “signals” of morning traffic congestion as much as it can from social media data, which is effective and powerful on a portion of roads. The tweet2traffic model may not work for other roads, simply because social media are not actively used in areas that are relevant to those roads.

It is worth noting that the tweet2traffic model works the best on those segments that are upstream of “choking’ points”, namely spill-back segments of bottlenecks with large day-to-day congestion variations. This is particularly appealing since this next-day traffic information can be the most valuable to both commuters and operators.

5.2 Role of each data source

We explain how multi-source data impact overall morning traffic prediction in Section 5.2.1. The respective contributions of each data source, and when and where our model outperforms those variants is discussed in Section 5.2.2.

Visualization of model parameters for road-level traffic descriptor

We first visualize in Figs. 12 to 15 the weights of road-level traffic descriptors fitted on four roads, i.e, I-279 S, I-376 W, I-376 E and PA-28 S used in our study. The road-level models describe the overall congestion patterns on the road by predicting if the road congestion cluster index is larger than a given level. For example, the road-level descriptor of I-279 S which has four identified congestion clusters (Fig. 4) is comprised of three binary classifiers respectively predicting if , and . For interpretability, the model weights of the first and last classifier of each road-level traffic descriptor are presented. The weights of the descriptor explain how multi-source tweets, weather and temporal features impact the morning traffic at the road level. The impacts of traffic incident information will be examined by the segment models in Section 5.2.2.

Figure 12: Model weights of road-level traffic descriptor of I-279 Southbound. Note: =congestion cluster index; full names and details of features used by the model can be found in Table 3.

Clearly, the selected features are consistent with our anticipation. As shown in Figs. 12 to 15, the selected features are across all types of tweet features, weather and temporal variables. It is found that tweeting features, which include both spatiotemporal sleep-wake pulses and event indicators, serve as a proxy for morning travel demand. As expected, the spatiotemporal sleep-wake pulses captured by each descriptor differ by space (i.e., census tract) and time (i.e., hour index) as morning travel demand among different Origin-Destination pairs is loaded into these four roads by different time. However, when looking at each road descriptor separately, the resulting relationships are surprisingly simple. Generally, we discover that the earlier people go to sleep, the more congested the road will be in the next morning. The early-sleeping pulse is represented by high resident tweeting activities in the evening, together with low tweeting activities during midnight in selected spatial areas. Accordingly, it is found in Figs. 12 to 15 that weights of sleep-wake pulses between 9 pm and midnight are mostly positive for classifiers while weights of sleep-wake pulses between 0-3 am are negative. This supports our finding that early-sleeping pulse positively impacts morning congestion if the roads will be congested the next morning. Interestingly, we also find that the early-wake pulses, represented by high resident tweeting activities in the early morning, also positively relates to the level of morning congestion on the road. This finding is supported by the positive weights of sleep-wake pulses between 3-5 am for classifiers in Figs. 12 to 15. It strongly supports our anticipation that in general the earlier people go to sleep, the higher chances they will commute and depart early, both of which contribute to morning congestion.

Figure 13: Model weights of road-level traffic descriptor on I-376 Westbound. Note: =congestion cluster index; full names and details of features used by the model can be found in Table 3.

Next, we present the impacts of tweet event indicator features which comprise of aggregated geocoded tweet counts and percentage of neutral tweets by time period. It is found that the percentage of neutral tweets in the evening (6-9 pm), i.e., labeled as “neu_EV”, is positively-correlated with next-day morning road congestion. This suggests that the occurrence or discussion of big events in the evening, represented by higher or lower tweet sentiment than normal, often results in low travel demand in the next morning. Possible examples of “big events” include the occurrence of sports games (e.g., Pittsburgh Pirates, Penguins, Steelers, etc.) in the evening or discussion of upcoming local holidays and so on, which are reasonable indicators of low travel demand in next morning.

Figure 14: Model weights of road-level traffic descriptor on I-376 Eastbound. Note: =congestion cluster index; full names and details of features used by the model can be found in Table 3 .
Figure 15: Model weights of road-level traffic descriptor on PA-28 Southbound. Note: =congestion cluster index; full names and details of features used by the model can be found in Table 3.

As expected, roads are found to be more congested between Tuesday and Thursday and traffic during weekends and official holidays are of low volume. Surprisingly, we show that weather features such as precipitation, pavement conditions and adverse weather conditions are only selected in classifiers. This suggests that adverse weather conditions in the early morning can often impact the morning road congestion levels if congestion occurs, but traffic congestion may be completely avoided when commuters become aware of the adverse weather in advance in the next morning.

Ablation study of data source

Experiments with tweet2traffic-no-feature model variants are conducted to determine the contribution of each data source to model prediction performances. We remove specific features from the model and computes performance degradation from tweet2traffic. Note that we only evaluate the importance of data sources on the top 10 road segments selected by performance improvement against historical-mean baseline in Section 5.1, as these segments make the most use of multi-source data for traffic prediction. Note that most of these segments are located at the edge of spillback regions upstream to traffic bottlenecks in the regional network. The results are shown in Fig. 16.

Figure 16: Percentage performance degradation by removing data source for the top 10 road segments. The error bars account for variances across road segment. Note: CST=congestion starting time; CD=congestion duration, and PTI=planning time index.

It is surprising to see that if we do not include Twitter related features, the model performances drop the most on all selected road segments in terms of all four congestion measurements. As shown in Fig. 16, the prediction of congestion duration and planning time index degrades over 15% if not including tweet features. On the contrary, if we remove traffic incident or weather features, the performance only degrades by 4% and 6%, or 5% and 4% for CD and PTI. This finding highlights the importance of having social media data for some specific road segments. There are three main reasons: (1) only traffic events and weather conditions reported before 5 am are used to fit the models so a large portion of traffic incidents, such as crashes during the morning commute, cannot be used by any models. The same applies to weather conditions, where moderately adverse weather conditions during the morning commute that do not prevent commuting trips can have unknown effects to traffic; (2) social media related features can be moderately correlated with those weather and indent related features. In other words, tweets may already include some information related to weather and incidents; and (3) traffic events and adverse weather conditions are sparsely distributed in time and space so the overall performance may not be affected significantly. Those hypotheses call for an investigation into testing samples to examine when and where our model outperforms tweet2traffic-no-feature variants.

Social media data We show when social media features are necessary for next-day morning traffic prediction using a representative road segment (ID: 104N04503) as an example. The segment is located right after the merging lanes of William Penn Hwy and Penn Ave into I-376 W and is upstream to Squirrel Hill Tunnel, which is one notorious congestion bottleneck in Pittsburgh with extremely large day-to-day congestion variation. Note that this segment is located at the upstream edge of the traffic spillback area of Squirrel Hill Tunnel, indexed by TMC order=1 in the cluster plots of I-376 W in Fig. 4. With tweet features approximating morning travel demand, this segment is an ideal location to test the effectiveness of social media data, since the limited supply only necessitates congestion while morning travel demand plays a crucial role in causing congestion on that segment.

Figure 17: Examples of social media impacts on next-day morning traffic: (a) time-series evaluation for a representative road segment. Each point is RMSE over a split; (b) Detailed predictions within a split. Each point represents error of one traffic prediction; (c) explanation of predictions made by traffic descriptor on Jul 22, 2014; (d) explanation of predictions made by traffic descriptor on Jul 22, 2014; (e) word cloud in the evening of Jul 21, 2014, and (d) word cloud in the late night of Jul 21, 2014.

As shown in Fig. 17, we first compare the prediction errors of models with and without tweet features by testing split in (a). The model with tweet features performs much better than its variant for the test split in July. Thus, we zoom into the split in (b) and select 7/22/2014 (Tuesday) as a representative sample for examination. In the prediction made by thee road-level traffic descriptor, we visualize the contributions8 of the top 5 features in (c) and (d). Our model with tweets describes morning congestion with cluster-level 2 while the variant without tweets predicts cluster level 3 which misleads the prediction. Hence, we further examine the predictions made by and classifiers of the traffic descriptor with tweets in (c) and (d). It is found that the percentages of neutral tweets in the evening and late night can trigger a prediction of level 2 morning congestion as in (c), but are not high enough to predict level 3 congestion as in (d). It indicates that some abnormal events which occurred in the evening and night of July 21 possibly reduced or put off morning travel demand on July 22. Finally, we confirm it by mining the topics of tweets posted in the two periods. The word cloud (e) of tweets posted in the evening, which contains “pnc park”, “scoreboard”, “pittsburgh pirates”, “game” as frequent words, together with word cloud (f) of tweets posted in late night, which contains emotional words such as “love”, “good”, “shit”, etc., and game-related words, indicate that there was a night game on July 21. In summary, we show that although our method does not improve morning traffic prediction for all road segments and days in our study, 78% of testing samples can benefit from the incorporation of social media by improving at least one of the four congestion measurement predictions.

Traffic event information For all segments, only testing days with traffic incidents, which are defined by nonzero summation of incident features over all hours, are considered to assess the effectiveness of incident data reported before 5 am for morning traffic prediction. As shown in Fig. 18, the performance degradation after removing incident features vary significantly by spatial locations of road segments. Traffic predictions of road segments adjacent to traffic bottlenecks (e.g., Ft. Pitt Tunnels, 40th St Bridge, Summer Hill, etc.) are generally more affected by the removal of traffic incident features. This is expected because traffic incidents occur frequently near traffic bottlenecks and are more likely to impact the traffic near the bottleneck and its upstream spillover. Interestingly, we also find in Fig. 18 that predictions of congestion status (Accuracy) and congestion starting time (CST) on segments upstream to these identified bottlenecks seem to be more affected by traffic incident features, while for predictions of CD and PTI, road segments upstream and downstream to traffic bottlenecks are equally affected. In fact, the results conform to the physical properties of network traffic flow because (1) congestion caused by traffic incidents propagates backward so all congestion measurements of segments upstream to traffic bottlenecks are affected by traffic incidents, and (2) traffic incidents can also block the outflow so segments downstream to traffic bottlenecks may be less congested than normal. Hence, predictions of CD and PTI for segments upstream to traffic bottlenecks can be affected by traffic incidents occurring at the bottleneck.

Feature Weight Feature Weight
BIAS 38.394 p_ds_10 -0.964
p_ds_7 0.518 vis_5 -0.491
precip_hrly_5 0.355 p_us_4 -0.474
wx_phrase_5 0.127 p_us_9 -0.197
c_2 0.098
precip_hrly_4 0.086
pave_cond_4 0.052
p_in_6 -2.246
Table 6: Segment-level model weights for CD prediction on 104-04530 (). Details of features are in Table 3.
Feature Weight Feature Weight
BIAS 8.286 p_ds_9 0.006
c_2 2.028 p_us_5 -0.001
c_3 1.843 p_us_6 -0.001
precip_hrly_5 0.823 p_us_4 -0.054
p_ds_8 0.728 p_us_9 -0.242
precip_hrly_2 0.128 p_ds_5 -0.391
wx_phrase_1 0.101 vis_3 -0.496
precip_hrly_4 0.018 vis_5 -0.55
Table 5: Segment-level model weights for CST prediction on 104-04530 (). Details of features are in Table 3.
Figure 18: Percentage performance degradation on non-recurrent incident days after removing traffic incident features.

We further examine a representative segment (ID: 104-04530) to support our finding. As shown in Fig. 18, the selected segment is close to Summer Hill where two highways I-279 SB and US-19 SB merge into Parkway North of Pittsburgh. The influences of traffic incident features on traffic prediction are crucial with and . We fit the segment-level model and visualize the model weights of CST and CD predictors in Table 6 and Table 6. For the CST predictor in Table 6, we show that traffic incidents occurring downstream at 7 am (p_ds_7) bring forward CST while incidents downstream at 10 am (p_ds_10) delay CST. Traffic incidents which partially closed the segment at 6 am (p_in_6), or upstream at 4 and 9 am (p_us_4, p_us_9) also delay the start of congestion. For the CD predictor in Table 6, a similar pattern is observed except that upstream incidents of all time periods (p_us_Hour) reduce the congestion duration. The fitted models support our proposed reasons (1) and (2) by explaining how traffic incident features impact morning traffic.

Weather conditions In this study, we assume adverse weather conditions are comprised of snowing, raining, and wet pavement conditions. Hence, testing days with a nonzero summation of these features across all hours are considered to assess the effectiveness of weather data known before 5 am for morning traffic prediction. Our results show that tweet2traffic-no-weather suffers from 3.5%1.9% performance degradation in terms of prediction accuracy of CS, 27.6%19.9% degradation in terms of RMSE of CST, 22.2%13.0% for CD and 21.5%13.3% for PTI on the same group of 10 road segments in Section 5.2.2. Compared with the results in Fig. 16, weather features are now crucial factors for predicting morning traffic, especially for CST, CD and PTI on days with adverse weather conditions. Note that we do not further analyze the contributions of weather data for different spatial locations because weather features are the same for all road segments. Also, we do not have enough segment-level attributes such as pavement types and slopes to explain the variances of precipitation impacts across the network.

5.3 Effectiveness of clustered model structure

We conduct experiments with tweet2traffic-no-cluster variant to show the effectiveness of the clustering in the proposed model structure. As shown in Fig. 19, after removing the clustered structure and feeding tweet features directly to segment-level models instead, most road segments suffer from performance degradation especially in terms of prediction errors of CST, CD and PTI. We also observe that the performance degradation is more significant for road segments in the middle of the road. For example, the RMSE of CST prediction for the segment in the middle of PA-28 SB increases by 80% after removing the clustered structure.

Figure 19: Percentage performance degradation after removing clustered model structure.

Our method which adds road traffic cluster probabilities into segment-level models resembles multi-level random intercept model (Raudenbush and Bryk, 2002), where the road congestion clusters can be considered as the random intercepts of upper groups (congestion clusters) and the features selected via penalty in segment-level models explain relationships in lower groups. As shown in Fig. 19, adding clustered structure is not always beneficial for segment models such as for segments on both ends of the road, which is reasonable as traffic at these locations does not follow overall congestion clusters closely. However, for segments in the middle of the road, model generalization performances have been largely improved because overall congestion clusters help segment models select critical features for explaining segment variances from high-dimensional features with a small sample size. The results also suggest that social media data seem to explain overall road-level congestion better than for specific locations.

5.4 Sensitivity analysis: how far ahead morning traffic be well predicted with tweet information

In previous sections, we show that most useful social media features are extracted from evening (EV) and late night (LN), which are readily available by midnight for the next-day traffic prediction. We thus conduct experiments with two tweet2traffic-before-time variants that use only features available before 3 am or midnight. Specifically, for social media features, sleep-wake patterns and event indicators later than the cut-off time are removed. The same applies to weather features. However, traffic incident features remain the same because we assume planned traffic events are known beforehand. To prevent data leakage in modeling training, all selected features are re-normalized with the available data for each experiment.

Figure 20: Next-day morning traffic prediction errors by road segment under three forecasting horizons. Each line represents a road segment and each point is model prediction error with a specified forecasting horizon on x-axis .

As shown in Fig. 20, it is found that our method is generally insensitive to the increase of forecasting horizon for most road segments when tweet data are mined and used. This implies that we can even make the next-day morning traffic prediction by the midnight before, which is particularly appealing for proactive traffic management during morning commute. A larger performance drop is observed when increasing the horizon from “before 5 am” to “before 3 am”, possibly because data between 3-5 am captures people’s waking-up patterns which can help explain morning traffic. However, the changes in performance with respect to forecasting horizons are relatively small, especially for CST, CD and PTI. For traffic management purposes, the optimal application of our method is to predict next-day morning traffic by midnight, which is expected to give commuters a sufficient time window to make trip decisions before/during morning commute.

6 Conclusions and future work

This paper demonstrates the possibility of using social media data that are publicly available online as a crowdsourcing tool to predict next-day morning traffic. A comprehensive framework, namely tweet2traffic, has been designed for regional networks which first characterizes morning road congestion, processes and extracts traffic information from social data, then examines the relationship between social media and morning congestion patterns, and ultimately uses the relationship to predict morning traffic using data by midnight or up to 5 am. Specifically, we extract three types of information from tweets for explaining next-day morning traffic, which include people’s sleep-wake status, local events, and planned traffic incidents for next day. Considering the sparsity of geocoded tweets, we propose a novel social media data augmentation method which first filters local residents from noisy users such as tourists, infers their home locations, and augments the dataset with user timelines such as non-geocoded tweets, retweets, and favorites by assuming residents stay at home during the night. A sentiment analysis model with a neural language model as backend is constructed to detect abnormal local events and incident records are extracted from traffic-related tweets. To remove compounding errors, we propose to describe morning traffic with a quadruple of variables including binary congestion status, congestion starting time, congestion duration, and planning time index. A traffic prediction pipeline which consists of three building blocks: (1) road-level traffic descriptor, (2) segment-level traffic classifier and (3) segment-level traffic regressor, is developed to encode the supply-side variables (traffic incidents, weather, etc.) and demand-side variables (tweets, temporal variables, etc.) through a clustered learning structure which makes use of ordered spatio-temporal congestion patterns of segments along a road.

Through experiments in the Pittsburgh region, we show that our approach outperforms existing baseline methods and can learn meaningful travel demand characteristics from tweeting profiles and activities. Generally, we discover that the earlier people go to sleep, the more congested the road will be in the next morning. Tweeting activities in the night before and early morning are statistically associated with morning congestion. In addition, we find that the occurrence of big events in the evening, represented by higher or lower tweet sentiment than normal, often implies lower travel demand in the next morning than normal days. We illustrate through ablation studies that social media data, traffic events, and weather conditions are important data sources for next-day morning traffic prediction and their roles in prediction vary by space and time. We find that tweet features can effectively improve next-day morning traffic prediction for some road segments, particularly upstream of the pillback areas of a typical bottleneck and with large day-to-day congestion variation, where congestion patterns can be very sensitive to travel demand characteristics. In addition, the impacts of social media data are more significant when special events, such as sports games, occur during the night, which possibly reduces or puts off morning travel demand. In addition, information of traffic incidents and weather conditions improves predictions on days of non-recurrent traffic with roadworks or adverse weather conditions. Traffic incidents downstream to road segments shift congestion starting time while upstream incidents reduce congestion duration. Adverse weather conditions often increase congestion scales. Finally, our approach proves to be insensitive to the increase in forecasting lag times, and multi-source data up to midnight can be sufficient in providing good prediction of next day morning traffic. This is particularly appealing for traffic management, as traffic information and management strategies can be provided to commuters proactively before the morning starts.

Our experiments suggest that the multi-source data features, including tweets, incidents, weather and temporal variables, are only effective for improving morning traffic predictions on some locations in the network, but not for all. This is completely anticipated as some road segments are much more likely to be influenced or implied by weather, population activities and incidents than others. The data driven approach precisely helps learn the “signals” of morning traffic congestion as much as it can from social media data, which is effective and powerful on a portion of roads. This tweet2traffic model may not work for other roads, simply because social media are not actively used in areas that are relevant to those roads.

Last but not least, it is important to note the limitations of the current study. First, the main task of this study is to improve morning traffic prediction with social media data and travel demand characteristics are interpreted from the linear predictive models. Hence, no causal relationship can been claimed by this paper. Second, the loss of Twitter users with geo-coded tweets has been observed over the past few years. It’s therefore worth investigating how our method can still work using the active tweet users that can change considerably over time. Finally, some road segments in our study are shown to have high prediction errors even using social media. In a nutshell, integrating social media data into traffic prediction opens up a wide range of opportunities for transportation research to mine cross-domain data sources for predicting traffic. In our future work, we plan to add new types of cross-domain data collected from buildings (Zheng et al., 2016), energy systems (Zhang and Qian, 2018), traffic signal timing systems (Yao and Qian, 2020), water/sewer systems, city light to predict long-term traffic, because all those data can strongly imply population activities relevant to travels. Examining the impacts of the changing tweet users and topics to traffic prediction over time is also an interesting direction.


This project is funded by National Science Foundation Award CMMI-1751448, and in part by Carnegie Mellon University’s Traffic21 Institute and Mobility21, a National USDOT University Transportation Center for mobility sponsored by the U.S. Department of Transportation.



  1. journal: Transportation Research Part C: Emerging Technologies
  2. INRIX traffic:
  3. RCRS incident:
  4. Weather underground:
  5. Twitter streaming API:
  6. Twint:
  7. Pittsburgh zoning map:
  8. The contribution of a feature to a prediction made by linear models are the product of feature value and feature weight.


  1. Agarwal, A., Xie, B., Vovsha, I., Rambow, O., Passonneau, R.J., 2011. Sentiment analysis of twitter data, in: Proceedings of the workshop on language in social media (LSM 2011), pp. 30–38.
  2. Agarwal, R., Srikant, R., et al., 1994. Fast algorithms for mining association rules, in: Proc. of the 20th VLDB Conference, pp. 487–499.
  3. Ali, F., Kwak, D., Khan, P., Islam, S.R., Kim, K.H., Kwak, K.S., 2017. Fuzzy ontology-based sentiment analysis of transportation and city feature reviews for safe traveling. Transportation Research Part C: Emerging Technologies 77, 33–48.
  4. Bakshi, R.K., Kaur, N., Kaur, R., Kaur, G., 2016. Opinion mining and sentiment analysis, in: 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom), IEEE. pp. 452–455.
  5. Banfield, J.D., Raftery, A.E., 1993. Model-based gaussian and non-gaussian clustering. Biometrics , 803–821.
  6. Berlingerio, M., Ghaddar, B., Guidotti, R., Pascale, A., Sassi, A., 2017. The graal of carpooling: Green and social optimization from crowd-sourced data. Transportation Research Part C: Emerging Technologies 80, 20–36.
  7. Blei, D.M., Ng, A.Y., Jordan, M.I., 2003. Latent dirichlet allocation. Journal of machine Learning research 3, 993–1022.
  8. Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al., 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165 .
  9. Comaniciu, D., Meer, P., 2002. Mean shift: A robust approach toward feature space analysis. IEEE Transactions on pattern analysis and machine intelligence 24, 603–619.
  10. Cottrill, C., Gault, P., Yeboah, G., Nelson, J.D., Anable, J., Budd, T., 2017. Tweeting transit: An examination of social media strategies for transport information management during a large event. Transportation Research Part C: Emerging Technologies 77, 421–432.
  11. Cui, Y., Meng, C., He, Q., Gao, J., 2018. Forecasting current and next trip purpose with social media data and google places. Transportation Research Part C: Emerging Technologies 97, 159–174.
  12. D’Andrea, E., Ducange, P., Lazzerini, B., Marcelloni, F., 2015. Real-time detection of traffic from twitter stream analysis. IEEE transactions on intelligent transportation systems 16, 2269–2283.
  13. Davis, C.A., Varol, O., Ferrara, E., Flammini, A., Menczer, F., 2016. Botornot: A system to evaluate social bots, in: Proceedings of the 25th International Conference Companion on World Wide Web, International World Wide Web Conferences Steering Committee. pp. 273–274.
  14. Devlin, J., Chang, M.W., Lee, K., Toutanova, K., 2019. Bert: Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186.
  15. Ester, M., Kriegel, H.P., Sander, J., Xu, X., et al., 1996. A density-based algorithm for discovering clusters in large spatial databases with noise., in: Kdd, pp. 226–231.
  16. FHWA, 2019. Urban congestion report (ucr): documentation and definition. URL:
  17. França, U., Sayama, H., McSwiggen, C., Daneshvar, R., Bar-Yam, Y., 2016. Visualizing the “heartbeat” of a city with tweets. Complexity 21, 280–287.
  18. Gkiotsalitis, K., Stathopoulos, A., 2015. A utility-maximization model for retrieving users’ willingness to travel for participating in activities from big-data. Transportation Research Part C: Emerging Technologies 58, 265–277.
  19. Go, A., Bhayani, R., Huang, L., 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford 1.
  20. Gu, Y., Qian, Z.S., Chen, F., 2016. From twitter to detector: Real-time traffic incident detection using social media data. Transportation research part C: emerging technologies 67, 321–342.
  21. Harrison, G., Grant-Muller, S.M., Hodgson, F.C., 2020. New and emerging data forms in transportation planning and policy: Opportunities and challenges for “track and trace” data. Transportation Research Part C: Emerging Technologies 117, 102672.
  22. Hasan, S., Ukkusuri, S.V., 2014. Urban activity pattern classification using topic models from online geo-location data. Transportation Research Part C: Emerging Technologies 44, 363–381.
  23. Hasan, S., Zhan, X., Ukkusuri, S.V., 2013. Understanding urban human activity and mobility patterns using large-scale location-based data from online social media, in: Proceedings of the 2nd ACM SIGKDD international workshop on urban computing, ACM. p. 6.
  24. He, J., Shen, W., Divakaruni, P., Wynter, L., Lawrence, R., 2013. Improving traffic prediction with tweet semantics., in: IJCAI, pp. 1387–1393.
  25. Hossain, N., Hu, T., Feizi, R., White, A.M., Luo, J., Kautz, H., 2016. Inferring fine-grained details on user activities and home location from social media: Detecting drinking-while-tweeting patterns in communities. arXiv preprint arXiv:1603.03181 .
  26. Hu, M., Liu, B., 2004. Mining opinion features in customer reviews, in: AAAI, pp. 755–760.
  27. Hu, W., Jin, P.J., 2017. An adaptive hawkes process formulation for estimating time-of-day zonal trip arrivals with location-based social networking check-in data. Transportation Research Part C: Emerging Technologies 79, 136–155.
  28. Huang, A., Gallegos, L., Lerman, K., 2017. Travel analytics: Understanding how destination choice and business clusters are connected based on social media data. Transportation Research Part C: Emerging Technologies 77, 245–256.
  29. Huang, C., Wang, D., Zhu, S., Zhang, D.Y., 2016. Towards unsupervised home location inference from online social media, in: Big Data (Big Data), 2016 IEEE International Conference on, IEEE. pp. 676–685.
  30. Jha, K., Burris, M.W., Eisele, W.L., Schrank, D.L., Lomax, T.J., 2018. Estimating Reference Speed from Probe-based Travel Speed Data for Performance Measurement. Technical Report.
  31. Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y., 2002. An efficient k-means clustering algorithm: Analysis and implementation. IEEE transactions on pattern analysis and machine intelligence 24, 881–892.
  32. Khare, A., He, Q., Batta, R., 2019. Predicting gasoline shortage during disasters using social media. OR Spectrum , 1–34.
  33. Kuflik, T., Minkov, E., Nocera, S., Grant-Muller, S., Gal-Tzur, A., Shoor, I., 2017. Automating a framework to extract and analyse transport related social media content: The potential and the challenges. Transportation Research Part C: Emerging Technologies 77, 275–291.
  34. Lin, J., Cromley, R.G., 2018. Inferring the home locations of twitter users based on the spatiotemporal clustering of twitter data. Transactions in GIS 22, 82–97.
  35. Lin, L., Ni, M., He, Q., Gao, J., Sadek, A.W., 2015. Modeling the impacts of inclement weather on freeway traffic speed: exploratory study with social media data. Transportation research record 2482, 82–89.
  36. Liu, B., 2012. Sentiment analysis and opinion mining. Synthesis lectures on human language technologies 5, 1–167.
  37. Lyman, K., Bertini, R.L., 2008. Using travel time reliability measures to improve regional transportation planning and operations. Transportation Research Record 2046, 1–10.
  38. Markou, I., Kaiser, K., Pereira, F.C., 2019. Predicting taxi demand hotspots using automated internet search queries. Transportation Research Part C: Emerging Technologies 102, 73–86.
  39. Ni, M., He, Q., Gao, J., 2016. Forecasting the subway passenger flow under event occurrences with social media. IEEE Transactions on Intelligent Transportation Systems 18, 1623–1632.
  40. Qian, Z., Michael Zhang, H., 2012. Full closure or partial closure? evaluation of construction plans for the i-5 closure in downtown sacramento. Journal of Transportation Engineering 139, 273–286.
  41. Rashidi, T.H., Abbasi, A., Maghrebi, M., Hasan, S., Waller, T.S., 2017. Exploring the capacity of social media data for modelling travel behaviour: Opportunities and challenges. Transportation Research Part C: Emerging Technologies 75, 197–211.
  42. Raudenbush, S.W., Bryk, A.S., 2002. Hierarchical linear models: Applications and data analysis methods. volume 1. Sage.
  43. Schulz, A., Ristoski, P., Paulheim, H., 2013. I see a car crash: Real-time detection of small scale incidents in microblogs, in: Extended semantic web conference, Springer. pp. 22–33.
  44. Shahnaz, F., Berry, M.W., Pauca, V.P., Plemmons, R.J., 2006. Document clustering using nonnegative matrix factorization. Information Processing & Management 42, 373–386.
  45. Sheffi, Y., 1985. Urban transportation networks .
  46. Steiger, E., Resch, B., de Albuquerque, J.P., Zipf, A., 2016. Mining and correlating traffic events from human sensor observations with official transport data using self-organizing-maps. Transportation Research Part C: Emerging Technologies 73, 91–104.
  47. Suma, S., Mehmood, R., Albeshri, A., 2017. Automatic event detection in smart cities using big data analytics, in: International Conference on Smart Cities, Infrastructure, Technologies and Applications, Springer. pp. 111–122.
  48. U.S. Census Bureau, 2015. Commuting characteristics by sex, 2011-2015, american community survey 5-year estimates. URL:
  49. Wu, L., Zhi, Y., Sui, Z., Liu, Y., 2014. Intra-urban human mobility and activity transition: Evidence from social media check-in data. PloS one 9, e97010.
  50. Yao, W., Qian, S., 2020. Learning to recommend signal plans under incidents with real-time traffic prediction. Transportation Research Record 2674, 45–59. doi:10.1177/0361198120917668.
  51. Zhang, P., Qian, Z.S., 2018. User-centric interdependent urban systems: using time-of-day electricity usage data to predict morning roadway congestion. Transportation Research Part C: Emerging Technologies 92, 392–411.
  52. Zhang, Z., He, Q., 2019. Social media in transportation research and promising applications, in: Transportation analytics in the era of big data. Springer, pp. 23–45.
  53. Zhang, Z., He, Q., Gao, J., Ni, M., 2018. A deep learning approach for detecting traffic accidents from social media data. Transportation research part C: emerging technologies 86, 580–596.
  54. Zhang, Z., He, Q., Zhu, S., 2017. Potentials of using social media to infer the longitudinal travel behavior: A sequential model-based clustering method. Transportation Research Part C: Emerging Technologies 85, 396–414.
  55. Zhang, Z., Ni, M., He, Q., Gao, J., Gou, J., Li, X., 2016. Exploratory study on correlation between twitter concentration and traffic surges. Transportation Research Record 2553, 90–98.
  56. Zhao, S., Zhang, K., 2017. Observing individual dynamic choices of activity chains from location-based crowdsourced data. Transportation Research Part C: Emerging Technologies 85, 1–22.
  57. Zheng, Z., Wang, D., Pei, J., Yuan, Y., Fan, C., Xiao, F., 2016. Urban traffic prediction through the second use of inexpensive big data from buildings, in: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, ACM. pp. 1363–1372.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description