Inferring Passenger Type
from Commuter Eigentravel Matrices
Abstract
A sufficient knowledge of the demographics of a commuting public is essential in formulating and implementing more targeted transportation policies, as commuters exhibit different ways of traveling—including time in the day of travel, the duration of travel, and even the choice of transport mode. With the advent of the Automated Fare Collection system (AFC), probing the travel patterns of commuters has become less invasive and more accessible. Consequently, numerous transport studies related to human mobility have shown that these observed patterns allow one to pair individuals with locations and/or activities at certain times of the day. However, classifying commuters using their travel signatures is yet to be thoroughly examined.
Here, we contribute to the literature by demonstrating a procedure to characterize passenger types (Adult, Child/Student, and Senior Citizen) based on their threemonth travel patterns taken from a smart fare card system. We first establish a method to construct distinct commuter matrices, which we refer to as eigentravel matrices, that capture the characteristic travel routines of individuals. From the eigentravel matrices, we build classification models that predict the type of passengers traveling. Among the models explored, the gradient boosting method (GBM) gives the best prediction accuracy at , which is better than the minimum model accuracy (41%) required visàvis the proportional chance criterion. In addition, we find that travel features generated during weekdays have greater predictive power than those on weekends. This work should not only be useful for transport planners, but for market researchers as well. With the awareness of which commuter types are traveling, ads, service announcements, and surveys, among others, can be made more targeted spatiotemporally. Finally, our framework should be effective in creating synthetic populations for use in realworld simulations that involve a metropolitan’s public transport system.
keywords:
Transport, Human mobility, Activity pattern recognition, Commuter classification, Automated fare collection, Sociodemographics, Machine learning, Gradient boosting method, Random forest1 Introduction
The era of big and smart data has provided a substantial impetus in understanding human mobility—revealing the regularity and predictability of human behavior. In transport studies in particular, the widespread use of contactless smart fare card systems has spurred considerable growth in the field Ma2013 ; Lee2011 ; Sun2012 ; Chakirov2011 ; Legara2015 . The main focus of most disquisitions on human mobility has been on identifying and/or predicting activity locations given an individual’s past transportation transactions record, essentially spotting places where an individual goes to and hangs around at certain times of day—revealing ones home, work, and “third place” Jarv2014 ; Kusakabe2014 ; Pelletier2011 ; Goulias1999 ; Nassir2015 ; Lee2014 .
Understanding human mobility is especially consequential in urban landuse and transportation planning Chu2008 ; Utsunomiya2006 . Gaining insights on where people go and what activities they engage in, or even inferring what drives them to travel from one place to another, can help in designing smart cities that can sufficiently address the needs of their citizens from their environment Medina2014 ; Othman2014 ; thereby improving their overall wellbeing.
Notwithstanding the fact that most human mobility studies are centered on matching individuals with locations and/or activities, certain sociotechnical datasets have more to offer other than spatial information. In this study, for example, we utilize data from travel fare cards that not only have spatiotemporal information such as origin, destination, time of travel, and duration of travel, but also provide a particular demographic information, which is the type of passenger traveling, i.e. Adult, Child/Student, or Senior Citizen. In this work, instead of predicting where people go at certain times of the day, we determine a set of features based on travel routines that can help identify which passenger types are traveling. Realizing commuter types can give us a better understanding of the structure of a society and the needs of its people from their surroundings. From the perspective of transport planning, this can help stakeholders quantify more systematically how a certain group of commuters would react to or be affected by changes in the entire transport system—from infrastructure changes to policy changes Medina2014 ; Tong2013 . Finally, from the standpoint of modeling and simulations, our proposed approach can aid in setting up synthetic populations wherein different passenger categories exhibit varying travel signatures.
The paper is organized as follows. In the next section, we discuss the data used in the study. This is then followed by a methods section where (1) present some descriptive statistics relevant to the construction of our classification models, and (2) demonstrate in detail how we set up the eigentravel matrices that define the feature variables used in building the classification models. Finally, we end the article with a discussion and conclusion section where we elaborate on our results and share some insights into them.
2 Data
This paper looks into movements of public transport commuters within Singapore using a threemonth travel dataset. In the citystate, there is only one smart fare card system called EZlink used in both its bus and rail transit system (RTS). Moreover, the public transport system has both entry and exit automated fare collection (AFC) for the bus and RTS. With both entry and exit AFC, the durations of travel for each transaction can be evaluated in a straightforward manner.
The dataset at hand has more than 3 million unique and anonymized card ID’s; this includes single journey transactions across the three months under study. For purpose of computation, we utilized a randomly sampled population of 30,000 regular commuters. Such sampling yields a confidence interval equal to 99.99% or an error of less than 0.01%. The population is equally split among three passenger types: adult, child/student, senior citizen.
Each travel transaction contains the following pieces of information that are relevant to the study: card ID, origin, destination, start date (of travel), start time (of travel), end time (of travel), mode of transport (bus or rail), and passenger type.
3 Methodology
3.1 Descriptive Statistics
(a) Weekdays  
(b) Saturdays  (c) Sundays 
(a) Weekdays  
(b) Saturdays  (c) Sundays 
We first look at some descriptive statistics that can be derived from the dataset. In Fig. 1, we show the temporal travel demand statistics via the ride start time distributions for each passenger type. A typical weekday travel demand curve that we see in the literature is that there are two distinct peaks that correspond to both the AM and PM peak hours— when people go to work/school (in the morning) and when they go home (in the afternoon) Sun2012 ; Legara2015 . However, when we discriminate across passenger types, we see three distinct curves (Fig. 1a). The curve for the passenger type “Adult” (Acurve) is the same as the usual travel demand curves presented in the literature. However, for the travel demand curve of children/students (Ccurve), we see that there is only one sharp peak that is found in the morning; in the afternoon, the curve plateaus. This suggests that children/students have practically varying endofschool times—spread from 1300 hours to 1800 hours as there are students who only go to class in the morning. Finally, the travel demand curve for the elderlies (senior citizen, Scurve) does not reveal a peak, which implies that seniors typically do not have a “universal” schedule. These three demand curves give a hint on how to set up the different travel features for classifying passenger types. We probe these curves in greater detail in the Results and Discussion section below.
We also look at how the two modes of Singapore public transport, bus and train, are utilized across the period under study for the three types (see Figure 2). The barplots show that, in general, the usage of bus dominates that of the rail. This is more pronounced for the elderlies wherein around of the total trips account for the bus usage.
3.2 Eigentravel Matrices
Building from what have been established in the previous section, we construct a unique eigentravel matrix for each agent to characterize an individual’s travelprint. captures, at the minimum, the observed differences in travel demand (or ride times) of each passenger type and their preferred modes of transport.
is a twodimensional matrix. The fortytwo (42) rows correspond to three 14week partitions from the threemonth data. The first fourteen rows aim to capture the travel patterns on weekdays; while the second and third fourteenweek slabs correspond to saturdays and sundays, respectively. In this study, only trips between 0400 and 2359 hours of each day are captured— this is depicted in the 20 columns that represent each hour of the time period under study. Figure 3 shows a schematic diagram of an individual ’s matrix. In Figure 4, we zoom in on one of the fortytwo week slices in . The figure is discussed in greater detail below.
Meanwhile, the eigentravel matrix is constructed to not only quantify when an agent is traveling, but to also carry information on a commuter’s transport mode of choice and his/her durations of travel. Each cell in can have a value in the range and is given by
(1) 
where is a journey in the journey set , which is a collection of all journeys that begin on week at hour and ends on the same week and day at hour where . , on the other hand, is the duration of travel (in minutes) of the individual in week and hour . If a journey covers two adjacent hours, say the travel was from 0651 hours to 0702 hours covering hours 0600 () and 0700 (), respectively, the corresponding travel duration for each hour will be counted separately. Finally, to distinguish between using a bus and a rail transit, a multiplier of either 1.0 or 10.0 is introduced.
To illustrate the construction of , consider the travel transactions of a hypothetical agent in Table 1. In Journeys 1 and 3, the agent utilized the bus system; therefore, for the two trips. For Journey , on the other hand, the factor since the agent utilized the rail transit system (RTS). For Journey , the trip crosses two hslices— and 4, respectively (see Fig. 4). Note that in this study, we are starting at 0400 hours (), therefore, and for 6AM and 7AM, respectively. In Fig. 4, we can see that Journey 1 covers a duration of mins and mins for and , respectively. Consequently, cells and of will have nonzero values, and are computed as follows: and . From Journey 2, . Finally, from Journeys 2 and 3, . Actual samples of eigentravel matrices are shown in Figure 5.
3.3 Classification
We utilize different supervised machine learning models and perform predictive analytics on the constructed eigentravel matrices. The three best models are (1) a distributed random forest (DRF) model, (2) a gradient boosting method (GBM), and (3) a support vector machine (SVM). These methods are standard advanced classification techniques in machine learning and have demonstrated success in a wide range of systems Etter2012 ; Zhang2015 ; Hastie2009 ; Leshem2007 .
Both DRF and GBM are forwardlearning ensemble models madeup of multiple basis elements—the decision trees (DT) Zhang2015 ; Friedman2001 ; Click2015 . Each DT in each of the ensemble provides a “weak” solution to the classification problem at hand. The main difference between DRF and GBM lies in how the two models generate their base models. In the DRF, the individual DTs are generated indepedently, and the fitting simply averages the performance of each of the learners; in the GBM, on the other hand, a gradientdescent based boosting formulation with the objective of minimizing the loss function in every iteration is implemented in spawning new learners. In spite of this, similar to DRF, the final fitting is just the average of the base models. Finally, SVM is a supervised classification technique where sample clusters are separated by defining hyperplanes that give the largest minimum distances from each cluster.
The forwardlearning ensemble models DRF and GBM are implemented using the H2O Python Module h2o , while the SVM is performed using scikitlearn scikit —a machine learning Python module. We note that linear models and deep learning methods produce results that are inferior when compared to the methods described above.
3.4 Features
We reshape each of the eigentravel matrices into onedimendional arrays whose elements correspond to the features considered in this study. The features contain information on the travel time of the individuals and their preferred mode of transport as described in Section 3.2. The predictor variables are labelled , where corresponds to the average ride pattern of an individual during weekdays for the first week under consideration at 0400 hours. , on the other hand, is for the same week averaged across weekdays at 0500 hours. Finally, is the 5th Saturday in the data set at 2400 hours. The response variable is the passenger type.
4 Results and Discussion
We compare our results for each of the models by comparing their accuracy rates against the proportional chance criterion (PCC)—a common yardstick in evaluating the success of a classifier when compared to a random chance prediction Legara2011 . PCC is calculated by summing the squared proportion of each of the group represented in the sample. As a rule of thumb, a successful model, indicative of a significant predictive score, should have an accuracy of at least of the PCC Hair2009 . Accordingly, our objective is to have an accuracy of at least .
Among the three models, GBM resulted to the highest prediction accuracy of 76%, which is 84% better than the minimum required model accuracy (41%) derived from the PCC. DRF and SVM gave 72% and 64% accuracy rates, respectively. The deep learning method we performed with layers sampled from 1 to 200 and hidden nodes from 100 to 600 only reached a maximum of 64%, while results from the linear methods are just within .
Focusing on both GBM and DRF, which resulted to greater than 70% accuracy rates, we provide heatmaps of the scaled variable importances (see Figure 6). What is apparent in the figure is that most of the variables associated with the trips made during weekdays (boxed slabs) dominate the rest of the features; that is, the predictor variables corresponding to weekend travels do not contribute significantly in identifying passenger clusters. This finding concurs with the travel demand curves shown in Fig. 1 where the overall profiles of the curves for the three types for both Saturdays (Fig. 1b) and Sundays (Fig. 1c) are structurally similar.
In Fig. 7, we zoom in on the weekdays of the fourteen weeks by taking the average of the scaled variable importances of features that represent the same hour of the weekdays. In the plot, for the GBM, the leading variables are those identified with the hours: 0600, 1100, 1400, and 1500; for the DRF, the prominent variables are those at 0800, 1100, 1400, and 1500.
Adult  Child  Senior  Error  Rate  

Adult  3637  392  971  27.3%  1,363 / 5,000 
Child  497  4223  280  15.4%  777 / 5,000 
Senior  1175  379  3444  31.09%  1,554 / 4,998 
Total  5,309  4,994  4,695  24.6%  3,694 / 14,998 
The variable importance values may be explained by looking at Fig. 1 and Table 2, which shows a sample confusion matrix resulting from implementing the GBM. Table 2 establishes that predicting children and/or students gives the highest accuracy with only an error rate of approximately 17.4%; compared to the adults and senior citizens where the error rates are 28.5% and 34.4%, respectively. In addition, most of the misclassification are between the adults and senior citizens; therefore, we reckon that predictor variables that maximize the dissimilarity between the adults and senior citizens will play more significant roles in the models.
We now discuss what insights we can derive from the travel patterns of the commuters, focusing on the sets of the most relevant predictor variables. For ease of discussion, we introduce to represent a set of 14 weekday predictor variables that fall under a given hour (). To recap, from Section 3.2, refers to 0400 hours, to 0500 hours, and to 2200 hours. To illustrate further, , which is a set of 14 (weekday) variables that fall within the first hour of each weekday considered. From the variable importance results for the GBM and DRF, we focus on the sets and , respectively; these sets refer to variables under the following time frames: 0600, 1100, 1400, and 1500 hours for the GBM and 0800, 1100, 1400, and 1500 hours for the DRF.
In Section 3.1, the general profiles of the different travel demand curves in Fig. 1 for the adults (Acurve), the children/students (Ccurve), and the senior citizens (Scurve) are discussed. We look into specific segments of the curves guided by the sets and . Note that each variable set in either of the sets isolates one particular curve from the rest of the curves. This is intuitive since the best predictors maximize the dissimilarity between curves.
First, we take a look at (at 0600 hours) where the Ccurve is at its highest and narrowest (also when compared against the two other curves). At this hour, almost all children commuters are on their way to school. The narrowness of the Ccurve peak implies that the start time of schools are highly likely the same across the citystate and that they are more rigid than the adult working hours—the Acurve within the same time frame is wider. In addition, at 0600 hours, the Ccurve is isolated from the intersecting A and Scurves. Almost similar dynamics is surmised for ; however, the Ccurve peak has started to drop at lower travel demand levels. Second, we focus on at 1100 hours. At 1100 hours, both the students/children and the working adult population are in their schools/offices; that is, they are not traveling. This may explain why both A and Ccurves at that hour are overlapping—isolating the Scurve. In addition, notice that in Fig. 1a, the Scurve has no prominent peak unlike in the Acurve (2 peaks) and Ccurve (1 peak). This is not surprising since most senior citizens do not follow “regular” adult working hours (although some may still do as depicted by the two shallow “bumps” around the same region where the Acurve peaks). It can be said that, by and large, the elderlies do not have a “universal” schedule unlike the working adults, and that during “working hours” when the students and working adults are at work or in school, more elderlies are traveling. Finally, for and , at 14001500 hours, the Acurve is left at the lower levels of the travel demand and is isolated from the C and Scurves. This is a particularly interesting trend for the student/child population, which reveals that most students are only in schools for half a day and that they have varying end of school times. This is manifested in the Ccurve where there is no second peak observed as it starts to plateau at 1300 to 1800 hours. In addition, from 14001500 hours, the travel demand curve implies that most working adults are still in their offices. The insights presented here are summarized in Table 3.
Predictor  Hour of Day  Isolated Curve  Remarks  



CCurve 


1100 hours  SCurve 




ACurve 

5 Conclusion
A sufficient knowledge of the demographics of a commuting public is essential in formulating and implementing more targeted transportation policies—different schemes can affect different commuter types in several ways. In this work, using data taken from Singapore’s automated fare collection (AFC) system, we showed that commuters exhibit varying travel patterns that can be used to categorize passengers into three general types: adult, child/student, senior citizen. We first established a method to construct distinct commuter matrices that we referred to as eigentravel matrices that capture the characteristic travel routines of individuals by taking into account their times in the day of travel, durations of travel, and preferred modes of transport. We then performed a multivariate analysis (840 feature variables) on the eigentravel matrices using three supervised machine learning models: gradient boosting method (GBM), distributed random forest (DRF), and support vector mahine (SVM). GBM gave the best prediction accuracy of 76%. Furthermore, implementing a variable importance analysis showed that features associated with weekday travels are better than those associated with weekends.
Many cities are already using AFC systems, and some metropolitan areas in the developing worlds are already transitioning into using such technology. However, not all AFC systems provide passenger type information like what the dataset in this study provides. Nevertheless, with the approach presented, urban planners can now have a way to determine passenger types by looking at the “natural tendencies” of public transport commuters, thru eigentravel matrices, in a noninvasive manner.
The technique demonstrated allows transport planners to formulate more targeted transportation policies and schemes. The framework is not only useful to urban and transport planners; the field of marketing research may also find this work relevant and beneficial. With adequate awareness of which passenger types dominate the travel demand at specific times of day, ad agencies (and survey firms) can create andt put up more focused advertisements, service announcements, and surveys—helping stakeholders to properly channel their resources. Finally, from the perspective of modeling and simulations, the categorization presented can be useful in generating synthetic populations for use as inputs to computational models (e.g. agentbased models) to accurately capture the revealed travel signatures for each commuter category.
Acknowledgement
We would like to thank the Land Transport Authority of Singapore for the ticketing data used in this work, Nasri bin Othman for his assistance in preparing the datasets, and Hu Nan for his valuable feedback on the work. This research is supported by Singapore ASERC Complex Systems Programme research grant (1224504056).
References
 (1) X. Ma, Y.J. Wu, Y. Wang, F. Chen, J. Liu, Mining smart card data for transit riders’ travel patterns, Transportation Research Part C: Emerging Technologies 36 (2013) 1–12.
 (2) S. G. Lee, M. D. Hickman, Travel pattern analysis using smart card data of regular users, in: Proceedings of the 90th Transportation Research Board Annual Meeting, 114258, 2011.
 (3) L. Sun, D.H. Lee, A. Erath, X. Huang, Using smart card data to extract passenger’s spatiotemporal density and train’s trajectory of mrt system, in: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2012.
 (4) A. Chakirov, A. Erath, Use of public transport smart card fare payment data for travel behaviour analysis in singapore, in: 16th International Conference of Hong Kong Society for Transportation Studies, Hong Kong, 2011.
 (5) E. F. Legara, K. K. Lee, G. G. Hung, C. Monterola, Mechanismbased model of a mass rapid transit system: A perspective, in: International Journal of Modern Physics Conference, no. 1560011 in 36, 2015.
 (6) O. Järv, R. Ahas, F. Witlox, Understanding monthly variability in human activity spaces: A twelvemonth study using mobile phone call detail records, Transportation Research Part C: Emerging Technologies 38 (2014) 122–135.
 (7) T. Kusakabe, Y. Asakura, Behavioural data mining of transit smart card data: A data fusion approach, Transportation Research Part C: Emerging Technologies 46 (2014) 179–191.
 (8) M.P. Pelletier, M. Trépanier, C. Morency, Smart card data use in public transit: A literature review, Transportation Research Part C: Emerging Technologies 19 (4) (2011) 557–568.
 (9) K. G. Goulias, Longitudinal analysis of activity and travel pattern dynamics using generalized mixed markov latent class models, Transportation Research Part B: Methodological 33 (8) (1999) 535–558.
 (10) N. Nassir, M. Hickman, Z. Ma, Activity detection and transfer identification for public transit fare card data, Transportation 42 (2015) 683–705.
 (11) S. G. Lee, M. D. Hickman, Trip purpose inference using automated fare collection data, Public Transport 6 (12) (2014) 1–20.
 (12) K. K. A. Chu, R. Chapleau, Enriching archived smart card transaction data for transit demand modeling, in: Transportation Research Record: Journal of the Transportation Research Board, no. 2063, 2008, pp. 63–72.
 (13) M. Utsunomiya, J. Attanucci, N. H. Wilson, Potential uses of transit smart card registration and transaction data to improve transit planning, in: Transportation Research Record: Journal of the Transportation Research Board, no. 1971, 2006, pp. 119–126.
 (14) S. A. O. Medina, A. Erath, Estimating dynamic workplace capacities by means of public transport smart card data and household travel survey in singapore, in: Transportation Research Record: Journal of the Transportation Research Board, Vol. 2344, 2014, pp. 20–30.
 (15) N. bin Othman, E. F. Legara, V. Selvam, C. Monterola, Simulating congestion dynamics of train rapid transit using smart card data, in: Procedia Computer Science, Vol. 29, 2014, pp. 1610–1620.
 (16) M. A. OrtegaTong, Classification of london’s public transport users using smart card data, Master’s thesis, Massachusetts Institute of Technology (June 2013).
 (17) V. Etter, M. Kafsi, E. Kazemi, Been there, done that: What your mobility traces reveal about your behavior, in: International Conference on Pervasive Computing: Mobile Data Challenge, 2012.
 (18) Y. Zhang, A. Haghani, A gradient boosting method to improve travel time prediction, Transportation Research Part C: Emerging Technologies.
 (19) T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, second edition Edition, Springer, 2009.
 (20) G. Leshem, Y. Ritov, Traffic flow prediction using adaboost algorithm with random forests as a weak learner, International Journal of Intelligent Technology 2 (2) (2007) 111–116.
 (21) J. H. Friedman, Greedy function approximation: A gradient boosting machine, The Annals of Statistics 29 (5) (2001) 1189–1232.
 (22) C. Click, J. Lanford, M. Malohlava, V. Parmar, Gradient Boosted Models with H2O’s R Package, 2nd Edition, H2O.ai, Inc., 2307 Leghorn Street Mountain View, CA 94043, 2015.
 (23) H. C. Team [online] (2015) [cited 20 August 2015]. [link].
 (24) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, É. Duchesnay, Scikitlearn: Machine learning in python, Journal of Machine Learning Research 12 (2011) 2825–2830.
 (25) E. F. Legara, C. Abundo, C. Monterola, Ranking of predictor variables based on effectsize criterion provides an accurate means of automatically classifying opinion column articles, Physica A: Statistical Mechanics and Its Applications 390 (1) (2011) 110–119.
 (26) J. F. H. Jr, W. C. Black, B. J. Babin, R. E. Anderson, Multivariate Data Analysis, seventh Edition, Prentice Hall, 2009.