Estimation of Passenger Route Choice Pattern
Using Smart Card Data for Complex Metro Systems
Abstract
Nowadays, metro systems play an important role in meeting the urban transportation demand in large cities. The understanding of passenger route choice is critical for public transit management. The wide deployment of Automated Fare Collection(AFC) systems opens up a new opportunity. However, only each trip’s tapin and tapout timestamp and stations can be directly obtained from AFC system records; the train and route chosen by a passenger are unknown, which are necessary to solve our problem. While existing methods work well in some specific situations, they don’t work for complicated situations. In this paper, we propose a solution that needs no additional equipment or human involvement than the AFC systems. We develop a probabilistic model that can estimate from empirical analysis how the passenger flows are dispatched to different routes and trains. We validate our approach using a large scale data set collected from the Shenzhen metro system. The measured results provide us with useful inputs when building the passenger path choice model.
I Introduction
Nowadays, metro systems play an important role in meeting the urban transportation demand in large cities. Due to its fast speed, high efficiency, large volume and punctuality, the urban metro has become the first choice of many people. In Shenzhen, China, in midJune 2015, there were around 3.5 million metro trips every day, which was around one third of the total public traffic. Fig. 1 illustrates the metro operating map of Shenzhen. With further expansion of the metro system, the amount of passengers may increase rapidly. On one hand, the increasing usage of metros can effectively help reduce the traffic pressure on surface roads. On the other hand, it also brings dramatic increasing of passenger demand on metro systems.
The traffic patterns of large metro systems are usually very complex. Under the condition of network operation and seamless transfer in current metro systems, the train and route chosen by a passenger are unknown. It is common to have more than one route between the origin station and the destination station, a.k.a multipath in transportation systems. As shown in Figure 2(a), there are two routes from station O to station D. This means that for an OD pair with more than one route, we don’t know how passengers are distributed over these routes and trains.
This missing information at a fine granularity could be important for both passengers and metro operators. From the operators’ point of view, understanding the flow distribution of passengers in the whole metro network is important for improving the service reliability. The potential applications can be a mobile application of trip planning for metro passengers, a monitoring system for metro operators, a route suggestion and emergency management system for urban administrators etc. This paper aims to develop a solution to calculate the probability of each route chosen for an OD pair, which can be used to estimate the passengers flow at a granularity of trains of each line, as shown in Figure 2(d).
Traditional approaches are not scalable. To understand the passengers’ route choice behavior, one traditional method is to conduct field surveys at train stations, by asking passengers which route they will take to reach their destinations. There are limitations of this method: firstly, most surveys are conducted with focus on a part of the passengers at particular locations within a limited time window, hence the results are often limited in diversity, scale and accuracy; secondly, it is both laborintensive and timeconsuming in conducting such surveys.
The wide deployment of Automated Fare Collection(AFC) systems opens up a new opportunity for metro network analysis: the transaction records from AFC can reveal the Origin (O) and the Destination (D) of every passenger’s trip, as passengers are required to tap their smart cards or RFID based tickets each time they enter the O station or exit the D station. Passengers’ flows can be coarsely demonstrated by OD (origindestination) pairs. However, AFC records failed to expose the passengers’ routes directly. Even in cases that the route of an OD is unique, the AFC records are still not able to show which train a passenger takes. There are too many factors that can affect a passenger’s final plan, i.e., trains or train combinations one takes. For example, if the train fails to have enough capacity to accommodate all passengers waiting on a platform, some passengers would have to wait for another train. This phenomenon, known as “travelers left behind” is quite common during rush hours or at large stations. There are already some studies using transaction records from AFC to understand the passengers’ route and train choice behavior [1, 2]. Although these methods work well in some specific situations, they don’t work for complicated situations, such as the case where there are various “leftbehinds” at different stations caused by the imbalance of geographical distribution of passengers. Also, usually the walking time between the charge gate and that platform, and the walking time for transfer between platforms could not be ignored.
In this paper, we propose a solution that needs no additional facility than the trains operating time table and the AFC records data. By matching a passenger’s smart card records with the trains operating time table, the route that he/she might choose can be narrowed down. We develop a probabilistic model that can empirically estimate how the passenger flows are distributed among different routes and trains. As a concrete example in Figure 2(b)(c), if a passenger tapsin at station at time point and tapsout at station at time point , both Route 1 and Route 2 can be the possible choice after we narrow down the possible plan based on the time table. Our solution is to answer that at what probability each route is chosen by the passengers. For a route like Route 1, which further has multiple possible train combinations that satisfy the time table constraints, we further derive the probability that passengers may choose each plan, i.e., or .
The contributions of this paper include:

We define two kinds of timedependent polynomial distributions of the number of trains waited for by passengers. The first is the number of trains that a passenger waits at his/her original station. The other is the number of trains a passenger waits when he/she transits at the transfer station. A set of algorithms are proposed to calculate the parameters of the two distributions.

We further propose a probabilistic model that can estimate how the passenger flows are distributed among different routes and trains.

We then deploy the algorithms on a cloud platform and develop supporting modules for the system level solution.

Finally we validate our approach using a large scale data set collected from the Shenzhen metro system. The measured results provide us with useful inputs when we build the passenger path choice model.
For the rest of this paper, we discuss the related work in Section II. The overview of this study is given in Sections III. Section IV discusses the solution in details. We present system design and the algorithm implementation on a cloud platform in Section V. Section VI presents the experimental studies. Finally, Section VII concludes the paper.
Ii Related Work
Building users route choice model is an important research direction in the field of transportation [3], which is the basis for traffic management policiesmaking. Due to the lack of the observation of how probably each route is chosen for an OD pair with multiple routes, most of the past studies focus on building route choice models from empirical perspective. They assume that all passengers have full knowledge of the transportation when attempting to minimize some objective functions e.g., minimizing their travel time (user equilibrium) or minimizing the total system travel time (system optimum)[4, 5, 6]. However, those models depend heavily on behavior assumptions and lack in reliable supporting data. Given the dynamic and stochastic nature of transportation systems, the assumption of the passengers’ global knowledge is questionable.
Fortunately, the large amount of smart card data in a long period provide us a great opportunity to analyze passengers’ transit behavior and evaluate transit service. There were a few previous studies regarding the utilization of smart card data. The literature [7] considered the potential usage of smart card data for travel. The literature [8] analyzed users’ travel behavior using data mining technology, which clustered users into four groups according to their temporal travel patterns. Our recent work [9] studied individual passenger’s temporal and spatial travel patterns. We found that if a passenger is temporally regular, it is very possible that the passenger is also spatially regular. Besides understanding travel behavior, smart card data have been used to improve public transit services. The study in [10] gave a comprehensive review of using smart card data from different aspects: strategic, tactical and operational. To improve the resilience to service disruptions of metro systems, paper [11] investigated a practical problem about integrating localized bus service with metro network. Using the same data set, three optimization models were formulated to design demanddriven timetables for a singletrack metro service [12].
For understanding passengers’ flow assignments in metro system, The authors in [1] proposed a method to estimate which trains every smart card holder boarded during his/her journey. This method could be used to estimate trains’ occupancies. However, it was also based on some assumptions that may be only available in some limited scenarios: (a)The methodology assumed that all passengers know the train timetable beforehand. When choosing route, they will first choose the plan with minimum total waiting time, then choose the plan with fewer transits. The remaining small percentage of undecided trips were assumed to be assigned to all possible plans with an equal probability. (b)The walking time between the charge gates and that platform, and the walking time for transfer between platforms are ignored, which may lead to mismatching between passengers’ tap time and trains’ operation time.
The authors in [13] proposed a linear regression model to analyze the individual trajectory during a metro trip, which could be used to estimate the spatialtemporal density inside a metro system. However, their model focused on a singletrack scenario that is oversimplified. The study in [14] used a clustering algorithm to infer the routeuse patterns of metro passengers from the smartcard data. It confirmed that a Gaussian mixture model worked well in finding the route shares and the mean and variance of travel times for each route of London underground. But the conclusion based on two preconditions. First, the number of routes used by users must be known in advance. Second, the probability distribution function of travel time of each route must be Gaussian. The study in [2] developed an integrated Bayesian approach to infer both network attributes and passenger route choice behavior in a complex metro system, which worked well in some cases that there are lack of train timetable. But a large set of explanatory variables and the probability distribution of these variables need to be calibrated, such as it assumed that all link costs are characterized by independent normal distributions. This is not always true. Taking the phenomenon “leftbehind” as example, the imbalance of geographical distribution of passengers leads to the various “leftbehinds” in different stations and results that the time cost does not follow normal distributions in a station for different number of trains that need to be waited for. There are other related work of big data based analysis for smart transportation systems [15, 16, 17, 18], while they are not targeting metro systems.
In sum, the existing methods did not consider the passengers’ “leftbehind” in detail, which however is one of the main factors affecting us to understanding passengers’ path choice behavior. In this paper, we propose a novel approach to calculate the probability of each route used for an OD with multiple routes, it can be used to complicated traffic situation of complex metro network, especially for the situation that “leftbehind” occurs often.
Iii Overview
Iiia Dataset
There are two types of data used in this study, smart card transaction data and train operation data. A smart card transaction record is reported when a passenger passes through the entrance or exit gate by tapping smart card, which includes fields (unique identifier of smart card), (metro station), (transaction time) and (enter or exit). A train operation record is collected when a train arrives at or departs a station, which includes fields (train sequence), (metro line), (metro station), (transaction time) and (arrive or leave).
For a trip of a passenger, we can observe the trip’s beginning time , origin , end time , and destination , by joining the tapin and tapout tap events together. If the trip needs transfers, we say the trip has parts. The first part is from the passenger entering metro system to he/she getting off from his/her first boarded train. The last part is from the passenger getting off from the second last train to he/she exiting metro system. If the passenger doesn’t need to transfer, then the first part of the trip is also the last part.
IiiB Notations and Assumptions
Suppose the set of effective routes of an OD pair is and . For simplicity, we divide one day into fixed slots with a time interval . We set to be a half hour. Then one day can be split as . And we assume that given the interval , the probability of each route being chosen is stable in each time interval. We further define that the routes being chosen in a specific time slot obeys polynomial distribution with parameter where . Given the train operation table , the set of trips of passengers that begin at time slot , we aim to calculate .
For simplicity, we further present some assumptions and notations that will be used in this paper.

We assume that the time that most passengers spend to walk between the platform and the ODs’ entrance/exit is less than the departure interval between two adjacent trains. This assumption is rational, because for most metro system, the distance between gate and platform is not far.

We assume that most passengers will exit the metro station through the exit gate as soon as possible after getting off the train that reaches her/his destination.
Based on the two assumptions, we can infer that given a trip of a passenger and the route that he/she chooses, the train that he/she boards in the last part of trip can be determined uniquely.
Tapin(): where , represents the passengers who enter metro system at time slot in station and chooses metro line .
Transfer(): where , represents the passengers who transfer from metro line to in transit station at the time slot .
To calculate of an OD pair, all effective routes are needed firstly. Then given all effective routes , and all trips starting at time slot from station to , and train operation table , we use the maximum likelihood function as Equation (1) to calculate , where is the possibility that a passenger passes through exit gate at time on condition of , and the route chosen .
(1) 
In practical, the time cost of a trip () has a certain relation with train operation data. So given a trip of a passenger, we can find all possible plans (train or trains combination) that the passenger may choose for a route by matching two types of data. So can be calculated by summing up the probabilities of all plans.
In order to get the probability of each plan being chosen, we first transform the train that a passenger may board into the number of trains being needed to wait for. Then we define that the number of trains waited by passengers of obeys the polynomial distribution with parameter , and the number of trains waited by passengers of obeys the polynomial distribution with parameter .
From the process of a trip of a passenger, we can infer that is affected by . So we can calculate firstly, then . As not all the OD pairs have multiple routes, the trips with one route and no transfer can be used to estimate because the train chosen is unique. Then considering as prior knowledge, the trips with one route and some transfers can be used to estimate . Finally, considering both and as prior knowledge, can be estimated by maximizing function (1).
IiiC Framework
The framework is illustrated in Figure 3. The details are given as follows.
For smart card data, we have been finding several kinds of errant data, e.g., missing data, duplicated data and data with logical errors. So in the step of data preprocessing, we conduct a detailed clearing process to filter out errant data on a daily basis. In the step of generating route set, we use the algorithm proposed in [19] to find the k shortest paths of all OD pairs. Then according to the time cost of passengers in practice, we filter some routes that passengers have never used. In the step of trips classification, according to the number of routes and transfers of their ODs, we classify all trips into four groups: Notransferoneroute, Onetransferoneroute, Multitransferoneroute, Multiroutes. In the step of possible plan analysis, we find all possible plans that a passenger may chose by matching smart card data with trains operation timetable. The trips in Notransferoneroute and Onetransferoneroute groups are used for estimating and respectively. Then considering and as prior knowledge, we calculate the probability of each route being chosen for an OD with multipleroutes. Finally as an application, passenger flows are analyzed.
Iv Methodology
Iva Finding all effective routes for an OD pair
In this subsection, we use two steps to find all effective routes for an OD pair. The first step is to find all routes for an OD pair. The second step is to filter the routes that have never been used by passengers from these possible routes. We use the algorithm proposed in [19] to find the shortest routes with efficiency in time , where , are the numbers of the vertices and edges in a digraph respectively. We define the cost of a route as the maximum time cost that contains the minimum of walking time and running time of trains. is determined in term of the accessibility and complexity of metro system. In practice, not all of the routes of an OD are used by passengers. In order to filter those routes that passengers never choose, we sort all trips of an OD pair over two months by the time cost. We then filter the top trips with largest time cost. Although most of passengers do not linger too long inside metro system, there are still some passengers showing abnormal travelling behaviors, such as beggars, express logistics worker. Their time cost and travel plan choice may be anomaly. Our recent work [9] found that a reasonable value of is 2, which can filter the abnormal passengers with high accuracy. Then we get the largest time cost denoted as of the remaining trips. Finally we filter the routes with time cost longer than from all possible routes. The rest are effective routes. In this paper, if not explicitly pointed out, a route refers to an effective one.
IvB Extracting all possible plans chosen by each passenger
In this subsection, given a passenger’s smart card data and train operation data, we extract all possible plans that can be chosen by the passenger. A general trip of a passenger in metro system can be depicted as 5 steps as shown in Figure 4: (1) passing through entrance gate and walking to the platform, (2) waiting on the platform for a train, (3) boarding a train and staying on the train until the train reaches the passenger’s destination, (4) getting off the train and exiting the metro system. To be noted, if the passenger needs to transfer, before step(4), (5) transit between platforms needs to be considered. So the whole trip duration is composed of: entry time (ETT), wait time (WTT), on train (OTT), transfer time (TFT), and exit time (EXT).
In this paper, we denote the minimal walking time of , , as , and respectively, where l and l’ are metro lines and s is a metro station. The method for calculating the value of and has been given in our previous paper [20].
Let us denote the arrival and departure time of a train at station of metro line as and respectively. Suppose a passenger enter metro system at station . His being able to board the train needs to satisfy the following Equation (2)(i). Likewise, if a passenger exits metro system at station , his being able to board the train before his exiting metro system needs to satisfy the following Equation (2)(ii). For a passenger who need to transit at transfer station , let us denote the arrival time of a train at station of line as and the departure time of another train of another line at station as . That the passenger getting off from can board train needs to satisfy Equation (2)(iii).
(2) 
In sum, for a passenger, given each route, we can find all plans chosen during her/his trip by Equation (2).
IvC Solution of and
In this section, we give the approaches for calculating and . As aforementioned, we define that the number of trains waited by the passengers of obeys the polynomial distribution with parameter , and the number of trains waited by passengers of obeys the polynomial distribution with parameter . We consider the two polynomial distributions and separately. That’s because the transfer passengers arrive at transit station almost simultaneously. While the time that the passengers arrive at the origin station is more random. Hence we first solve that given a plan chosen by a passenger, how to transform it into the number of trains that the passenger waits for. Then we give an approach to estimate and using several specific trips.
IvC1 The number of trains waited by passengers
Given a train boarded by a passengers of , in order to transform it into the number of trains the passenger waited for, we divide these passengers of into several groups according to the arrival time of trains. We use of to represent the passengers who enter the metro system between the departure time of train and the departure time of train . Suppose for these passengers in , the set of trains that they may board is , as shown in Figure 5(a). is the maximum number of trains needed to be waited for. Field observations show that the firstcomefirstserved policy is not applicable in practice. There are many factors affecting which train a passenger eventually gets on, such as the distance between the gate and the platform, walking speed, the number of passengers in the waiting queue. Furthermore, a typical train has six to eight cars with multiple doors available for boarding simultaneously. A wise strategy or good luck in choosing train doors could also lead to an earlier boarding. So the train that a passenger eventually gets on is more likely to be a random variable in practice. Let the probability of train boarded by these passengers is . Thus the number of trains needed to be waited for (the number of passengers in these trains) obeys to polynomial distribution, where .
Likewise, given a train boarded by the passengers of , in order to transform it into the number of trains the passengers waited for, we divide these passengers of into several groups according to the arrival time of trains of line . We use to represent the passengers who get off train of line . Suppose for these passengers in , the set of trains that they transfer is as shown in Figure 5(b). For a passenger, the train that he may get on is also influenced by many factors, such as the distance between platforms, walking speed, the waiting position for train, the number of people in the waiting queue, and so forth. So we see the train that a passenger eventually gets on as a random variable. We assume the probability of the number of trains needed to be waited for is stable in same time slot. We denote the probability of boarded by a passengers of as . Thus the number of passenger in these trains obeys to polynomial distribution, where .
IvC2 Calculating of and
From the process of trips of a passengers, we can infer that is affected by (). So we can estimate first, then for .
In order to calculate , we assume that several specific trips of are representative enough to analyze the distribution of the number of trains needed to wait for at an origin station. This is a practical because it is difficult to ascertain the exact train chosen for every passenger during the first part of his trip, especially for complex scene with multiple transfers and multiple routes. However the train chosen for a trip with only one route and no transfer can be inferred. So we classified the trips with only one route as group 1. According to the our assumption in Section IIIB, we can know that given the route chosen, the train chosen in the last part of a trip can be determined uniquely. That means the train boarded at the first part of the trip of group 1 can be inferred, because these trips only have one part during their total journeys. And according to our statistics, the volume of trips in group 1 accounts for of the whole. Clearly, those trip are representative enough [21]. So we can use these trips with only one route and no transfer needed to estimate .
Similarly we assume several specific trips of are representative enough to analyze the distribution of the number of trains needed to wait for at transfer station when passengers need to transit. A trip with only one route and one transfer has two parts during its journey. We classified these trips as group 2 in this paper. The train boarded in the last part of trips in group 2 can be inferred uniquely. Though there may be more than one trains passengers boarded in the first part, the possibility for these trains can be known from . So we can see as prior knowledge and use trips in group 2 to estimate .
So, we divide all trips into four classes according to the number of routes and transfers of their ODs: Notransferoneroute, onetransferoneroute, multitransferoneroute, Multiroutes. The passengers in Notransferoneroute and Onetransferoneroute are used for estimating and respectively. Then using the passengers in Multiroutes and considering and as prior knowledge, we calculate the probability of each route being chosen for an OD with Multiroutes in the next subsection.
Suppose the set of passengers of in group 1 (oneroutenotransfer) is . Using the approach given in Section IVB, we can get the number of trains that each passenger waits for in is respectively. Suppose the number of passengers waiting trains is by counting the same digits of . We use maximumlikelihood estimation to obtain the value of by Equation (3a) and (3b).
(3a)  
(3b) 
Suppose the set of passengers of in group 2 is . They may has different original stations and beginning time slots. We use to represent the first part of the trip of passenger . Given a passenger , suppose the set of plans that the passenger may choose is . The train chosen in the second part of trip can be obtained. A plan can be represented as . The numbers of trains needed to be waited for are . We use maximumlikelihood estimation to obtain the value of by function Equation (4). It is difficult to calculate the derivatives of the logarithm of the sum of some formulas in Equation (4a)for maximum. So firstly we convert it to Equation (4b) by applying the Jensen inequality, and then calculate by maximizing the value of right hand side of Equation (4b).
(4a)  
(4b) 
IvD Calculating the probability of each route being chosen for an OD pair
In this subsection, we aim to give an approach to calculate the probability of each route being chosen for an OD pair with multiple effective routes. Suppose the set of effective routes from station to is . We denote the probability of route being chosen at time slot is denoted as , where .
Suppose the set of the passengers entering metro system during time slot from station to station is , where is the number of passengers. We assume that they are independent. For a passenger in , given that he chooses route , we can obtain all plans that the passenger may choose during his trip using the approach given in section IVB. Denote the set of plans as , where is the possibility that a passenger choose on condition of , route chosen , and . For the same reason, it is difficult to calculate the derivatives of the logarithm of the sum of some formulas in Equation (5a), so we transform it into Equation (5b).
(5a)  
(5b) 
V System Implementation
Our algorithm of calculating the probability of each route being chosen for an OD pair with multiple routes is based on a large amount of data. The framework of our system is illustrated in Figure 6, which has three layers: the platform layer, the analysis layer and the view layer.
The platform layer is mainly used for storage and job processing purpose. Our algorithms need batch processing on a large amount of data. So it is more efficient to run on a parallel platform [22]. We use distributed computing platform Hadoop [23] that was designed for batch processing in big data. It mainly includes two modules, HDFS [24] and MapReduce [25]. HDFS provides highthroughput access to large data. MapReduce is for parallel processing of large data sets. In our platform, we utilize a 34 TB Hadoop Distributed File System (HDFS) on a cluster consisting of 11 nodes, each of which is equipped with 32 cores and 32 GB RAM. To improve retrieving efficiency, some mapReduce based software tools, such as Pig [26], Hbase [27] are used.
The analysis layer running on platform layer is the keystone of our paper. It mainly contains three parts: Route generating for getting all effective plans of each OD pair; Train choice analysis for finding all possible trains that a passenger may get on; Route choice analysis for calculating the probability of each route being chosen for an OD pair with multiple routes. The three parts are based on large volume of data. They are all being translated to a series of MapReduce jobs that run on the distributed environment.
The view layer based on analysis layer performs passenger flow analysis and displays the results to public or transport agencies for strategic planning and management, such as the spatiotemporal passenger flow analysis for all metro lines, trains, sections, and so on.
Vi Case study
Via Dataset
The dataset used in this study is the smart card transaction records and train operation logs in Shenzhen, China. The metro system has 5 metro lines by 2013. The whole data collected from around 4 million smart cards have more than 300 million smart card transaction records, covering 60 consecutive days from June 1, 2015 to July 30, 2015.
ViB Left behind analysis
Figure 7(a) shows the average number of trains waited by passengers in station LaoJie of metro line(LuoHuJiChangDong) at the first part of their trips at different time slots of one day. Figure 7(b) shows the number of passengers passing through station LaoJie (sectional flows of two adjacent sites LaoJie to DaJuYuan) at different time slots of one day. The station Laojie locating in the heart of ShenZhen business district is a transfer station of line 1 and line 3. There are about 12 thousand tapin passengers and 60 thousand transfer passengers in LaoJie per day. From figure 7, we can get that there are a remarkable similarity of the two lines. The crosscorrelation of the number of train waited and sectional flow is 0.75 which is larger than 0.7. So we can assume that the more the passengers passing through a station, the more left behind passengers there are in the station.
From Figure 7(a) we also can get that the phenomena of left behind is varying with time, which is a good indicator of transit service performance and can provide better travel advice for users. There are two obvious peak periods 7:009:00 and 18:0020:00 in Figure 7. That is because there are so many passengers who go to work in the morning rush hours and back home in the evening rush hours that the capacity of trains can not meet the actual requirements. So in rush hours, passengers must wait for more trains. Another point that deserves further explanation is that even during offpeak periods that train may not be crowded, the average number of trains needed to be waited for is bigger than 1.0. That is, not all of passengers get on the first available train. This is understandable because some passengers care about comfort that they anticipate that the next train will have seats available and choose to wait.
Figure 8 gives the distribution of the number of trains waited by passengers at all stations of metro line 1(JiChangDongLuoHu) at four time slots(07:0007:30, 07:3008:00, 08:00 08:30, 08:30 09:00) of morning rush hours. In Figure 8, the bar labelled with “1st” means the probability that a passenger needs to wait for one train (gets on the 1st coming train). “2nd” means the probability that a passenger needs to wait for two trains, and so on. From Figure 8, we can get that the leftbehind varies in time and space. That is because the distribution of passengers is uneven in time and space as shown in Figure 9.
Figure 9 gives all sectional flows of metro line 1(JiChangDongLuoHu) at four time slots(07:0007:30, 07:3008:00, 08:0008:30, 08:3009:00) of morning peek. We can get that the leftbehind is most serious from XX(short of XiXiang) to TY (short of TaoYuan) in 08:3008:30, which indicates that the train capacity can not meet the demand of passengers in these station.
Moreover, station JCD(JiChangDong) is the first station of line 1 from JCD to Luohu, which tell us that all trains arriving at this station are empty. That means that there are more remaining capacity than other stations. But from this figure we get that there are also many passengers in JCD who need to wait for more than one trains. There are two reasons: Firstly, JCD is the closest metro station to ShenZhen airport, where there are a lot of passengers getting off from plane and carry packs of luggage and struggle for a local train. Secondly, for safety and comport, they are more likely to choose the next train with more seats. As JCD is the first station of line 1, passengers prefer a train with seats more than other stations.
ViC Route choice pattern
In this section, two typical OD pairs of stations were chosen to illustrate our proposed method to calculate the probability of each route being chosen. Figure 10(a) shows the layout of the two OD pairs.
The first OD pair is BaiShiLongFuTian that had two effective routes, (1)Take the metro line 4, get off at station ShaoNianGong. Then take metro line 3, get off at the FuTian station. (2)Take the metro line 4, get off at ShiMinZhongXin. Then take metro line 2, get off at the FuTian station. Both of the two routes need one transfer and average time cost of them is nearly the same. The first route is recommended by some mobile apps, such as Baidu map App, ShenZhen metro App provided by ShenZhen Metro Group Company. However our experiment results show that the probability of the first route being chosen at rush hours and off peak hours is and respectively, which is less than the probability of the second route being chosen. The results tell us that the route given by mobile app doesn’t always reflect most passengers’ real choice. It also provides proof of the walking penalty when the general cost of a path is calculated.
According to a survey about all transfer stations in ShenZhen, ShaoNianGong is one of the transfer stations with longest walking time. The walking time is about 5 minute, which is more than that of ShiMinZhongXin with 2 minute. Our onsite investigations tell us that most of passengers do not prefer to walking two much time when transferring. Some passengers tell us that they do not know the actual walking time in every transfer station. Generally, they will follow the shortest path by some map app in their smartphone, which tell them the first route has less time cost than the second one. Based on that, it is understandable for the passengers’ route choice behavior of BaiShiLongFuTian. Most of passengers at peak period are local residents. They are more familiar with the metro and know more about the walking time cost than the passengers in off peak period such as visitors. Tourists who have less experience are more likely to rely on mobile apps. So they are more likely to choose to transfer in ShiMinZhongXin.
The second OD pair is WuHeYanNan. There are also two effective routes as shown in Figure 10(b). (1)Take metro line 5, get off at HuangBeiLing. Then take metro line 2. (2)Take the metro line 5, get off at ShenZhenBei, then take metro line 4, get off at ShiMinZhongXin, then take metro line 2. The first route costs ten minutes more than the second one. However our analysis results show that there are still passengers choosing the first route. This is because the second route has one more transfers, which is likely to offset the advantage of low time cost. The result provides proof of the transfer penalty when the general cost of a path is calculated.
ViD Spatiotemporal density analysis
Spatiotemporal density of all trains of metro line 1(LuohuJichangdong) is shown in Figure 11, where the x axis and y axis represent time and station respectively. Every train starts at the lowest station and finally reaches the highest station in the y axis. Each diagonal line represents a train and covers the information about the train’ spatiotemporal density. The color represents the density of passengers. From this figure we can get that there are two peek hours as morning and evening. The density in morning peek is more serious than in evening peek. So the departure intervals of trains in morning peek could be set to be shorter than that in evening peek.
Beside, this spatiotemporal density information can be used for assessing the metro service and forecasting the density of all running train and so on. The sectional flow of whole metro system at four time slots in morning peek is shown in Figure 12. We can get that (1) The passenger flows are most crowded in 8:008:30 (2) The passenger flow is uneven distributed throughout whole metro system. The metro line 1, 3, 4 have more passengers than line 2 and 5. For metro line 1, 3, 4, the densities are more serious in middle than that in both sides. This can help to design schedules for shuttle trains and so on. All these information can improve both passengers’ and transportation operators’ knowledge on transportation system. For example the information can be further used to improve the service by redesigning timetable, adjusting velocity, and etc. Passengers on the other hand can also plan their trips based on the information.
Vii Conclusion and discussion
In this paper, we present an approach to calculate the probability of route choices for an OD pair with multiple routes in a complex metro network. In doing so, we find, for each passenger, all possible plans that he/she can choose for each effective route by matching smart card data and train operation logs. We also calculate two kinds of timedependent polynomial distributions using maximum likelihood estimation. One is the number of trains that a passenger waits for at his/her original station. The other is the number of trains that a passenger waits for when he/she changes at the transfer station. Based on that, we propose a probability model to calculate the probability of each route being chosen for an OD with multipaths. The approach in this paper is applied to Shenzhen metro system. Onsite investigations validate that our algorithm is accurate and can be used to estimate passenger flows.
Acknowledgment
The authors would like to thank anonymous reviewers for their valuable comments. This work was supported in part by the China National Basic Research Program (973 Program) under Grant 2015CB352400, by the National High Technology Research and Development Program of China (863 Program) under Grant 2014AA01A702, by NSFC under Grant U1401258, by NSF under Grant CMMI 1436786 and CNS 1526638, by Research Program of Shenzhen under Grant JSGG20150512145714248, KQCX2015040111035011 and CYZZ20150403111012661, by the Natural Science Foundation of Hubei Province under Grant 2014CFB1007, and by the Fundamental Research Funds for the Central Universities.
References
 [1] T. Kusakabe, T. Iryo, and Y. Asakura, “Estimation method for railway passengers’ train choice behavior with smart card transaction data,” Transportation, vol. 37, no. 5, pp. 731–749, 2010.
 [2] L. Sun, Y. Lu, J. G. Jin, D.H. Lee, and K. W. Axhausen, “An integrated bayesian approach for passenger flow assignment in metro networks,” Transportation Research Part C: Emerging Technologies, vol. 52, pp. 116 – 131, 2015.
 [3] S. Peeta and A. K. Ziliaskopoulos, “Dynamic traffic assignment: The past, the present and the future,” Networks and Spatial Economics, vol. 1, 2001.
 [4] Y. Sheffi, “Urban transportation networks: Equilibirum analysis with mathematical programming methods,” 1985, englewood Cliffs, NJ: PrenticeHall, c1985.
 [5] H. Talaat and B. Abdulhai, “Modeling driver psychological deliberation during dynamic route selection processes,” in in Intelligent Transportation Systems Conference, 2006. ITSC’06. IEEE, 2006, pp. 695–700.
 [6] S. Nakayama and R. Kitamura, “Route choice model with inductive learning,” pp. 63–70, 2000.
 [7] M. Bagchi and P. White, “The potential of public transport smart card data,” Transport Policy, vol. 12, no. 5, pp. 464–474, 2005.
 [8] B. Agard, C. Morency, and M. Trépanier, “Mining public transport user behaviour from smart card data,” in 12th IFAC Symposium on Information Control Problems in ManufacturingINCOM, 2006, pp. 17–19.
 [9] J. Zhao, C. Tian, F. Zhang, C. Xu, and S. Feng, “Understanding temporal and spatial travel patterns of individual passengers by mining smart card data,” in Intelligent Transportation Systems (ITSC), 2014 IEEE 17th International Conference on. IEEE, 2014, pp. 2991–2997.
 [10] M.P. Pelletier, M. Trépanier, and C. Morency, “Smart card data use in public transit: A literature review,” Transportation Research Part C: Emerging Technologies, vol. 19, no. 4, pp. 557–568, 2011.
 [11] J. G. Jin, L. C. Tang, L. Sun, and D.H. Lee, “Enhancing metro network resilience via localized integration with bus services,” Transportation Research Part E: Logistics and Transportation Review, vol. 63, pp. 17 – 30, 2014. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1366554514000039
 [12] L. Sun, J. G. Jin, D.H. Lee, K. W. Axhausen, and A. Erath, “Demanddriven timetable design for metro services,” Transportation Research Part C: Emerging Technologies, vol. 46, pp. 284 – 299, 2014. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0968090X1400182X
 [13] L. Sun, D.H. Lee, A. Erath, and X. Huang, “Using smart card data to extract passenger’s spatiotemporal density and train’s trajectory of mrt system,” in Proceedings of the ACM SIGKDD International Workshop on Urban Computing. ACM, 2012, pp. 142–148.
 [14] L. R. H. S. Fu, Q., “A bayesian modelling framework for individual passenger¡¯s probabilistic route choices: a case study on the london underground,” in Transportation Research Board 93rd Annual Meeting(No. 145328), pp. 197–203.
 [15] Z. Tian, Y. Wang, C. Tian, F. Zhang, L. Tu, and C. Xu, “Understanding operational and charging patterns of electric vehicle taxis using gps records,” in Intelligent Transportation Systems (ITSC), 2014 IEEE 17th International Conference on. IEEE, 2014, pp. 2472–2479.
 [16] J. Zhang, X. Yu, C. Tian, F. Zhang, L. Tu, and C. Xu, “Analyzing passenger density for public bus: Inference of crowdedness and evaluation of scheduling choices,” in Intelligent Transportation Systems (ITSC), 2014 IEEE 17th International Conference on. IEEE, 2014, pp. 2015–2022.
 [17] J. Huang, L. Wang, C. Tian, F. Zhang, and C. Xu, “Mining freight truck’s trip patterns from gps data,” in Intelligent Transportation Systems (ITSC), 2014 IEEE 17th International Conference on. IEEE, 2014, pp. 1988–1994.
 [18] Y. Li, C. Tian, F. Zhang, and C. Xu, “Traffic condition matrix estimation via weighted spatiotemporal compressive sensing for unevenlydistributed and unreliable gps data,” in Intelligent Transportation Systems (ITSC), 2014 IEEE 17th International Conference on. IEEE, 2014, pp. 1304–1311.
 [19] E. D., “Finding the k shortest paths[j],” Siam Journal on Computing, 2006.
 [20] F. Zhang, J. Zhao, C. Tian, C. Xu, X. Liu, and L. Rao, “Spatiotemporal segmentation of metro trips using smart card data,” Vehicular Technology, IEEE Transactions on, vol. PP, no. 99, pp. 1–1, 2015.
 [21] Wikipedia, the free encyclopedia, “Sampling statistics,” https://en.wikipedia.org/wiki/Sampling statistics.
 [22] F.Y. Wang, “Parallel control and management for intelligent transportation systems: Concepts, architectures, and applications,” Intelligent Transportation Systems, IEEE Transactions on, vol. 11, no. 3, pp. 630–638, 2010.
 [23] T. White, Hadoop: the definitive guide. O’Reilly, 2012.
 [24] D. Borthakur, “Hdfs architecture guide,” HADOOP APACHE PROJECT http://hadoop. apache. org/common/docs/current/hdfs design. pdf, 2008.
 [25] J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.
 [26] A. F. Gates, O. Natkovich, S. Chopra, P. Kamath, S. M. Narayanamurthy, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava, “Building a highlevel dataflow system on top of mapreduce: the pig experience,” Proceedings of the VLDB Endowment, vol. 2, no. 2, pp. 1414–1425, 2009.
 [27] L. George, HBase: the definitive guide. O’Reilly Media, Inc., 2011.
Juanjuan Zhao received her MS from the Department of Computer Science from Wuhan University of Technology, China in 2009. She was a research assistant from 2009 to 2012 in the Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences and currently is a Ph.D. student there. Her research interests include cloud computing, big data processing, streamingdata processing, data fusion technique, bigdatadriven systems, spatiotemporal data mining. 
Fan Zhang is an associate professor in the Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences. He received his Ph.D. in Communication and Information System from Huazhong University of Science and Technology in 2007. He was a postdoctoral fellow at University of New Mexico and University of NebraskaLincoln from 2009 to 2011. His research topics include big data processing, data privacy and urban computing. 
Lai Tu received the B.S in Communication Engineering and Ph.D. degree in Information and Communication Engineering from Huazhong University of Science and Technology, China, in 2002 and 2007 respectively. From 2007/7 to 2008/12, he worked as a postdoc fellow in the Department of EIE. in Huazhong University of Science and Technology. From 2009/1 to 2010/10, He worked as a postdoc researcher in the Department of CSIE. in Nation Cheng Kung University, Taiwan. Currently, he is an associate professor of the School of Electronic and Information and Communications in Huazhong University of Science and Technology. His research areas include urban computing, human behavior study, mobile computing and networking. 
Chengzhong Xu received his Ph.D. degree from the University of Hong Kong in 1993. He is currently a professor of the Department of Electrical and Computer Engineering of Wayne State University, USA. He also holds an adjunct appointment with the Shenzhen Institute of Advanced Technology of Chinese Academy of Science as the Director of the Institute of Advanced Computing and Data Engineering. His research interest is in parallel and distributed systems and cloud computing. He has published more than 200 papers in journals and conferences. He was the Best Paper Nominee of 2013 IEEE High Performance Computer Architecture (HPCA), and the Best Paper Nominee of 2013 ACM High Performance Distributed Computing (HPDC). He serves on a number of journal editorial boards, including IEEE Transactions on Computers, IEEE Transactions on Parallel and Distributed Systems, IEEE Transactions on Cloud Computing, Journal of Parallel and Distributed Computing and China Science Information Sciences. He was a recipient of the Faculty Research Award, Career Development Chair Award, and the President¡¯s Award for Excellence in Teaching of WSU. He was also a recipient of the “Outstanding Oversea Scholar¡± award of NSFC. For more information, visit http://www.ece.eng.wayne.edu/ czxu. Dr. Xu is an IEEE Fellow. 
Dayong Shen received the bachelor¡¯s degree and master’s degree in system engineering from NUDT respectively in 2011 and 2013. He is currently working toward the Ph.D degree in Social Transportation and Social Logistics. His current research interests include intelligent scheduling ,artificial intelligence algorithm and parallel social systems. He has rich experience in designing and implementing parallel logistics system projects. 
Chen Tian is an associate professor at State Key Laboratory for Novel Software Technology, Nanjing University, China. He was previously an associate professor at School of Electronics Information and Communications, Huazhong University of Science and Technology, China. Dr. Tian received the BS (2000), MS (2003) and PhD (2008) degrees at Department of Electronics and Information Engineering from Huazhong University of Science and Technology, China. From 2012 to 2013, he was a postdoctoral researcher with the Department of Computer Science, Yale University. His research interests include data center networks, network function virtualization, distributed systems, Internet streaming and urban computing. 
XiangYang Li is a professor School of Computer Science and Technology, University of Science and Technology of China. He was previously a professor at Computer Science Department at the Illinois Institute of Technology. Dr. Li received MS (2000) and PhD (2001) degree at Department of Computer Science from University of Illinois at UrbanaChampaign, a Bachelor degree at Department of Computer Science and a Bachelor degree at Department of Business Management from Tsinghua University, P.R. China, both in 1995. His research interests include wireless networking, mobile computing, security and privacy, cyber physical systems, and algorithms. He and his students won five best paper awards (IEEE GlobeCom 2015, IEEE HPCCC 2014, ACM MobiCom 2014, COCOON 2001, IEEE HICSS 2001), one best demo award (ACM MobiCom 2012). He published a monograph ”Wireless Ad Hoc and Sensor Networks: Theory and Applications”. He coedited several books, including, ”Encyclopedia of Algorithms”. Dr. Li is an editor of several journals, including IEEE Transaction on Mobile Computing, and IEEE/ACM Transaction on Networking. He has served many international conferences in various capacities, including ACM MobiCom, ACM MobiHoc, IEEE MASS. He is an IEEE Fellow and an ACM Distinguished Scientist. 
Zhengxi Li is doctoral supervisor, professor and vice president of North China University of Technology. His research interests cover intelligent traffic control and management, control theory and control engineering, electric drive technology. 