ALCNN: Attention-based Model for Fine-grained Demand Inference of Dock-less Shared Bike in New Cities

ALCNN: Attention-based Model for Fine-grained Demand Inference of Dock-less Shared Bike in New Cities

Chang Liu Shanghai Jiao Tong University800 Dongchuan RoadShanghaiChina Yanan Xu Shanghai Jiao Tong University800 Dongchuan RoadShanghaiChina  and  Yanmin Zhu Shanghai Jiao Tong University800 Dongchuan RoadShanghaiChina

In recent years, dock-less shared bikes have been widely spread across many cities in China and facilitate people’s lives. However, at the same time, it also raises many problems about dock-less shared bike management due to the mismatching between demands and real distribution of bikes. Before deploying dock-less shared bikes in a city, companies need to make a plan for dispatching bikes from places having excessive bikes to locations with high demands for providing better services. In this paper, we study the problem of inferring fine-grained bike demands anywhere in a new city before the deployment of bikes. This problem is challenging because new city lacks training data and bike demands vary by both places and time. To solve the problem, we provide various methods to extract discriminative features from multi-source geographic data, such as POI, road networks and nighttime light, for each place. We utilize correlation Principle Component Analysis (coPCA) to deal with extracted features of both old city and new city to realize distribution adaption. Then, we adopt a discrete wavelet transform (DWT) based model to mine daily patterns for each place from fine-grained bike demand. We propose an attention based local CNN model, ALCNN, to infer the daily patterns with latent features from coPCA with multiple CNNs for modeling the influence of neighbor places. In addition, ALCNN merges latent features from multiple CNNs and can select a suitable size of influenced regions. The extensive experiments on real-life datasets show that the proposed approach outperforms competitive methods.

Urban computing, dock-less shared bike, attention mechanism, discrete wavelet transform, transfer learning
copyright: noneccs: Information systems Data miningccs: Information systems Spatial-temporal systemsccs: Applied computing Transportation

1. Introduction

Recently, dock-less shared bike services have achieved great success and reinvented bike sharing business in China, especially in major cities. Dock-less shared bikes provide an environmentally friendly solution to solve the last mile problem which refers to the troublesome distance between home and the nearest traffic center. Many dock-less shared bike companies have grown rapidly and seized the market, such as ofo and Mobike111 The rules they set are similar: user can find a dock-less shared bike via a GPS-based smart phone APP and follow the APP’s instructions to unlock the bike (scan the QR code or enter passwords), then ride the bike to anywhere they want, finally lock the bike and pay some money to the company. The convenience of this mode makes many people benefit from it and reinvents bike industry in China.

Figure 1. Fine-grained bike demands in three different places in a same day. Bike demands are counted for each half hour.

But the prosperity of this new way of transportation will inevitably lead to new problems. Before companies officially deploy dock-less shared bikes in a new city, they need to make a plan of managing bikes as the real-time distribution of bikes may not match bike demand. Manage bikes according to bike demands will greatly improve the effective use of bikes, i.e., serving more costumers with less cost. However, the bike demands in a city vary spatially and temporally. We regard the curve of demand per half hour in one area as fine-grained bike demand. Figure 1 shows fine-grained bike demands in a day for three locations. We can see that for each location, its fine-grained bike demand varies with time. Different places have different bike demands. If we know the fine-grained bike demands of the three places in Figure 1, at noon we can move spare bikes in the second place to the first place and the third place.

The fine-grained bike demands are easy to collect in a city which has deployed shared bikes for a while. But for a city having no shared bikes, the lacking of historical bike demand records makes it hard to build a prediction model. In this paper, we focus on a problem of inferring fine-grained bike demands within a time slot for a place in new cities which have not deployed dock-less shared bikes. We want to build an demand inference approach based on geo-related data, such as POI, road and transportation, in a city having bikes. Then we can transfer it to a new city and infer bike demands. Companies can use the inference result to design a schedule algorithm before the deployment to balance the supply and demand of bikes in all regions (Li et al., 2018).

Many existing works on dock-less shared bike system take geo-related data into consideration to infer the distribution of bike demand in a city (Liu et al., 2018a, b). But they only focused on the spatial distribution and omitted temporal variability which is also important for the management of bikes. In this paper, we aim to infer the bike demand distribution over both spatial and temporal dimensions.

However, there are three challenges for solving the problem. First, the bike demands are time-varying but we only have static geographic data like POIs. The bike demands vary with time in a day for one place. Even for the same time of different days, their demands may also vary a little. Second, our task is to infer fine-grained bike demands in a new city having not deployed shared bikes and we don’t have any bike demand data in that city. The data distribution may be different in diverse cities. Finally, each place is influenced by their neighbors and the size of area influenced by neighbors is hard to determine.

To address these challenges, we first analysis the fine-grained bike demand data and find that there exist daily demand patterns for many places (details are shown in Section 3). We propose to utilize discrete wavelet transform (DWT) to mine daily bike demand patterns from fine-grained bike demands of every day in each place. The mined daily demand patterns are stable and are used as the ground truth of inference. To deal with the problem of lacking of training data in a new city, we next utilize multi-source geographic data to extract discriminated features and train a inference model to predict bike demands. The inference model is trained with data from an old city which has deployed shared bikes and transferred to new cities. We employ correlation Principal Component Analysis (coPCA) to realize the distribution adaption when extracting latent features from geographic data of both old city and new city. For considering the influence of neighbors, we then propose an attention-based model, ALCNN, which utilizes CNN to aggregate geographic features from neighbors in a local region for each place. As the size of influenced area of different places varies, we build several CNN models with local regions of different sizes and merge their learned latent features using attention mechanism. In other words, the attention mechanism can choose local regions of a suitable size for each place automatically.

The main contributions of this paper are summarized as follows.

  • To the best of our knowledge, we are the first to study the problem of inferring fine-grained bike demands in a new city. We present the importance of finding temporal patterns from fine-grained bike demands and propose to use discrete wavelet transform to mine daily patterns from bike demands.

  • We utilize local CNN to aggregate geographic features of neighbors in local regions of various sizes for each place. We propose to use attention mechanism to merge the learned latent features from local regions of different sizes.

  • We evaluate our inference approach on real Mobike data of three cities in China. The results show that our approach outperforms all baselines. The attention mechanism can improve the inference performance.

The remainder of this paper is organized as follows. In Section 2, we formulate our fine-grained bike demand inference problem and introduce the framework of our approach. In Section 3, we analysis the dataset of dock-less shared bikes of Mobike. Next, we introduce feature extraction processes and our proposed model ALCNN in Section 4 and Section 5, respectively. Then, we conduct extensive experiments to evaluate our approach in Section 6.

2. Overview

In this section, we first introduce several fundamental definitions and present the formulation of our bike demand inference problem. Next, we show the overall framework of our approach.

2.1. Preliminary

Definition 2.1 (City grid and city grid map ).

We regard a city as one rectangle and divide it into disjointed grids of the same size according to latitude and longitude. Each grid is denoted by , where and . As a result, the city can be seen as a grid map, i.e., . For convenience, we use function to judge whether a point is inside a grid .

Definition 2.2 (Mobike trip record collection ).

We use the data from a dock-less shared bike company called Mobike. Our dataset of dock-less shared bikes is a collection of riding trip records, i.e., . Each record is a tuple the elements of which denote the starting location and time, ending location and time, respectively. One location is formed by its longitude and latitude.

If the starting location of one record , is within a grid , we say that a renting behavior happened in that grid. Based on the trip records, we define a bike demand set as follows.

Definition 2.3 (Fine-grained bike demand set ).

For easy processing, we divide each day to into time slots of the same length and use to denote the -th time slot in the -th day. Next, we can obtain fine-grained bike demands of each grid in a certain time slot with the following equation.


As we aim to infer daily bike demand pattern of each grid, we use one demand vector to denote bike demands of all time slots in day for grid , i.e., . For each grid, its bike demand vectors of different days can form a set of fine-grained bike demands

Figure 2. Fine-grained bike demands on different days in a same grid

As Figure 2 shows, we find that in most grids, their fine-grained bike demands in different days are similar to each other. Based on the observation, we can draw two conclusions. On the one hand, most grid has their daily patterns among the bike demands. On the other hand, it should be better to mine the daily patterns which will be more stable and benefit the convergence of our inference model. Therefore, we define daily temporal patterns of bike demand for each grid as follows.

Figure 3. The framework of our system
Definition 2.4 (Daily demand pattern ).

For each grid, we aim to find a demand vector which is similar to all demand vectors of most days. Considering that the bike demands of different grids or different days can vary widely, we normalize all demand vector with its summation of all elements. We employ Kullback-Leibler(KL) divergence to measure the difference between demand vectors. If we manage to find a demand vector and the KL divergence and all other demand vectors of the same grid is smaller than a threshold , we treat as the daily demand pattern of grid . Note that does not have a superscript because is not a real bike demand vector of any day. We will talk about how to find by DWT latter.


Problem Formulation. We consider two cities S and T, where S has deployed dock-less shared bikes and T is a new city to launch shared bike business. We divide both cities into grids of the same size. We regard city S as a source domain while city T as a target domain. Given a set of raw bike data in city S, we compute the set of fine-grained bike demands D according to Definition 2.3. Our goal is to infer fine-grained bike demands in the new city T, based on the known fine-grained bike demands from city S and multiple geo-related data (e.g., POI and road networks) in both city S and T.

2.2. Inference Framework

To achieve the fine-grained bike demand inference goal, we propose an inference system consisting of two major components, i.e., feature extraction and model training, as shown Figure 3.

In the feature extraction component, we first extract features from multi-source geographic data including POIs, road networks, satellite lights, transportation centers and business centers. We also process the original records of Mobike and get bike demands. Next, we utilize Correlation Principal Component Analysis (coPCA) and Discrete Wavelet Transform (DWT) to deal with the two kinds of features in a further step. coPCA produces discriminative latent features as the input of our inference model. coPCA processes geo-related data in source city and target city together. It realizes the distribution adaption between the two cities and improves the inference performance. DWT extracts temporal patterns from bike demand vectors and can improve the stability of our inference model as it represents the demand vectors with fewer dimensions and parameters.

In the second component, we feed the extracted latent features to a proposed attention-based local CNN model. We train the model with data of source city S and transfer the trained model to a new city T. Using the geographic features of the target city T, we can infer the bike demand in the new city. The proposed attention-base local CNN model can select features of each grid and its neighbors and processes them with multiple convolution networks for modeling influence of neighbors. The attention mechanism in the model can help choosing the most suitable size of local region to provide better inference results.

3. Data Analysis

Before we conduct empirical analysis on dock-less shared bike data, we need to set a evaluation method to define how close does two different fine-grained demands are. In order to do this, we adopt the Kullback-Leibler (KL) divergence to compute the closeness between two fine-grained bike demands. Based on the definition in Section 2, we can define the KL divergence:


Intuitively, the smaller KL divergence is, the closer two fine-grained demand vectors are. In our context, the closeness of every two consecutive bike demand vectors indicate the stability of daily fine-grained bike demands.

As we have many fine-grained bike demands on each grid for different days, we still need a way to evaluate the divergence between all fine-grained bike demands on each grid. As we mentioned in Section 2, we use the maximum of Kl divergence among all fine-grained bike demands as evaluation. The lower divergence means the higher possibility we can find a pattern on that grid.


where is a vector of days. By the definition of divergence, we can do empirical analysis on real data, to see whether there will be necessary to find a pattern. We scan all the real data and get the result as shown in Figure 4.

Figure 4. Distribution of divergence between demands of different days

We can see about of real fine-grained demands have a divergence less than , and almost of real fine-grained demands have a divergence less than . Based on observation, we draw a conclusion that most grids in the city have temporal patterns, which is necessary to mine temporal patterns for further work.

Actually, temporal patterns are very complicated, for example, even if we mainly conside peaks in patterns, the amount, time and height of peaks will all be changeable in everywhere in the city. However, we don’t think that the difference among temporal patterns is just random. Instead, we think that there is a connection between temporal patterns and the geographic data. For example, most grids near subway stations have “TRIPLE PEAK” patterns (three peaks during the day), because that subway station has a large flow of people and it comes to a burst on every possible peak, while the grids near business center always have “DOUBLE PEAK” patterns (two peaks during the day), for that the people working in business center have a rapid pace of working, hence they don’t have enough time to use bikes in noon, which makes the peak at noon does not exist.

Above all, we find that there is a need to find temporal patterns from fine-grained bike demands, for that most grids have low divergence. We also find that temporal patterns are typically different because of geographic diversity, which inspires us to leverage geographic data sources such as POI, subway station, business center when inferring the patterns.

4. Feature Extraction

In this section, we will show the details about feature extraction component. The result of feature extraction will be the input of our propose ALCNN model in section 5.

4.1. Features from Multi-source Data

Geographic data is utilized to infer bike demands in this paper as they reflect the condition of transportation and high related to the bike demands. In this paper, we consider kinds of geographic data including POI, road networks, nighttime light, transportation centers, and business centers. We will give a brief introduction to each kinds of data and present the feature extraction methods. The extracted detailed features from every kinds data are denoted by , , , , and , respectively.

1.POI Features:

Each POI represents a city venue with its name, address, category and spatial coordinates. The number and diversity of POIs on a grid reflects its prosperity and hence are related to bike demand density. We extract the following three POI features for each grid:

(1) POI category frequency (): We associate each city grid with a POI category frequency vector . is a -dimension vector and the -th element in indicates the number of venues of the -th category located in the grid .

(2) Number of POIs (): We can count the total number of POIs based on frequency vector on a grid: .

(3) POI entropy (): Besides the number of POIs, we also compute the entropy based on for indicating the heterogeneity of POIs in a grid.


2. Road Network Features:

Intuitively, the more roads located in one grid, the more convenient the traffic will be. In our dataset, each road consists of its name, category (or level), start point and end point. We propose to use two road network features for every grid:

(1) Road category frequency (): We can associate road category frequency vector for each city grid . The -th element in denotes the number of -th type roads overlapped with grid .

(2) Number of roads (): After we get the frequency vector , we can count the total number of overlapped roads : .

3. Nighttime Light Features:

The nighttime light data are collected by satellites. We get the light intensity by sample points for every meters on the map. Intuitively, nighttime light intensity are positively correlated with business prosperity and population density of a grid which implies large shared bike demands. We identify two kinds of features:

(1) Average light intensity (): To reduce the influence of data noises, we calculate the average light intensity. Denote as the set of light points in the city, and as the intensity of every point. We compute the average light intensity in :


(2) Distance to the nearest light centre (): From the map of nighttime lights, we identify that there are several centers whose light intensity is large than other places. Each light center can be denoted by their longitude and latitude. Based on the light centers, we compute the geographic distance between each grid and their nearest light center.


where function is the function to calculate geographic distance between two location and . For a grid , we use the longitude and latitude of its center as the location. denotes the set of all light centers in a city.

4. Transportation Features:

As we all know, most dock-less shared bikes are parked near the transportation centers, such as subway station. Because people usually get off buses or metros in transportation centers and need to find a bike for traveling in a short distance, e.g., from bus station to home. We can extract two main features:

(1) Number of transportation centers (): We use as the set of transportation centers in the city. Then, we can count the number of transportation centers in each grid:


(2) Distance to the nearest transportation centers (): For each grid, we also compute its center to the most nearest transportation centers:


5. Business centre features:

Around the business centers, the flow of people can be very large which indicates large bike demands. In our business centre dataset, each business centre is denoted by its name, level, and location. We use to denote the set of all business centers in a city. We extract two business centre features for each grid:

(1) Distance to the nearest business centre ()

We use as the set of all business centres, we have:


(2) Level of the nearest business centre ()


4.2. Correlation Principal Component Analysis for Transfer Learning

As our task is to infer fine-grained bike demands in a new city and the distribution of two cities may be different, we propose to apply coPCA over the extracted features to achieve distribution adaption. Originally, PCA aims to use latent features with low dimension to represent the raw features. If each data sample is seen as one point in a coordinate system, PCA rotates the coordinate axes and on the top several axes the samples have the largest variance. The latent features on these axes can reserve effective information and filter out data noises. In this paper, we use coPCA to find a transformation to minimize difference between the distributions of data from two cities.

We first construct two matrices to store features of grids from two cities, respectively. In the matrices, each row denotes one grid in a city and every column stands for one kind of feature, in other words the matrices are concatenations of and . Next, we concatenate the two matrices in the first dimension and apply PCA to the entire matrix. When we get the result of coPCA, we will divide the output matrix into two parts along the first dimension. The two parts can be seen as the results of dimension reduction of data from source city and target city, respectively.

Figure 5. The structure of our local CNN model. Four grids are selected as training samples. The size of local region of each grid is set to in this figure.

4.3. Discrete Wavelet Transform

As we discuss in Section 3, each grid has its daily bike demand patterns. Moreover, the dimension of fine-grained bike demand vector can be large which may lead to high computation complexity and a large number of parameters. Inspired by the dimension reduction of input features, we also want to reduce the dimension of output, i.e., bike demands. We utilize Discrete Wavelet Transform(DWT) to solve the problem. As with other wavelet transforms, a key advantage it has over Fourier transforms is temporal resolution: it captures both frequency and location information (location in time).

The process of DWT is actually passing through a series of filters. For convenience, we use two filters, the low pass filter and high pass filter, which are decided by mother wavelets and are known as quadrature mirror filter222 According to early researches on DWT(Kronland-Martinet et al., 1987), the low pass filter can filter the original signal to the approximation coefficients, while the high pass filter can filter the original signal to the detail coefficients. The main reason why we use DWT is that DWT can obtain the approximation by low pass filter, which uses fewer parameters because of the other parameters are in high-frequency information. The basic math formula is as follow:


where is the fine-grained bike demand in grid on day . is the reduced parameter size, which is half of origin size in this case. is high-pass filter while is low-pass filter. is the result after DWT. Then we use inverse DWT (idwt) to transfer to a new fine-grained demand, and we ensemble all new fine-grained demands in grid on different days to get the candidate daily pattern . The selection of and is based on the mother wavelet, which is not unique, like Haar wavelets, Daubechies wavelets, Symlets wavelets. In this paper, we choose Daubechies wavelets to do our work.

Since we have found a way to use fewer parameters to describe the origin distribution approximately, we can use DWT to preprocess a fine-grained bike demand and get a profile on every day. Then, we can get the candidate temporal pattern of one grid from all profiles assembled from its original fine-grained bike demands on different days. We need to set a standard divergence threshold to judge whether is a temporal pattern or the fine-grained demands of this grid are just disorganized, which is defined in Section 2 and 3. After all the above steps, we can get the temporal pattern on every grid or decide that it doesn’t have a temporal pattern.

Above all, we choose Daubechies wavelets to do DWT. For each existing fine-grained bike demand one grid , we use DWT to find its profile, which will reduce the number of parameters. Then, we set a threshold to judge whether this grid has a temporal pattern using the similarity among all its profiles in different days. After these steps, we find most grids having temporal patterns and put them into our training set, with the mined patterns as their labels.

5. Attention Based Local Cnn Model

After extracting features from multi-source data, we propose an Attention-based Local CNN (ALCNN) model to infer bike demand of each grid.

5.1. Local CNN

As we all know, geographically adjacent regions share similar characteristics. E.g., if a place is near a metro station, it will have high pedestrian volumes as the station. And the station has a large influence on the traffic of that place. Inspired by this thought, we develop a local convolution network to model the influence of neighbors for each grid.

Figure 5 shows the structure of our proposed local CNN which can be seen as a variant of traditional CNN. The input of local CNN is the feature tensor of all grids in a city. We first select one target grid and its neighbors within a certain distance as a local region with Equation 15. E.g., in Figure 5, the target grid and its adjacent grids (distance equal to ) are selected. Note that every local region has all the features of the central grid itself and its neighbors, but it only has the label (temporal pattern after DWT) of the central grid, since it is just an enlargement of the central grid actually.


where is a target grid we consider, is the feature tensor for all grids in the city after PCA. The selected feature map for every grid is a tensor called , where is the neighbor size and is the dimension of new features after coPCA.

Next, convolution operations are performed on the feature map of selected local region. Different target grids share CNN parameters. Then the CNN is followed with two fully connected layers for learning interactions between features.


where is the parameter tensor of -th convolution filter, are the -th bias vectors, are -th weight matrices and is the activation function.

5.2. Attention Mechanism

Figure 6. The structure of our attention-based model. The depth of maps denotes the number of features of one grid. The purple grid is the target place of this sample for inferring bike demands. CNNs are applied to local regions of different sizes, i.e., , and etc. The learned latent features from different local regions are merged with attention mechanism.

With different local region sizes, we can produce several hidden vectors for each grid. The simplest way to merge these vectors is to concatenate them or compute average values for each dimension. However, the size of affecting areas of different grids may be different. For example, the size affecting areas of a metro station is larger than that of a park. In another word, different grids may prefer different sizes of local region in the bike demand inference task. Therefore, it may be not suitable to treat outputs of CNNs with different local region sizes equally or give them static weights.

In this paper, we propose to utilize the attention mechanism to weight the hidden features from different local CNNs for each grid. The attention module is shown in Figure 6. With the hidden features outputted by different local CNNs, we first compute the attention weight with the following equations, and we use the geo-related features of this grid and as one of the input.


where is a feature vector with size equal to and is a weight matrix.

Then, the hidden features are merged with attention weights using element-wise product as shown in Equation (18).


where is the result vector after attention mechanism.

Finally, the attention module output a hidden vector for each grid as the representation. With a fully connected layer, we infer fine-grained bike demand for the grid, which is also the pattern on this grid.

Input : 
POI features: ;
Road network features: ;
Satellite light features: ;
Transportation features: ;
Business centers: ;
Bike demands: , ;
Output :  ALCNN model;
1 \\ Extract features and construct training instances;
2 ;
3 for all in city  do
4       ;
5       Expand to local region: ;
6       Put an instance in ;
8 end for
\\ Train the model;
9 Initialize all parameters in ALCNN;
10 repeat
11       Randomly select a batch of instances from ;
12       Find by minimizing loss function KLMSE;
14until model converges or stopping criteria is met;
Algorithm 1 ALCNN training process

5.3. Learning Process and Algorithm

To learn model parameters of our models, we use KLMSE as the objective function which is defined below:


where is the corresponding ground truth, is the inference of our model and means the calculate of KL divergence. We adopt the mini-batch Adam to update parameters iteratively. To prevent overfitting in the training data, we apply dropout to randomly drop neurons in the two fully-connected layers in Equation (16). We also perform batch normalization to address the covariance shift problem and achieve faster convergence.

Algorithm 1 outlines the training process of our ALCNN. Notice that all features in the input contain two sets, one for source city and the other for target city. We first use coPCA and DWT to extract feature from multi-source geo-related data. For each grid, the features of its local region and its own bike demand pattern are used as one training sample. Next, we initialize our model. During the iterations, we randomly select a batch of training samples and update parameters using gradient descent until the model converges.

6. Experiments

6.1. Experimental Settings

Datasets. We managed to crawl Mobike data of three different cities, i.e., Beijing, Shanghai and Ningbo, China during 07/06/2017 and 15/07/2017. We also collected many kinds of geographic data, which are mentioned in Section 4. Table 1 shows the statistics of both our Mobike data and geographic data.

\topruleType City
Beijing Shanghai Ningbo
# POIs 532,094 694,898 85,613
POI categories 17 17 17
Satellite light
# samples 23,021 28,954 7,482
Average intensity(cd) 16.086 11.179 20.309
Max distance(m) 18.779 32.305 15.430
Road networks
# roads 23,021 29,398 5,351
# road levels 29 32 26
Transportation centers
# centers 334 366 53
Max distance(m) 33.475 43.526 10.861
Business centers
# centers 26 28 17
Max level 4 4 4
Mobike records
# bikes 656,437 591,295 35,591
Record amount 3,010,873 2,601,398 161,234
Table 1. Details of data

Compared Methods. As we study a novel research problem, there are few methods that are specially designed for solving it. As a result, we compare our proposed approach with several classic machine learning methods and one state-of-the-art approach.

  • Linear Regression (LR): This method ignores features of neighbors and infers bike demand of each grid independently using linear regression with -norm regularization.

  • KNN. KNN predicts fine-grained bike demands by computing the average demands of nearest neighbors. The neighbors are selected from the training set using cosine similarity on geographic features.

  • RandomForest. RandomForest constructs a multitude of decision trees to boost prediction performance.

  • XGBoost (Chen and Guestrin, 2016). XGBoost is an optimized distributed gradient boosting method with high efficiency.

  • CoFA GeoConv (Liu et al., 2018a). CoFA GeoConv is the state-of-the-art method for inferring bike demands and it combines joint Factor Analysis and convolutional neural network techniques.

Parameter Setting. Generally, we tune all the methods mentioned before and report their performance with optimal parameter settings. LR, KNN, RandomForest, and XGBoost are implemented with the scikit-learn333 which is a popular Python machine learning library. For LR, we use normalization and set penalty weight . For KNN, we set the size of selected nearest neighbor to . For RandomForest, we use random state and bootstrap, and we set the number of estimators as . For XGBoost, hyperparameters are set to , , , , , , , . As for CoFA GeoConv, we use the same parameters in their paper, i.e., , .

We implement our approach with TensorFlow444 We set the learning rate to and batch size to . We also set our kennel size as , but if local region size is less or equal than , we will decline our kennel size correspondingly. We utilize the early stop method based on the performance on the validation set (stop until no increase in rounds). To prevent the model from overfitting, we employ dropout on the attention network and fully-connected layers and the dropout ratio is set to . Batch normalization (Ioffe and Szegedy, 2015) is conducted on the fully-connected layers for achieving faster convergence and better performance.

Evaluation Protocol. As our task in this paper is to infer bike demands for assisting the deployment of bikes in a new city, we use the dataset of one city to train our model and transfer it to another city. We employ the KLMSE between the inference demands and the ground truth to evaluate the performance of various competing methods. Considering that the bikes are usually deployed first in developed cities rather than undeveloped cities, we test two kinds of transfer learning, i.e., transferring between two similar developed cities and transferring between a developed city and one not so developed city. In our experiments, we treat Beijing (BJ) and Shanghai (SH) as developed cities and Ningbo (NB) as a less developed city. We have three transfer learning tasks, including BJ SH, SH BJ, and SH NB.

6.2. Comparison Results

\topruleMethod KLMSE
LR 0.232 0.506 0.493
KNN 0.182 0.213 0.201
RandomForest 0.175 0.199 0.187
XGBoost 0.151 0.163 0.570
CoFA GeoConv 0.120 0.135 0.129
ALCNN-DWT 0.107 0.120 0.117
ALCNN-coPCA 0.104 0.116 0.113
ALCNN-attention 0.112 0.128 0.126
ALCNN 0.093 0.104 0.099
Table 2. Comparison among different methods

We first give the comparison with other models on the three transfer learning tasks. The comparison results are shown in Table 2. ALCNN-DWT, ALCNN-coPCA, and ALCNN-attention denote our approach without DWT, coPCA, and attention modules, respectively.

We find that our attention-based local CNN model performs best on all the three demand inference tasks and achieves the lowest KLMSE value. Among all baselines, LR performs the worst because it does not consider the influence of neighbors. Except LR, the KLMSE of other methods are all below as they all consider neighbors’ features which proves the effectiveness of leveraging information of neighbors. RandomForest and XGBoost utilize multiple trees to perform ensemble learning which improves the inference performance. CoFA GeoConv achieves the lowest KLMSE among all baselines because it uses the coPCA to realize the distribution adaption between datasets of source city and target city. Except the coPCA, our ALCNN utilizes DWT and attention mechanism, and that’s why our ALCNN performs better. Compared with ALCNN-DWT, ALCNN-coPCA, and ALCNN-attention, ALCNN with all the three modules achieves the best performance which proves the effectiveness of using DWT, coPCA and attention mechanism.

6.3. Effectiveness of Attention Mechanism

Figure 7. Effectiveness of attention mechanism

In our approach, the attention mechanism is aimed to automatically find the suitable local region size, i.e., determining the influence distance of neighbors. To further investigate whether the attention mechanism works, we conduct experiments to test the performance of our models with different fixed region sizes and compare them with the results of our method with attention mechanism. The experimental results are shown in Figure 7.

In the figure, we can see that our approaches with different sizes of local regions have different performances. When the local region size is , our approach achieves the lowest KLMSE. It proves that the local region size can affect the performance of the inference model and there exists one optimal local region size. In addition, we show the performance of our approach with attention mechanism with a red line in the figure. It shows that with the attention mechanism, our approach performs much better even compared with the method with the optimal fixed region size (i.e., ). We think the attention mechanism can select different optimal local region size for each grid which improves the inference performance, because different grids can affect areas of different sizes. E.g., the size of affecting areas of a metro station will be much larger than a park in the suburbans.

6.4. Influence of Feature Extraction Module

Figure 8. Influence of our feature extraction model

Our feature extraction module contains two parts: coPCA and DWT. The two parts realize the distribution adaption between datasets of source city and target city for input features and outputs, respectively, and improve the inference performance. Besides coPCA, there are some other methods can do the distribution adaption, including Factor Analysis (FA) (Harman, 1960). For DWT, there are many other wavelets to choose, like Haar wavelets and Daubechies wavelets. Therefore, we want to know if coPCA is better than FA and which wavelets are the best for DWT. We compare our approaches with different methods and the results are shown in Figure 8.

In the experiments, we will compare coPCA and FA, and test DWT with Haar wavelets (haar), Daubechies wavelets (db), Biorthogonal wavelets (bior), Coiflets wavelets (coif), and Symlets wavelets (sym). We can see that all combinations of distribution adaption methods work good (KLMSE). When coFA is used, Biorthogonal wavelets turn out to work best, while when coPCA is used, Daubechies wavelets are leading. As we can see, Daubechies wavelets with coPCA has the lowest KLMSE, as a result, we decide to use Daubechies wavelets based DWT and coPCA for our feature extraction component.

7. Related Work

7.1. Research on Bike Sharing Systems and Urban Computing

There are some researchers studying the problems in bike sharing systems. Some of them focused on the prediction of traffic flow based on historical data (Hoang et al., 2016; Li et al., 2015; Yang et al., 2016). Zeng et al. introduced a station-level demand prediction method (Zeng et al., 2016). Other works mainly focused on the problem of station site optimization, which aimed to choose the best locations as bike stations sites (Liu et al., 2015; Martinez et al., 2012). However, they only focused on a single city populated with dock-less shared bikes, without applying the insights learned from a city into other cities.

Many works consider the temporal diversity in other fields of urban computing. Ferreira et al.’s work (Ferreira et al., 2013) provides origin-destination queries from users that enable the study of mobility across the city, while Burns’s work (Burns et al., 2018) provides an evaluating urban emissions in river system. The work influences us most is completed by Yao et al. (Yao et al., 2018), which proposes a multiple views of taxi demand pred view. However, none of these work solve the problem that the influence of an area is variable, which is the main contribution in our work.

7.2. Transfer Learning on Urban Computing

Some researchers applied transfer learning approaches for urban computing, especially the transfer learning among cities (Pan and Yang, 2009). Wei et al. solved the problem of predicting air quality in a target city by transferring knowledge learned from a source city (Wei et al., 2016). Several approaches on domain adaptation (Do and Ng, 2006; Dong et al., 2015; Ganin and Lempitsky, 2014; Ganin et al., 2016) have been proposed to adopt a model trained in the source domain to the target domain, some of which use the ideas of multi-view and multi-task. Among all related work, The work of Liu et al. (Liu et al., 2018a) is the most related one to our work. They mainly focus on how to infer bike distribution in new cities, which is a predigestion of our problem. They propose a new way of feature exaction and fully apply the idea of transfer learning. Our work is greatly influenced by theirs in feature exaction, especially how to use geographic data. Since our problem is more complicated than theirs, we mainly focus on temporal patterns and attention mechanism. Their work doesn’t consider the temporality of dock-less shared bikes, which is very important because dock-less shared bikes flow very frequently.

7.3. Attention Mechanism

Since the idea of attention mechanism was proposed, attention-based neural networks have been successfully used in many tasks (Ba et al., 2014; Vaswani et al., 2017), including many popular tasks: machine translation (Luong et al., 2015), computer version (You et al., 2016), speech recognition (Chorowski et al., 2015) and healthcare (Ma et al., 2018). However, in the field of urban computing, the idea of attention mechanism is rarely used. Many works use the similar network as local CNN, but none of them add attention mechanism to it. Yao et al. point out local CNN may cause problems because the neighbor size is fixed (Yao et al., 2018), which does inspire us to try to use attention mechanism to find a suitable size of local neighbor for every grid.

8. Conclusion

In this paper, we mainly focus on the problem of inferring fine-grained daily bike demands in a new city, which is an important issue faced by bike sharing companies. We point out the importance to mine temporal patterns of fine-grained bike demands by DWT. We also adopt coPCA to extract features from multi-source geographic data which can be transferred to new cities. Then, we propose a local CNN to infer fine-grained bike demands with consideration of neighbors’ influence. We also utilize attention mechanism to determine a suitable size of local region for every grid. The experiments on a real-world dataset demonstrate our proposed approach outperforms all competitive prediction methods and proves the effectiveness of the attention mechanism in the selection of local region sizes.


  • J. Ba, V. Mnih, and K. Kavukcuoglu (2014) Multiple object recognition with visual attention. arXiv preprint arXiv:1412.7755. Cited by: §7.3.
  • E. E. Burns, L. J. Carter, D. W. Kolpin, J. Thomas-Oates, and A. B. Boxall (2018) Temporal and spatial variation in pharmaceutical concentrations in an urban river system. Water research 137, pp. 72–85. Cited by: §7.1.
  • T. Chen and C. Guestrin (2016) Xgboost: a scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785–794. Cited by: 4th item.
  • J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio (2015) Attention-based models for speech recognition. In Advances in neural information processing systems, pp. 577–585. Cited by: §7.3.
  • C. B. Do and A. Y. Ng (2006) Transfer learning for text classification. In Advances in Neural Information Processing Systems, pp. 299–306. Cited by: §7.2.
  • D. Dong, H. Wu, W. He, D. Yu, and H. Wang (2015) Multi-task learning for multiple language translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Vol. 1, pp. 1723–1732. Cited by: §7.2.
  • N. Ferreira, J. Poco, H. T. Vo, J. Freire, and C. T. Silva (2013) Visual exploration of big spatio-temporal urban data: a study of new york city taxi trips. IEEE Transactions on Visualization and Computer Graphics 19 (12), pp. 2149–2158. Cited by: §7.1.
  • Y. Ganin and V. Lempitsky (2014) Unsupervised domain adaptation by backpropagation. arXiv preprint arXiv:1409.7495. Cited by: §7.2.
  • Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky (2016) Domain-adversarial training of neural networks. The Journal of Machine Learning Research 17 (1), pp. 2096–2030. Cited by: §7.2.
  • H. H. Harman (1960) Modern factor analysis.. Cited by: §6.4.
  • M. X. Hoang, Y. Zheng, and A. K. Singh (2016) Forecasting citywide crowd flows based on big data. ACM SIGSPATIAL 2016. Cited by: §7.1.
  • S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §6.1.
  • R. Kronland-Martinet, J. Morlet, and A. Grossmann (1987) Analysis of sound patterns through wavelet transforms. International journal of pattern recognition and artificial intelligence 1 (02), pp. 273–302. Cited by: §4.3.
  • Y. Li, Y. Zheng, and Q. Yang (2018) Dynamic bike reposition: a spatio-temporal reinforcement learning approach. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1724–1733. Cited by: §1.
  • Y. Li, Y. Zheng, H. Zhang, and L. Chen (2015) Traffic prediction in a bike-sharing system. In Proceedings of the 23rd SIGSPATIAL International Conference on Advances in Geographic Information Systems, pp. 33. Cited by: §7.1.
  • J. Liu, Q. Li, M. Qu, W. Chen, J. Yang, H. Xiong, H. Zhong, and Y. Fu (2015) Station site optimization in bike sharing systems. In 2015 IEEE International Conference on Data Mining, pp. 883–888. Cited by: §7.1.
  • Z. Liu, Y. Shen, and Y. Zhu (2018a) Inferring dockless shared bike distribution in new cities. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pp. 378–386. Cited by: §1, 5th item, §7.2.
  • Z. Liu, Y. Shen, and Y. Zhu (2018b) Where will dockless shared bikes be stacked?:—parking hotspots detection in a new city. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 566–575. Cited by: §1.
  • M. Luong, H. Pham, and C. D. Manning (2015) Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025. Cited by: §7.3.
  • F. Ma, Q. You, H. Xiao, R. Chitta, J. Zhou, and J. Gao (2018) Kame: knowledge-based attention model for diagnosis prediction in healthcare. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 743–752. Cited by: §7.3.
  • L. M. Martinez, L. Caetano, T. Eiró, and F. Cruz (2012) An optimisation algorithm to establish the location of stations of a mixed fleet biking system: an application to the city of lisbon. Procedia-Social and Behavioral Sciences 54, pp. 513–524. Cited by: §7.1.
  • S. J. Pan and Q. Yang (2009) A survey on transfer learning. IEEE Transactions on knowledge and data engineering 22 (10), pp. 1345–1359. Cited by: §7.2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §7.3.
  • Y. Wei, Y. Zheng, and Q. Yang (2016) Transfer knowledge between cities. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1905–1914. Cited by: §7.2.
  • Z. Yang, J. Hu, Y. Shu, P. Cheng, J. Chen, and T. Moscibroda (2016) Mobility modeling and prediction in bike-sharing systems. In Proceedings of the 14th Annual International Conference on Mobile Systems, Applications, and Services, pp. 165–178. Cited by: §7.1.
  • H. Yao, F. Wu, J. Ke, X. Tang, Y. Jia, S. Lu, P. Gong, J. Ye, and Z. Li (2018) Deep multi-view spatial-temporal network for taxi demand prediction. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §7.1, §7.3.
  • Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo (2016) Image captioning with semantic attention. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4651–4659. Cited by: §7.3.
  • M. Zeng, T. Yu, X. Wang, V. Su, L. T. Nguyen, and O. J. Mengshoel (2016) Improving demand prediction in bike sharing system by learning global features. Machine Learning for Large Scale Transportation Systems (LSTS)@ KDD-16. Cited by: §7.1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description