ALCNN: Attentionbased Model for Finegrained Demand Inference of Dockless Shared Bike in New Cities
Abstract.
In recent years, dockless shared bikes have been widely spread across many cities in China and facilitate people’s lives. However, at the same time, it also raises many problems about dockless shared bike management due to the mismatching between demands and real distribution of bikes. Before deploying dockless shared bikes in a city, companies need to make a plan for dispatching bikes from places having excessive bikes to locations with high demands for providing better services. In this paper, we study the problem of inferring finegrained bike demands anywhere in a new city before the deployment of bikes. This problem is challenging because new city lacks training data and bike demands vary by both places and time. To solve the problem, we provide various methods to extract discriminative features from multisource geographic data, such as POI, road networks and nighttime light, for each place. We utilize correlation Principle Component Analysis (coPCA) to deal with extracted features of both old city and new city to realize distribution adaption. Then, we adopt a discrete wavelet transform (DWT) based model to mine daily patterns for each place from finegrained bike demand. We propose an attention based local CNN model, ALCNN, to infer the daily patterns with latent features from coPCA with multiple CNNs for modeling the influence of neighbor places. In addition, ALCNN merges latent features from multiple CNNs and can select a suitable size of influenced regions. The extensive experiments on reallife datasets show that the proposed approach outperforms competitive methods.
1. Introduction
Recently, dockless shared bike services have achieved great success and reinvented bike sharing business in China, especially in major cities. Dockless shared bikes provide an environmentally friendly solution to solve the last mile problem which refers to the troublesome distance between home and the nearest traffic center. Many dockless shared bike companies have grown rapidly and seized the market, such as ofo and Mobike^{1}^{1}1https://mobike.com/cn/. The rules they set are similar: user can find a dockless shared bike via a GPSbased smart phone APP and follow the APP’s instructions to unlock the bike (scan the QR code or enter passwords), then ride the bike to anywhere they want, finally lock the bike and pay some money to the company. The convenience of this mode makes many people benefit from it and reinvents bike industry in China.
But the prosperity of this new way of transportation will inevitably lead to new problems. Before companies officially deploy dockless shared bikes in a new city, they need to make a plan of managing bikes as the realtime distribution of bikes may not match bike demand. Manage bikes according to bike demands will greatly improve the effective use of bikes, i.e., serving more costumers with less cost. However, the bike demands in a city vary spatially and temporally. We regard the curve of demand per half hour in one area as finegrained bike demand. Figure 1 shows finegrained bike demands in a day for three locations. We can see that for each location, its finegrained bike demand varies with time. Different places have different bike demands. If we know the finegrained bike demands of the three places in Figure 1, at noon we can move spare bikes in the second place to the first place and the third place.
The finegrained bike demands are easy to collect in a city which has deployed shared bikes for a while. But for a city having no shared bikes, the lacking of historical bike demand records makes it hard to build a prediction model. In this paper, we focus on a problem of inferring finegrained bike demands within a time slot for a place in new cities which have not deployed dockless shared bikes. We want to build an demand inference approach based on georelated data, such as POI, road and transportation, in a city having bikes. Then we can transfer it to a new city and infer bike demands. Companies can use the inference result to design a schedule algorithm before the deployment to balance the supply and demand of bikes in all regions (Li et al., 2018).
Many existing works on dockless shared bike system take georelated data into consideration to infer the distribution of bike demand in a city (Liu et al., 2018a, b). But they only focused on the spatial distribution and omitted temporal variability which is also important for the management of bikes. In this paper, we aim to infer the bike demand distribution over both spatial and temporal dimensions.
However, there are three challenges for solving the problem. First, the bike demands are timevarying but we only have static geographic data like POIs. The bike demands vary with time in a day for one place. Even for the same time of different days, their demands may also vary a little. Second, our task is to infer finegrained bike demands in a new city having not deployed shared bikes and we don’t have any bike demand data in that city. The data distribution may be different in diverse cities. Finally, each place is influenced by their neighbors and the size of area influenced by neighbors is hard to determine.
To address these challenges, we first analysis the finegrained bike demand data and find that there exist daily demand patterns for many places (details are shown in Section 3). We propose to utilize discrete wavelet transform (DWT) to mine daily bike demand patterns from finegrained bike demands of every day in each place. The mined daily demand patterns are stable and are used as the ground truth of inference. To deal with the problem of lacking of training data in a new city, we next utilize multisource geographic data to extract discriminated features and train a inference model to predict bike demands. The inference model is trained with data from an old city which has deployed shared bikes and transferred to new cities. We employ correlation Principal Component Analysis (coPCA) to realize the distribution adaption when extracting latent features from geographic data of both old city and new city. For considering the influence of neighbors, we then propose an attentionbased model, ALCNN, which utilizes CNN to aggregate geographic features from neighbors in a local region for each place. As the size of influenced area of different places varies, we build several CNN models with local regions of different sizes and merge their learned latent features using attention mechanism. In other words, the attention mechanism can choose local regions of a suitable size for each place automatically.
The main contributions of this paper are summarized as follows.

To the best of our knowledge, we are the first to study the problem of inferring finegrained bike demands in a new city. We present the importance of finding temporal patterns from finegrained bike demands and propose to use discrete wavelet transform to mine daily patterns from bike demands.

We utilize local CNN to aggregate geographic features of neighbors in local regions of various sizes for each place. We propose to use attention mechanism to merge the learned latent features from local regions of different sizes.

We evaluate our inference approach on real Mobike data of three cities in China. The results show that our approach outperforms all baselines. The attention mechanism can improve the inference performance.
The remainder of this paper is organized as follows. In Section 2, we formulate our finegrained bike demand inference problem and introduce the framework of our approach. In Section 3, we analysis the dataset of dockless shared bikes of Mobike. Next, we introduce feature extraction processes and our proposed model ALCNN in Section 4 and Section 5, respectively. Then, we conduct extensive experiments to evaluate our approach in Section 6.
2. Overview
In this section, we first introduce several fundamental definitions and present the formulation of our bike demand inference problem. Next, we show the overall framework of our approach.
2.1. Preliminary
Definition 2.1 (City grid and city grid map ).
We regard a city as one rectangle and divide it into disjointed grids of the same size according to latitude and longitude. Each grid is denoted by , where and . As a result, the city can be seen as a grid map, i.e., . For convenience, we use function to judge whether a point is inside a grid .
Definition 2.2 (Mobike trip record collection ).
We use the data from a dockless shared bike company called Mobike. Our dataset of dockless shared bikes is a collection of riding trip records, i.e., . Each record is a tuple the elements of which denote the starting location and time, ending location and time, respectively. One location is formed by its longitude and latitude.
If the starting location of one record , is within a grid , we say that a renting behavior happened in that grid. Based on the trip records, we define a bike demand set as follows.
Definition 2.3 (Finegrained bike demand set ).
For easy processing, we divide each day to into time slots of the same length and use to denote the th time slot in the th day. Next, we can obtain finegrained bike demands of each grid in a certain time slot with the following equation.
(1) 
As we aim to infer daily bike demand pattern of each grid, we use one demand vector to denote bike demands of all time slots in day for grid , i.e., . For each grid, its bike demand vectors of different days can form a set of finegrained bike demands
As Figure 2 shows, we find that in most grids, their finegrained bike demands in different days are similar to each other. Based on the observation, we can draw two conclusions. On the one hand, most grid has their daily patterns among the bike demands. On the other hand, it should be better to mine the daily patterns which will be more stable and benefit the convergence of our inference model. Therefore, we define daily temporal patterns of bike demand for each grid as follows.
Definition 2.4 (Daily demand pattern ).
For each grid, we aim to find a demand vector which is similar to all demand vectors of most days. Considering that the bike demands of different grids or different days can vary widely, we normalize all demand vector with its summation of all elements. We employ KullbackLeibler(KL) divergence to measure the difference between demand vectors. If we manage to find a demand vector and the KL divergence and all other demand vectors of the same grid is smaller than a threshold , we treat as the daily demand pattern of grid . Note that does not have a superscript because is not a real bike demand vector of any day. We will talk about how to find by DWT latter.
(2) 
Problem Formulation. We consider two cities S and T, where S has deployed dockless shared bikes and T is a new city to launch shared bike business. We divide both cities into grids of the same size. We regard city S as a source domain while city T as a target domain. Given a set of raw bike data in city S, we compute the set of finegrained bike demands D according to Definition 2.3. Our goal is to infer finegrained bike demands in the new city T, based on the known finegrained bike demands from city S and multiple georelated data (e.g., POI and road networks) in both city S and T.
2.2. Inference Framework
To achieve the finegrained bike demand inference goal, we propose an inference system consisting of two major components, i.e., feature extraction and model training, as shown Figure 3.
In the feature extraction component, we first extract features from multisource geographic data including POIs, road networks, satellite lights, transportation centers and business centers. We also process the original records of Mobike and get bike demands. Next, we utilize Correlation Principal Component Analysis (coPCA) and Discrete Wavelet Transform (DWT) to deal with the two kinds of features in a further step. coPCA produces discriminative latent features as the input of our inference model. coPCA processes georelated data in source city and target city together. It realizes the distribution adaption between the two cities and improves the inference performance. DWT extracts temporal patterns from bike demand vectors and can improve the stability of our inference model as it represents the demand vectors with fewer dimensions and parameters.
In the second component, we feed the extracted latent features to a proposed attentionbased local CNN model. We train the model with data of source city S and transfer the trained model to a new city T. Using the geographic features of the target city T, we can infer the bike demand in the new city. The proposed attentionbase local CNN model can select features of each grid and its neighbors and processes them with multiple convolution networks for modeling influence of neighbors. The attention mechanism in the model can help choosing the most suitable size of local region to provide better inference results.
3. Data Analysis
Before we conduct empirical analysis on dockless shared bike data, we need to set a evaluation method to define how close does two different finegrained demands are. In order to do this, we adopt the KullbackLeibler (KL) divergence to compute the closeness between two finegrained bike demands. Based on the definition in Section 2, we can define the KL divergence:
(3) 
Intuitively, the smaller KL divergence is, the closer two finegrained demand vectors are. In our context, the closeness of every two consecutive bike demand vectors indicate the stability of daily finegrained bike demands.
As we have many finegrained bike demands on each grid for different days, we still need a way to evaluate the divergence between all finegrained bike demands on each grid. As we mentioned in Section 2, we use the maximum of Kl divergence among all finegrained bike demands as evaluation. The lower divergence means the higher possibility we can find a pattern on that grid.
(4) 
where is a vector of days. By the definition of divergence, we can do empirical analysis on real data, to see whether there will be necessary to find a pattern. We scan all the real data and get the result as shown in Figure 4.
We can see about of real finegrained demands have a divergence less than , and almost of real finegrained demands have a divergence less than . Based on observation, we draw a conclusion that most grids in the city have temporal patterns, which is necessary to mine temporal patterns for further work.
Actually, temporal patterns are very complicated, for example, even if we mainly conside peaks in patterns, the amount, time and height of peaks will all be changeable in everywhere in the city. However, we don’t think that the difference among temporal patterns is just random. Instead, we think that there is a connection between temporal patterns and the geographic data. For example, most grids near subway stations have “TRIPLE PEAK” patterns (three peaks during the day), because that subway station has a large flow of people and it comes to a burst on every possible peak, while the grids near business center always have “DOUBLE PEAK” patterns (two peaks during the day), for that the people working in business center have a rapid pace of working, hence they don’t have enough time to use bikes in noon, which makes the peak at noon does not exist.
Above all, we find that there is a need to find temporal patterns from finegrained bike demands, for that most grids have low divergence. We also find that temporal patterns are typically different because of geographic diversity, which inspires us to leverage geographic data sources such as POI, subway station, business center when inferring the patterns.
4. Feature Extraction
In this section, we will show the details about feature extraction component. The result of feature extraction will be the input of our propose ALCNN model in section 5.
4.1. Features from Multisource Data
Geographic data is utilized to infer bike demands in this paper as they reflect the condition of transportation and high related to the bike demands. In this paper, we consider kinds of geographic data including POI, road networks, nighttime light, transportation centers, and business centers. We will give a brief introduction to each kinds of data and present the feature extraction methods. The extracted detailed features from every kinds data are denoted by , , , , and , respectively.
1.POI Features:
Each POI represents a city venue with its name, address, category and spatial coordinates. The number and diversity of POIs on a grid reflects its prosperity and hence are related to bike demand density. We extract the following three POI features for each grid:
(1) POI category frequency (): We associate each city grid with a POI category frequency vector . is a dimension vector and the th element in indicates the number of venues of the th category located in the grid .
(2) Number of POIs (): We can count the total number of POIs based on frequency vector on a grid: .
(3) POI entropy (): Besides the number of POIs, we also compute the entropy based on for indicating the heterogeneity of POIs in a grid.
(5) 
2. Road Network Features:
Intuitively, the more roads located in one grid, the more convenient the traffic will be. In our dataset, each road consists of its name, category (or level), start point and end point. We propose to use two road network features for every grid:
(1) Road category frequency (): We can associate road category frequency vector for each city grid . The th element in denotes the number of th type roads overlapped with grid .
(2) Number of roads (): After we get the frequency vector , we can count the total number of overlapped roads : .
3. Nighttime Light Features:
The nighttime light data are collected by satellites. We get the light intensity by sample points for every meters on the map. Intuitively, nighttime light intensity are positively correlated with business prosperity and population density of a grid which implies large shared bike demands. We identify two kinds of features:
(1) Average light intensity (): To reduce the influence of data noises, we calculate the average light intensity. Denote as the set of light points in the city, and as the intensity of every point. We compute the average light intensity in :
(6) 
(2) Distance to the nearest light centre (): From the map of nighttime lights, we identify that there are several centers whose light intensity is large than other places. Each light center can be denoted by their longitude and latitude. Based on the light centers, we compute the geographic distance between each grid and their nearest light center.
(7) 
where function is the function to calculate geographic distance between two location and . For a grid , we use the longitude and latitude of its center as the location. denotes the set of all light centers in a city.
4. Transportation Features:
As we all know, most dockless shared bikes are parked near the transportation centers, such as subway station. Because people usually get off buses or metros in transportation centers and need to find a bike for traveling in a short distance, e.g., from bus station to home. We can extract two main features:
(1) Number of transportation centers (): We use as the set of transportation centers in the city. Then, we can count the number of transportation centers in each grid:
(8) 
(2) Distance to the nearest transportation centers (): For each grid, we also compute its center to the most nearest transportation centers:
(9) 
5. Business centre features:
Around the business centers, the flow of people can be very large which indicates large bike demands. In our business centre dataset, each business centre is denoted by its name, level, and location. We use to denote the set of all business centers in a city. We extract two business centre features for each grid:
(1) Distance to the nearest business centre ()
We use as the set of all business centres, we have:
(10) 
(2) Level of the nearest business centre ()
(11) 
4.2. Correlation Principal Component Analysis for Transfer Learning
As our task is to infer finegrained bike demands in a new city and the distribution of two cities may be different, we propose to apply coPCA over the extracted features to achieve distribution adaption. Originally, PCA aims to use latent features with low dimension to represent the raw features. If each data sample is seen as one point in a coordinate system, PCA rotates the coordinate axes and on the top several axes the samples have the largest variance. The latent features on these axes can reserve effective information and filter out data noises. In this paper, we use coPCA to find a transformation to minimize difference between the distributions of data from two cities.
We first construct two matrices to store features of grids from two cities, respectively. In the matrices, each row denotes one grid in a city and every column stands for one kind of feature, in other words the matrices are concatenations of and . Next, we concatenate the two matrices in the first dimension and apply PCA to the entire matrix. When we get the result of coPCA, we will divide the output matrix into two parts along the first dimension. The two parts can be seen as the results of dimension reduction of data from source city and target city, respectively.
4.3. Discrete Wavelet Transform
As we discuss in Section 3, each grid has its daily bike demand patterns. Moreover, the dimension of finegrained bike demand vector can be large which may lead to high computation complexity and a large number of parameters. Inspired by the dimension reduction of input features, we also want to reduce the dimension of output, i.e., bike demands. We utilize Discrete Wavelet Transform(DWT) to solve the problem. As with other wavelet transforms, a key advantage it has over Fourier transforms is temporal resolution: it captures both frequency and location information (location in time).
The process of DWT is actually passing through a series of filters. For convenience, we use two filters, the low pass filter and high pass filter, which are decided by mother wavelets and are known as quadrature mirror filter^{2}^{2}2https://en.wikipedia.org/wiki/Quadrature_mirror_filter. According to early researches on DWT(KronlandMartinet et al., 1987), the low pass filter can filter the original signal to the approximation coefficients, while the high pass filter can filter the original signal to the detail coefficients. The main reason why we use DWT is that DWT can obtain the approximation by low pass filter, which uses fewer parameters because of the other parameters are in highfrequency information. The basic math formula is as follow:
(12)  
(13)  
(14) 
where is the finegrained bike demand in grid on day . is the reduced parameter size, which is half of origin size in this case. is highpass filter while is lowpass filter. is the result after DWT. Then we use inverse DWT (idwt) to transfer to a new finegrained demand, and we ensemble all new finegrained demands in grid on different days to get the candidate daily pattern . The selection of and is based on the mother wavelet, which is not unique, like Haar wavelets, Daubechies wavelets, Symlets wavelets. In this paper, we choose Daubechies wavelets to do our work.
Since we have found a way to use fewer parameters to describe the origin distribution approximately, we can use DWT to preprocess a finegrained bike demand and get a profile on every day. Then, we can get the candidate temporal pattern of one grid from all profiles assembled from its original finegrained bike demands on different days. We need to set a standard divergence threshold to judge whether is a temporal pattern or the finegrained demands of this grid are just disorganized, which is defined in Section 2 and 3. After all the above steps, we can get the temporal pattern on every grid or decide that it doesn’t have a temporal pattern.
Above all, we choose Daubechies wavelets to do DWT. For each existing finegrained bike demand one grid , we use DWT to find its profile, which will reduce the number of parameters. Then, we set a threshold to judge whether this grid has a temporal pattern using the similarity among all its profiles in different days. After these steps, we find most grids having temporal patterns and put them into our training set, with the mined patterns as their labels.
5. Attention Based Local Cnn Model
After extracting features from multisource data, we propose an Attentionbased Local CNN (ALCNN) model to infer bike demand of each grid.
5.1. Local CNN
As we all know, geographically adjacent regions share similar characteristics. E.g., if a place is near a metro station, it will have high pedestrian volumes as the station. And the station has a large influence on the traffic of that place. Inspired by this thought, we develop a local convolution network to model the influence of neighbors for each grid.
Figure 5 shows the structure of our proposed local CNN which can be seen as a variant of traditional CNN. The input of local CNN is the feature tensor of all grids in a city. We first select one target grid and its neighbors within a certain distance as a local region with Equation 15. E.g., in Figure 5, the target grid and its adjacent grids (distance equal to ) are selected. Note that every local region has all the features of the central grid itself and its neighbors, but it only has the label (temporal pattern after DWT) of the central grid, since it is just an enlargement of the central grid actually.
(15) 
where is a target grid we consider, is the feature tensor for all grids in the city after PCA. The selected feature map for every grid is a tensor called , where is the neighbor size and is the dimension of new features after coPCA.
Next, convolution operations are performed on the feature map of selected local region. Different target grids share CNN parameters. Then the CNN is followed with two fully connected layers for learning interactions between features.
(16) 
where is the parameter tensor of th convolution filter, are the th bias vectors, are th weight matrices and is the activation function.
5.2. Attention Mechanism
With different local region sizes, we can produce several hidden vectors for each grid. The simplest way to merge these vectors is to concatenate them or compute average values for each dimension. However, the size of affecting areas of different grids may be different. For example, the size affecting areas of a metro station is larger than that of a park. In another word, different grids may prefer different sizes of local region in the bike demand inference task. Therefore, it may be not suitable to treat outputs of CNNs with different local region sizes equally or give them static weights.
In this paper, we propose to utilize the attention mechanism to weight the hidden features from different local CNNs for each grid. The attention module is shown in Figure 6. With the hidden features outputted by different local CNNs, we first compute the attention weight with the following equations, and we use the georelated features of this grid and as one of the input.
(17) 
where is a feature vector with size equal to and is a weight matrix.
Then, the hidden features are merged with attention weights using elementwise product as shown in Equation (18).
(18) 
where is the result vector after attention mechanism.
Finally, the attention module output a hidden vector for each grid as the representation. With a fully connected layer, we infer finegrained bike demand for the grid, which is also the pattern on this grid.
5.3. Learning Process and Algorithm
To learn model parameters of our models, we use KLMSE as the objective function which is defined below:
(19) 
where is the corresponding ground truth, is the inference of our model and means the calculate of KL divergence. We adopt the minibatch Adam to update parameters iteratively. To prevent overfitting in the training data, we apply dropout to randomly drop neurons in the two fullyconnected layers in Equation (16). We also perform batch normalization to address the covariance shift problem and achieve faster convergence.
Algorithm 1 outlines the training process of our ALCNN. Notice that all features in the input contain two sets, one for source city and the other for target city. We first use coPCA and DWT to extract feature from multisource georelated data. For each grid, the features of its local region and its own bike demand pattern are used as one training sample. Next, we initialize our model. During the iterations, we randomly select a batch of training samples and update parameters using gradient descent until the model converges.
6. Experiments
6.1. Experimental Settings
Datasets. We managed to crawl Mobike data of three different cities, i.e., Beijing, Shanghai and Ningbo, China during 07/06/2017 and 15/07/2017. We also collected many kinds of geographic data, which are mentioned in Section 4. Table 1 shows the statistics of both our Mobike data and geographic data.
\topruleType  City  

Beijing  Shanghai  Ningbo  
POI  
# POIs  532,094  694,898  85,613 
POI categories  17  17  17 
Satellite light  
# samples  23,021  28,954  7,482 
Average intensity(cd)  16.086  11.179  20.309 
Max distance(m)  18.779  32.305  15.430 
Road networks  
# roads  23,021  29,398  5,351 
# road levels  29  32  26 
Transportation centers  
# centers  334  366  53 
Max distance(m)  33.475  43.526  10.861 
Business centers  
# centers  26  28  17 
Max level  4  4  4 
Mobike records  
# bikes  656,437  591,295  35,591 
Record amount  3,010,873  2,601,398  161,234 
\bottomrule 
Compared Methods. As we study a novel research problem, there are few methods that are specially designed for solving it. As a result, we compare our proposed approach with several classic machine learning methods and one stateoftheart approach.

Linear Regression (LR): This method ignores features of neighbors and infers bike demand of each grid independently using linear regression with norm regularization.

KNN. KNN predicts finegrained bike demands by computing the average demands of nearest neighbors. The neighbors are selected from the training set using cosine similarity on geographic features.

RandomForest. RandomForest constructs a multitude of decision trees to boost prediction performance.

XGBoost (Chen and Guestrin, 2016). XGBoost is an optimized distributed gradient boosting method with high efficiency.

CoFA GeoConv (Liu et al., 2018a). CoFA GeoConv is the stateoftheart method for inferring bike demands and it combines joint Factor Analysis and convolutional neural network techniques.
Parameter Setting. Generally, we tune all the methods mentioned before and report their performance with optimal parameter settings. LR, KNN, RandomForest, and XGBoost are implemented with the scikitlearn^{3}^{3}3http://scikitlearn.org which is a popular Python machine learning library. For LR, we use normalization and set penalty weight . For KNN, we set the size of selected nearest neighbor to . For RandomForest, we use random state and bootstrap, and we set the number of estimators as . For XGBoost, hyperparameters are set to , , , , , , , . As for CoFA GeoConv, we use the same parameters in their paper, i.e., , .
We implement our approach with TensorFlow^{4}^{4}4https://www.tensorflow.org/. We set the learning rate to and batch size to . We also set our kennel size as , but if local region size is less or equal than , we will decline our kennel size correspondingly. We utilize the early stop method based on the performance on the validation set (stop until no increase in rounds). To prevent the model from overfitting, we employ dropout on the attention network and fullyconnected layers and the dropout ratio is set to . Batch normalization (Ioffe and Szegedy, 2015) is conducted on the fullyconnected layers for achieving faster convergence and better performance.
Evaluation Protocol. As our task in this paper is to infer bike demands for assisting the deployment of bikes in a new city, we use the dataset of one city to train our model and transfer it to another city. We employ the KLMSE between the inference demands and the ground truth to evaluate the performance of various competing methods. Considering that the bikes are usually deployed first in developed cities rather than undeveloped cities, we test two kinds of transfer learning, i.e., transferring between two similar developed cities and transferring between a developed city and one not so developed city. In our experiments, we treat Beijing (BJ) and Shanghai (SH) as developed cities and Ningbo (NB) as a less developed city. We have three transfer learning tasks, including BJ SH, SH BJ, and SH NB.
6.2. Comparison Results
\topruleMethod  KLMSE  

BJ SH  BJ NB  SH NB  
LR  0.232  0.506  0.493 
KNN  0.182  0.213  0.201 
RandomForest  0.175  0.199  0.187 
XGBoost  0.151  0.163  0.570 
CoFA GeoConv  0.120  0.135  0.129 
ALCNNDWT  0.107  0.120  0.117 
ALCNNcoPCA  0.104  0.116  0.113 
ALCNNattention  0.112  0.128  0.126 
ALCNN  0.093  0.104  0.099 
\bottomrule 
We first give the comparison with other models on the three transfer learning tasks. The comparison results are shown in Table 2. ALCNNDWT, ALCNNcoPCA, and ALCNNattention denote our approach without DWT, coPCA, and attention modules, respectively.
We find that our attentionbased local CNN model performs best on all the three demand inference tasks and achieves the lowest KLMSE value. Among all baselines, LR performs the worst because it does not consider the influence of neighbors. Except LR, the KLMSE of other methods are all below as they all consider neighbors’ features which proves the effectiveness of leveraging information of neighbors. RandomForest and XGBoost utilize multiple trees to perform ensemble learning which improves the inference performance. CoFA GeoConv achieves the lowest KLMSE among all baselines because it uses the coPCA to realize the distribution adaption between datasets of source city and target city. Except the coPCA, our ALCNN utilizes DWT and attention mechanism, and that’s why our ALCNN performs better. Compared with ALCNNDWT, ALCNNcoPCA, and ALCNNattention, ALCNN with all the three modules achieves the best performance which proves the effectiveness of using DWT, coPCA and attention mechanism.
6.3. Effectiveness of Attention Mechanism
In our approach, the attention mechanism is aimed to automatically find the suitable local region size, i.e., determining the influence distance of neighbors. To further investigate whether the attention mechanism works, we conduct experiments to test the performance of our models with different fixed region sizes and compare them with the results of our method with attention mechanism. The experimental results are shown in Figure 7.
In the figure, we can see that our approaches with different sizes of local regions have different performances. When the local region size is , our approach achieves the lowest KLMSE. It proves that the local region size can affect the performance of the inference model and there exists one optimal local region size. In addition, we show the performance of our approach with attention mechanism with a red line in the figure. It shows that with the attention mechanism, our approach performs much better even compared with the method with the optimal fixed region size (i.e., ). We think the attention mechanism can select different optimal local region size for each grid which improves the inference performance, because different grids can affect areas of different sizes. E.g., the size of affecting areas of a metro station will be much larger than a park in the suburbans.
6.4. Influence of Feature Extraction Module
Our feature extraction module contains two parts: coPCA and DWT. The two parts realize the distribution adaption between datasets of source city and target city for input features and outputs, respectively, and improve the inference performance. Besides coPCA, there are some other methods can do the distribution adaption, including Factor Analysis (FA) (Harman, 1960). For DWT, there are many other wavelets to choose, like Haar wavelets and Daubechies wavelets. Therefore, we want to know if coPCA is better than FA and which wavelets are the best for DWT. We compare our approaches with different methods and the results are shown in Figure 8.
In the experiments, we will compare coPCA and FA, and test DWT with Haar wavelets (haar), Daubechies wavelets (db), Biorthogonal wavelets (bior), Coiflets wavelets (coif), and Symlets wavelets (sym). We can see that all combinations of distribution adaption methods work good (KLMSE). When coFA is used, Biorthogonal wavelets turn out to work best, while when coPCA is used, Daubechies wavelets are leading. As we can see, Daubechies wavelets with coPCA has the lowest KLMSE, as a result, we decide to use Daubechies wavelets based DWT and coPCA for our feature extraction component.
7. Related Work
7.1. Research on Bike Sharing Systems and Urban Computing
There are some researchers studying the problems in bike sharing systems. Some of them focused on the prediction of traffic flow based on historical data (Hoang et al., 2016; Li et al., 2015; Yang et al., 2016). Zeng et al. introduced a stationlevel demand prediction method (Zeng et al., 2016). Other works mainly focused on the problem of station site optimization, which aimed to choose the best locations as bike stations sites (Liu et al., 2015; Martinez et al., 2012). However, they only focused on a single city populated with dockless shared bikes, without applying the insights learned from a city into other cities.
Many works consider the temporal diversity in other fields of urban computing. Ferreira et al.’s work (Ferreira et al., 2013) provides origindestination queries from users that enable the study of mobility across the city, while Burns’s work (Burns et al., 2018) provides an evaluating urban emissions in river system. The work influences us most is completed by Yao et al. (Yao et al., 2018), which proposes a multiple views of taxi demand pred view. However, none of these work solve the problem that the influence of an area is variable, which is the main contribution in our work.
7.2. Transfer Learning on Urban Computing
Some researchers applied transfer learning approaches for urban computing, especially the transfer learning among cities (Pan and Yang, 2009). Wei et al. solved the problem of predicting air quality in a target city by transferring knowledge learned from a source city (Wei et al., 2016). Several approaches on domain adaptation (Do and Ng, 2006; Dong et al., 2015; Ganin and Lempitsky, 2014; Ganin et al., 2016) have been proposed to adopt a model trained in the source domain to the target domain, some of which use the ideas of multiview and multitask. Among all related work, The work of Liu et al. (Liu et al., 2018a) is the most related one to our work. They mainly focus on how to infer bike distribution in new cities, which is a predigestion of our problem. They propose a new way of feature exaction and fully apply the idea of transfer learning. Our work is greatly influenced by theirs in feature exaction, especially how to use geographic data. Since our problem is more complicated than theirs, we mainly focus on temporal patterns and attention mechanism. Their work doesn’t consider the temporality of dockless shared bikes, which is very important because dockless shared bikes flow very frequently.
7.3. Attention Mechanism
Since the idea of attention mechanism was proposed, attentionbased neural networks have been successfully used in many tasks (Ba et al., 2014; Vaswani et al., 2017), including many popular tasks: machine translation (Luong et al., 2015), computer version (You et al., 2016), speech recognition (Chorowski et al., 2015) and healthcare (Ma et al., 2018). However, in the field of urban computing, the idea of attention mechanism is rarely used. Many works use the similar network as local CNN, but none of them add attention mechanism to it. Yao et al. point out local CNN may cause problems because the neighbor size is fixed (Yao et al., 2018), which does inspire us to try to use attention mechanism to find a suitable size of local neighbor for every grid.
8. Conclusion
In this paper, we mainly focus on the problem of inferring finegrained daily bike demands in a new city, which is an important issue faced by bike sharing companies. We point out the importance to mine temporal patterns of finegrained bike demands by DWT. We also adopt coPCA to extract features from multisource geographic data which can be transferred to new cities. Then, we propose a local CNN to infer finegrained bike demands with consideration of neighbors’ influence. We also utilize attention mechanism to determine a suitable size of local region for every grid. The experiments on a realworld dataset demonstrate our proposed approach outperforms all competitive prediction methods and proves the effectiveness of the attention mechanism in the selection of local region sizes.
References
 Multiple object recognition with visual attention. arXiv preprint arXiv:1412.7755. Cited by: §7.3.
 Temporal and spatial variation in pharmaceutical concentrations in an urban river system. Water research 137, pp. 72–85. Cited by: §7.1.
 Xgboost: a scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785–794. Cited by: 4th item.
 Attentionbased models for speech recognition. In Advances in neural information processing systems, pp. 577–585. Cited by: §7.3.
 Transfer learning for text classification. In Advances in Neural Information Processing Systems, pp. 299–306. Cited by: §7.2.
 Multitask learning for multiple language translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Vol. 1, pp. 1723–1732. Cited by: §7.2.
 Visual exploration of big spatiotemporal urban data: a study of new york city taxi trips. IEEE Transactions on Visualization and Computer Graphics 19 (12), pp. 2149–2158. Cited by: §7.1.
 Unsupervised domain adaptation by backpropagation. arXiv preprint arXiv:1409.7495. Cited by: §7.2.
 Domainadversarial training of neural networks. The Journal of Machine Learning Research 17 (1), pp. 2096–2030. Cited by: §7.2.
 Modern factor analysis.. Cited by: §6.4.
 Forecasting citywide crowd flows based on big data. ACM SIGSPATIAL 2016. Cited by: §7.1.
 Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §6.1.
 Analysis of sound patterns through wavelet transforms. International journal of pattern recognition and artificial intelligence 1 (02), pp. 273–302. Cited by: §4.3.
 Dynamic bike reposition: a spatiotemporal reinforcement learning approach. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1724–1733. Cited by: §1.
 Traffic prediction in a bikesharing system. In Proceedings of the 23rd SIGSPATIAL International Conference on Advances in Geographic Information Systems, pp. 33. Cited by: §7.1.
 Station site optimization in bike sharing systems. In 2015 IEEE International Conference on Data Mining, pp. 883–888. Cited by: §7.1.
 Inferring dockless shared bike distribution in new cities. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pp. 378–386. Cited by: §1, 5th item, §7.2.
 Where will dockless shared bikes be stacked?:—parking hotspots detection in a new city. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 566–575. Cited by: §1.
 Effective approaches to attentionbased neural machine translation. arXiv preprint arXiv:1508.04025. Cited by: §7.3.
 Kame: knowledgebased attention model for diagnosis prediction in healthcare. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 743–752. Cited by: §7.3.
 An optimisation algorithm to establish the location of stations of a mixed fleet biking system: an application to the city of lisbon. ProcediaSocial and Behavioral Sciences 54, pp. 513–524. Cited by: §7.1.
 A survey on transfer learning. IEEE Transactions on knowledge and data engineering 22 (10), pp. 1345–1359. Cited by: §7.2.
 Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §7.3.
 Transfer knowledge between cities. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1905–1914. Cited by: §7.2.
 Mobility modeling and prediction in bikesharing systems. In Proceedings of the 14th Annual International Conference on Mobile Systems, Applications, and Services, pp. 165–178. Cited by: §7.1.
 Deep multiview spatialtemporal network for taxi demand prediction. In ThirtySecond AAAI Conference on Artificial Intelligence, Cited by: §7.1, §7.3.
 Image captioning with semantic attention. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4651–4659. Cited by: §7.3.
 Improving demand prediction in bike sharing system by learning global features. Machine Learning for Large Scale Transportation Systems (LSTS)@ KDD16. Cited by: §7.1.