Forecaster: A Graph Transformer for Forecasting Spatial and TimeDependent Data
Abstract
Spatial and timedependent data is of interest in many applications. This task is difficult due to its complex spatial dependency, longrange temporal dependency, data nonstationarity, and data heterogeneity. To address these challenges, we propose Forecaster, a graph Transformer architecture. Specifically, we start by learning the structure of the graph that parsimoniously represents the spatial dependency between the data at different locations. Based on the topology of the graph, we sparsify the Transformer to account for the strength of spatial dependency, longrange temporal dependency, data nonstationarity, and data heterogeneity. We evaluate Forecaster in the problem of forecasting taxi ridehailing demand and show that our proposed architecture significantly outperforms the stateoftheart baselines.
1 Introduction
Spatial and timedependent data describe the evolution of signals (i.e., the values of attributes) at multiple spatial locations across time [39, 14]. It occurs in many domains, including economics [8], global trade [10], environment studies [15], public health [20], or traffic networks [16] to name a few. For example, the gross domestic product (GDP) of different countries in the past century, the daily temperature measurements of different cities for the last decade, and the hourly taxi ridehailing demand at various urban locations in the recent year are all spatial and timedependent data. Forecasting such data allows to proactively allocate resources and take actions to improve the efficiency of society and the quality of life.
However, forecasting spatial and timedependent data is challenging — they exhibit complex spatial dependency, longrange temporal dependency, heterogeneity, and nonstationarity. Take the spatial and timedependent data in a traffic network as an example. The data at a location (e.g., taxi ridehailing demand) may correlate more with the data at a geometrically remote location than a nearby location [16], exhibiting complex spatial dependency. Also, the data at a time instant may be similar to the data at a recent time instant, say an hour ago, but may also highly correlate with the data a day ago or even a week ago, showing strong longrange temporal dependency. Additionally, the spatial and timedependent data may be influenced by many other relevant factors (e.g., weather influences taxi demand). These factors are relevant information, shall be taken into account. In other words, in this paper, we propose to perform forecasting with heterogeneous sources of data at different spatial and time scales and including auxiliary information of a different nature or modality. Further, the data may be nonstationary due to unexpected incidents or traffic accidents [16]. This nonstationarity makes the conventional time series forecasting methods such as autoregressive integrated moving average (ARIMA) and vector autoregression (VAR), which usually rely on stationarity, inappropriate for accurate forecasting with spatial and timedependent data [16, 40].
Recently, deep learning models have been proposed for forecasting for spatial and timedependent data [16, 35, 11, 36, 7, 38, 34, 40]. To deal with spatial dependency, most of these models either use predefined distance/similarity metrics or other prior knowledge like adjacency matrices of traffic networks to determine dependency among locations. Then, they often use a (standard or graph) convolutional neural network (CNN) to better characterize the spatial dependency between these locations. These adhoc methods may lead to errors in some cases. For example, the locations that are considered as being dependent (independent) may actually be independent (dependent) in practice. As a result, these models may encode the data at a location by considering the data at independent locations and neglecting the data at dependent locations, leading to inaccurate encoding. Regarding temporal dependency, most of these models use recurrent neural networks (RNN), CNN, or their variants to capture the data longrange temporal dependency and nonstationarity. But it is well documented that these networks may fail to capture temporal dependency between distant time epochs [9, 29].
To tackle these challenges, we propose Forecaster, a new deep learning
architecture for forecasting spatial and timedependent data.
Our architecture consists of two parts. First, we use the theory of
Gaussian Markov random fields [24] to learn the structure of the graph that parsimoniously represents the spatial dependency between the locations (we call such graph a dependency graph). Gaussian Markov random fields
model spatial and timedependent data as a multivariant Gaussian distribution
over the spatial locations. We then estimate the precision matrix
of the distribution [6].
To evaluate the effectiveness of our proposed architecture, we apply it to the task of forecasting taxi ridehailing demand in New York City [28]. We pick 996 hot locations in New York City and forecast the hourly taxi ridehailing demand around each location from January 1st, 2009 to June 30th, 2016. Our architecture accounts for crucial auxiliary information such as weather, day of the week, hour of the day, and holidays. This improves significantly the forecasting task. Evaluation results show that our architecture reduces the root mean square error (RMSE) and mean absolute percentage error (MAPE) of the Transformer by 8.8210% and 9.6192%, respectively, and also show that our architecture significantly outperforms other stateoftheart baselines.
In this paper, we present critical innovation:

Forecaster combines the theory of Gaussian Markov random fields with deep learning. It uses the former to find the dependency graph among locations, and this graph becomes the basis for the deep learner forecast spatial and timedependent data.

Forecaster sparsifies the architecture of the Transformer based on the dependency graph, allowing the Transformer to capture better the spatiotemporal dependency within the data.

We apply Forecaster to forecasting taxi ridehailing demand and demonstrate the advantage of its proposed architecture over stateoftheart baselines.
2 Methodology
In this section, we introduce the proposed architecture of Forecaster. We start by formalizing the problem of forecasting spatial and timedependent data (Section 2.1). Then, we use Gaussian Markov random fields to determine the dependency graph among data at different locations (Section 2.2). Based on this dependency graph, we design a sparse linear layer, which is a fundamental building block of Forecaster (Section 2.3). Finally, we present the entire architecture of Forecaster (Section 2.4).
2.1 Problem Statement
We define spatial and timedependent data as a series of spatial signals, each collecting the data at all spatial locations at a certain time. For example, hourly taxi demand at a thousand locations in 2019 is a spatial and timedependent data, while the hourly taxi demand at these locations between 8 a.m. and 9 a.m. of January 1st, 2019 is a spatial signal. The goal of our forecasting task is to predict the future spatial signals given the historical spatial signals and historical/future auxiliary information (e.g., weather history and forecast). We formalize forecasting as learning a function that maps historical spatial signals and historical/future auxiliary information to future spatial signals, as Equation (1):
(1) 
where is the spatial signal at time , , with the data at location at time ;
the number of locations; the auxiliary information at
time , , the dimension of the auxiliary
information;
2.2 Gaussian Markov Random Field
We use Gaussian Markov random fields to find the dependency graph of the data over the different spatial locations. Gaussian Markov random fields model the spatial and timedependent data as a multivariant Gaussian distribution over locations, i.e., the probability density function of the vector given by is
(2) 
where and are the expected value (mean) and precision matrix (inverse of the covariance matrix) of the distribution.
The precision matrix characterizes the conditional dependency between different locations — whether the data and at the and locations depend on each other or not given the data at all the other locations (). We can measure the conditional dependency between locations and through their conditional correlation coefficient :
(3) 
where is the , entry of . In practice, we set a threshold on , and treat locations and as conditionally dependent if the absolute value of is above the threshold.
The nonzero entries define the structure of the dependency graph between locations. Figure 1 shows an example of a dependency graph. Locations 1 and 2 and locations 2 and 3 are conditionally dependent, while locations 1 and 3 are conditionally independent. This principle example illustrates the advantage of Gaussian Markov random field over adhoc pairwise similarity metrics — the former leads to parsimonious (sparse) graph representations.
We estimate the precision matrix by graphical lasso [6], an L1penalized maximum likelihood estimator:
(4) 
where is the empirical covariance matrix computed from the data:
(5) 
where is the number of time samples used to compute .
2.3 Building Block: Sparse Linear Layer
We use the dependency graph to sparsify the architecture of the Transformer. This leads to the Transformer better capturing the spatial dependency within the data. There are multiple linear layers in the Transformer. Our sparsification on the Transformer replaces all these linear layers by the sparse linear layers described in this section.
We use the dependency graph to build a sparse linear layer. Figure 2 shows an example
(based on the dependency graph in Figure 1).
Suppose that initially the layer (of five neurons) is fully connected to the layer (of nine neurons).
We assign neurons to the data at different locations (marked as ”1”, ”2”, and ”3” for locations 1, 2, and 3, respectively) and to the auxiliary information (marked as ”a”)
as illustrated next. How to assign neurons is a design choice for users. In this example, assign one neuron to each
location and two neurons to the auxiliary information at the
layer and assign two neurons to each location and three neurons to
the auxiliary information at the layer.
After assigning neurons, we prune connections based on the structure
of the dependency graph. As locations 1 and 3 are conditionally independent, we prune
the connections between them. We also prune the connections between
the neurons associated with locations and the auxiliary information to further simplify the architecture.
Our sparse linear layer is similar to the stateoftheart graph convolution approaches such as GCN [12] and TAGCN [5, 26] — all of them transform the data based on the adjacency matrix of the graph. The major difference is our sparse linear layer learns the weights for nonzero entries of the adjacency matrix (equivalent to the weights of the sparse linear layer), considering that different locations may have different strengths of dependency between each other.
2.4 Entire Architecture: Graph Transformer
Forecaster adopts an architecture similar to that of the Transformer except for substituting all the linear layers in the Transformer with our sparse linear layer designed based on the dependency graph. Figure 3 shows its architecture. Forecaster employs an encoderdecoder architecture [27], which has been widely adopted in sequence generation tasks such as taxi demand forecasting [16] and pose prediction [30]. The encoder is used to encode the historical spatial signals and historical auxiliary information; the decoder is used to predict the future spatial signals based on the output of the encoder and the future auxiliary information. We omit what Forecaster shares with the Transformer (e.g., positional encoding, multihead attention) and emphasize only on their differences in this section. Instead, we provide a brief introduction to multihead attention in the appendix.
Encoder
At each time step in the history, we concatenate the spatial signal with its auxiliary information. This way, we obtain a sequence where each element is a vector consisting of the spatial signal and the auxiliary information at a specific time step. The encoder takes this sequence as input. Then, a sparse embedding layer (consisting of a sparse linear layer with ReLU activation) maps each element of this sequence to the state space of the model and outputs a new sequence. In Forecaster, except for the sparse linear layer at the end of the decoder, all the layers have the same output dimension. We term this dimension and the space with this dimension as the state space of the model. After that, we add positional encoding to the new sequence, giving temporal order information to each element of the sequence. Next, we let the obtained sequence pass through stacked encoder layers to generate the encoding of the input sequence. Each encoder layer consists of a sparse multihead attention layer and a sparse feedforward layer. These layers are the same multihead attention layer and feedforward layer as in the Transformer, except that sparse linear layers, which reflect the spatial dependency between locations, to replace linear layers within them. The sparse multihead attention layer enriches the encoding of each element with the information of other elements in the sequence, capturing the longrange temporal dependency between elements. It takes each element as a query, as a key, and also as a value. A query is compared with other keys to obtain the similarities between an element and other elements, and then these similarities are used to weight the values to obtain the new encoding of the element. Note each query, key, and value consists of two parts: the part for encoding the spatial signal and the part for encoding the auxiliary information — both impact the similarity between a query and a key. As a result, in the new encoding of each element, the part for encoding the spatial signal takes into account the auxiliary information. The sparse feedforward layer further refines the encoding of each element.
Decoder
For each time step in the future, we concatenate its auxiliary information with the (predicted) spatial signal one step before. Then, we input this sequence to the decoder. The decoder first uses a sparse embedding layer to map each element of the sequence to the state space of the model, adds the positional encoding, and then passes it through stacked decoder layers to obtain the new encoding of each element. Finally, the decoder uses a sparse linear layer to project this encoding back and predict the next spatial signal. Similar to the Transformer, the decoder layer contains two sparse multihead attention layers and a sparse feedforward layer. The first (masked) sparse multihead attention layer compares the elements in the sequence, obtaining a new encoding for each element. Like the Transformer, we put a mask here such that an element is compared with only earlier elements in the sequence. This is because, in the inference stage, a prediction can be made based on only the earlier predictions and the past history — information about later predictions are not available. Hence, a mask needs to be placed here such that in the training stage we also do the same thing as in the inference stage. The second sparse multihead attention layer compares each element of the sequence in the decoder with the history sequence in the encoder so that we can learn from the past history. If nonstationarity happens, the comparison will tell the element is different from the historical elements that it is normally similar to, and therefore we should instead learn from other more similar historical elements, handling this nonstationarity. The following sparse feedforward layer further refines the encoding of each element.
3 Evaluation
In this section, we apply Forecaster to the problem of forecasting taxi ridehailing demand in Manhattan, New York City. We demonstrate that Forecaster outperforms the stateoftheart baselines (the Transformer [29] and DCRNN [16]) and a conventional time series forecasting method (VAR [19]).
3.1 Evaluation Settings
Dataset
Our evaluation uses the NYC Taxi dataset [28] from 01/01/2009
to 06/30/2016 (7.5 years in total). This dataset records detailed
information for each taxi trip in New York City, including its pickup
and dropoff locations. Based on this dataset, we select 996 locations
with hot taxi ridehailing demand in Manhattan of New York City, shown
in Figure 4. Specifically, we compute the taxi ridehailing
demand at each location by accumulating the taxi ride closest to that location. Note that these selected locations
are not uniformly distributed, as different regions of Manhattan has
distinct taxi demand.
Our evaluation uses hourly weather data from [32] to construct (part of) the auxiliary information. Each record in this weather data contains seven entries — temperature, wind speed, precipitation, visibility, and the Booleans for rain, snow, and fog.
Details of the Forecasting Task
In our evaluation, we forecast taxi demand for the next three hours based on the previous 674 hours and the corresponding auxiliary information (i.e., use a history of four weeks around; , in Equation (1)). Instead of directly inputing this history sequence into the model, we first filter it. This filtering is based on the following observation: a future taxi demand correlates more with the taxi demand at previous recent hours, the similar hours of the past week, and the similar hours on the same weekday in the past several weeks. In other words, we shrink the history sequence and only input the elements relevant to forecasting. Specifically, our filtered history sequence contains the data for the following taxi demand (and the corresponding auxiliary information):

The recent past hours: ;

Similar hours of the past week: ;

Similar hours on the same weekday of the past several weeks: .
Evaluation Metrics
Similar to prior work [16, 7], we use root mean square error (RMSE) and mean absolute percentage error (MAPE) to evaluate the quality of the forecasting results. Suppose that for the forecasting job (), the ground truth is , and the prediction is , where is the number of locations, and is the length of the forecasted sequence. Then RMSE and MAPE are:
(6) 
Following practice in prior work [7], we set a threshold on when computing MAPE: if , disregard the term associated it. This practice prevents small dominating MAPE.
3.2 Models Details
We evaluate Forecaster and compare it against baseline models including VAR, DCRNN, and the Transformer.
Our model: Forecaster
Forecaster uses weather (7dimensional vector), weekday (onehot encoding, 7dimensional vector), hour (onehot encoding, 24dimensional vector), and a Boolean for holidays (1dimensional vector) as auxiliary information (39dimensional vector). Concatenated with a spatial signal (996dimensional vector), each element of the input sequence for Forecaster is a 1035dimensional vector. Forecaster uses one encoder layer and one decoder layer (i.e., . Except for the sparse linear layer at the end of the decoder, all the layers of Forecaster use four neurons for encoding the data at each location and 64 neurons for encoding the auxiliary information and thus have 4048 neurons in total (i.e., ). The sparse linear layer at the end has 996 neurons. Forecaster uses the following loss function:
(7) 
where is a constant balancing the impact of RMSE with MAPE, .
Baseline model: Vector Autoregression
Vector autoregression (VAR) [19] is a conventional multivariant time series forecasting method. It predicts the future endogenous variables (i.e., the spatial signal in our case) as a linear combination of the past endogenous variables and the current exogenous variables (i.e., the auxiliary information in our case):
(8) 
where , , . Matrices and are estimated during the training stage. Our implementation is based on Statsmodels[25], a standard Python package for statistics.
Baseline model: DCRNN
DCRNN [16] is a deep learning model that models the dependency relations between locations as a diffusion process guided by a predefined distance metric. Then, it leverages graph CNN to capture spatial dependency and RNN to capture the temporal dependency within the data.
Baseline model: Transformer
The Transformer [29] uses the same input and loss function as Forecaster. It also adopts a similar architecture except that all the layers are fullyconnected. For a comprehensive comparison, we evaluate two versions of the Transformer:

Transformer (same width): All the layers in this implementation have the same width as Forecaster. The linear layer at the end of decoder has a width of 996; other layers have a width of 4048 (i.e., ).

Transformer (best width): We vary the width of all the layers (except for the linear layer at the end of decoder which has a fixed width of 996) from 64 to 4096, and pick the best width in performance to implement.
3.3 Results
Metrics  Model  Average  Next step  Second next step  Third next step 
VAR  6.9991  6.4243  7.1906  7.3476  
DCRNN  5.3750 ± 0.0691  5.1627 ± 0.0644  5.4018 ± 0.0673  5.5532 ± 0.0758  
RMSE  Transformer (same width)  5.6802 ± 0.0206  5.4055 ± 0.0109  5.6632 ± 0.0173  5.9584 ± 0.0478 
Transformer (best width)  5.6898 ± 0.0219  5.4066 ± 0.0302  5.6546 ± 0.0581  5.9926 ± 0.0472  
Forecaster  5.1879 ± 0.0082  4.9629 ± 0.0102  5.2275 ± 0.0083  5.3651 ± 0.0065  
VAR  33.7983  31.9485  34.5338  34.9126  
DCRNN  24.9853 ± 0.1275  24.4747 ± 0.1342  25.0366 ± 0.1625  25.4424 ± 0.1238  
MAPE (%)  Transformer (same width)  22.5787 ± 0.2153  21.8932 ± 0.2006  22.3830 ± 0.1943  23.4583 ± 0.2541 
Transformer (best width)  22.2793 ± 0.1810  21.4545 ± 0.0448  22.1954 ± 0.1792  23.1868 ± 0.3334  
Forecaster  20.1362 ± 0.0316  19.8889 ± 0.0269  20.0954 ± 0.0299  20.4232 ± 0.0604 
Our evaluation of Forecaster starts by using Gaussian Markov random fields to determine the spatial dependency between the data at different locations. Based on the method in Section 2.2, we can obtain a conditional correlation matrix where each entry of the matrix represents the conditional correlation coefficient between two locations. If the absolute value of an entry is less than a threshold, we will treat the corresponding two locations as conditionally independent, and round the value of the entry to zero. This threshold can be chosen based only on the performance on the validation set. Figure 5 shows the structure of the conditional correlation matrix under a threshold of 0.1. We can see that the matrix is sparse, which means a location generally depends on just a few other locations other than all the locations. We found that a location depends on only 2.5 other locations on average. There are some locations which many other locations depend on. For example, there is a location in Lower Manhattan which 16 other locations depend on. This may be because there are many locations with significant taxi demand in Lower Manhattan, with these locations sharing a strong dependency. Figure 6 shows the top 400 spatial dependencies. We see some longrange spatial dependency between remote locations. For example, there is a strong dependency between Grand Central Terminal and New York Penn Station, which are important stations in Manhattan with a large traffic of passengers.
After determining the spatial dependency between locations, we use the graph Transformer architecture of Forecaster to predict the taxi demand. Table 1 contrasts the performance of Forecaster to other baseline models. Here we run all the evaluated deep learning models six times (using different seeds) and report the mean and the standard deviation of the results. As VAR is not subject to the impact of random initialization, we run it once. We can see for all the evaluated models, the RMSE and MAPE of predicting the next step are lower than that of predicting later steps (e.g., the third next step). This is because, for all the models, the prediction of later steps is built upon the prediction of the next step, and thus the error of the former includes the error of the latter. Comparing the performance of these models, we can see the RMSE and MAPE of VAR is higher than that of the deep learning models. This is because VAR does not model well the nonlinearity and nonstationarity within the data; it also does not consider the spatial dependencies between locations in the structure of its coefficient matrices (matrices and in Equation (8)). Among the deep learning models, DCRNN and the Transformer perform similarly. The former captures the spatial dependency within the data but does not capture well the longrange temporal dependency, while the latter focuses on exploiting the longrange temporal dependency but neglects the spatial dependency. As for our method, Forecaster outperforms all the baseline methods at every future step of forecasting. On average (over these future steps), Forecaster achieves an RMSE of 5.1879 and a MAPE of 20.1362, which is 8.8210% and 9.6192% better than Transformer (best width), and 3.4809% and 19.4078% better than DCRNN. This demonstrates the advantage of Forecaster in capturing both the spatial dependency and the longrange temporal dependency.
4 Related Work
To our knowledge, this work is the first (1) to integrate Gaussian Markov Random fields with deep learning to forecast spatial and timedependent data, using the former to derive a dependency graph; (2) to sparsify the architecture of the Transformer based on the dependency graph, significantly improving the forecasting quality of the result architecture. The most closely related work is a set of proposals on forecasting spatial and timedependent data and the Transformer, which we briefly review in this section.
4.1 Spatial and TimeDependent Data Forecasting
Conventional methods for forecasting spatial and timedependent data such as ARIMA and Kalman filteringbased methods [18, 17] usually impose strong stationary assumptions on the data, which are often violated [16]. Recently, deep learningbased methods have been proposed to tackle the nonstationary and highly nonlinear nature of the data [35, 38, 36, 7, 34, 16]. Most of these works consist of two parts: modules to capture spatial dependency and modules to capture temporal dependency. Regarding spatial dependency, the literature mostly uses prior knowledge such as physical closeness between regions to derive an adjacency matrix and/or predefined distance/similarity metrics to decide whether two locations are dependent or not. Then, based on this information, they usually use a (standard or graph) CNN to characterize the spatial dependency between dependent locations. However, these methods are not good predictors of dependency relations between the data at different locations. Regarding temporal dependency, available works [35, 36, 7, 34, 16] usually use RNNs and CNNs to extract the longrange temporal dependency. However, both RNN and CNN do not learn well the longrange temporal dependency, with the number of operations used to relate signals at two distant time positions in a sequence growing at least logarithmically with the distance between them [29].
We evaluate our architecture with the problem of forecasting taxi ridehailing demand around a large number of spatial locations. The problem has two essential features: (1) These locations are not uniformly distributed like pixels in an image, making standard CNNbased methods [35, 34, 38] not good for this problem; (2) it is desirable to perform multistep forecasting, i.e., forecasting at several time instants in the future, this implying that the work mainly designed for singlestep forecasting [36, 7] is less applicable. DCRNN [16] is the stateoftheart baseline satisfying both features. Hence, we compare our architecture with DCRNN and show that our work outperforms DCRNN.
4.2 Transformer
The Transformer [29] avoids recurrence and instead purely relies on the selfattention mechanism to let the data at distant positions in a sequence to relate to each other directly. This benefits learning longrange temporal dependency. The Transformer and its extensions have been shown to significantly outperform RNNbased methods in NLP and image generation tasks [29, 22, 3, 33, 4, 21, 13]. It has also been applied to graph and node classification problems [1, 37]. However, it is still unknown how to apply the architecture of Transformer to spatial and timedependent data, especially to deal with spatial dependency between locations. Later work [31] extends the architecture of Transformer to video generation. Even though this also needs to address spatial dependency between pixels, the nature of the problem is different from our task. In video generation, pixels exhibit spatial dependency only over a short time interval, lasting for at most tens of frames — two pixels may be dependent only for a few frames and become independent in later frames. On the contrary, in spatial and timedependent data, locations exhibit longterm spatial dependency lasting for months or even years. This fundamental difference of the applications that we consider enables us to use Gaussian Markov random fields to determine the dependency graph as basis for sparsifying the Transformer. Child et al. [2] propose another sparse Transformer architecture with a different goal of accelerating the multihead attention operations in the Transformer. This architecture is very different from our architecture.
5 Conclusion
Forecasting spatial and timedependent data is challenging due to complex spatial dependency, longrange temporal dependency, nonstationarity, and heterogeneity within the data. This paper proposes Forecaster, a graph Transformer architecture to tackle these challenges. Forecaster uses Gaussian Markov random fields to determine the dependency graph between the data at different locations. Then, Forecaster sparsifies the architecture of the Transformer based on the structure of the graph and lets the sparsified Transformer (i.e., graph Transformer) capture the spatiotemporal dependency, nonstationarity, and heterogeneity in one shot. We apply Forecaster to the problem of forecasting taxiride hailing demand at a large number of spatial locations. Evaluation results demonstrate that Forecaster significantly outperforms stateoftheart baselines (the Transformer and DCRNN). \ackWe thank the reviewers. This work is partially supported by NSF CCF (award 1513936).
Appendix: MultiHead Attention
The multihead attention layer is a core component of the Transformer for capturing longrange temporal dependency within data. It takes a query sequence , a key sequence , and a value sequence as inputs, and outputs a new sequence where each element of the output sequence is impacted by the corresponding query and all the keys and values, no matter how distant these keys and values are from the query in the temporal order, and thus captures the longrange temporal dependency. The detailed procedure is as follows.
First, the multihead attention layer compares each query with each key to get their similarity from multiple perspectives (termed as multihead; is the number of heads, ):
(9) 
where is the similarity between and under head , ; are parameter matrices for head that need to be learned; is the inner product between two vectors.
In our work, to balance the impact of spatial signals and auxiliary information on the prediction, we first scale and then use its scaled version instead in Equation (9) for computing the similarity . Suppose in and , the first dimensions are for encoding spatial signals, and the next dimensions are for encoding auxiliary information, we compute as:
(10) 
where is the Hadamard product.
Second, the multihead attention layer uses these similarities as weights to generate a new sequence for each head :
(11) 
where is another parameter matrix for head that needs to be learned.
Third, the sequence under each head is concatenated and then used to generate the final output sequence:
(12) 
where is a parameter matrix that needs to be learned; represents a concatenation of two vectors.
In summary, the multihead attention layer needs to learn the parameter matrices , , , , and , which all can be treated as linear layers without bias. In our architecture, we use sparse linear layers to replace these linear layers, capturing the spatial dependency between locations.
Footnotes
 The approach to estimate the precision matrix of a Gaussian Markov random field (i.e., graphical lasso) can also be used with nonGaussian distributions [23].
 For simplicity, we assume in this work that different locations share the same auxiliary information, i.e., can impact , for any . However, it is easy to generalize our approach to the case where locations do not share the same auxiliary information.
 However, our architecture still allows the encodings for the data at different locations (i.e., the encoding for the spatial signal) to consider the auxiliary information through the sparse multihead attention layers in our architecture, which we will illustrate in the Section 2.4.
 We use the following algorithm to select the locations. Our roadmap has 5464 locations initially. Then, we compute the average hourly taxi demand at each of these locations. After that, we use a threshold (= 10) and an iterative procedure to down select to the 996 hot locations. This algorithm selects the locations from higher to lower demand. Every time when a location is added to the pool of selected locations, we compute the average hourly taxi demand at each of the locations in the pool by remapping the taxi rides to these locations. If every location in the pool has a demand no less than the threshold, we will add the location; otherwise, remove it from the pool. We reiterate this procedure over all the 5464 locations. This procedure guarantees that all the selected locations have an average hourly taxi demand no less than the threshold.
References
 (2019) PathAugmented Graph Transformer Network. In Workshop on Learning and Reasoning with GraphStructured Data (ICML workshop), pp. 1–5. Cited by: §4.2.
 (2019) Generating Long Sequences with Sparse Transformers. arXiv preprint arXiv:1904.10509. Cited by: §4.2.
 (2019) TransformerXL: Attentive Language Models beyond a FixedLength Context. In Annual Meeting of the Association for Computational Linguistics (ACL), pp. 2978–2988. Cited by: §1, §4.2.
 (2019) Bert: Pretraining of Deep Bidirectional Transformers for Language Understanding. In Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACLHLT), pp. 4171â4186. Cited by: §1, §4.2.
 (2017) Topology Adaptive Graph Convolutional Networks. arXiv preprint arXiv:1710.10370, pp. 1–13. Cited by: §2.3.
 (2008) Sparse Inverse Covariance Estimation with the Graphical Lasso. Biostatistics 9 (3), pp. 432–441. Cited by: §1, §2.2.
 (2019) Spatiotemporal MultiGraph Convolution Network for RideHailing Demand Forecasting. In AAAI Conference on Artificial Intelligence, pp. 3656–3663. Cited by: §1, §3.1.3, §3.1.3, §4.1, §4.1.
 (2016) The Forces of Economic Growth: A Time Series Perspective. Princeton University Press. Cited by: §1.
 (2001) Gradient Flow in Recurrent Nets: the Difficulty of Learning LongTerm Dependencies. Cited by: §1.
 (2017) Visual Exploration of Global Trade Networks with TimeDependent and Weighted Hierarchical Edge Bundles on GPU. Computer Graphics Forum 36 (3), pp. 273–282. Cited by: §1.
 (2016) StructuralRNN: Deep Learning on SpatioTemporal Graphs. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5308–5317. Cited by: §1.
 (2017) SemiSupervised Classification with Graph Convolutional Networks. In International Conference on Learning Representations (ICLR), pp. 1–14. Cited by: §2.3.
 (2019) Text Generation from Knowledge Graphs with Graph Transformers. In Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACLHLT), pp. 2284–2293. Cited by: §4.2.
 (2019) Cope: Interactive Exploration of CoOccurrence Patterns in Spatial Time Series. IEEE Transactions on Visualization and Computer Graphics 25 (8), pp. 2554–2567. Cited by: §1.
 (2014) Vismate: Interactive Visual Analysis of StationBased Observation Data on Climate Changes. In IEEE Conference on Visual Analytics Science and Technology (VAST), pp. 133–142. Cited by: §1.
 (2018) Diffusion Convolutional Recurrent Neural Network: DataDriven Traffic Forecasting. In International Conference on Learning Representations (ICLR), pp. 1–16. Cited by: §1, §1, §1, §2.4, §3.1.3, §3.2.3, §3, §4.1, §4.1.
 (2013) ShortTerm Traffic Flow Forecasting: An Experimental Comparison of TimeSeries Analysis and Supervised Learning. IEEE Transactions on Intelligent Transportation Systems 14 (2), pp. 871–882. Cited by: §4.1.
 (2011) Discovering SpatioTemporal Causal Interactions in Traffic Data Streams. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 1010–1018. Cited by: §4.1.
 (2005) New Introduction to Multiple Time Series Analysis. Springer. Cited by: §3.2.2, §3.
 (2009) ExpectationBased Scan Statistics for Monitoring Spatial Time Series Data. International Journal of Forecasting 25 (3), pp. 498–517. Cited by: §1.
 (2018) Image Transformer. In International Conference on Machine Learning (ICML), pp. 4055–4064. Cited by: §4.2.
 (2018) Improving Language Understanding by Generative Pretraining. OpenAI. Cited by: §1, §4.2.
 (2011) HighDimensional Covariance Estimation by Minimizing L1Penalized LogDeterminant Divergence. Electronic Journal of Statistics 5, pp. 935–980. Cited by: footnote 1.
 (2005) Gaussian Markov Random Fields: Theory And Applications (Monographs on Statistics and Applied Probability). Chapman & Hall/CRC. Cited by: §1.
 (2010) Statsmodels: Econometric and Statistical Modeling with Python. In Python in Science Conference, pp. 57–61. Cited by: §3.2.2.
 (2018) Classification with VertexBased Graph Convolutional Neural Networks. In Asilomar Conference on Signals, Systems, and Computers (ACSSC), pp. 752–756. Cited by: §2.3.
 (2014) Sequence to Sequence Learning with Neural Networks. In Advances in Neural Information Processing Systems (NIPS), pp. 3104–3112. Cited by: §2.4.
 (2018) Trip Record Data. Cited by: §1, §3.1.1.
 (2017) Attention is All You Need. In Advances in Neural Information Processing Systems (NIPS), pp. 5998–6008. Cited by: §1, §1, §3.2.4, §3, §4.1, §4.2.
 (2017) The Pose Knows: Video Forecasting by Generating Pose Futures. In IEEE International Conference on Computer Vision (ICCV), pp. 3332–3341. Cited by: §2.4.
 (2018) NonLocal Neural Networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7794–7803. Cited by: §4.2.
 (2018) Historical Weather. External Links: Link Cited by: §3.1.1.
 (2019) XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv preprint arXiv:1906.08237. Cited by: §1, §4.2.
 (2019) Revisiting SpatialTemporal Similarity: A Deep Learning Framework for Traffic Prediction. In AAAI Conference on Artificial Intelligence, pp. 5668–5675. Cited by: §1, §4.1, §4.1.
 (2018) Deep MultiView SpatialTemporal Network for Taxi Demand Prediction. In AAAI Conference on Artificial Intelligence, pp. 2588–2595. Cited by: §1, §4.1, §4.1.
 (2018) SpatioTemporal Graph Convolutional Networks: A Deep Learning Framework for Traffic Forecasting. In International Joint Conference on Artificial Intelligence (IJCAI), pp. 3634–3640. Cited by: §1, §4.1, §4.1.
 (2019) Graph Transformer Networks. In Advances in Neural Information Processing Systems (NeurIPS), pp. 11960–11970. Cited by: §4.2.
 (2017) Deep SpatioTemporal Residual Networks for Citywide Crowd Flows Prediction. In AAAI Conference on Artificial Intelligence, pp. 1655–1661. Cited by: §1, §4.1, §4.1.
 (2003) Correlation Analysis of Spatial Time Series Datasets: A FilterandRefine Approach. In PacificAsia Conference on Knowledge Discovery and Data Mining (PAKDD), pp. 532–544. Cited by: §1.
 (2017) SpatioTemporal Neural Networks for SpaceTime Series Forecasting and Relations Discovery. In IEEE International Conference on Data Mining (ICDM), pp. 705–714. Cited by: §1, §1.