ST-UNet: A Spatio-Temporal U-Network for Graph-structured Time Series Modeling

ST-UNet: A Spatio-Temporal U-Network for
Graph-structured Time Series Modeling

Bing Yu School of Mathematical Sciences, Peking UniversityBeijingChina100871 Haoteng Yin Center for Data Science (AAIS), Peking UniversityBeijingChina100871  and  Zhanxing Zhu Center for Data Science,
Peking University
Beijing Institute of Big Data ResearchBeijingChina100871

The spatio-temporal graph learning is becoming an increasingly important object of graph study. Many application domains involve highly dynamic graphs where temporal information is crucial, e.g. traffic networks and financial transaction graphs. Despite the constant progress made on learning structured data, there is still a lack of effective means to extract dynamic complex features from spatio-temporal structures. Particularly, conventional models such as convolutional networks or recurrent neural networks are incapable of revealing the temporal patterns in short or long terms and exploring the spatial properties in local or global scope from spatio-temporal graphs simultaneously. To tackle this problem, we design a novel multi-scale architecture, Spatio-Temporal U-Net (ST-UNet), for graph-structured time series modeling. In this U-shaped network, a paired sampling operation is proposed in spacetime domain accordingly: the pooling (ST-Pool) coarsens the input graph in spatial from its deterministic partition while abstracts multi-resolution temporal dependencies through dilated recurrent skip connections; based on previous settings in the downsampling, the unpooling (ST-Unpool) restores the original structure of spatio-temporal graphs and resumes regular intervals within graph sequences. Experiments on spatio-temporal prediction tasks demonstrate that our model effectively captures comprehensive features in multiple scales and achieves substantial improvements over mainstream methods on several real-world datasets.

spatio-temporal graph, multi-scale framework, U-network, graph convolution, dilated recurrent skip-connections
copyright: none

1. Introduction

With the latest success of extending deep learning approaches from regular grids to structured data, graph representation learning has become an active research area nowadays. Many real-world data such as social relations, biological molecules and sensor networks are naturally with a graph form. Recently, there has been a surge of interests in exploring and analyzing the representation of graphs for tasks like node classification and link prediction (Kipf and Welling, 2016; Hamilton et al., 2017; Gao et al., 2018). However, among those studies, the dynamic graph has received relatively less attention than the static graph that consists of fixed node values or labels. The spatio-temporal graph is one of typical dynamic graphs, with varying input for each node along time axis, e.g. traffic sensor streaming and human action sequences. In this work, we systematically study the dynamic graph in spacetime domain, with an aim to develop a principled and effective method to interpret the spatio-temporal graph and to forecast future values or labels of certain nodes thereof, or to predict the whole graph in the next few time steps.

In the field of spatio-temporal data, videos are a well-studied example, whose successive frames consistently share spatial and temporal structures. By leveraging different types of neural networks, a hybrid framework is constructed to exploit such spatio-temporal regularities within video frames, for instance, applications in weather radar echoes (Xingjian et al., 2015) and in traffic heatmaps (Zhang et al., 2018). In this case, each frame in the video firstly passes through convolution neural networks (CNN) for visual feature extraction, and then followed by recurrent neural networks (RNN) for sequence learning. Even though images can be regarded as special cases of graphs, widely used deep learning models still face significant challenges in applying to spatio-temporal graphs. First, graph-structured data are generated from non-Euclidean domain, which may not align in regular grids as required by existing models. Second, compared to grid-like data, there is no spatial locality or order information among nodes of a graph. Due to such irregularities, standard operations (for example, convolution and pooling) are not directly applicable to graph domain.

To bridge the above gap, (Bruna et al., 2013) proposes graph convolutional networks (GCNs) redefining the notion of the convolution and generalizing it to arbitrary graphs based on spectral graph theory. The introduction of GCNs boosts the latest rapid development of graph study. Moreover, it has been successfully adopted in a variety of applications where the dynamic graph is strongly associated. For instance, in action recognition, human action sequences can be assembled as a spatio-temporal graph, where body joints are constituted as a series of skeleton graph changing along time axis. Correspondingly, (Li et al., 2018a) designs a GCN-based model to capture the spatial patterns of skeleton sequences as well as the temporal dynamics contained therein. In traffic forecasting, each sensor station streams the traffic status of a certain road within a traffic network. In this sensor graph, the spatial edges are weighted by the pair-wised distance between stations in the network while the temporal ones are connected by the same sensors between adjacent time frames. Recent studies have investigated the feasibility of combining GCNs with RNN (Li et al., 2018b) or CNN (Yu et al., 2018) for traffic prediction on road networks. GCN-based models obtain considerable improvements compared to traditional ones that typically ignore the spatio-temporal correlations and lack in the capability for handling structured sequences.

In order to accurately understand local and global properties of dynamic graphs, it is necessary to process the data through multiple scales. The spatio-temporal graph particularly requires such scale-spanning analysis since its particularity and complexity in spacetime domain. However, most mainstream methods have overlooked such principle, partially because of the difficulties of extending existing operations like the pooling to graph data. Nevertheless, multi-scale modeling of the dynamic graph has the similarity with the pixel-wise prediction task, as an image pixel corresponding to a graph node. U-shaped networks with U-Net (Ronneberger et al., 2015) as the representative achieve state-of-the-art performance on pixel-level prediction, whose architecture has high representational capacity of both the local distributed and the global hidden information within the input. Thus, it is particularly appealing to apply such U-shaped design to modeling dynamic graphs.

In this paper, we propose a novel multi-scale framework, Spatio-Temporal U-Net (ST-UNet), to model and predict graph-structured time series. To precisely capture the spatio-temporal correlations in dynamic graphs, we firstly generalize the U-shaped architecture from images to spatio-temporal graphs. ST-UNet employs multi-granularity graph convolution for extracting both generalized and localized spatial features, and adds dilated recurrent skip-connections for capturing multi-resolution temporal dependencies. Under the settings of ST-UNet, we define two essential operations of the framework accordingly: the spatio-temporal pooling (ST-Pool) operation samples nodes to form a smaller graph from the output of deterministic graph partition (Maue and Sanders, 2007) and abstracts time series at multiple temporal resolutions through skip connections between recurrent units. Consequently, the unpooling (ST-Unpool) as a paired operation restores the original structure and temporal dependency of dynamic graphs based on previous settings in the downsampling. To better localize the representation from the input, higher-level features retrieved from the pooling part are concatenated with the upsampled output. Overall, with contributions of hierarchical U-shaped design, ST-UNet is able to effectively derive multi-scale features and precisely learn representations from the spatio-temporal graph.

2. Related Work

Following spectral-based formulation (Bruna et al., 2013; Niepert et al., 2016; Defferrard et al., 2016), the graph convolution operator ‘’ is introduced as the multiplication of a graph signal with a kernel , where is a vector of Fourier coefficients, as


where is the graph Fourier basis, which is a matrix of eigenvectors of the normalized graph Laplacian ( is an identity matrix and is the diagonal degree matrix of adjacency matrix with ); while is the diagonal matrix of eigenvalues of (Shuman et al., 2012). In order to localize the filter, the kernel can be restricted to a truncated expansion of Chebyshev polynomials to order with the rescaled as , where is a vector of Chebyshev coefficients (Hammond et al., 2011). Hence, the graph convolution can then be expressed as,


where is the Chebyshev polynomial of order evaluated at the rescaled Laplacian .

Apart from convolutional operations on graphs, there are also several recent studies focusing on structured sequence learning. Structured RNN (Jain et al., 2016) attempts to fit the spatio-temporal graph into a mixture of recurrent neural networks by associating each node and edge to a certain type of the networks. Based on the framework of convLSTM (Xingjian et al., 2015), graph convolutional recurrent network (GCRN) (Seo et al., 2016) is firstly proposed modeling structured sequences by replacing regular 2D convolution with spectral-based graph convolution. And it has set a trend of GCN-embedded designs for the follow-up studies (Li et al., 2018b; Yu et al., 2018). Recently, an encoder-decoder model on graphs is developed for graph embedding tasks. The model known as graph U-Net (Gao and Ji, 2019) brings pooling and upsampling operations to graph data. However, the scope of its uses is bounded by the static graph. Additionally, it introduces extra training parameters for node selection during the pooling procedure. Furthermore, the pooling operation it proposed does not keep the original structure of the input graph that may raise an issue for those tasks whose local spatial relations are critical.

3. Methodology

In this section, we start with the definition of the spatio-temporal graph and the problem formulation of prediction tasks on it. The special design of U-shaped network is elaborated in the following with essential operations of pooling and upsampling defined on the spatio-temporal graph. Base on the above advances, a multi-scale architecture, Spatio-Temporal U-Net, is introduced for graph-structured time series modeling eventually.

Figure 1. An illustration of the proposed Spatio-Temporal U-Net architecture. ST-UNet employs graph convolutional gated recurrent units (GCGRU) as its backbone. In this example, the proposed framework contains three GCGRU layers formed as a U-shaped structure with one ST-Pool and one ST-Unpool applied in one side respectively. Spatio-temporal features obtained from the input are downsampled into multi-resolution representations through a ST-Pooling operation. As subgraph (a) represents, the input graph at each time step is equally coarsened into nearly a quarter of its original size at the level 2 combining with feature pooling regarding the channel dimension. Meanwhile, the temporal dependency of the input sequence is dilated to 2 with skip-connections crossing every other recurrent unit, as shown in subgraph (b). The ST-Unpooling, as a reverse operation, restores the spatio-temporal graph into its original structure with upsampling in spatial features and resumes regular dependencies of time series concurrently. To assemble a more precise output with better localized representations, high-level features of the pooling side are fused with the upsampled output through a skip connection at the same level. The final output can be utilized for predicting node attributes or the entire graph in the next few time steps.

3.1. Spatio-temporal Graph Modeling

Suppose spatio-temporal data are gathered through a structured spatial region consisting of nodes. Inside each node, there are measurements which vary over time. Thus, observation at any time can be represented by a feature vector . Moreover, data collected over the whole region are able to be expressed in terms of a feature matrix . As time goes by, a chronological sequence of matrices is accumulated, which can be further formalized as the spatio-temporal graph defined as follows.

Definition 3.1 (Spatio-temporal Graph).

A spatio-temporal graph is an attributed graph with a time-variable feature matrix . It is defined as where is the set of vertices, is the set of edges, and is an adjacency matrix recording the weighted connectedness between two vertices. Contrary to the static graph, node attributes of the spatio-temporal one evolve over time as , where is the length of time steps and is the dimension of features in each node.

In practice, due to structural properties of the data, spatio-temporal graph modeling can be formulated as the prediction task of graph-structured time series. The objective of this task is to accurately predict future attributes of nodes in a given spatio-temporal graph based on historical records, which is formally described below.

Definition 3.2 (Spatio-temporal Prediction).

Spatio-temporal prediction aims to forecast the most likely future length- sequence of node attributes in a graph given the previous observations:


where is an observation of node attributes linked by a weighted graph at time step .

3.2. Pooling Operation on Spatio-temporal Graphs

The spatio-temporal graph can be decomposed into two domains: graph-structured data in spatial while time series in temporal. As a result, it inherits the characteristics of structural complexity from graphs and dynamic complexity from sequences. Therefore, we discuss the downsampling approaches applied from the spatial and the temporal perspective respectively in this section. Lastly, a unified pooling operation is defined in spacetime domain.

Spatial Graph Pooling

Pooling layers play a vital role in CNNs since its function of achieving feature reduction. It generally follows the convolutional layer to progressively reduce spatial resolution of feature maps and enlarge receptive fields, thereby controlling parameter overfitting and achieving better generalization. However, the standard pooling operation is not directly applicable to graph-structured data, since it requires distinct neighborhoods which are obviously not accessible from graphs. Besides the local pooling, there are operations imposed on the input generally that could bypass the requirement of locality information, such as global pooling and -max pooling. But these pooling approaches also bring issues of limited flexibility and inconsistent selection (Gao and Ji, 2019).

It is indispensable for the pooling in extracting multilevel abstraction of graphs. Thus, we make use of the improved path growing algorithm (PGA) (Maue and Sanders, 2007) to perform graph partitions by solving the maximum weight matching problem (referred to as ‘MaxWeightMatching’). Given a graph with nodes at time step , PGA finds an approximate solution to the problem with a subset of edges satisfying: 1) there are no two members of sharing an endpoint; 2) its total weights are the largest. Subsequently, the algorithm generates the partition through gradually removing edges in and merging nodes connected thereof, as the Algorithm 1 describes. At each level, it reduces the size of a graph by the factor of two, producing a coarser graph corresponding to observing the data domain at a different resolution:


where is a partitioned graph with nodes at the level which controls reduction scale of the input. is a set of super nodes, each element of which contains a disjoint subset of . We use to denote mapping relations between nodes in and . Formally, after the graph convolutional layer, we can acquire the convolved feature matrix of a coarser graph through the graph partition algorithm as


where is a graph signal matrix with attributes in each node of while is a length- feature matrix with channels in each node of and is the number of nodes contained in each super node . Finally, we employ the maximum or mean feature activation over nodes in partitioned regions to obtain pooled features in each of regarding the channel dimension as


where is the output of spatial graph pooling with -channel features on nodes. Figure 1 (a) shows an example of the proposed spatial graph pooling. Since graph partition is calculated in advance, it makes the operation very efficient without introducing extra training parameters. Moreover, the scope of spatial graph pooling can be adjusted through the level which offers a precise control. In oder to address the inconsistency issue in node selection, the deterministic result of graph partition in PGA is equally applied to the graph at each time step.

Input: graph ; pooling level
Output: graph with adjusted
I. Edge Selection
1 A subset of edges, ;
2 while  do
3       ;
4       deterministically choose with ;
5       while  do
6             let be the heaviest edge adjacent to ;
7             append to ;
8             remove and its adjacent edges from ;
9             ;
10       end while
11      ;
12       extend to a maximal matching;
13 end while
II. Graph Coarsening
15 while  do
16       remove selected in from ;
17       merge connected by as a super node ;
18       adjust adjacent edges with related weights of ;
19       remove from ;
20 end while
III. Multilevel Partition
22 while  do
23       repeat Part I & II with at step 1 as input;
24       ;
25 end while
Algorithm 1 Graph Partition Algorithm (gPartition)

Temporal Downsampling

Recurrent neural networks and its variants have shown impressive stability and capability of tackling sequence learning problems. Conventional recurrent models such as long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997) and gated recurrent units (GRU) (Chung et al., 2014) are initially designed for regular sequences with fixed time intervals, which significantly limits their capacity for capturing complex data dependencies. Recently, several studies have explored how to expand the scope of recurrent units in RNNs to more sophisticated data like the spatio-temporal one. Based on fully-connected LSTM (FC-LSTM), (Xingjian et al., 2015) develops a modified recurrent network with embedded convolutional layers (convLSTM) to forecast spatio-temporal sequences. Inside each recurrent unit, convolutional operations with kernels are substituted for multiplications by dense matrices, which enables the network for handling image sequences. Afterwards, (Seo et al., 2016) extends this approach by replacing the standard convolution by the graph convolution for structured sequence modeling. Following the similar scheme, we leverage the GRU model and GCN layers as Graph Convolutional Gated Recurrent Units (GCGRU) to discover temporal patterns from graph-structured time series:


where is the Hadamard product and stands for non-linear activation functions. In this setting, and represent the gate of update and reset at time step ; while and denote the current memory content and final memory at current time step respectively. Both and are parameters of the size- graph convolutional kernel. We use the notion ‘’ to describe the graph convolution between the graph signal and filters which are the functions of the graph Laplacian parameterized by -localized Chebyshev coefficients as Eq. (2) notes. By stacking several graph convolutional recurrent layers, the adopted backbone GCGRU can be used as a seq2seq model for graph-structured sequence learning.

The above architecture may be enough to model structured sequences by exploiting local stationarity and spatio-temporal correlations. But it still suffers from the restriction of interpreting temporal dynamic through determinate periods. In terms of multi-timescale modeling, many attempts have been made to extend recurrent networks to various time scope, including phased LSTM (Neil et al., 2016) and clockwork RNNs (Koutnik et al., 2014). Inspired by jumping design between recurrent units in (Chang et al., 2017), we insert the skip connection between gated recurrent units to learn graph-structured sequences in multilevel temporal dependencies. It also generates a dilation between successive cells, which is equivalent to abstract temporal features over a different resolution. Denote as the GCGRU cell in layer at time . The dilated skip connection can be expressed as


where is the input to layer at time ; denotes the skip length, also referred to the dilation of layer ; and represents the GRU cell and output operations. Figure 1 (b) provides a diagram of the proposed temporal downsampling implemented by the dilated recurrent skip-connections. Such hierarchical design of dilation brings in multiple temporal scales for recurrent units at different layers. It also contributes to broadening the range of temporal dependency as the regular jump connection does but with fewer parameters and high efficiency.

In summary, based on the proposals made in pooling on spatio-temporal data, we define spatio-temporal pooling (ST-Pool) as the operation performing downsampling on a spatio-temporal graph by aggregating convolved features over non-overlapped partitions regarding the channel dimension on its spatial projection while dilating dynamic dependencies over recurrent units aligned in the same layer on its temporal projection.

3.3. Spatio-temporal Unpooling Operation

As the inverse operation of downsampling, the unpooling is crucial in the U-shaped network for recovering pooled features to their original resolution through upsampling. There are several approaches defined on grid-like data that could satisfy this aim, such as transposed convolution (Zeiler et al., 2011) and unpooling layers (Zeiler and Fergus, 2014). However, these operations are not directly applicable to spatio-temporal domain due to specialty and compositionality of its data. To this end, we propose spatio-temporal unpooling (ST-Unpool) accordingly: to restore primary structure of the input, the operation utilizes the reversed mapping to place back merged nodes and edges from to ; to resume regular temporal dependencies between recurrent units, the output of each time step in a skip-connected layer are fed into a vanilla recurrent layer without further temporal dilation.

Meanwhile, we provide three strategies for upsampling node attributes from a coarser graph, namely, direct copy, ordered deconv and weighted deconv. As the name suggests, the first approach directly copies features of a super node to each node it contains; while ordered deconvolution assigns parameterized features to each merged node based on its degree order. On top of ordered deconvolution, the weighted one concatenates structural information of merged nodes in a sub-graph as an embedded feature vector to upsampled features. All three methods of upsampling have been tested and compared in Section 4.4.

3.4. Architecture of Spatio-Temporal U-Net

Based on spatio-temporal pooling and unpooling operations proposed above, we develop a U-shaped multi-scale architecture, Spatio-Temporal U-Net, to address the challenge of analyzing and predicting graph-structured sequences. Following the classic U-shaped design, it contains two parts in symmetry: downsampling and upsampling. In the contracting part, it firstly applies graph convolution to aggregate information from each node’s neighborhoods, and then follows by the ST-Pool layer to encode convolved features into multiple spatio-temporal resolution. In the expansive part, it utilizes the ST-Unpool layer for upsampling the reduced features to their original dimensions, with the concatenation of corresponding high-level features retrieved from the downsampling. In the end, one graph convolution layer is attached to propagate the information through multiple spatial scales for the final prediction. The illustration of proposed architecture presents in Figure 1. We now can summarize the main characteristics of ST-UNet in three aspects,

  • To the best of our knowledge, it is the first time that a multi-scale network with U-shaped design is applied to learn and model spatio-temporal structures from graph-structured time series.

  • A novel pair of operators in spatio-temporal pooling and unpooling are firstly proposed for extracting and fusing multilevel features in spacetime domain.

  • The proposed framework ST-UNet achieves the balance between accuracy and efficiency with considerable scalability through multi-scale feature extraction and fusion as shown in the experiment below.

4. Experimental Studies

In this section, we present the evaluation of our model proposed in Section 3.4. Several mainstream models are tested and analyzed on spatio-temporal prediction tasks. Experiments show that ST-UNet consistently outperforms other models and achieves state-of-the-art performance regarding prediction accuracy. We also perform the ablation study to validate the effectiveness of spatio-temporal pooling and unpooling operations. Comparison between GCN-based models suggests that ST-UNet has the superiority in balancing efficiency and scalability on the large-scale dataset. For a fair comparison, we execute grid search strategy to determine the best hyper-parameters on validations for all test models.

4.1. Spatio-temporal Sequence Modeling on Moving-MNIST

In order to investigate the ability of node-level prediction, we compare ST-UNet with its plain version GCGRU on a synthetic dataset, moving-MNIST constructed by (Xingjian et al., 2015). It consists of 20-frame sequences (first 10 frames as input and the last for prediction), each of which contains two handwritten digits whose location is bouncing inside a 64 64 patch.111To make it feasible for all test models, the image frame in moving-MNIST is downsampled to 32 32 in the experiment of this section. Following the default setup in (Seo et al., 2016), image frames are converted into spatio-temporal graphs. The adjacency matrix is constructed based on distances between each pixel node and its equal neighbors of a k-nearest-neighbor graph in four directions (up, down, left and right). Kernel size of graph convolution is set to 3 for both models. The visualized outcome of moving sequence prediction in Figure 2 indicates that, thanks to hierarchical feature fusion in spacetime domain, the U-shaped network can learn better representation and obtain superior performance than the model purely based on GCNs in the node-level. It suggests the transferability of such multi-scale designs from regular grids to non-Euclidean domain as well.

Figure 2. Qualitative results for moving MNIST. First row is the ground truth, second and third are the predictions of ST-UNet() and GCGRU() respectively.

4.2. Graph-structured Time-series Modeling on Traffic Prediction

Experimental Setup

For traffic prediction task, we conduct experiments on two real-world public datasets: METR-LA released by (Li et al., 2018b) includes traffic information gathered by 207 loop detectors of Los Angeles County in 4 months, ranging from March 1st to June 30th of 2012; PeMS (M/L) generated by (Yu et al., 2018) contains traffic status collected from monitoring stations deployed over California state highway system in the weekdays of May and June of 2012, including 228 and 1026 stations respectively. Both datasets aggregate traffic records into a 5-min interval with an adjacency matrix describing the sensor topology of traffic networks. We use the same experimental settings of previous studies on these two datasets, including data preprocessing, dataset split, and other related configurations.

The following mainstream methods are selected as the baseline: 1). Historical Average (HA); 2). Linear Support Vector Regression (LSVR); 3). Auto-Regressive Integrated Moving Average (ARIMA); 4). Feedforward Neural Network (FNN); 5). Fully-Connected LSTM (FC-LSTM) (Sutskever et al., 2014); 6). Spatio-Temporal Graph Convolutional Networks (STGCN) (Yu et al., 2018); 7). Diffusion Convolutional Recurrent Neural Network (DCRNN) (Li et al., 2018b).

This task requires using observed traffic time series in the window of one hour to forecast future status in the next 15, 30, and 60 minutes. Thus, three standard metrics of sequence prediction are adopted to measure the performance of different methods, namely, Mean Absolute Errors (MAE), Mean Absolute Percentage Errors (MAPE), and Root Mean Squared Errors (RMSE).

ST-UNet Settings

All ST-UNet models use the kernel size for the graph convolution. Both spatial pooling level and temporal dilation are set at 2 with ‘direct copy’ employed as the upsampling approach. We train our models by using Adam optimizer to minimize the mean of and loss for 80 epochs with the batch size as 50. The schedule sampling and layer normalization are utilized in training for better generalization. The initial learning rate is with a decay rate of 0.7 after every 8 epochs. The hidden size of recurrent units in our model is 96 for METR-LA dataset; while it is assigned to 64 for the rest.

Results Analysis

Table 1 demonstrates the numerical results of spatio-temporal traffic prediction on datasets METR-LA and PeMS-M. We observe the following phenomenon in both datasets: 1) graph convolution based models, including STGCN, DCRNN and ST-UNet generally outperform other baselines, which emphasizes the importance of including graph topology for traffic prediction. 2) RNN-based models tend to act better for the long-term prediction, suggesting their advantages in capturing temporal dependency. 3) regarding the adopted metrics, ST-UNet achieves the best performance for all three forecasting windows, which validates the effectiveness of multi-scale designs in spatio-temporal sequence modeling. 4) traditional approaches such as LSVR and ARIMA mostly perform worse than deep learning models, due to their limited capacities for handling complex non-linear data. In addition, historical average is a reflection of traffic status in a long-term, which is invariant to the short-term impact.

Model METR-LA (15/ 30/ 60 min) PeMS-M (15/ 30/ 60 min)
HA 4.16 13.0 7.80 4.01 10.61 7.20
LSVR 2.97/ 3.64/ 4.67 7.68/ 9.9/ 13.63 5.89/ 7.35/ 9.13 2.50/ 3.63/ 4.54 5.81/ 8.88/ 11.50 4.55/ 6.67/ 8.28
ARIMA 3.99/ 5.15/ 6.90 9.6/ 12.7/ 17.4 8.21/ 10.45/ 13.23 5.55/ 5.86/ 6.83 12.92/ 13.94/ 17.34 9.00/ 9.13/ 11.48
FNN 3.99/ 4.23/ 4.49 9.9/ 12.9/ 14.0 7.94/ 8.17/ 8.69 2.39/ 3.41/ 4.88 5.53/ 8.16/ 12.12 4.40/ 6.40/ 8.84
FC-LSTM 3.44/ 3.77/ 4.37 9.6/ 10.9/ 13.2 6.30/ 7.23/ 8.69 3.67/ 3.87/ 4.19 9.09/ 9.57/ 10.55 6.58/ 7.03/ 7.79
STGCN 2.87/ 3.48/ 4.45 7.4/ 9.4/ 11.8 5.54/ 6.84/ 8.41 2.25/ 3.03/ 4.02 5.26/ 7.33/ 9.85 4.04/ 5.70/ 7.64
DCRNN 2.77/ 3.15/ 3.60 7.3/ 8.8/ 10.5 5.38/ 6.45/ 7.59 2.25/ 2.98/ 3.83 5.30/ 7.39/ 9.85 4.04/ 5.58/ 7.19
ST-UNet 2.72/ 3.12/ 3.55 6.9/ 8.4/ 10.0 5.13/ 6.16/ 7.40 2.15/ 2.81/ 3.38 5.06/ 6.79/ 8.33 4.03/ 5.42/ 6.68
Table 1. Performance comparison of different models on METR-LA and PeMS-M datasets.

4.3. Ablation Study of ST-Pool & ST-Unpool

As the above two tasks reveal, ST-UNet steadily outperforms mainstream models for spatio-temporal prediction. But it may be argued that performance gains are actually due to the deeper architecture or benefit from multilevel abstraction in spatial or temporal alone. Therefore, we initiate an ablation study to investigate the contribution of spatio-temporal pooling and unpooling operations in our model. We conduct the experiment with ST-UNet in four styles: the plain version by removing all ST-Pool and ST-Unpool operations; T-UNet only with pooling and upooling in temporal; S-UNet only with pooling and unpooling in spatial; and the full version. To the aim of pure comparison, we only test these variants without additional training tricks. The numerical outcome in Table 2 confirms that the proposed operations are valid for model enhancement in both spatial and temporal dimension. Moreover, thanks to the multi-scale feature integration through U-shaped network, applying pooling and unpooling operations in space and time coherently results in further improvement and better generalization.

Models GCGRU T-UNet S-UNet ST-UNet


MAE 2.248 0.004
MAPE(%) 5.244 0.028
RMSE 3.994 0.005


MAE 2.980 0.011
MAPE(%) 7.124 0.037
RMSE 5.452 0.019


MAE 3.756 0.031
MAPE(%) 8.844 0.082
RMSE 6.716 0.025
Table 2. Comparison of ST-UNet variants with or without ST-Pool & ST-Unpool operations in terms of prediction accuracy on PeMS-M.

4.4. Comparison Study of Upsampling Approaches in ST-Unpool

As we discussed in Section 3.3, there are three methods for upsampling spatial features in the unpooling part. We carry out the experiment to examine the relation between these methods and the performance of corresponding models. Comparison of three upsampling approaches in terms of the mean square error is summarized in Table 3. The method of direct copy generally performs better than the other two, especially in relatively long terms. It suggests that the simple mechanism may be more steady and robust in this case. Furthermore, local properties within a super node such as degree orders and connectedness may not contain enough information to support complex feature reconstruction, due to the isomorphism of its node elements and significant structural differences among other nodes.

Models Direct-Copy Ordered-Deconv Weighted-Deconv
15min 3.980 0.009
30min 5.452 0.019
60min 6.716 0.025
Table 3. Comparison of different upsampling approaches in ST-Unpool in terms of MSE on PeMS-M (The notion ‘’ indicates that the test model may not converge eventually).

4.5. Scalability and Efficiency Study on Large-scale Graph Data

To test the scalability and efficiency of ST-UNet, we experiment our model and other GCN-based ones on a large dataset PeMS-L which contains over one thousand sensor nodes in a single graph. We list the comparison of prediction accuracy for four major models in Table 4. Apparently, conventional graph convolution based approaches, including GCGRU and DCRNN face great challenges in handling such large-scale graphs. We use the symbol ‘’ to mark the model whose batch size is forced to reduce a half since its graphical memory consumption exceeded the capacity over a standard GPU card.222All experiments are compiled and tested on a CentOS cluster (CPU: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz, GPU: NVIDIA GeForce GTX 1080). By means of its fully convolutional structures, STGCN is able to process such large dataset at once. With the help of exploring spatio-temporal correlations in a global view, it behaves well in short-and-mid term prediction but suffering from overfitting in long periods. On the other hand, DCRNN maintains a higher standard on long-term forecasting but with the cost of massive computational demands. For instance, the model normally takes more than 10 minutes to train one epoch with the batch size of 16 on PeMS-L. By contrast, ST-UNet confers better outcome in less half of the time that DCRNN need. It has reached the balance between time efficiency and prediction accuracy through spatial and temporal pooling operations applied. It also has advantages in extracting spatial features and temporal dependencies with fewer parameters and in multilevel abstraction.

Models PeMS-L (15/ 30/ 60 min)
HA 4.60 12.50 8.05
2.48/ 3.43/ 4.08 5.76/ 8.45/ 10.28 4.40/ 6.25/ 7.62
STGCN 2.37/ 3.27/ 4.36 5.56/ 7.98/ 11.59 4.32/ 6.21/ 8.31
2.41/ 3.28/ 4.32 5.61/ 8.18/ 11.33 4.22/ 5.87/ 7.58
ST-UNet 2.34/ 3.02/ 3.66 5.54/ 7.56/ 9.52 4.32/ 5.81/ 7.14
Table 4. Comparison of GCN-based models in terms of prediction accuracy on the large-scale dataset PeMS-L.

5. Conclusion

In this paper, we propose a universal multi-scale architecture ST-UNet to learn and predict graph-structured time series, integrating multi-granularity graph convolution and dilated recurrent skip-connections through the U-shaped network design. Experiments show that our model consistently outperforms other state-of-the-art methods on several real-world datasets, indicating its great potentials on extracting comprehensive spatio-temporal features through scale-spanning sequence modeling. The ablation study validates the efficiency improvement obtained from the proposed pooling and unpooling operations in spacetime domain. Moreover, ST-UNet also achieves the balance between efficiency and capacity with considerable flexibility. These features are quite promising and practical for structured sequence modeling in the future research development and industrial applications.


  • (1)
  • Bruna et al. (2013) Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. 2013. Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203 (2013).
  • Chang et al. (2017) Shiyu Chang, Yang Zhang, Wei Han, Mo Yu, Xiaoxiao Guo, Wei Tan, Xiaodong Cui, Michael Witbrock, Mark A Hasegawa-Johnson, and Thomas S Huang. 2017. Dilated recurrent neural networks. In Advances in Neural Information Processing Systems. 77–87.
  • Chung et al. (2014) Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014).
  • Defferrard et al. (2016) Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems. 3844–3852.
  • Gao and Ji (2019) Hongyang Gao and Shuiwang Ji. 2019. Graph U-Net.
  • Gao et al. (2018) Hongyang Gao, Zhengyang Wang, and Shuiwang Ji. 2018. Large-scale learnable graph convolutional networks. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 1416–1424.
  • Hamilton et al. (2017) Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems. 1024–1034.
  • Hammond et al. (2011) David K Hammond, Pierre Vandergheynst, and Rémi Gribonval. 2011. Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis 30, 2 (2011), 129–150.
  • Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780.
  • Jain et al. (2016) Ashesh Jain, Amir R Zamir, Silvio Savarese, and Ashutosh Saxena. 2016. Structural-RNN: Deep learning on spatio-temporal graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5308–5317.
  • Kipf and Welling (2016) Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
  • Koutnik et al. (2014) Jan Koutnik, Klaus Greff, Faustino Gomez, and Juergen Schmidhuber. 2014. A clockwork rnn. arXiv preprint arXiv:1402.3511 (2014).
  • Li et al. (2018a) Chaolong Li, Zhen Cui, Wenming Zheng, Chunyan Xu, and Jian Yang. 2018a. Spatio-Temporal graph convolution for skeleton based action recognition. In AAAI Conference on Artificial Intelligence.
  • Li et al. (2018b) Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. 2018b. Diffusion convolutional recurrent neural network: Data-driven traffic forecasting. In International Conference on Learning Representations.
  • Maue and Sanders (2007) Jens Maue and Peter Sanders. 2007. Engineering algorithms for approximate weighted matching. In International Workshop on Experimental and Efficient Algorithms. Springer, 242–255.
  • Neil et al. (2016) Daniel Neil, Michael Pfeiffer, and Shih-Chii Liu. 2016. Phased lstm: Accelerating recurrent network training for long or event-based sequences. In Advances in Neural Information Processing Systems. 3882–3890.
  • Niepert et al. (2016) Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. 2016. Learning convolutional neural networks for graphs. In International conference on Machine Learning. 2014–2023.
  • Ronneberger et al. (2015) Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-assisted Intervention. Springer, 234–241.
  • Seo et al. (2016) Youngjoo Seo, Michaël Defferrard, Pierre Vandergheynst, and Xavier Bresson. 2016. Structured sequence modeling with graph convolutional recurrent networks. arXiv preprint arXiv:1612.07659.
  • Shuman et al. (2012) David I Shuman, Sunil K Narang, Pascal Frossard, Antonio Ortega, and Pierre Vandergheynst. 2012. The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains. arXiv preprint arXiv:1211.0053 (2012).
  • Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems. 3104–3112.
  • Xingjian et al. (2015) SHI Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. 2015. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In Advances in Neural Information Processing Systems. 802–810.
  • Yu et al. (2018) Bing Yu, Haoteng Yin, and Zhanxing Zhu. 2018. Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting. In Proceedings of the 27th International Joint Conference on Artificial Intelligence. 3634–3640.
  • Zeiler and Fergus (2014) Matthew D Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In European Conference on Computer Vision. Springer, 818–833.
  • Zeiler et al. (2011) Matthew D Zeiler, Graham W Taylor, Rob Fergus, et al. 2011. Adaptive deconvolutional networks for mid and high level feature learning. In International Conference on Computer Vision, Vol. 1. 6.
  • Zhang et al. (2018) Junbo Zhang, Yu Zheng, Dekang Qi, Ruiyuan Li, Xiuwen Yi, and Tianrui Li. 2018. Predicting citywide crowd flows using deep spatio-temporal residual networks. Artificial Intelligence 259 (2018), 147–166.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description