STUNet: A SpatioTemporal UNetwork for
Graphstructured Time Series Modeling
Abstract.
The spatiotemporal graph learning is becoming an increasingly important object of graph study. Many application domains involve highly dynamic graphs where temporal information is crucial, e.g. traffic networks and financial transaction graphs. Despite the constant progress made on learning structured data, there is still a lack of effective means to extract dynamic complex features from spatiotemporal structures. Particularly, conventional models such as convolutional networks or recurrent neural networks are incapable of revealing the temporal patterns in short or long terms and exploring the spatial properties in local or global scope from spatiotemporal graphs simultaneously. To tackle this problem, we design a novel multiscale architecture, SpatioTemporal UNet (STUNet), for graphstructured time series modeling. In this Ushaped network, a paired sampling operation is proposed in spacetime domain accordingly: the pooling (STPool) coarsens the input graph in spatial from its deterministic partition while abstracts multiresolution temporal dependencies through dilated recurrent skip connections; based on previous settings in the downsampling, the unpooling (STUnpool) restores the original structure of spatiotemporal graphs and resumes regular intervals within graph sequences. Experiments on spatiotemporal prediction tasks demonstrate that our model effectively captures comprehensive features in multiple scales and achieves substantial improvements over mainstream methods on several realworld datasets.
1. Introduction
With the latest success of extending deep learning approaches from regular grids to structured data, graph representation learning has become an active research area nowadays. Many realworld data such as social relations, biological molecules and sensor networks are naturally with a graph form. Recently, there has been a surge of interests in exploring and analyzing the representation of graphs for tasks like node classification and link prediction (Kipf and Welling, 2016; Hamilton et al., 2017; Gao et al., 2018). However, among those studies, the dynamic graph has received relatively less attention than the static graph that consists of fixed node values or labels. The spatiotemporal graph is one of typical dynamic graphs, with varying input for each node along time axis, e.g. traffic sensor streaming and human action sequences. In this work, we systematically study the dynamic graph in spacetime domain, with an aim to develop a principled and effective method to interpret the spatiotemporal graph and to forecast future values or labels of certain nodes thereof, or to predict the whole graph in the next few time steps.
In the field of spatiotemporal data, videos are a wellstudied example, whose successive frames consistently share spatial and temporal structures. By leveraging different types of neural networks, a hybrid framework is constructed to exploit such spatiotemporal regularities within video frames, for instance, applications in weather radar echoes (Xingjian et al., 2015) and in traffic heatmaps (Zhang et al., 2018). In this case, each frame in the video firstly passes through convolution neural networks (CNN) for visual feature extraction, and then followed by recurrent neural networks (RNN) for sequence learning. Even though images can be regarded as special cases of graphs, widely used deep learning models still face significant challenges in applying to spatiotemporal graphs. First, graphstructured data are generated from nonEuclidean domain, which may not align in regular grids as required by existing models. Second, compared to gridlike data, there is no spatial locality or order information among nodes of a graph. Due to such irregularities, standard operations (for example, convolution and pooling) are not directly applicable to graph domain.
To bridge the above gap, (Bruna et al., 2013) proposes graph convolutional networks (GCNs) redefining the notion of the convolution and generalizing it to arbitrary graphs based on spectral graph theory. The introduction of GCNs boosts the latest rapid development of graph study. Moreover, it has been successfully adopted in a variety of applications where the dynamic graph is strongly associated. For instance, in action recognition, human action sequences can be assembled as a spatiotemporal graph, where body joints are constituted as a series of skeleton graph changing along time axis. Correspondingly, (Li et al., 2018a) designs a GCNbased model to capture the spatial patterns of skeleton sequences as well as the temporal dynamics contained therein. In traffic forecasting, each sensor station streams the traffic status of a certain road within a traffic network. In this sensor graph, the spatial edges are weighted by the pairwised distance between stations in the network while the temporal ones are connected by the same sensors between adjacent time frames. Recent studies have investigated the feasibility of combining GCNs with RNN (Li et al., 2018b) or CNN (Yu et al., 2018) for traffic prediction on road networks. GCNbased models obtain considerable improvements compared to traditional ones that typically ignore the spatiotemporal correlations and lack in the capability for handling structured sequences.
In order to accurately understand local and global properties of dynamic graphs, it is necessary to process the data through multiple scales. The spatiotemporal graph particularly requires such scalespanning analysis since its particularity and complexity in spacetime domain. However, most mainstream methods have overlooked such principle, partially because of the difficulties of extending existing operations like the pooling to graph data. Nevertheless, multiscale modeling of the dynamic graph has the similarity with the pixelwise prediction task, as an image pixel corresponding to a graph node. Ushaped networks with UNet (Ronneberger et al., 2015) as the representative achieve stateoftheart performance on pixellevel prediction, whose architecture has high representational capacity of both the local distributed and the global hidden information within the input. Thus, it is particularly appealing to apply such Ushaped design to modeling dynamic graphs.
In this paper, we propose a novel multiscale framework, SpatioTemporal UNet (STUNet), to model and predict graphstructured time series. To precisely capture the spatiotemporal correlations in dynamic graphs, we firstly generalize the Ushaped architecture from images to spatiotemporal graphs. STUNet employs multigranularity graph convolution for extracting both generalized and localized spatial features, and adds dilated recurrent skipconnections for capturing multiresolution temporal dependencies. Under the settings of STUNet, we define two essential operations of the framework accordingly: the spatiotemporal pooling (STPool) operation samples nodes to form a smaller graph from the output of deterministic graph partition (Maue and Sanders, 2007) and abstracts time series at multiple temporal resolutions through skip connections between recurrent units. Consequently, the unpooling (STUnpool) as a paired operation restores the original structure and temporal dependency of dynamic graphs based on previous settings in the downsampling. To better localize the representation from the input, higherlevel features retrieved from the pooling part are concatenated with the upsampled output. Overall, with contributions of hierarchical Ushaped design, STUNet is able to effectively derive multiscale features and precisely learn representations from the spatiotemporal graph.
2. Related Work
Following spectralbased formulation (Bruna et al., 2013; Niepert et al., 2016; Defferrard et al., 2016), the graph convolution operator ‘’ is introduced as the multiplication of a graph signal with a kernel , where is a vector of Fourier coefficients, as
(1) 
where is the graph Fourier basis, which is a matrix of eigenvectors of the normalized graph Laplacian ( is an identity matrix and is the diagonal degree matrix of adjacency matrix with ); while is the diagonal matrix of eigenvalues of (Shuman et al., 2012). In order to localize the filter, the kernel can be restricted to a truncated expansion of Chebyshev polynomials to order with the rescaled as , where is a vector of Chebyshev coefficients (Hammond et al., 2011). Hence, the graph convolution can then be expressed as,
(2) 
where is the Chebyshev polynomial of order evaluated at the rescaled Laplacian .
Apart from convolutional operations on graphs, there are also several recent studies focusing on structured sequence learning. Structured RNN (Jain et al., 2016) attempts to fit the spatiotemporal graph into a mixture of recurrent neural networks by associating each node and edge to a certain type of the networks. Based on the framework of convLSTM (Xingjian et al., 2015), graph convolutional recurrent network (GCRN) (Seo et al., 2016) is firstly proposed modeling structured sequences by replacing regular 2D convolution with spectralbased graph convolution. And it has set a trend of GCNembedded designs for the followup studies (Li et al., 2018b; Yu et al., 2018). Recently, an encoderdecoder model on graphs is developed for graph embedding tasks. The model known as graph UNet (Gao and Ji, 2019) brings pooling and upsampling operations to graph data. However, the scope of its uses is bounded by the static graph. Additionally, it introduces extra training parameters for node selection during the pooling procedure. Furthermore, the pooling operation it proposed does not keep the original structure of the input graph that may raise an issue for those tasks whose local spatial relations are critical.
3. Methodology
In this section, we start with the definition of the spatiotemporal graph and the problem formulation of prediction tasks on it. The special design of Ushaped network is elaborated in the following with essential operations of pooling and upsampling defined on the spatiotemporal graph. Base on the above advances, a multiscale architecture, SpatioTemporal UNet, is introduced for graphstructured time series modeling eventually.
3.1. Spatiotemporal Graph Modeling
Suppose spatiotemporal data are gathered through a structured spatial region consisting of nodes. Inside each node, there are measurements which vary over time. Thus, observation at any time can be represented by a feature vector . Moreover, data collected over the whole region are able to be expressed in terms of a feature matrix . As time goes by, a chronological sequence of matrices is accumulated, which can be further formalized as the spatiotemporal graph defined as follows.
Definition 3.1 (Spatiotemporal Graph).
A spatiotemporal graph is an attributed graph with a timevariable feature matrix . It is defined as where is the set of vertices, is the set of edges, and is an adjacency matrix recording the weighted connectedness between two vertices. Contrary to the static graph, node attributes of the spatiotemporal one evolve over time as , where is the length of time steps and is the dimension of features in each node.
In practice, due to structural properties of the data, spatiotemporal graph modeling can be formulated as the prediction task of graphstructured time series. The objective of this task is to accurately predict future attributes of nodes in a given spatiotemporal graph based on historical records, which is formally described below.
Definition 3.2 (Spatiotemporal Prediction).
Spatiotemporal prediction aims to forecast the most likely future length sequence of node attributes in a graph given the previous observations:
(3)  
where is an observation of node attributes linked by a weighted graph at time step .
3.2. Pooling Operation on Spatiotemporal Graphs
The spatiotemporal graph can be decomposed into two domains: graphstructured data in spatial while time series in temporal. As a result, it inherits the characteristics of structural complexity from graphs and dynamic complexity from sequences. Therefore, we discuss the downsampling approaches applied from the spatial and the temporal perspective respectively in this section. Lastly, a unified pooling operation is defined in spacetime domain.
Spatial Graph Pooling
Pooling layers play a vital role in CNNs since its function of achieving feature reduction. It generally follows the convolutional layer to progressively reduce spatial resolution of feature maps and enlarge receptive fields, thereby controlling parameter overfitting and achieving better generalization. However, the standard pooling operation is not directly applicable to graphstructured data, since it requires distinct neighborhoods which are obviously not accessible from graphs. Besides the local pooling, there are operations imposed on the input generally that could bypass the requirement of locality information, such as global pooling and max pooling. But these pooling approaches also bring issues of limited flexibility and inconsistent selection (Gao and Ji, 2019).
It is indispensable for the pooling in extracting multilevel abstraction of graphs. Thus, we make use of the improved path growing algorithm (PGA) (Maue and Sanders, 2007) to perform graph partitions by solving the maximum weight matching problem (referred to as ‘MaxWeightMatching’). Given a graph with nodes at time step , PGA finds an approximate solution to the problem with a subset of edges satisfying: 1) there are no two members of sharing an endpoint; 2) its total weights are the largest. Subsequently, the algorithm generates the partition through gradually removing edges in and merging nodes connected thereof, as the Algorithm 1 describes. At each level, it reduces the size of a graph by the factor of two, producing a coarser graph corresponding to observing the data domain at a different resolution:
(4) 
where is a partitioned graph with nodes at the level which controls reduction scale of the input. is a set of super nodes, each element of which contains a disjoint subset of . We use to denote mapping relations between nodes in and . Formally, after the graph convolutional layer, we can acquire the convolved feature matrix of a coarser graph through the graph partition algorithm as
(5) 
where is a graph signal matrix with attributes in each node of while is a length feature matrix with channels in each node of and is the number of nodes contained in each super node . Finally, we employ the maximum or mean feature activation over nodes in partitioned regions to obtain pooled features in each of regarding the channel dimension as
(6) 
where is the output of spatial graph pooling with channel features on nodes. Figure 1 (a) shows an example of the proposed spatial graph pooling. Since graph partition is calculated in advance, it makes the operation very efficient without introducing extra training parameters. Moreover, the scope of spatial graph pooling can be adjusted through the level which offers a precise control. In oder to address the inconsistency issue in node selection, the deterministic result of graph partition in PGA is equally applied to the graph at each time step.
Temporal Downsampling
Recurrent neural networks and its variants have shown impressive stability and capability of tackling sequence learning problems. Conventional recurrent models such as long shortterm memory (LSTM) (Hochreiter and Schmidhuber, 1997) and gated recurrent units (GRU) (Chung et al., 2014) are initially designed for regular sequences with fixed time intervals, which significantly limits their capacity for capturing complex data dependencies. Recently, several studies have explored how to expand the scope of recurrent units in RNNs to more sophisticated data like the spatiotemporal one. Based on fullyconnected LSTM (FCLSTM), (Xingjian et al., 2015) develops a modified recurrent network with embedded convolutional layers (convLSTM) to forecast spatiotemporal sequences. Inside each recurrent unit, convolutional operations with kernels are substituted for multiplications by dense matrices, which enables the network for handling image sequences. Afterwards, (Seo et al., 2016) extends this approach by replacing the standard convolution by the graph convolution for structured sequence modeling. Following the similar scheme, we leverage the GRU model and GCN layers as Graph Convolutional Gated Recurrent Units (GCGRU) to discover temporal patterns from graphstructured time series:
(7)  
where is the Hadamard product and stands for nonlinear activation functions. In this setting, and represent the gate of update and reset at time step ; while and denote the current memory content and final memory at current time step respectively. Both and are parameters of the size graph convolutional kernel. We use the notion ‘’ to describe the graph convolution between the graph signal and filters which are the functions of the graph Laplacian parameterized by localized Chebyshev coefficients as Eq. (2) notes. By stacking several graph convolutional recurrent layers, the adopted backbone GCGRU can be used as a seq2seq model for graphstructured sequence learning.
The above architecture may be enough to model structured sequences by exploiting local stationarity and spatiotemporal correlations. But it still suffers from the restriction of interpreting temporal dynamic through determinate periods. In terms of multitimescale modeling, many attempts have been made to extend recurrent networks to various time scope, including phased LSTM (Neil et al., 2016) and clockwork RNNs (Koutnik et al., 2014). Inspired by jumping design between recurrent units in (Chang et al., 2017), we insert the skip connection between gated recurrent units to learn graphstructured sequences in multilevel temporal dependencies. It also generates a dilation between successive cells, which is equivalent to abstract temporal features over a different resolution. Denote as the GCGRU cell in layer at time . The dilated skip connection can be expressed as
(8) 
where is the input to layer at time ; denotes the skip length, also referred to the dilation of layer ; and represents the GRU cell and output operations. Figure 1 (b) provides a diagram of the proposed temporal downsampling implemented by the dilated recurrent skipconnections. Such hierarchical design of dilation brings in multiple temporal scales for recurrent units at different layers. It also contributes to broadening the range of temporal dependency as the regular jump connection does but with fewer parameters and high efficiency.
In summary, based on the proposals made in pooling on spatiotemporal data, we define spatiotemporal pooling (STPool) as the operation performing downsampling on a spatiotemporal graph by aggregating convolved features over nonoverlapped partitions regarding the channel dimension on its spatial projection while dilating dynamic dependencies over recurrent units aligned in the same layer on its temporal projection.
3.3. Spatiotemporal Unpooling Operation
As the inverse operation of downsampling, the unpooling is crucial in the Ushaped network for recovering pooled features to their original resolution through upsampling. There are several approaches defined on gridlike data that could satisfy this aim, such as transposed convolution (Zeiler et al., 2011) and unpooling layers (Zeiler and Fergus, 2014). However, these operations are not directly applicable to spatiotemporal domain due to specialty and compositionality of its data. To this end, we propose spatiotemporal unpooling (STUnpool) accordingly: to restore primary structure of the input, the operation utilizes the reversed mapping to place back merged nodes and edges from to ; to resume regular temporal dependencies between recurrent units, the output of each time step in a skipconnected layer are fed into a vanilla recurrent layer without further temporal dilation.
Meanwhile, we provide three strategies for upsampling node attributes from a coarser graph, namely, direct copy, ordered deconv and weighted deconv. As the name suggests, the first approach directly copies features of a super node to each node it contains; while ordered deconvolution assigns parameterized features to each merged node based on its degree order. On top of ordered deconvolution, the weighted one concatenates structural information of merged nodes in a subgraph as an embedded feature vector to upsampled features. All three methods of upsampling have been tested and compared in Section 4.4.
3.4. Architecture of SpatioTemporal UNet
Based on spatiotemporal pooling and unpooling operations proposed above, we develop a Ushaped multiscale architecture, SpatioTemporal UNet, to address the challenge of analyzing and predicting graphstructured sequences. Following the classic Ushaped design, it contains two parts in symmetry: downsampling and upsampling. In the contracting part, it firstly applies graph convolution to aggregate information from each node’s neighborhoods, and then follows by the STPool layer to encode convolved features into multiple spatiotemporal resolution. In the expansive part, it utilizes the STUnpool layer for upsampling the reduced features to their original dimensions, with the concatenation of corresponding highlevel features retrieved from the downsampling. In the end, one graph convolution layer is attached to propagate the information through multiple spatial scales for the final prediction. The illustration of proposed architecture presents in Figure 1. We now can summarize the main characteristics of STUNet in three aspects,

To the best of our knowledge, it is the first time that a multiscale network with Ushaped design is applied to learn and model spatiotemporal structures from graphstructured time series.

A novel pair of operators in spatiotemporal pooling and unpooling are firstly proposed for extracting and fusing multilevel features in spacetime domain.

The proposed framework STUNet achieves the balance between accuracy and efficiency with considerable scalability through multiscale feature extraction and fusion as shown in the experiment below.
4. Experimental Studies
In this section, we present the evaluation of our model proposed in Section 3.4. Several mainstream models are tested and analyzed on spatiotemporal prediction tasks. Experiments show that STUNet consistently outperforms other models and achieves stateoftheart performance regarding prediction accuracy. We also perform the ablation study to validate the effectiveness of spatiotemporal pooling and unpooling operations. Comparison between GCNbased models suggests that STUNet has the superiority in balancing efficiency and scalability on the largescale dataset. For a fair comparison, we execute grid search strategy to determine the best hyperparameters on validations for all test models.
4.1. Spatiotemporal Sequence Modeling on MovingMNIST
In order to investigate the ability of nodelevel prediction, we compare STUNet with its plain version GCGRU on a synthetic dataset, movingMNIST constructed by (Xingjian et al., 2015). It consists of 20frame sequences (first 10 frames as input and the last for prediction), each of which contains two handwritten digits whose location is bouncing inside a 64 64 patch.^{1}^{1}1To make it feasible for all test models, the image frame in movingMNIST is downsampled to 32 32 in the experiment of this section. Following the default setup in (Seo et al., 2016), image frames are converted into spatiotemporal graphs. The adjacency matrix is constructed based on distances between each pixel node and its equal neighbors of a knearestneighbor graph in four directions (up, down, left and right). Kernel size of graph convolution is set to 3 for both models. The visualized outcome of moving sequence prediction in Figure 2 indicates that, thanks to hierarchical feature fusion in spacetime domain, the Ushaped network can learn better representation and obtain superior performance than the model purely based on GCNs in the nodelevel. It suggests the transferability of such multiscale designs from regular grids to nonEuclidean domain as well.
4.2. Graphstructured Timeseries Modeling on Traffic Prediction
Experimental Setup
For traffic prediction task, we conduct experiments on two realworld public datasets: METRLA released by (Li et al., 2018b) includes traffic information gathered by 207 loop detectors of Los Angeles County in 4 months, ranging from March 1st to June 30th of 2012; PeMS (M/L) generated by (Yu et al., 2018) contains traffic status collected from monitoring stations deployed over California state highway system in the weekdays of May and June of 2012, including 228 and 1026 stations respectively. Both datasets aggregate traffic records into a 5min interval with an adjacency matrix describing the sensor topology of traffic networks. We use the same experimental settings of previous studies on these two datasets, including data preprocessing, dataset split, and other related configurations.
The following mainstream methods are selected as the baseline: 1). Historical Average (HA); 2). Linear Support Vector Regression (LSVR); 3). AutoRegressive Integrated Moving Average (ARIMA); 4). Feedforward Neural Network (FNN); 5). FullyConnected LSTM (FCLSTM) (Sutskever et al., 2014); 6). SpatioTemporal Graph Convolutional Networks (STGCN) (Yu et al., 2018); 7). Diffusion Convolutional Recurrent Neural Network (DCRNN) (Li et al., 2018b).
This task requires using observed traffic time series in the window of one hour to forecast future status in the next 15, 30, and 60 minutes. Thus, three standard metrics of sequence prediction are adopted to measure the performance of different methods, namely, Mean Absolute Errors (MAE), Mean Absolute Percentage Errors (MAPE), and Root Mean Squared Errors (RMSE).
STUNet Settings
All STUNet models use the kernel size for the graph convolution. Both spatial pooling level and temporal dilation are set at 2 with ‘direct copy’ employed as the upsampling approach. We train our models by using Adam optimizer to minimize the mean of and loss for 80 epochs with the batch size as 50. The schedule sampling and layer normalization are utilized in training for better generalization. The initial learning rate is with a decay rate of 0.7 after every 8 epochs. The hidden size of recurrent units in our model is 96 for METRLA dataset; while it is assigned to 64 for the rest.
Results Analysis
Table 1 demonstrates the numerical results of spatiotemporal traffic prediction on datasets METRLA and PeMSM. We observe the following phenomenon in both datasets: 1) graph convolution based models, including STGCN, DCRNN and STUNet generally outperform other baselines, which emphasizes the importance of including graph topology for traffic prediction. 2) RNNbased models tend to act better for the longterm prediction, suggesting their advantages in capturing temporal dependency. 3) regarding the adopted metrics, STUNet achieves the best performance for all three forecasting windows, which validates the effectiveness of multiscale designs in spatiotemporal sequence modeling. 4) traditional approaches such as LSVR and ARIMA mostly perform worse than deep learning models, due to their limited capacities for handling complex nonlinear data. In addition, historical average is a reflection of traffic status in a longterm, which is invariant to the shortterm impact.
Model  METRLA (15/ 30/ 60 min)  PeMSM (15/ 30/ 60 min)  
MAE  MAPE (%)  RMSE  MAE  MAPE (%)  RMSE  
HA  4.16  13.0  7.80  4.01  10.61  7.20 
LSVR  2.97/ 3.64/ 4.67  7.68/ 9.9/ 13.63  5.89/ 7.35/ 9.13  2.50/ 3.63/ 4.54  5.81/ 8.88/ 11.50  4.55/ 6.67/ 8.28 
ARIMA  3.99/ 5.15/ 6.90  9.6/ 12.7/ 17.4  8.21/ 10.45/ 13.23  5.55/ 5.86/ 6.83  12.92/ 13.94/ 17.34  9.00/ 9.13/ 11.48 
FNN  3.99/ 4.23/ 4.49  9.9/ 12.9/ 14.0  7.94/ 8.17/ 8.69  2.39/ 3.41/ 4.88  5.53/ 8.16/ 12.12  4.40/ 6.40/ 8.84 
FCLSTM  3.44/ 3.77/ 4.37  9.6/ 10.9/ 13.2  6.30/ 7.23/ 8.69  3.67/ 3.87/ 4.19  9.09/ 9.57/ 10.55  6.58/ 7.03/ 7.79 
STGCN  2.87/ 3.48/ 4.45  7.4/ 9.4/ 11.8  5.54/ 6.84/ 8.41  2.25/ 3.03/ 4.02  5.26/ 7.33/ 9.85  4.04/ 5.70/ 7.64 
DCRNN  2.77/ 3.15/ 3.60  7.3/ 8.8/ 10.5  5.38/ 6.45/ 7.59  2.25/ 2.98/ 3.83  5.30/ 7.39/ 9.85  4.04/ 5.58/ 7.19 
STUNet  2.72/ 3.12/ 3.55  6.9/ 8.4/ 10.0  5.13/ 6.16/ 7.40  2.15/ 2.81/ 3.38  5.06/ 6.79/ 8.33  4.03/ 5.42/ 6.68 
4.3. Ablation Study of STPool & STUnpool
As the above two tasks reveal, STUNet steadily outperforms mainstream models for spatiotemporal prediction. But it may be argued that performance gains are actually due to the deeper architecture or benefit from multilevel abstraction in spatial or temporal alone. Therefore, we initiate an ablation study to investigate the contribution of spatiotemporal pooling and unpooling operations in our model. We conduct the experiment with STUNet in four styles: the plain version by removing all STPool and STUnpool operations; TUNet only with pooling and upooling in temporal; SUNet only with pooling and unpooling in spatial; and the full version. To the aim of pure comparison, we only test these variants without additional training tricks. The numerical outcome in Table 2 confirms that the proposed operations are valid for model enhancement in both spatial and temporal dimension. Moreover, thanks to the multiscale feature integration through Ushaped network, applying pooling and unpooling operations in space and time coherently results in further improvement and better generalization.
Models  GCGRU  TUNet  SUNet  STUNet  
min 
MAE  2.248 0.004  
MAPE(%)  5.244 0.028  
RMSE  3.994 0.005  
min 
MAE  2.980 0.011  
MAPE(%)  7.124 0.037  
RMSE  5.452 0.019  
min 
MAE  3.756 0.031  
MAPE(%)  8.844 0.082  
RMSE  6.716 0.025 
4.4. Comparison Study of Upsampling Approaches in STUnpool
As we discussed in Section 3.3, there are three methods for upsampling spatial features in the unpooling part. We carry out the experiment to examine the relation between these methods and the performance of corresponding models. Comparison of three upsampling approaches in terms of the mean square error is summarized in Table 3. The method of direct copy generally performs better than the other two, especially in relatively long terms. It suggests that the simple mechanism may be more steady and robust in this case. Furthermore, local properties within a super node such as degree orders and connectedness may not contain enough information to support complex feature reconstruction, due to the isomorphism of its node elements and significant structural differences among other nodes.
Models  DirectCopy  OrderedDeconv  WeightedDeconv 

15min  3.980 0.009  
30min  5.452 0.019  
60min  6.716 0.025 
4.5. Scalability and Efficiency Study on Largescale Graph Data
To test the scalability and efficiency of STUNet, we experiment our model and other GCNbased ones on a large dataset PeMSL which contains over one thousand sensor nodes in a single graph. We list the comparison of prediction accuracy for four major models in Table 4. Apparently, conventional graph convolution based approaches, including GCGRU and DCRNN face great challenges in handling such largescale graphs. We use the symbol ‘’ to mark the model whose batch size is forced to reduce a half since its graphical memory consumption exceeded the capacity over a standard GPU card.^{2}^{2}2All experiments are compiled and tested on a CentOS cluster (CPU: Intel(R) Xeon(R) CPU E52620 v4 @ 2.10GHz, GPU: NVIDIA GeForce GTX 1080). By means of its fully convolutional structures, STGCN is able to process such large dataset at once. With the help of exploring spatiotemporal correlations in a global view, it behaves well in shortandmid term prediction but suffering from overfitting in long periods. On the other hand, DCRNN maintains a higher standard on longterm forecasting but with the cost of massive computational demands. For instance, the model normally takes more than 10 minutes to train one epoch with the batch size of 16 on PeMSL. By contrast, STUNet confers better outcome in less half of the time that DCRNN need. It has reached the balance between time efficiency and prediction accuracy through spatial and temporal pooling operations applied. It also has advantages in extracting spatial features and temporal dependencies with fewer parameters and in multilevel abstraction.
Models  PeMSL (15/ 30/ 60 min)  

MAE  MAPE (%)  RMSE  
HA  4.60  12.50  8.05 
2.48/ 3.43/ 4.08  5.76/ 8.45/ 10.28  4.40/ 6.25/ 7.62  
STGCN  2.37/ 3.27/ 4.36  5.56/ 7.98/ 11.59  4.32/ 6.21/ 8.31 
2.41/ 3.28/ 4.32  5.61/ 8.18/ 11.33  4.22/ 5.87/ 7.58  
STUNet  2.34/ 3.02/ 3.66  5.54/ 7.56/ 9.52  4.32/ 5.81/ 7.14 
5. Conclusion
In this paper, we propose a universal multiscale architecture STUNet to learn and predict graphstructured time series, integrating multigranularity graph convolution and dilated recurrent skipconnections through the Ushaped network design. Experiments show that our model consistently outperforms other stateoftheart methods on several realworld datasets, indicating its great potentials on extracting comprehensive spatiotemporal features through scalespanning sequence modeling. The ablation study validates the efficiency improvement obtained from the proposed pooling and unpooling operations in spacetime domain. Moreover, STUNet also achieves the balance between efficiency and capacity with considerable flexibility. These features are quite promising and practical for structured sequence modeling in the future research development and industrial applications.
References
 (1)
 Bruna et al. (2013) Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. 2013. Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203 (2013).
 Chang et al. (2017) Shiyu Chang, Yang Zhang, Wei Han, Mo Yu, Xiaoxiao Guo, Wei Tan, Xiaodong Cui, Michael Witbrock, Mark A HasegawaJohnson, and Thomas S Huang. 2017. Dilated recurrent neural networks. In Advances in Neural Information Processing Systems. 77–87.
 Chung et al. (2014) Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014).
 Defferrard et al. (2016) Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems. 3844–3852.
 Gao and Ji (2019) Hongyang Gao and Shuiwang Ji. 2019. Graph UNet. https://openreview.net/forum?id=HJePRoAct7
 Gao et al. (2018) Hongyang Gao, Zhengyang Wang, and Shuiwang Ji. 2018. Largescale learnable graph convolutional networks. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 1416–1424.
 Hamilton et al. (2017) Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems. 1024–1034.
 Hammond et al. (2011) David K Hammond, Pierre Vandergheynst, and Rémi Gribonval. 2011. Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis 30, 2 (2011), 129–150.
 Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long shortterm memory. Neural Computation 9, 8 (1997), 1735–1780.
 Jain et al. (2016) Ashesh Jain, Amir R Zamir, Silvio Savarese, and Ashutosh Saxena. 2016. StructuralRNN: Deep learning on spatiotemporal graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5308–5317.
 Kipf and Welling (2016) Thomas N Kipf and Max Welling. 2016. Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
 Koutnik et al. (2014) Jan Koutnik, Klaus Greff, Faustino Gomez, and Juergen Schmidhuber. 2014. A clockwork rnn. arXiv preprint arXiv:1402.3511 (2014).
 Li et al. (2018a) Chaolong Li, Zhen Cui, Wenming Zheng, Chunyan Xu, and Jian Yang. 2018a. SpatioTemporal graph convolution for skeleton based action recognition. In AAAI Conference on Artificial Intelligence.
 Li et al. (2018b) Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. 2018b. Diffusion convolutional recurrent neural network: Datadriven traffic forecasting. In International Conference on Learning Representations.
 Maue and Sanders (2007) Jens Maue and Peter Sanders. 2007. Engineering algorithms for approximate weighted matching. In International Workshop on Experimental and Efficient Algorithms. Springer, 242–255.
 Neil et al. (2016) Daniel Neil, Michael Pfeiffer, and ShihChii Liu. 2016. Phased lstm: Accelerating recurrent network training for long or eventbased sequences. In Advances in Neural Information Processing Systems. 3882–3890.
 Niepert et al. (2016) Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. 2016. Learning convolutional neural networks for graphs. In International conference on Machine Learning. 2014–2023.
 Ronneberger et al. (2015) Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. Unet: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computerassisted Intervention. Springer, 234–241.
 Seo et al. (2016) Youngjoo Seo, Michaël Defferrard, Pierre Vandergheynst, and Xavier Bresson. 2016. Structured sequence modeling with graph convolutional recurrent networks. arXiv preprint arXiv:1612.07659.
 Shuman et al. (2012) David I Shuman, Sunil K Narang, Pascal Frossard, Antonio Ortega, and Pierre Vandergheynst. 2012. The emerging field of signal processing on graphs: Extending highdimensional data analysis to networks and other irregular domains. arXiv preprint arXiv:1211.0053 (2012).
 Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems. 3104–3112.
 Xingjian et al. (2015) SHI Xingjian, Zhourong Chen, Hao Wang, DitYan Yeung, WaiKin Wong, and Wangchun Woo. 2015. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In Advances in Neural Information Processing Systems. 802–810.
 Yu et al. (2018) Bing Yu, Haoteng Yin, and Zhanxing Zhu. 2018. Spatiotemporal graph convolutional networks: A deep learning framework for traffic forecasting. In Proceedings of the 27th International Joint Conference on Artificial Intelligence. 3634–3640.
 Zeiler and Fergus (2014) Matthew D Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In European Conference on Computer Vision. Springer, 818–833.
 Zeiler et al. (2011) Matthew D Zeiler, Graham W Taylor, Rob Fergus, et al. 2011. Adaptive deconvolutional networks for mid and high level feature learning. In International Conference on Computer Vision, Vol. 1. 6.
 Zhang et al. (2018) Junbo Zhang, Yu Zheng, Dekang Qi, Ruiyuan Li, Xiuwen Yi, and Tianrui Li. 2018. Predicting citywide crowd flows using deep spatiotemporal residual networks. Artificial Intelligence 259 (2018), 147–166.