Recurrent Graph Tensor Networks
Abstract
Yao Lei Xu, Danilo P. Mandic
\addressDepartment of Electrical and Electronic Engineering, Imperial College London, SW7 2AZ, UK
Emails: {yao.xu15, d.mandic}@imperial.ac.uk
Recurrent Neural Networks (RNNs) are among the most successful machine learning models for sequence modelling. In this paper, we show that the modelling of hidden states in RNNs can be approximated through a multilinear graph filter, which describes the directional flow of temporal information. The so derived multilinear graph filter is then generalized to a tensor network form to improve its modelling power, resulting in a novel Recurrent Graph Tensor Network (RGTN). To validate the expressive power of the derived network, several variants of RGTN models were proposed and employed for the task of timeseries forecasting, demonstrating superior properties in terms of convergence, performance, and complexity. By leveraging the multimodal nature of tensor networks, RGTN models were shown to outperform a standard RNN by 45% in terms of meansquarederror while using up to 90% less parameters. Therefore, by combining the expressive power of tensor networks with a suitable graph filter, we show that the proposed RGTN models can outperform a classical RNN at a drastically lower parameter complexity, especially in the multimodal setting.
Tensor Networks, Tensor Decomposition, Graph Neural Networks, Graph Signal Processing, Recurrent Neural Networks.
1 Introduction
Tensors and graphs have found numerous applications in neural networks, by offering promising solutions for improving deep learning systems. In this context, tensor methods have been used to relax computational complexity of neural networks [9], as well as to alleviate their notorious “blackbox” nature [5]. Graph based methods have generalized classical convolutional neural networks to irregular data domains, resulting in graph neural networks that have achieved stateoftheart results in many applications [14]. Despite promising results, there is a void in literature regarding the combination of both techniques to solve deep learning challenges, especially in the area of sequence modelling. To address this issue, we set out to investigate Recurrent Neural Networks (RNNs) [8], the de facto deep learning tool for sequence modelling, from a tensor and graph theoretic perspective.
Tensors are multilinear generalization of vectors and matrices to multiway arrays, which allows for a richer representation of the data that is not limited to the classical “flatview” matrix approaches [4]. Recent developments in tensor manipulation have led to Tensor Decomposition (TD) techniques that can represent high dimensional tensors through a contracting network of smaller core tensors. Such TD techniques can be used to compress the number of parameters needed to represent highdimensional data, and have found many applications in deep learning. Notably, it has been shown that TD techniques, such as TensorTrain Decomposition (TTD) [11], can be used to compress neural networks considerably while maintaining comparable performance [9].
The field of Graph Signal Processing (GSP) generalizes traditional signal processing concepts to irregular domains [12], which are naturally represented as graphs. Developments in GSP have led to series of spatial and spectral based techniques that generalise the notion of frequency and locality to irregular domains, allowing for the processing of signals that takes into account the underlying data domain [2]. Several concepts developed in GSP have found application in neural networks, where graph filters can be implemented across multiple layers to incorporate graph information [14].
However, despite the promising results achieved in both fields, the full potential arising from the combination of tensors and graphs is yet to be explored, especially in the area of sequence modelling. To this end, we set out to investigate RNNs using the theoretical framework underpinning tensor networks and graph signal processing. More specifically, we show that the modelling of RNN hidden states can be approximated through a multilinear graph filtering operation, which can be used in conjunction with tensor networks to create a novel Recurrent Graph Tensor Network (RGTN). Our experimental results confirm the superiority of the proposed RGTN models, demonstrating desirable properties in terms of convergence, performance, and complexity.
The rest of the paper is organised as follows. Section 2 introduces the necessary theoretical background regarding tensors, graphs, and RNNs. Section 3 derives the proposed RGTN models. Section 4 analyses the experimental results achieved by the proposed models, demonstrating their effectiveness. Finally, section 5 summarises the virtues of the introduced framework.
2 Theoretical Background
A short theoretical background is presented below, covering several topics in tensor networks, graph signal processing, and recurrent neural networks. We refer the readers to [4], [12], and [8] for an indepth investigation of the subjects.
2.1 Tensors and Tensor Networks





Matrix of size  

Vector of size  

Scalar  

entry of  

An order tensor, , represents an way array with modes, where the mode is of size , for . Scalars, vectors, and matrices are special cases of tensors of order 0, 1, and 2 respectively, as detailed in Table 1. A tensor can be reshaped into a matrix through the matricization process, while the reverse process is referred to as tensorization [4]. A tensor can also be reshaped into a vector through the vectorization process, which is denoted with the operator . The tensor indices in this paper are grouped according to the LittleEndian convention [6].
An contraction, denoted by , between an th order tensor, , and an th order tensor, , with equal dimensions , yields a tensor of order , , with entries defined as in (1) [4]. For the special case of matrices, and , the contraction denotes the standard matrix multiplication AB.
(1)  
A (left) Kronecker product between two tensors, and , denoted by , yields a tensor of the same order, , with entries , where [4]. For the special case of matrices and , the Kronecker product yields a blockmatrix:
(2) 
A tensor network admits a graphical representation of tensor contractions, where each tensor is represented as a node, while the number of edges that extend from that node corresponds to the tensor order [3]. If two nodes are connected through a particular edge, it represents a linear contraction over modes of equal dimensions represented by the edge. An example of tensor contraction in tensor network form is shown in Figure 1.
Special instances of tensor networks include Tensor Decomposition (TD) networks. The TD methods approximate highorder, largedimensional tensors via contractions of smaller core tensors, which reduces the computational complexity drastically while preserving the data structure [3]. For instance, TensorTrain decomposition (TTD) [10] [11] is a highly efficient TD method that can decompose a large order tensor, , into smaller core tensors, , as:
(3) 
where the set of for and is referred to as the TTrank. By virtue of TT, the number of entries in the original tensor is effectively reduced from to , which is highly efficient for high and low TTrank. An example of TTD is shown in Figure 2.
2.2 Graph Signal Processing
A graph is defined by a set of vertices (or nodes) for , and a set of edges connecting the and vertices , for and . A signal on a given graph is a defined by a vector such that , which associates a signal value to every node on the graph [13].
A given graph can be fully described in terms of its weighted adjacency matrix, , such that if , and if . Alternatively, the same graph can be described by its Laplacian matrix defined as , where is the diagonal degree matrix such that . In addition, both the weighted adjacency matrix and the Laplacian matrix can be presented in normalized form as and respectively [13].
In addition to capturing the underlying graph structure, both the Laplacian matrix and the weighted adjacency matrix can be used as shift operators to filter signals on graphs. Practically, a graph shift based filter results in a linear combination of vertexshifted graph signals, which captures graph information at a local level [12]. For instance, the operation results in a filtered signal such that:
(4) 
where denotes the hop neighbours that are directly connected to the th node. For graph signals stacked in a matrix form as , equation (4) can be compactly written as:
(5) 
For reaching neighbours that are hops away, equation (4) can be extended to its polynomial form, as , where are constants [12]. Fundamentally, a Khop based graph filter acts locally in the vertex space of a graph, which takes into account the irregular domain underlying the data described by its weighted adjacency matrix.
2.3 Recurrent Neural Networks
Recurrent Neural Networks (RNNs) [8] [7] are among the most successful deep learning tools for sequence modelling. A standard RNN layer captures timevarying dependencies by processing hidden states, , at time through feedback or recurrent weights as:
(6) 
where is the hidden state vector from the previous timestep, is the input features vector at time , is the feedback matrix, is the input weight matrix, is a bias vector, and is an elementwise activation function.
Finally, after extracting the hidden states, these can be passed through additional weight matrices to generate outputs, at time , in the form:
(7) 
where is the output weight matrix, is the hidden state at time , is a bias vector, and is an elementwise activation function.
3 Recurrent Graph Tensor Networks
3.1 Special Recurrent Graph Filter
In this section, we derive the implicit graph filter underlying RNNs by considering a special case of the hidden state equation for modelling sequential data.
Consider a linear form of equation (6) without the bias term (nonlinearity and bias can be introduced later on, as discussed in Section 3.3). Let for successive timesteps, then equation (6) can be written in blockmatrix form as:
(8) 
We now define: (i) as the matrix generated by stacking, , as rowvectors over successive timesteps; (ii) as the matrix generated by stacking hidden state vectors, , as rowvectors at corresponding timesteps; and (iii) as the block matrix composed by the powers of from (8). This allows equation (8) to be expressed as:
(9) 
Consider the special case of where is a positive constant less than 1, and is the identity matrix. This allows the hidden state equation to be expressed as , which represents a system where the past information is propagated to the future via a damping factor . For this special case, we can simplify equation (8) as:
(10) 
Let denote the uppertriangular matrix in (10), then the same equation can be expressed in matrix form as . Observe that matrix G can be further decomposed as , where is the identity matrix and is the weighted adjacency matrix composed by the th powers of the constant . Therefore, the special case of recurrent modelling can be expressed as:
(11) 
which is a form of localized graph filter as discussed in (5). For instance, each of the timesteps can now be considered as a node of a graph where signals are sampled from, and A is the corresponding weighted graph adjacency matrix connecting different timesteps. This also justifies the triangular nature of the graph filter, since only past information can influence future states but not viceversa.
3.2 General Recurrent Graph Filter
We now proceed to relax the restrictions from Section 3.1, to make it possible to extend the graph filter from (11) to the general case of sequence modelling.
Let the feedback matrix be a scaled idempotent matrix, that is where is a positive constant less than 1 that models the damping effect, and is an idempotent matrix that models how information propagates between successive timesteps. For this setup, the feedback matrix has the property for greater than 0, which allows the block matrix R to be simplified as:
(12) 
The block matrix R with above properties can be further decomposed by using: (i) the weighted graph adjacency matrix, A, from equation (11); (ii) the idempotent matrix, ; and (iii) the identity matrix, , to yield:
(13) 
which allows us to express equation 9 in its full form as:
(14) 
Finally, we define the multilinear graph filter, , to be the th order tensorization of the block matrix, , from equation (13). This allows us to simplify the expression in (14) without the vectorization operator by means of a double tensor contraction, as:
(15) 
This indicates that the graph structure discussed in Section 3.1 can be considered for the general case of sequence modelling, resulting in a multilinear graph filter, , capable of modelling sequential information defined on a timevertex graph domain.
3.3 Tensor Network Formulation
We next introduce a number of novel Recurrent Graph Tensor Network (RGTN) models, which benefit from the expressive power of tensor networks and the graph filters derived in the previous sections.
Consider the special case of graph filtering in section 3.1, where the hidden states are extracted through a graph filter that is localized in the vertex space. The extracted hidden states can be considered as feature maps, which can be flattened and passed through additional dense layers to generate application dependent outputs [1]. Using the tensor network notation, we can represent the special graph filter contraction and the dense layer matrix contraction as a unique tensor network, as shown in Figure 3. We refer to this special graph filter based tensor network as the special Recurrent Graph Tensor Network (sRGTN).
For the general case of graph filtering in Section 3.2, we can similarly represent the double contraction in (15) to derive the tensor network in Figure 4. This is referred to as the general Recurrent Graph Tensor Network (gRGTN). Unlike the sRGTN in Figure 3, where the graph filter and the feature map contractions can be modelled separately, gRGTN implies a stronger coupling of the features with the underlying graph domain, as captured by the multilinear graph filter .
For multimodal problems, we propose a highly efficient variant of the sRGTN by appealing to the supercompression power of the TTD. Indeed, by reshaping dense layer matrices into higher order tensors and representing them in a TT format, we can drastically reduce the parameter complexity, as discussed in [9]. This allows us to simultaneously: (i) maintain the inherent tensor structure of the problem; (ii) drastically reduce the parameter complexity of the model; and (iii) incorporate the underlying graph topology. This leads to the MultiModal, TensorTrain variant of sRGTN (sRGTNTT), which is illustrated in Figure 5.
Finally, nonlinearity can be introduced into all considered models by applying a pointwise activation function on top of a contraction. Nonlinear layers can also be stacked one after another to increase the overall expressive power.
4 Experiments
4.1 Experimental Setting
In this section, the proposed Recurrent Graph Tensor Network (RGTN) models are implemented and compared to a standard Recurrent Neural Network (RNN) to validate the proposed models for the task of timeseries forecasting.
The learning task of this experiment is to forecast the PM2.5 level across multiple sites in China, using the Beijing MultiSite AirQuality dataset [15]. Specifically, the data consists of hourly air quality measurements of 12 variables (PM2.5, PM10, SO2, NO2, CO, O3, TEMP, PRES, DEWP, RAIN, wd, WSPM) between 2013 and 2017 across 12 different geographical sites.
The given dataset is preprocessed by: (i) filling the missing data points with the corresponding feature median; (ii) scaling the numerical features between 0 and 1; and (iii) encoding the categorical features via onehotencoding, which increases the number of total features to 27 per site.
For the given task, a total of 4 models were implemented and compared. To achieve comparable results, all models were given the same architectural specification, as shown in Table 2, with the only difference being the choice of feature extraction method. Specifically, the four feature extraction methods are based on: (i) a simple recurrent neural network (RNN); (ii) a special Recurrent Graph Tensor Network (sRGTN), which implements the architecture in Figure 3; (iii) a general Recurrent Graph Tensor Network (gRGTN), which implements the architecture in Figure 4; and (iv) a sRGTN with TT decomposition (sRGTNTT), which implements the architecture in Figure 5. Finally, all the models were trained with the same setting, that is using: (i) a stochastic gradient descent optimizer with a learning rate of ; (ii) a meansquarederror loss function; (iii) a batch size of 32; and (iv) a total of 100 epochs. The first 70% of the data was used for training purposes (20% of which is used for validation), and the remaining 30% for testing.
RNN Model  Layer 1  Layer 2  Layer 3 
Layer Type  RNN  Dense  Dense 
Units  8  12  12 
Activation  tanh  tanh  linear 
sRGTN Model  Layer 1  Layer 2  Layer 3 
Layer Type  sRGTN  Dense  Dense 
Units  8  12  12 
Activation  tanh  tanh  linear 
gRGTN Model  Layer 1  Layer 2  Layer 3 
Layer Type  gRGTN  Dense  Dense 
Units  8  12  12 
Activation  tanh  tanh  linear 
sRGTNTT Model  Layer 1  Layer 2  Layer 3 
Layer Type  sRGTN  TTDense  Dense 
Units  8  12  12 
Activation  tanh  tanh  linear 
TTRank  n.a.  (1,2,2,1)  n.a. 
Note that for the sRGTNTT model, each input data sample, , was kept in its natural multimodal form, which contains consecutive timesteps of data across different sites, where each site contains features. For the RNN, sRGTN, and gRGTN models, each of the input data sample, , was a matrix created by concatenating 27 features per site across all 12 sites (324 features in total) over consecutive timesteps. For all models, the target prediction variable, , is the PM2.5 measurement of the successive timestep across all 12 sites.
4.2 Experiment Results
RNN  sRGTN  gRGTN  sRGTNTT  
N. Param.  3408  3384  3448  463 
Test MSE  0.010935  0.009998  0.009183  0.008467 
This section analyses the results obtained from the proposed experiment, demonstrating the superior properties of the proposed RGTN models over a classical RNN model, in terms of convergence, performance, and complexity properties, especially in the multimodal setting.
Table 3 shows the number of trainable parameters and the final test MeanSquaredError (MSE) achieved by the four considered models. Simulation results confirm the superiority of the proposed class of RGTN models over a standard RNN, especially in the multimodal setting. In particular, the proposed sRGTNTT model successfully captured the inherent multimodality of the underlying problem, achieving the best MSE score while using up to 90% less parameters than other models.
Figure 6 shows the validation MSE of all four considered models during the training phase (in log scale). The error curves show that all RGTN models exhibited better convergence properties than the standard RNN model, as fewer epochs were needed to converge.
5 Conclusion
We have proposed a novel Recurrent Graph Tensor Network (RGTN) architecture, by merging the expressive power of tensor networks with graph signal processing methods over irregular domains. We have provided the theoretical framework underpinning the proposed RGTN models and applied them to the task of timeseries forecasting. The experimental results have shown that the proposed class of RGTN models exhibits desirable properties in terms of convergence, performance, and complexity. In particular, when dealing with multimodal data, the proposed RGTN models have outperformed a standard RNN both in terms of performance (45% improvement in meansquarederror) and complexity (90% reduction in trainable parameters).
References
 (2019) A stateoftheart survey on deep learning theory and architectures. Electronics 8 (3), pp. 292. Cited by: §3.3.
 (2017) Geometric deep learning: Going beyond Euclidean data. IEEE Signal Processing Magazine 34 (4), pp. 18–42. Cited by: §1.
 (2016) Tensor networks for dimensionality reduction and largescale optimization: part 1 Lowrank tensor decompositions. Foundations and Trends® in Machine Learning 9 (45), pp. 249–429. Cited by: §2.1, §2.1.
 (201403) Era of big data processing: a new approach via tensor networks and tensor decompositions. ArXiv eprints. External Links: 1403.2048 Cited by: §1, §2.1, §2.1, §2.1, §2.
 (2016) On the expressive power of deep learning: a tensor analysis. In Proceedings of Conference on Learning Theory, pp. 698–728. Cited by: §1.
 (2014) Alternating minimal energy methods for linear systems in higher dimensions. SIAM Journal on Scientific Computing 36 (5), pp. A2248–A2271. Cited by: §2.1.
 (2020) The role of hidden markov models and recurrent neural networks in event detection and localization for biomedical signals: theory and application. Information Fusion. Cited by: §2.3.
 (2001) Recurrent neural networks for prediction: Learning algorithms, architectures and stability. John Wiley & Sons, Inc.. Cited by: §1, §2.3, §2.
 (2015) Tensorizing neural networks. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), pp. 442–450. Cited by: §1, §1, §3.3.
 (2009) Breaking the curse of dimensionality, or how to use SVD in many dimensions. SIAM Journal on Scientific Computing 31 (5), pp. 3744–3759. Cited by: §2.1.
 (2011) Tensortrain decomposition. SIAM Journal on Scientific Computing 33 (5), pp. 2295–2317. Cited by: §1, §2.1.
 (2019) Graph signal processing–Part II: processing and analyzing signals on graphs. arXiv preprint arXiv:1909.10325. Cited by: §1, §2.2, §2.
 (2019) Graph signal processing–Part I: graphs, graph spectra, and spectral clustering. arXiv preprint arXiv:1907.03467. Cited by: §2.2, §2.2.
 (2020) A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems. Cited by: §1, §1.
 (2017) Cautionary tales on airquality improvement in beijing. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences 473 (2205), pp. 20170457. Cited by: §4.1.