RECURRENT GRAPH TENSOR NETWORKS

Recurrent Graph Tensor Networks

Abstract

\name

Yao Lei Xu, Danilo P. Mandic \addressDepartment of Electrical and Electronic Engineering, Imperial College London, SW7 2AZ, UK
E-mails: {yao.xu15, d.mandic}@imperial.ac.uk

Recurrent Neural Networks (RNNs) are among the most successful machine learning models for sequence modelling. In this paper, we show that the modelling of hidden states in RNNs can be approximated through a multi-linear graph filter, which describes the directional flow of temporal information. The so derived multi-linear graph filter is then generalized to a tensor network form to improve its modelling power, resulting in a novel Recurrent Graph Tensor Network (RGTN). To validate the expressive power of the derived network, several variants of RGTN models were proposed and employed for the task of time-series forecasting, demonstrating superior properties in terms of convergence, performance, and complexity. By leveraging the multi-modal nature of tensor networks, RGTN models were shown to out-perform a standard RNN by 45% in terms of mean-squared-error while using up to 90% less parameters. Therefore, by combining the expressive power of tensor networks with a suitable graph filter, we show that the proposed RGTN models can out-perform a classical RNN at a drastically lower parameter complexity, especially in the multi-modal setting.

{keywords}

Tensor Networks, Tensor Decomposition, Graph Neural Networks, Graph Signal Processing, Recurrent Neural Networks.

1 Introduction

Tensors and graphs have found numerous applications in neural networks, by offering promising solutions for improving deep learning systems. In this context, tensor methods have been used to relax computational complexity of neural networks [9], as well as to alleviate their notorious “black-box” nature [5]. Graph based methods have generalized classical convolutional neural networks to irregular data domains, resulting in graph neural networks that have achieved state-of-the-art results in many applications [14]. Despite promising results, there is a void in literature regarding the combination of both techniques to solve deep learning challenges, especially in the area of sequence modelling. To address this issue, we set out to investigate Recurrent Neural Networks (RNNs) [8], the de facto deep learning tool for sequence modelling, from a tensor and graph theoretic perspective.

Tensors are multi-linear generalization of vectors and matrices to multi-way arrays, which allows for a richer representation of the data that is not limited to the classical “flat-view” matrix approaches [4]. Recent developments in tensor manipulation have led to Tensor Decomposition (TD) techniques that can represent high dimensional tensors through a contracting network of smaller core tensors. Such TD techniques can be used to compress the number of parameters needed to represent high-dimensional data, and have found many applications in deep learning. Notably, it has been shown that TD techniques, such as Tensor-Train Decomposition (TTD) [11], can be used to compress neural networks considerably while maintaining comparable performance [9].

The field of Graph Signal Processing (GSP) generalizes traditional signal processing concepts to irregular domains [12], which are naturally represented as graphs. Developments in GSP have led to series of spatial and spectral based techniques that generalise the notion of frequency and locality to irregular domains, allowing for the processing of signals that takes into account the underlying data domain [2]. Several concepts developed in GSP have found application in neural networks, where graph filters can be implemented across multiple layers to incorporate graph information [14].

However, despite the promising results achieved in both fields, the full potential arising from the combination of tensors and graphs is yet to be explored, especially in the area of sequence modelling. To this end, we set out to investigate RNNs using the theoretical framework underpinning tensor networks and graph signal processing. More specifically, we show that the modelling of RNN hidden states can be approximated through a multi-linear graph filtering operation, which can be used in conjunction with tensor networks to create a novel Recurrent Graph Tensor Network (RGTN). Our experimental results confirm the superiority of the proposed RGTN models, demonstrating desirable properties in terms of convergence, performance, and complexity.

The rest of the paper is organised as follows. Section 2 introduces the necessary theoretical background regarding tensors, graphs, and RNNs. Section 3 derives the proposed RGTN models. Section 4 analyses the experimental results achieved by the proposed models, demonstrating their effectiveness. Finally, section 5 summarises the virtues of the introduced framework.

2 Theoretical Background

A short theoretical background is presented below, covering several topics in tensor networks, graph signal processing, and recurrent neural networks. We refer the readers to [4], [12], and [8] for an in-depth investigation of the subjects.

2.1 Tensors and Tensor Networks


-th order tensor of size

Matrix of size

Vector of size

Scalar

entry of

Table 1: Tensor, matrix, vector, and scalar notation.

An order- tensor, , represents an -way array with modes, where the mode is of size , for . Scalars, vectors, and matrices are special cases of tensors of order 0, 1, and 2 respectively, as detailed in Table 1. A tensor can be reshaped into a matrix through the matricization process, while the reverse process is referred to as tensorization [4]. A tensor can also be reshaped into a vector through the vectorization process, which is denoted with the operator . The tensor indices in this paper are grouped according to the Little-Endian convention [6].

An -contraction, denoted by , between an -th order tensor, , and an -th order tensor, , with equal dimensions , yields a tensor of order , , with entries defined as in (1) [4]. For the special case of matrices, and , the contraction denotes the standard matrix multiplication AB.

(1)

A (left) Kronecker product between two tensors, and , denoted by , yields a tensor of the same order, , with entries , where [4]. For the special case of matrices and , the Kronecker product yields a block-matrix:

(2)
Figure 1: Tensor network representation of a contraction between tensors and , over the modes with equal dimensions .
Figure 2: Tensor network representation of TT decomposition for an order-4 tensor, , as in (3).

A tensor network admits a graphical representation of tensor contractions, where each tensor is represented as a node, while the number of edges that extend from that node corresponds to the tensor order [3]. If two nodes are connected through a particular edge, it represents a linear contraction over modes of equal dimensions represented by the edge. An example of tensor contraction in tensor network form is shown in Figure 1.

Special instances of tensor networks include Tensor Decomposition (TD) networks. The TD methods approximate high-order, large-dimensional tensors via contractions of smaller core tensors, which reduces the computational complexity drastically while preserving the data structure [3]. For instance, Tensor-Train decomposition (TTD) [10] [11] is a highly efficient TD method that can decompose a large order tensor, , into smaller core tensors, , as:

(3)

where the set of for and is referred to as the TT-rank. By virtue of TT, the number of entries in the original tensor is effectively reduced from to , which is highly efficient for high and low TT-rank. An example of TTD is shown in Figure 2.

2.2 Graph Signal Processing

A graph is defined by a set of vertices (or nodes) for , and a set of edges connecting the and vertices , for and . A signal on a given graph is a defined by a vector such that , which associates a signal value to every node on the graph [13].

A given graph can be fully described in terms of its weighted adjacency matrix, , such that if , and if . Alternatively, the same graph can be described by its Laplacian matrix defined as , where is the diagonal degree matrix such that . In addition, both the weighted adjacency matrix and the Laplacian matrix can be presented in normalized form as and respectively [13].

In addition to capturing the underlying graph structure, both the Laplacian matrix and the weighted adjacency matrix can be used as shift operators to filter signals on graphs. Practically, a graph shift based filter results in a linear combination of vertex-shifted graph signals, which captures graph information at a local level [12]. For instance, the operation results in a filtered signal such that:

(4)

where denotes the -hop neighbours that are directly connected to the -th node. For graph signals stacked in a matrix form as , equation (4) can be compactly written as:

(5)

For reaching neighbours that are -hops away, equation (4) can be extended to its polynomial form, as , where are constants [12]. Fundamentally, a K-hop based graph filter acts locally in the vertex space of a graph, which takes into account the irregular domain underlying the data described by its weighted adjacency matrix.

2.3 Recurrent Neural Networks

Recurrent Neural Networks (RNNs) [8] [7] are among the most successful deep learning tools for sequence modelling. A standard RNN layer captures time-varying dependencies by processing hidden states, , at time through feedback or recurrent weights as:

(6)

where is the hidden state vector from the previous time-step, is the input features vector at time , is the feedback matrix, is the input weight matrix, is a bias vector, and is an element-wise activation function.

Finally, after extracting the hidden states, these can be passed through additional weight matrices to generate outputs, at time , in the form:

(7)

where is the output weight matrix, is the hidden state at time , is a bias vector, and is an element-wise activation function.

3 Recurrent Graph Tensor Networks

3.1 Special Recurrent Graph Filter

In this section, we derive the implicit graph filter underlying RNNs by considering a special case of the hidden state equation for modelling sequential data.

Consider a linear form of equation (6) without the bias term (non-linearity and bias can be introduced later on, as discussed in Section 3.3). Let for successive time-steps, then equation (6) can be written in block-matrix form as:

(8)

We now define: (i) as the matrix generated by stacking, , as row-vectors over successive time-steps; (ii) as the matrix generated by stacking hidden state vectors, , as row-vectors at corresponding time-steps; and (iii) as the block matrix composed by the powers of from (8). This allows equation (8) to be expressed as:

(9)

Consider the special case of where is a positive constant less than 1, and is the identity matrix. This allows the hidden state equation to be expressed as , which represents a system where the past information is propagated to the future via a damping factor . For this special case, we can simplify equation (8) as:

(10)

Let denote the upper-triangular matrix in (10), then the same equation can be expressed in matrix form as . Observe that matrix G can be further decomposed as , where is the identity matrix and is the weighted adjacency matrix composed by the -th powers of the constant . Therefore, the special case of recurrent modelling can be expressed as:

(11)

which is a form of localized graph filter as discussed in (5). For instance, each of the time-steps can now be considered as a node of a graph where signals are sampled from, and A is the corresponding weighted graph adjacency matrix connecting different time-steps. This also justifies the triangular nature of the graph filter, since only past information can influence future states but not vice-versa.

3.2 General Recurrent Graph Filter

We now proceed to relax the restrictions from Section 3.1, to make it possible to extend the graph filter from (11) to the general case of sequence modelling.

Let the feedback matrix be a scaled idempotent matrix, that is where is a positive constant less than 1 that models the damping effect, and is an idempotent matrix that models how information propagates between successive time-steps. For this setup, the feedback matrix has the property for greater than 0, which allows the block matrix R to be simplified as:

(12)

The block matrix R with above properties can be further decomposed by using: (i) the weighted graph adjacency matrix, A, from equation (11); (ii) the idempotent matrix, ; and (iii) the identity matrix, , to yield:

(13)

which allows us to express equation 9 in its full form as:

(14)

Finally, we define the multi-linear graph filter, , to be the -th order tensorization of the block matrix, , from equation (13). This allows us to simplify the expression in (14) without the vectorization operator by means of a double tensor contraction, as:

(15)

This indicates that the graph structure discussed in Section 3.1 can be considered for the general case of sequence modelling, resulting in a multi-linear graph filter, , capable of modelling sequential information defined on a time-vertex graph domain.

3.3 Tensor Network Formulation

We next introduce a number of novel Recurrent Graph Tensor Network (RGTN) models, which benefit from the expressive power of tensor networks and the graph filters derived in the previous sections.

Consider the special case of graph filtering in section 3.1, where the hidden states are extracted through a graph filter that is localized in the vertex space. The extracted hidden states can be considered as feature maps, which can be flattened and passed through additional dense layers to generate application dependent outputs [1]. Using the tensor network notation, we can represent the special graph filter contraction and the dense layer matrix contraction as a unique tensor network, as shown in Figure 3. We refer to this special graph filter based tensor network as the special Recurrent Graph Tensor Network (sRGTN).

Figure 3: Illustration of the sRGTN Model. The encircled section in dotted line represents the special graph filtering operation for extracting hidden states H as introduced in (11).
Figure 4: Illustration of the gRGTN Model. The encircled section in dotted line represents the general graph filtering operation for extracting hidden states, H, as introduced in (15).
Figure 5: Illustration of the sRGTN-TT Model. The encircled section in dotted line denotes the special graph filtering operation for extracting hidden states, , given a multi-modal input. The dense layer matrix (in yellow) is tensorized and represented in a tensor-train format to reduce complexity.

For the general case of graph filtering in Section 3.2, we can similarly represent the double contraction in (15) to derive the tensor network in Figure 4. This is referred to as the general Recurrent Graph Tensor Network (gRGTN). Unlike the sRGTN in Figure 3, where the graph filter and the feature map contractions can be modelled separately, gRGTN implies a stronger coupling of the features with the underlying graph domain, as captured by the multi-linear graph filter .

For multi-modal problems, we propose a highly efficient variant of the sRGTN by appealing to the super-compression power of the TTD. Indeed, by reshaping dense layer matrices into higher order tensors and representing them in a TT format, we can drastically reduce the parameter complexity, as discussed in [9]. This allows us to simultaneously: (i) maintain the inherent tensor structure of the problem; (ii) drastically reduce the parameter complexity of the model; and (iii) incorporate the underlying graph topology. This leads to the Multi-Modal, Tensor-Train variant of sRGTN (sRGTN-TT), which is illustrated in Figure 5.

Finally, non-linearity can be introduced into all considered models by applying a point-wise activation function on top of a contraction. Non-linear layers can also be stacked one after another to increase the overall expressive power.

4 Experiments

4.1 Experimental Setting

In this section, the proposed Recurrent Graph Tensor Network (RGTN) models are implemented and compared to a standard Recurrent Neural Network (RNN) to validate the proposed models for the task of time-series forecasting.

The learning task of this experiment is to forecast the PM2.5 level across multiple sites in China, using the Beijing Multi-Site Air-Quality dataset [15]. Specifically, the data consists of hourly air quality measurements of 12 variables (PM2.5, PM10, SO2, NO2, CO, O3, TEMP, PRES, DEWP, RAIN, wd, WSPM) between 2013 and 2017 across 12 different geographical sites.

The given dataset is pre-processed by: (i) filling the missing data points with the corresponding feature median; (ii) scaling the numerical features between 0 and 1; and (iii) encoding the categorical features via one-hot-encoding, which increases the number of total features to 27 per site.

For the given task, a total of 4 models were implemented and compared. To achieve comparable results, all models were given the same architectural specification, as shown in Table 2, with the only difference being the choice of feature extraction method. Specifically, the four feature extraction methods are based on: (i) a simple recurrent neural network (RNN); (ii) a special Recurrent Graph Tensor Network (sRGTN), which implements the architecture in Figure 3; (iii) a general Recurrent Graph Tensor Network (gRGTN), which implements the architecture in Figure 4; and (iv) a sRGTN with TT decomposition (sRGTN-TT), which implements the architecture in Figure 5. Finally, all the models were trained with the same setting, that is using: (i) a stochastic gradient descent optimizer with a learning rate of ; (ii) a mean-squared-error loss function; (iii) a batch size of 32; and (iv) a total of 100 epochs. The first 70% of the data was used for training purposes (20% of which is used for validation), and the remaining 30% for testing.

RNN Model Layer 1 Layer 2 Layer 3
Layer Type RNN Dense Dense
Units 8 12 12
Activation tanh tanh linear
sRGTN Model Layer 1 Layer 2 Layer 3
Layer Type sRGTN Dense Dense
Units 8 12 12
Activation tanh tanh linear
gRGTN Model Layer 1 Layer 2 Layer 3
Layer Type gRGTN Dense Dense
Units 8 12 12
Activation tanh tanh linear
sRGTN-TT Model Layer 1 Layer 2 Layer 3
Layer Type sRGTN TT-Dense Dense
Units 8 12 12
Activation tanh tanh linear
TT-Rank n.a. (1,2,2,1) n.a.
Table 2: Architecture of the models used in the experiment.

Note that for the sRGTN-TT model, each input data sample, , was kept in its natural multi-modal form, which contains consecutive time-steps of data across different sites, where each site contains features. For the RNN, sRGTN, and gRGTN models, each of the input data sample, , was a matrix created by concatenating 27 features per site across all 12 sites (324 features in total) over consecutive time-steps. For all models, the target prediction variable, , is the PM2.5 measurement of the successive time-step across all 12 sites.

4.2 Experiment Results

RNN sRGTN gRGTN sRGTN-TT
N. Param. 3408 3384 3448 463
Test MSE 0.010935 0.009998 0.009183 0.008467
Table 3: The number of trainable parameters and the Test MSE for all proposed models. The sRGTN-TT obtained the best results with a drastically lower number of parameters.

This section analyses the results obtained from the proposed experiment, demonstrating the superior properties of the proposed RGTN models over a classical RNN model, in terms of convergence, performance, and complexity properties, especially in the multi-modal setting.

Table 3 shows the number of trainable parameters and the final test Mean-Squared-Error (MSE) achieved by the four considered models. Simulation results confirm the superiority of the proposed class of RGTN models over a standard RNN, especially in the multi-modal setting. In particular, the proposed sRGTN-TT model successfully captured the inherent multi-modality of the underlying problem, achieving the best MSE score while using up to 90% less parameters than other models.

Figure 6 shows the validation MSE of all four considered models during the training phase (in log scale). The error curves show that all RGTN models exhibited better convergence properties than the standard RNN model, as fewer epochs were needed to converge.

Figure 6: Validation MSE over 100 epochs of training, for all considered models. The RGTN models exhibited faster convergence and enhanced performance over the RNN.

5 Conclusion

We have proposed a novel Recurrent Graph Tensor Network (RGTN) architecture, by merging the expressive power of tensor networks with graph signal processing methods over irregular domains. We have provided the theoretical framework underpinning the proposed RGTN models and applied them to the task of time-series forecasting. The experimental results have shown that the proposed class of RGTN models exhibits desirable properties in terms of convergence, performance, and complexity. In particular, when dealing with multi-modal data, the proposed RGTN models have out-performed a standard RNN both in terms of performance (45% improvement in mean-squared-error) and complexity (90% reduction in trainable parameters).

References

  1. M. Z. Alom, T. M. Taha, C. Yakopcic, S. Westberg, P. Sidike, M. S. Nasrin, M. Hasan, B. C. Van Essen, A. Awwal and V. K. Asari (2019) A state-of-the-art survey on deep learning theory and architectures. Electronics 8 (3), pp. 292. Cited by: §3.3.
  2. M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam and P. Vandergheynst (2017) Geometric deep learning: Going beyond Euclidean data. IEEE Signal Processing Magazine 34 (4), pp. 18–42. Cited by: §1.
  3. A. Cichocki, N. Lee, I. Oseledets, A. Phan, Q. Zhao and D. P. Mandic (2016) Tensor networks for dimensionality reduction and large-scale optimization: part 1 Low-rank tensor decompositions. Foundations and Trends® in Machine Learning 9 (4-5), pp. 249–429. Cited by: §2.1, §2.1.
  4. A. Cichocki (2014-03) Era of big data processing: a new approach via tensor networks and tensor decompositions. ArXiv e-prints. External Links: 1403.2048 Cited by: §1, §2.1, §2.1, §2.1, §2.
  5. N. Cohen, O. Sharir and A. Shashua (2016) On the expressive power of deep learning: a tensor analysis. In Proceedings of Conference on Learning Theory, pp. 698–728. Cited by: §1.
  6. S.V. Dolgov and D.V. Savostyanov (2014) Alternating minimal energy methods for linear systems in higher dimensions. SIAM Journal on Scientific Computing 36 (5), pp. A2248–A2271. Cited by: §2.1.
  7. Y. Khalifa, D. P. Mandic and E. Sejdic (2020) The role of hidden markov models and recurrent neural networks in event detection and localization for biomedical signals: theory and application. Information Fusion. Cited by: §2.3.
  8. D. P. Mandic and J. Chambers (2001) Recurrent neural networks for prediction: Learning algorithms, architectures and stability. John Wiley & Sons, Inc.. Cited by: §1, §2.3, §2.
  9. A. Novikov, D. Podoprikhin, A. Osokin and D. P. Vetrov (2015) Tensorizing neural networks. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), pp. 442–450. Cited by: §1, §1, §3.3.
  10. I. V. Oseledets and E. E. Tyrtyshnikov (2009) Breaking the curse of dimensionality, or how to use SVD in many dimensions. SIAM Journal on Scientific Computing 31 (5), pp. 3744–3759. Cited by: §2.1.
  11. I. V. Oseledets (2011) Tensor-train decomposition. SIAM Journal on Scientific Computing 33 (5), pp. 2295–2317. Cited by: §1, §2.1.
  12. L. Stankovic, D. Mandic, M. Dakovic, M. Brajovic, B. Scalzo and A. G. Constantinides (2019) Graph signal processing–Part II: processing and analyzing signals on graphs. arXiv preprint arXiv:1909.10325. Cited by: §1, §2.2, §2.
  13. L. Stankovic, D. P. Mandic, M. Dakovic, M. Brajovic, B. Scalzo and T. Constantinides (2019) Graph signal processing–Part I: graphs, graph spectra, and spectral clustering. arXiv preprint arXiv:1907.03467. Cited by: §2.2, §2.2.
  14. Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang and S. Y. Philip (2020) A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems. Cited by: §1, §1.
  15. S. Zhang, B. Guo, A. Dong, J. He, Z. Xu and S. X. Chen (2017) Cautionary tales on air-quality improvement in beijing. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences 473 (2205), pp. 20170457. Cited by: §4.1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
414412
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description