Tensor graph convolutional neural network
Abstract
In this paper, we propose a novel tensor graph convolutional neural network (TGCNN) to conduct convolution on factorizable graphs, for which here two types of problems are focused, one is sequential dynamic graphs and the other is crossattribute graphs. Especially, we propose a graph preserving layer to memorize salient nodes of those factorized subgraphs, i.e. cross graph convolution and graph pooling. For cross graph convolution, a parameterized Kronecker sum operation is proposed to generate a conjunctive adjacency matrix characterizing the relationship between every pair of nodes across two subgraphs. Taking this operation, then general graph convolution may be efficiently performed followed by the composition of small matrices, which thus reduces high memory and computational burden. Encapsuling sequence graphs into a recursive learning, the dynamics of graphs can be efficiently encoded as well as the spatial layout of graphs. To validate the proposed TGCNN, experiments are conducted on skeleton action datasets as well as matrix completion dataset. The experiment results demonstrate that our method can achieve more competitive performance with the stateoftheart methods.
I Introduction
Graph aphs such as recommending system (the connection of users and goods). In fact, all these graphs can be easily represented as a composition of several subgraphs. In this paper, we will focus on how to perform more efficient graph convolution on this type of problem. Especially, here we refer to two classic tasks: skeletonbased action recognition and recommending system.
As a representative of dynamic sequence graphs, skeletonbased action recognition has become a hot topic of computer vision, and it draws wide attention in recent years due to its wide applications, e.g. video surveillance, games console and robot vision. In previous literatures, various algorithms have been proposed [20, 3, 13, 18, 1, 6, 22, 5, 11, 10] to deal with skeleton data based action recognition. Some of them just focus on modeling temporal evolution while fail to well characterize the spatial dependencies among joints. Different from these algorithms, some other literatures attempt to model spatial structure by employing structure learning algorithms such as Riemannian network [6], LieNet [7], various types of recurrent neural network (RNN) [5, 17, 11, 22] and graph based representation [19]. Among them, graph representation proposed in [19] provides an efficient way for describing the irregular graphstructured skeleton data by modeling joints of skeleton as nodes of a graph. Further more, recent year, based on spectral graph theory, graph convolution neural network (GCNN) is proposed in [4, 9, 12] for irregular data as an alternative algorithm to CNN and shows promising performance. Although graph convolution has been used for sequence data, it more constructs a graph for each frame, whilst ignore temporal correlation. In contrast, we conduct a large spatiotemporal graph convolution by simultaneously model spatial and temporal relationship of different nodes.
Another representative of artificial intelligence application based on cross attribute factorized graphs is recommending system, which can be formed as matrix completion problem and has been investigated in previous literatures [8, 12, 15]. In this task, two separable graphs of different attributes, i.e. column graphs and row graphs, are provided representing similarity of users and items and were shown beneficial for the performance of recommender systems. [12] applies graph convolution in this task considering both users and item graphs from the Fourier transform respective and achieves the stateoftheart performance. However, quite different from [12], we propose to conduct cross graph convolution for this task aiming to optimally model the relationship between every pair of nodes from two graphs.
In this paper, taking the two representative tasks, we explore convolution on factorizable graphs. More generally, we proposed a tensor graph convolutional neural network (TGCNN) to deal with the problem. For sequential dynamic graphs, especially, we leverage recursive mechanism on consecutive tensor convolution to overcome the problem of computational explosion. Therein, a graph preserving layer is proposed to recursively optimize both previous encoded spatiotemporal graph and the successively input subgraph. In each recursive step, the salient nodes in previous subgraphs are preserved and further connected to those in current input graph, so that the graph preserving layer is able to globally model all those nodes in factorized graphs. To this end, two operations are employed in the graph preserving layer, cross graph convolution (for building cross graph relationship) and graph pooling (for nodes selecting). Specifically, for cross graph convolution, we design a novel parameterized Kronecker sum operation to learn an optimal conjunctive graph. During derivation, the property of Kronecker product is utilized so that graph filtering can be conduced on matrices in smaller size, which thus may avoid high memory and computational costs. Following the conjunctive cross graph relationship, graph pooling is used to choose an optimal subset of salient nodes for next recursive process. To evaluate the proposed TGCNN, we conduct experiments on two large scale action datasets named NTU RGB+D (NTU) dataset [17] and the Large Scale Combined (LSC) dataset [21]. Moreover, to test the generalization ability of the proposed cross graph convolution on graph structural data, we also conduct an extensive experiment on a matrix completion dataset [8, 12] named Synthetic ‘Netflix’ dataset. The experimental results show that our method outperforms those stateofthearts.
In summary, our main contributions are three folds:

we propose a novel tensor graph convolutional neural network which is able to globally learn an optimal graph from multiple factorized subgraphs through a recursive learning process.

we design cross graph convolution to efficiently encode relationship of each pair of nodes across two subgraphs, and efficiently derive in tensor space by utilizing the property of Kronecker product.

we experimentally validate the effectiveness of our method on both action recognition and matrix completion datasets, and report the stateoftheart results.
Ii Preliminary of spectral filtering
Let be a signal where each element represents a vertex in a graph, and be the corresponding adjacency matrix, then the spectral filtering on can be formulated as:
(1) 
where
(2) 
In Eqn. 2, is a diagonal degree matrix with the diagonal elements calculated as .
Iii TGCNN architecture
The whole architecture of the proposed TGCNN is shown in Fig. 1, in which the key component is the graph preserving layer. In following subsections, we first describe the two key operations of graph preserving layer, i.e. cross graph convolution and graph pooling. Then we show the whole recursive learning process of graph preserving layer.
Iiia Cross graph convolution
Cross graph convolution consists of two main steps, i.e. cross graph construction and spectral filtering. Cross graph construction aims to build conjunctive graph between two subgraphs, where each pair of nodes from these two subgraphs are well modeled. In cross graph construction process, let and be two signals and be the corresponding adjacency matrices, then the conjunctive signal is defined as follows:
(4) 
where
And according to spectral graph theory, the adjacency matrix of , denoted as , should correspondingly describe the similarities between each pair of nodes in . In previous research, Kronecker sum operation is proposed which can be used for describing conjunctive similarities of two given subgraphs:
(5) 
However, due to the complexity of signals, the conjunctive adjacency matrix generated by classic Kronecker sum operation may not well fit . For this reason, we parameterize the Kronecker sum operation expecting to learn a optimal conjunctive adjacency matrix from . The new operation, denoted as , is named as parameterized Kronecker sum operation which is defined as follows:
(6)  
(7) 
In Eqn. 6, are both trainable vectors.
Then, we conduct cross graph convolution by applying spectral filtering on and :
(8) 
In this process, the key step is to calculate the th order polynomials of the adjacency matrix. Formally, the st order polynomial on can be calculated as follows:
(9)  
(10)  
(11) 
where denotes the vectorization of a matrix by stacking its columns into a single column vector, and is the reverse process of which transforms a single column vector into a matrix. So, we also have the following equation:
(12) 
In above equations, Eqn. 10 is transformed to Eqn. 11 by utilizing the property of Kronecker product. Then, the nd order polynomial on can be further calculated as:
(13)  
(14)  
(15) 
Based on the results of 1st and 2nd order polynomials, the th order polynomial on can be deduced, which can be written as:
(16)  
(17) 
By utilizing the property of Kronecker product, the graph convolution on , which needs to calculate polynomials on matrices in size , is transformed to calculate the matrix product on two matrices in much smaller sizes, i.e. and respectively. Correspondingly, the computational cost also decreases from to , which effectively relieves the computational burden.
Moreover, to further improve the computation efficiency, the vectorization operation denoted as and its reverse operation need not be conducted frequently by transforming Eqn. 8 to the following format:
(18) 
IiiB Graph pooling
As cross graph convolution results in a large graph of nodes from two factorized graphs where the total number of nodes is , there may be some irrelevant nodes contributing little to action recognition. This part of nodes increase computation cost and also may degrade the performance. To reduce the disturbances of these irrelevant nodes as well as reduce the graph size, a projecting matrix is utilized to weight nodes so that an optimal subset of salient nodes can be effectively selected.
Formally, given the nodes denoted as and it’s corresponding adjacency matrix , the graph pooling process can be described as follows:
(19)  
(20) 
where is the parameter to be solved. The larger the matrix element is, the more important the corresponding weighted node is for action recognition. This graph pooling operation not only further promotes the performance of TGCNN, but also avoids serious graph expansion during the recursive learning process introduced in the following subsection in detail.
IiiC Recursive learning process
The recursive learning process aims to transform the factorized graph modeling from processing all nodes at one time to recurrent convolution on two subgraphs step by step, which effectively relieves the computational burden. Formally, let denote factorized subgraphs and be their corresponding signals, based on cross graph convolution and graph pooling, the recursive learning process can be formulated as follows:
(21)  
(22)  
(23)  
(24)  
(25) 
In above equations, and represent the conjunctively constructed adjacency matrix and signal respectively, which are generated from the current input and previous preserved graphs. and denote the preserved adjacency matrix and the filtering signal at the th recursive step, and is the th output feature. and are learnable variables for learning an output feature from the filtering signal, and is a nonlinear activation function which is used for endowing flexibility to the graph preserving layer.
Eqn. 2125 cooperate to achieve global learning on all input factorized subgraphs. Given successive factorized graphs, the nodes between the preserved and input graphs are jointly modeled through cross graph convolution (Eqn. 2123). Then, based on this, graph pooling is applied which acts as a memory unit to remember those salient nodes. These two operations are recursively conducted so that the learning process makes TGCNN be inherently deep as the previous input graphs are connected with current input one. And thus, the output features are able to describe the global graph structure.
IiiD The loss functions
For action recognition, the output features of graph preserving layer are further passed through a full connection layer and a softmax layer. Then cross entropy loss is employed for TGCNN training, which can be defined as follows:
in which
where denotes the cross entropy loss calculating the mean negative logarithm value of the prediction probability of the training samples, denotes the number of the training samples, represent input factorized graphs of th training sample and is the corresponding label.
Besides action recognition, to test the generalization ability of the proposed cross graph convolution on graph structural data, we also conduct an extensive experiment on a matrix completion dataset [8, 12]. In matrix completion, only two factorized subgraphs are provided instead of multiple graphs. So, this task can be treated as a simplified application case of our TGCNN. As the algorithm in [12] achieves the current best performance, for a fair comparison, we embed our cross graph convolution into the framework of [12] and employ the same loss function as [12], which is commonly employed in matrix completion algorithms.
Iv Experiments
We evaluate our TGCNN on two action recognition datasets named NTU RGB+D (NTU) dataset [17] and Large Scale Combined (LSC) dataset [21]. Besides, to test the generalization ability to graph structural data, we also conduct an extensive experiment on a matrix completion dataset named Synthetic ‘Netflix’ dataset [8], in which two factorized graphs are provided describing the relationship among users and items respectively. In the following subsections, we firstly introduce these three datasets, then we show the implemental details including the preprocessing of action recognition datasets. Finally, we compare the experimental results with the stateoftheart methods.
Iva Datasets
In NTU dataset, there are 56880 RGB+D video samples executed by 40 different human subjects whose ages are in the range from 10 to 35. Three synchronous Microsoft Kinect v2 sensors are used for collecting various modalities of signals from three different horizontal angles, where the modalities include RGB videos, depth sequences, skeleton data and infrared frames. Specifically, for skeleton data, the human skeleton is represented by 3D locations of 25 major body joints. This dataset is great challenging due to its large amount of samples, multiple view points and intraclass variations.
The LSC dataset is an integrated dataset created by combing nine existing public datasets. In this dataset, there are 4953 video sequences containing red, green and blue (RGB) video and depth information. These sequences contain 94 action classes which are performed by 107 subjects in total. As these video sequences come from different individual datasets, the variations with respect to subjects, performing manners and backgrounds are very large. Moreover, there is large difference among the number of samples of each action. All these factors, i.e. the large size, the large variations and the data imbalance for each class, make this dataset challenging for recognition.
Synthetic ‘Netflix’ dataset is frequently used in matrix completion task to evaluate different algorithms. In this dataset, the row axis represents different item (e.g. movie) while the column represents different users. Thus the value of each element shows whether a user would like an item or not. When creating this dataset, the matrix is generated by satisfying certain assumptions, e.g. low rank property and smoothness along rows and columns. Thus, there is strong communities structure in the generated user and item graphs. The advantage of this dataset is that it enables the behaviours of different algorithms be well studied in controlled settings.
IvB Preprocessing on skeleton data
The preprocessing of skeleton data aims to eliminate noise and also make the model be robust to different kinds of variations, e.g. body orientation variation and body scale variation. This process is done by the following three steps:
(i) The action sequences are first split to a fixed number of subsequences, and then one frame is chosen from each subsequence so that the generated sequences contain the same number of frames.
(ii) The skeletons are randomly scaled with different factors ranging in [0.95, 1.05] so that the adaptive scaling capacity of the model can be improved.
(iii) During training stage, the skeletons randomly are rotated along and axis with angles ranging in [45, 45], which makes the model be robust to orientation variation.
IvC Experiment on NTU dataset
The experiments on NTU dataset are conducted following two different protocols, named cross subject and cross view protocols respectively, in [17]. For cross subject protocol, samples are split to training and testing sets according to subjects’ ID numbers. Under this protocol, the split training and testing sets contain 40320 and 16560 samples respectively where the samples in each set are conducted by 20 subjects. For cross view protocol, there are 37920 and 18960 samples in training and testing sets respectively. Among them, samples in training set are captured by cameras 2 and 3 while the samples in testing set are captured by camera 1.
The main parameters in TGCNN are the polynomial order denoted as , the numbers of preserved nodes and the dimension of output features. For both protocols, K is set to be 2, the numbers of preserved nodes are both 50 and the dimension of output feature is 128. Table I shows the comparison results on NTU dataset. The proposed framework is compared with various thestateoftheart methods, including different kinds of recurrent neural networks (RNNs), hierarchical bidirectional recurrent neural networks (HBRNN) [5], partaware LSTM (PLSTM) [17], spatiotemporal LSTM (STLSTM) [11], , and geometric features LSTM (GFLSTM) [22]. For both protocols, our algorithm achieves the best performance.
IvD Experiment on Large Scale Combined dataset
We conduct experiments on LSC dataset by following two different protocols employed in [21]. For the first protocol named Random Cross Sample (RCSam) using data of 88 action classes, half of the samples of each class are randomly selected as training data while the rests are used as testing data. For the second protocol named Random Cross subject (RCSub) using data of 88 action classes, half of the subjects are randomly selected as training data and the rest subjects are used as testing data. In both protocols, only skeleton data are used for recognition. Due to the imbalance of samples in each class, the values of precision and recall are employed for evaluating the performance instead of accuracy.
Protocol  Method 




RCSam 

84.6  84.1  

85.9  85.6  

84.2  84.9  
TGCNN  86.6  82.9  
RCSub 

63.1  59.3  

74.5  73.7  

76.3  74.6  
TGCNN  83.1  76.5 
The parameter settings for both RCSam and RCsub protocols are the same: the numbers of nodes in preserving layer are set to be 30 while the dimension of output feature is set to be 80, and is set to be 2. The comparisons on LSC dataset are shown in Table II. Except the recall value in RCSam protocol, our algorithm outperforms the previous stateoftheart methods.
IvE Experiment on Synthetic ‘Netflix’ dataset
Methods Complexity RMSE 
GMC [8] mn 0.3693 
GRALS [15] m + n 0.0114 
RGCNN [12] mn 0.0053 
sRGCNN [12] m + n 0.0106 
TGCNN mm+nn 0.0042 
Architecture 




Precision  Recall  Precision  Recall  
IGCNN  
TGCNN  86.6  82.9  83.1  76.5 
We follow the protocol employed in [12] to evaluate the performance of our TGCNN on Synthetic ‘Netflix’ dataset. Under this protocol, a part of chosen values of users are first eliminated, and the task of algorithm is to recovering the missing values of this matrix based on the given fraction of entries. At last, root mean squared (RMS) error is employed to evaluate the difference between the recovered matrix and the ground truth. The smaller value of RMS error means the better performance.
The results of different matrix completion methods are reported in Table III, along with their theoretical complexities. Algorithms including geometric matrix completion (GMC) [8], recurrent graph CNN (RGCNN) [12], separable recurrent graph CNN (sRGCNN) [12] and graph regularized alternating least squares (GRALS) [15], are compared with our TGCNN. As it is shown, our TGCNN model achieves the best accuracy which demonstrates the generalization ability of TGCNN for graph structural data.
Architecture 



Precision  Recall  
Kronecker sum  
PKronecker sum  86.6  82.9  0.0042 
IvF Analysis of TGCNN
K


2 83.1  
3  
4  
5 77.7 
As TGCNN achieves promising performance, it is meaningful to verify how much the novel proposed operations, e.g. cross graph convolution and parameterized Kronecker sum, improve the performance of the network, and also how the parameter setting influences the result. For these purposes, several additional experiments are respectively conducted which are listed as follows:

TGCNN vs isolate graph CNN (IGCNN). To evaluate the effectiveness of the proposed recursive process, we conducted additional experiments on LSC datsets comparing the results of TGCNN with IGCNN, where in IGCNN the graph filtering is only conducted on each isolate graph (Table IV).

Parameterized Kronecker sum vs classic Kronecker sum. To see how much improvement the parameterized Kronecker sum operation brings, we conducted experiments on both LSC and Synthetic ‘Netflix’ dataset to compare the performance (Table V).

Setting different orders of polynomials. We conduct additional experiments on LSC dataset following RCSub protocol to see how different polynomial orders influent the performance (Table VI).
From the results we can have the following observations:

TGCNN outperforms IGCNN which verifies the effectiveness of the recursive learning process, which globally learns the additional cross graph relationship comparing to IGCNN.

The parameterized Kronecker sum operation further promotes the performance comparing to the classic one. This indicates that through parameterized Kronecker sum operation, the constructed conjunctive cross graph better fits the corresponding conjunctive signal.

The value of polynomial order influences the performance of graph filtering. On LSC dataset, the best precision is achieved by setting K to be 2 while the best recall is achieved when K is set to be 5.
V Conclusion
In this paper, we proposed a novel framework named TGCNN to globally model those subgraphs factorized from a large graph. For this purpose, we propose a recursive learning process on graph by specifically designing a novel graph preserving layer. Serving as a memory unit, this graph preserving layer memorizes those salient nodes of successively input graphs, where the memory function is achieved through applying novelly designed cross graph convolution and graph pooling. Specifically, cross graph convolution well models the relationship of each pair of nodes across graph and can be efficiently conducted. Besides, the proposed parameterized Kronecker product learns an optimal conjunctive adjacency matrix which further promotes the performance. Comprehensive experiments conducted on action recognition and matrix completion datasets verify the competitive performance of our TGCNN.
References
 [1] B. B. Amor, J. Su and A. Srivastava, Action recognition using rateinvariant analysis of skeletal shape trajectories. IEEE transactions on pattern analysis and machine intelligence, vol. 38, no.1, pp. 113, 2016.
 [2] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203, 2013.
 [3] Rizwan Chaudhry, Ferda Ofli, Gregorij Kurillo, Ruzena Bajcsy, and René Vidal, Bioinspired dynamic 3d discriminative skeletal features for human action recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 471478, 2013.
 [4] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst, Convolutional neural networks on graphs with fast localized spectral filtering. Advances in Neural Information Processing Systems, pp. 38443852, 2016.
 [5] Yong Du, Wei Wang, and Liang Wang, Hierarchical recurrent neural network for skeleton based action recognition. Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 11101118, 2015.
 [6] Zhiwu Huang and Luc J Van Gool, A riemannian network for spd matrix learning. AAAI, vol.2, pp. 16, 2017.
 [7] Zhiwu Huang, Chengde Wan, Thomas Probst, and Luc Van Gool, Deep learning on lie groups for skeletonbased action recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 60996108, 2017.
 [8] Vassilis Kalofolias, Xavier Bresson, Michael Bronstein, and Pierre Vandergheynst, Matrix completion on graphs. arXiv preprint arXiv:1408.1717, 2014.
 [9] Thomas N Kipf and Max Welling, Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
 [10] Chaolong Li, Zhen Cui, Wenming Zheng, Chunyan Xu, Rongrong Ji, and Jian Yang, Actionattending graphic neural network. arXiv preprint arXiv:1711.06427, 2017.
 [11] Jun Liu, Amir Shahroudy, Dong Xu, and Gang Wang, Spatiotemporal lstm with trust gates for 3d human action recognition. European Conference on Computer Vision, pp. 816833, Springer, 2016.
 [12] Federico Monti, Michael M Bronstein, and Xavier Bresson, Geometric matrix completion with recurrent multigraph neural networks. arXiv preprint arXiv:1704.06803, 2017.
 [13] Ferda Ofli, Rizwan Chaudhry, Gregorij Kurillo, René Vidal, and Ruzena Bajcsy, Sequence of the most informative joints (SMIJ): A new representation for human skeletal action recognition. Journal of Visual Communication and Image Representation, vol. 25, no.1, pp. 2438, 2014.
 [14] Omar Oreifej and Zicheng Liu, Hon4d: Histogram of oriented 4d normals for activity recognition from depth sequences. Computer Vision and Pattern Recognition, pp. 716723, 2013.
 [15] Nikhil Rao, HsiangFu Yu, Pradeep K Ravikumar, and Inderjit S Dhillon, Collaborative filtering with graph information: Consistency and scalable methods. Advances in neural information processing systems, pp. 21072115, 2015.
 [16] Aliaksei Sandryhaila and José MF Moura, Discrete signal processing on graphs. IEEE transactions on signal processing, vol. 61, no.7, pp. 16441656, 2013.
 [17] Amir Shahroudy, Jun Liu, TianTsong Ng, and Gang Wang, Ntu rgb+ d: A large scale dataset for 3d human activity analysis. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10101019, 2016.
 [18] Raviteja Vemulapalli, Felipe Arrate, and Rama Chellappa, Human action recognition by representing 3d skeletons as points in a lie group. Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 588595, 2014.
 [19] Pei Wang, Chunfeng Yuan, Weiming Hu, Bing Li, and Yanning Zhang, Graph based skeleton motion representation and similarity measurement for action recognition. In Proc. European Conference on Computer Vision, pp. 370385. Springer, 2016.
 [20] Lu Xia, ChiaChih Chen, and JK Aggarwal, View invariant human action recognition using histograms of 3d joints. IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 2027, 2012.
 [21] Jing Zhang, Wanqing Li, Pichao Wang, Philip Ogunbona, Song Liu, and Chang Tang, A large scale rgbd dataset for action recognition.
 [22] Songyang Zhang, Xiaoming Liu, and Jun Xiao, On geometric features for skeletonbased action recognition using multilayer lstm networks. In Proc. IEEE Winter Conference on Applications of Computer Vision, pp. 148157, 2017.