Sparse Diffusion-Convolutional Neural Networks
The predictive power and overall computational efficiency of Diffusion-convolutional neural networks make them an attractive choice for node classification tasks. However, a naive dense-tensor-based implementation of DCNNs leads to memory complexity which is prohibitive for large graphs. In this paper, we introduce a simple method for thresholding input graphs that provably reduces memory requirements of DCNNs to (i.e. linear in the number of nodes in the input) without significantly affecting predictive performance.
Sparse Diffusion-Convolutional Neural Networks
James Atwood††thanks: Now at Google Brain. UMass Amherst CICS Amherst, MA firstname.lastname@example.org Siddharth Pal Raytheon BBN Technologies Cambridge, MA email@example.com Don Towsley UMass Amherst CICS Amherst, MA firstname.lastname@example.org Ananthram Swami U.S. Army Research Lab Adelphi, MD email@example.com
noticebox[b]31th Conference on Neural Information Processing Systems (NIPS 2017), Barcelona, Spain.\end@float
There has been much recent interest in adapting models and techniques from deep learning to the domain of graph-structured data NIPS_atwood (); Bruna_2013 (); NIPS_defferrard (); Henaff_2015 (); ICLR_kipf (); Niepert_2016 (). Proposed by Atwood and Towsley NIPS_atwood (), Diffusion-convolutional neural networks (DCNNs) approach the problem by learning ‘filters’ that summarize local information in a graph via a diffusion process. These filters have been observed to provide an effective basis for node classification.
The DCNNs have been shown to possess attractive qualities like obtaining a latent representation for graphical data that is invariant under isomorphism, and utilizing tensor operations that can be efficiently implemented on the GPU. Nevertheless, as was remarked in NIPS_atwood (), when implemented using dense tensor operations, DCNNs have a memory complexity, which could get prohibitively large for massive graphs with millions or billions of nodes.
In an effort to improve the memory complexity of the DCNN technique, we investigate two approaches of thresholding the diffusion process – a pre-thresholding technique that thresholds the transition matrix itself, and a post-thresholding technique that enforces sparsity on the power series on the transition matrix. We show that pre-thresholding the transition matrix provides provably linear () memory requirements while the model’s predictive performance remains unhampered for small to moderate thresholding values (). On the other hand, the post-thresholding technique did not offer any gains in memory complexity. This result suggests that pre-thresholded sparse DCNNs (sDCCNs) are suitable models for large graphical datasets.
We study node classification on a single graph, say , with being the vertex or node set, and being the set of edges. No constraints are imposed on the graph ; the graph can be weighted or unweighted, directed or undirected. Each vertex is assumed to be associated with features, leading to the graph being described by an design matrix, , and an adjacency matrix, with being the number of vertices. In DCNN, we compute a degree-normalized transition matrix that gives the probability of moving from one node to another in a single step. However, in a sparse implementation of DCNN (sDCNN), rather than using the transition matrix directly, we remove edges with probabilities below a threshold in order to both improve memory complexity and regularize the graph structure.
Assume the nodes are associated with labels, i.e., each node in has a label in . Given a set of labelled nodes in a graph, the node classification task is to find labels for unlabeled nodes. Note that, while in this work we focus on node classification tasks, this framework can be easily extended to graph classification tasks where graphs have labels associated with them rather than individual nodes NIPS_atwood ().
Next, we describe the DCNN framework in greater detail. The neural network takes the graph and the design matrix as input, and returns a hard prediction for or a conditional distribution for unlabelled nodes. Each node is transformed to a diffusion-convolutional representation, which is an real matrix defined by hops of graph diffusion over features. The core operation of a DCNN is a mapping from nodes and their features to the results of a diffusion process that begins at that node. The node class label is finally obtained by integrating the result of the diffusion process over the graph through a fully connected layer, thus combining the structural and feature information in the graph data. In sDCNN, the diffusion process itself is thresholded, to reduce the computational complexity of the diffusion process over a large graph.
DCNN with no thresholding Consider a node classification task where a label is associated with each node in a graph. Let be an tensor containing the power series of the transition matrix . The probability of reaching node from node through hops is captured by , or equivalently by . The diffusion-convolutional activation for node , hop and feature of graph is given by
where are the learned weights of the diffusion-convolutional layer, and is the activation function. Briefly, the weights determine the effect of neighboring nodes’ features on the class label of a particular node. The activations can be expressed more concisely using tensor notation as
where the operator represents element-wise multiplication. Observe that the model only entails parameters, making the size of the latent diffusion-convolutional representation independent of the size of the input.
The output from the diffusion-convolutional layer connects to the output layer with neurons through a fully connected layer. A hard prediction for , denoted , can be obtained by taking the maximum activation, as follows
whereas, a conditional probability distribution can be obtained by applying the softmax function
Since DCNNs require computation and storage of tensors representing the power series of the transition matrices, it is costly in terms of computational resources. In this work, we investigate methods to enforce sparsity in such tensors, and consequently reduce the utilization of memory. In what follows, we describe two thresholding methods for enforcing sparsity.
Pre-thresholding Through this technique the transition matrix is first thresholded, and then a power series of the thresholded transition matrix is computed. For a threshold value , the pre-thresholded activation for node , hop , and feature is given by
is the thresholded transition matrix, and
Note that for an event , , if the event is true, and otherwise.
Post-thresholding This thresholding method enforces sparsity on the power series of the transition matrix . For a threshold value , the post-thresholded activation for node , hop , and feature is given by
Qualitatively, pre-thresholding only considers strong ties within a particular node’s neighborhood with all the intermediary ties being sufficiently strong, whereas, post-thresholding looks at the entire neighborhood of a node, and chooses the strong ties, allowing long hop ties to be chosen, potentially passing through multiple weak ties. For threshold parameter or set to zero, all the ties are considered, and we obtain the DCNN setting in this limit. On the other hand, when the threshold parameter is set to the maximum value of one, only the relevant node’s feature value is considered along with its neighboring node only if the node in question has one neighbor. This is qualitatively close to a logistic regression setting, even though not exactly the same.
3 Complexity results
For , the memory complexity for the DCNN method is . For , the memory complexity is .
An efficient way to store the power series would be to store the product of the power of transition matrices with the design matrix. However the intermediate powers of the transition matrices need to be stored, which requires memory. Storing the product of the power series tensor with the design matrix requires memory. Therefore, the upper bound on the memory usage is .
However if , the transition matrix can be represented in , thereby getting rid of the memory requirement. ∎
For and a fixed threshold , memory complexity under the pre-thresholding technique is . For , the memory complexity is
We argue that a sparse representation of the power series tensor product with the design matrix occupies memory in an inductive manner. For node in and hop , we define the set of nodes in the -hop neighborhood that influence node through pre-thresholding as
For , is the identity matrix , which is sparse with exactly non-zero entries.
For , is the thresholded transition matrix. In the pre-thresholding operation, transition probabilities from node that are less than are set to zero. For a particular node , the set of nodes that have 1-hop transition probabilities greater than or equal to , is exactly . Since, and , we must have . Therefore, can have at most entries.
For , is the thresholded transition matrix raised to the power. Suppose the sparse representation is such that
non-zero entries for each in , implying that have entries
The final step is to prove that for , has entries. For in V, let
By assumption, . Observe that is the set . Since and for all that are in , we have the bound . Thus, we have proved that has non-zero entries. Thus by induction, the costliest operation is computing which requires memory, and the result follows. ∎
For and a fixed threshold , memory complexity under the post-thresholding technique is . For , the memory complexity is .
Even though the post-thresholded power series tensor can be proven to have , in a manner similar to that of Lemma 3.2, the intermediate powers of the dense transition matrix still have to be computed. This requires memory, and therefore, no improvement in memory utilization is obtained. ∎
Assume ; To obtain the computational complexity of the DCNN method, we observe that two matrices are multiplied times, and a matrix needs to be multiplied with another matrix, times. Two matrices can be multiplied in complexity using efficient matrix multiplication algorithms for square matrices coppersmith (). The product between the transition matrix and the design matrix can be performed in operations, so the overall complexity is . The computational complexity of the post-thresholded sDCNN is also , because the dense power series tensor needs to be computed.
The pre-thresholded DCNN achieves an improvement in the computational complexity because the power series tensor is computed by multiplying two sparse matrices. The costliest operation is computing which is obtained by multiplying , a sparse matrix with at most non-zero entries, and , another sparse matrix with at most non-zero entries. Using efficient sparse matrix multiplication methods LeGall (); yuster_zwick (), if the condition holds, then the computational complexity of the sparse method is .
Therefore, pre-thresholded sDCNN achieves memory complexity and computational complexity, a significant improvement over DCNN, which requires memory and computational complexity. However, post-thresholded sDCNN still requires the same memory and computational complexity as that of DCNN. Therefore, going forward we will simply be considering pre-thresholded sDCNN, although post-thresholding could be thought of a way of regularizing the DCNN method.
In this section we explore how thresholding affects both the density of transient diffusion kernel and the performance of DCNNs.
4.1 Effect of Thresholding on Density
Figure 2 shows the results of applying the two thresholding strategies to the Cora dataset. Observe that, both thresholding techniques show a decrease in diffusion kernel density as the threshold is increased. However the decrease is more gradual for the pre-thresholding method, due to the fact that, transition probabilities reach low values for greater number of hops, which when post-thresholded lead to low densities for relatively slight increase in the diffusion threshold. On the other hand, the pre-thresholding method is better behaved, with the kernel density decreasing in a more gradual fashion.
The darker lines corresponding to larger diffusion kernels obtained through greater number of hops, , have higher diffusion kernel density for low diffusion threshold. However, as the diffusion threshold is increased, the darker lines cross over the lighter lines around , for the pre-thresholding method. The justification for this phenomenon is that as the diffusion threshold is increased to , only the contribution of the identity matrix remains, and the larger diffusion kernels therefore show lower density. A similar phenomenon occurs for the pre-thresholding technique, except that the crossover region occurs much earlier. Although we show only the behavior of the Cora dataset, the behavior should hold for other datasets as well.
4.2 Effect of Thresholding on Performance
Figure 3 shows the effect of thresholding on DCNN performance. Observe that, for both thresholding techniques, small-to-moderate thresholding values () have no significant effect on performance, although performance degrades for larger thresholds. This suggests that applying a small threshold when computing the transient diffusion kernel is an effective means of scaling DCNNs to larger graphs.
However, it should be noted that for moderate thresholds , the performance begins to decay. Eventually, when the threshold reaches the reciprocal of the maximum degree , the benefit of including neighborhood information vanishes entirely because no edges are left in the graph.
5 Related Work
Neural networks for graphs were introduced by Gori et al. Gori_2005 () and followed by Scarselli et al. Scarselli_2009 (), in departure from the traditional approach of transforming the graph into a simpler representation which could then be tackled by conventional machine learning algorithms. Both the works used recursive neural networks for processing graph data, requiring repeated application of contraction maps until node representations reach a stable state. Bruna et al. Bruna_2013 () proposed two generalizations of CNNs to signals defined on general domains; one based upon a hierarchical clustering of the domain, and another based on the spectrum of the graph Laplacian. This was followed by Henaff et al. Henaff_2015 (), which used these techniques to address a setting where the graph structure is not known a priori, and needs to be inferred. However, the parametrization of CNNs developed in Bruna_2013 (); Henaff_2015 () are dependent on the input graph size, while that of DCNNs or sDCNNs are not, making the technique transferable, i.e., a DCNN or sDCNN learned on one graph can be applied to another. Niepert et al. Niepert_2016 () proposed a CNN approach which extracts locally connected regions of the input graph, requiring the definition of node ordering as a pre-processing step.
Atwood and Towsley NIPS_atwood () had remarked that the DCNN technique uses memory. It is worth noting that the sparse implementation of DCNN yields an order improvement in the memory complexity. This will offer a performance parity with more recent techniques that report memory complexity, like the graph convolutional network method by Kipf et al. ICLR_kipf () and the localized spectral filtering method of Defferard et al. NIPS_defferrard () when the (unthresholded) input graph is sparse ( = ), which is the case for many real-world datasets of interest, and a performance improvement when graphs are dense ( > ).
We have shown that, by applying a simple thresholding technique, we can reduce the computational complexity of diffusion-convolutional neural networks to and the memory complexity to . This is achieved without significantly affecting the predictive performance of the model.
- (1) Atwood, J. & Towsley, D. (2016) Diffusion-Covolutional Neural Networks. In Advances in Neural Information Processing systems, pp. 1993–2001.
- (2) Bruna, J., Zaremba, W., Szlam, A. & LeCun, Y., 2013. Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203.
- (3) Coppersmith, D. & Winograd, S. (1990) Matrix multiplication via arithmetic progressions. In Journal of symbolic computation, 9.3 : 251-280.
- (4) Defferrard, M., Bresson, X., & Vandergheynst, P. (2016). Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems (NIPS) (pp. 3837-3845).
- (5) Gori, M., Monfardini, G. & Scarselli, F. (2005). A new model for learning in graph domains. In IEEE International Joint Conference on Neural Networks (Vol. 2, pp. 729-734).
- (6) Henaff, M., Bruna, J. & LeCun, Y., 2015. Deep convolutional networks on graph-structured data. arXiv preprint arXiv:1506.05163.
- (7) Kipf, T. N. & Welling, M. (2016) Semi-supervised classification with graph convolutional networks. In International Conference for Learning Representations.
- (8) Le Gall, F. (2012) Faster algorithms for rectangular matrix multiplication. In IEEE 53rd Annual Symposium on Foundations of Computer Science (FOCS).
- (9) Niepert, M., Ahmed, M. & Kutzkov, K., 2016. Learning convolutional neural networks for graphs. In International Conference on Machine (pp. 2014-2023).
- (10) Scarselli, F., Gori, M., Tsoi, A.C., Hagenbuchner, M. & Monfardini, G., 2009. The graph neural network model. In IEEE Transactions on Neural Networks, 20(1), pp.61-80.
- (11) Yuster, R. & Zwick, U. Fast sparse matrix multiplication. (2005) In ACM Transactions on Algorithms (TALG) 1.1 : 2-13.