# Large-Scale Learnable Graph Convolutional Networks

###### Abstract.

Convolutional neural networks (CNNs) have achieved great success on grid-like data such as images, but face tremendous challenges in learning from more generic data such as graphs. In CNNs, the trainable local filters enable the automatic extraction of high-level features. The computation with filters requires a fixed number of ordered units in the receptive fields. However, the number of neighboring units is neither fixed nor are they ordered in generic graphs, thereby hindering the applications of convolutional operations. Here, we address these challenges by proposing the learnable graph convolutional layer (LGCL). LGCL automatically selects a fixed number of neighboring nodes for each feature based on value ranking in order to transform graph data into grid-like structures in 1-D format, thereby enabling the use of regular convolutional operations on generic graphs. To enable model training on large-scale graphs, we propose a sub-graph training method to reduce the excessive memory and computational resource requirements suffered by prior methods on graph convolutions. Our experimental results on node classification tasks in both transductive and inductive learning settings demonstrate that our methods can achieve consistently better performance on the Cora, Citeseer, Pubmed citation network, and protein-protein interaction network datasets. Our results also indicate that the proposed methods using sub-graph training strategy are more efficient as compared to prior approaches.

^{†}

^{†}copyright: rightsretained

^{†}

^{†}price: 15.00

^{†}

^{†}journalyear: 2018

^{†}

^{†}copyright: acmcopyright

^{†}

^{†}conference: The 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining; August 19–23, 2018; London, United Kingdom

^{†}

^{†}booktitle: KDD ’18: The 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, August 19–23, 2018, London, United Kingdom

^{†}

^{†}price: 15.00

^{†}

^{†}doi: 10.1145/3219819.3219947

^{†}

^{†}isbn: 978-1-4503-5552-0/18/08

^{†}

^{†}ccs: Computing methodologies Neural networks

^{†}

^{†}ccs: Computing methodologies Structured outputs

^{†}

^{†}ccs: Computing methodologies Artificial intelligence

## 1. Introduction

Deep learning methods are becoming increasingly powerful in solving various challenging artificial intelligence tasks. Among these deep learning methods, convolutional neural networks (CNNs) (LeCun et al., 1998) have demonstrated promising performance in many image-related applications, such as image classification (Deng et al., 2009), semantic segmentation (Chen et al., 2016), and object detection (Ren et al., 2015; He et al., 2017). A variety of CNN models have been proposed to continuously set the performance records (Krizhevsky et al., 2012b; Simonyan and Zisserman, 2015; Szegedy et al., 2015; He et al., 2016a). In addition to images, CNNs have also been successfully applied to natural language processing tasks such as neural machine translation (Cho et al., 2014; Luong et al., 2015; Gehring et al., 2017). One common characteristic behind these tasks is that the data can be represented by grid-like structures. This enables the use of convolutional operations in the form of the same local filters scanning every position on the input. Unlike traditional hand-crafted filters, the local filters used in convolutional layers are trainable. The networks can automatically decide what kind of features to extract by learning the weights in these trainable filters, thereby avoiding hand-crafted feature extraction (Wang et al., 2012).

In many real-world applications, the data can be naturally represented as graphs, such as social, citation, and biological networks. Figure 1 provides an illustration of graph data. Many interesting discoveries can be made by analyzing these graph data, such as social network analysis (Grover and Leskovec, 2016). An important task on graph data is node classification (Kipf and Welling, 2017; Veličković et al., 2017), in which models make predictions for every node in a graph based on node features and graph topology. As mentioned above, CNNs, with the power of automatic feature extraction, have achieved great success on tasks with grid-like data, which can be considered as special cases of graph data. Therefore, applying deep learning models, especially CNNs, on graph tasks is appealing. However, using regular convolutional operations on generic graphs faces two main challenges. These challenges are resulted from the fact that regular convolutions require the number of neighboring nodes for each node remains the same, and these neighboring nodes are ordered. In generic graphs, the numbers of neighboring nodes usually differ for different nodes in a graph. In addition, among the neighboring nodes of a node, there is no ranking information based on which we can order them to ensure the output is deterministic. In this work, we analyze the necessity of having a fixed number of ordered neighboring nodes in regular convolutional operations and propose elegant solutions to address these challenges.

Several recent studies tried to apply convolutional operations on generic graphs. Graph convolutional networks (GCNs) (Kipf and Welling, 2017) proposed to use a convolution-like operation to aggregate features of all adjacent nodes for each node, followed by a linear transformation to generate new a feature representation for a given node. Specifically, all feature vectors in the neighborhood, including the feature vector of the central node itself, are summed up, weighted by non-trainable weights depending on the number of neighbors. This can be thought of as a convolution-like operation which, however, is intrinsically different from the regular convolutional operation in two aspects. First, it does not use the same local filter to scan every node; that is, nodes that have different numbers of adjacent nodes have filters of different sizes and weights. Second, the weights in the filters are the same for all neighboring nodes in the receptive field as they are determined by the number of neighbors. Consequently, the weights are not learned. Graph attention networks (GATs) (Veličković et al., 2017) employed the attention mechanism (Bahdanau et al., 2015) to obtain different and trainable weights for adjacent nodes by measuring the correlation between their feature vectors and that of the central node. Yet graph attention operation still differs from the regular convolution which learns weights in local filters directly. Moreover, the attention mechanism requires extra computation in terms of pairs of feature vectors, resulting in excessive memory and computational resource requirements in practice.

In this work, we make two major contributions to applying CNNs on generic graph data. First, we propose the learnable graph convolutional layer (LGCL) to enable the use of regular convolutional operations on graphs. Note that prior studies modified the original convolutional operations to fit them for graph data. In contrast, our LGCL transforms the graphs to enable the use of regular convolutions. Our models based on LGCL achieve better performance on both transductive learning and inductive node classification tasks, as demonstrated by our experimental results. Second, we observe another limitation of prior methods; that is, their training process takes the adjacency matrix of the whole graph as an input. This requires excessive memory and computational resources when the graph has a large amount of nodes, which is usually the case in real-world tasks. In order to overcome this limitation, we develop a sub-graph training method, which is a simple yet effective approach to allow the training of deep learning methods on large-scale graph data. The sub-graph training method can significantly reduce the amount of required memory and computational resources, with negligible loss in terms of model performance.

## 2. Related Work

A few recent studies have tried to apply convolutional operations on graph data. Graph convolutional networks (GCNs) were introduced in (Kipf and Welling, 2017) and achieved the state-of-art performance on several node classification tasks. The authors defined and used a convolution-like operation termed the spectral graph convolution. This enables CNNs to directly operate on graphs. Basically, each layer in GCNs updates the feature vector representation of each node in the graph by considering the features of neighboring nodes. To be specific, the layer-wise forward-propagation operation of GCNs can be expressed as

(1) |

where and are the input and output matrices of layer , respectively. For both matrices, the numbers of rows are the same, corresponding to the number of nodes in the graph, while the numbers of columns can be different, depending on the dimensions of the input and output feature space. In Eq (1), is used to aggregate feature vectors of adjacent nodes, where is the adjacency matrix of the graph, and is the identity matrix. Also, is used, instead of , because the layers need to add self-loop connections to make sure that the old feature vector of the node itself is taken into consideration when updating the representation of a node. is the diagonal node degree matrix, which is used to normalize so that the scale of feature vectors after aggregation remains the same. is a trainable weight matrix and represents a linear transformation that changes the dimension of feature space. Therefore, the dimension of depends on how many features that each node in the input and output have, i.e., the number of columns in and , respectively. denotes an activation function like ReLU.

We analyze the convolution-like operation, which is the feature aggregation step through pre-multiplying by . Consider a node with a feature vector corresponding to the -th row in . The aggregation output, controlled by the -th row in , is a weighted sum of the feature vectors of all of its adjacent nodes, including the node itself. We can see that the operation is equivalent to having a local filter for each node, whose receptive field consists of the node itself and all its neighboring nodes. As is common that nodes in a generic graph have different numbers of adjacent nodes, the receptive field size varies, resulting in different local filters. This is a key difference from the regular convolutional operation, where the same local filter is applied to scan each position in grid-like data. Moreover, while using local filters of different sizes for graph data seems reasonable, it is worth noting that there is no trainable parameter in . In addition, each adjacent node receives the same weight in the weighted sum, which makes it a simple average. While CNNs achieve the power of automatic feature extraction by learning the weights in local filters, this non-trainable aggregation operation in GCNs limits the capability of CNNs on generic graph data.

From this perspective, graph attention networks (GATs) (Veličković et al., 2017) tried to enable learnable weights when aggregating neighboring feature vectors by employing the attention mechanism (Bahdanau et al., 2015; Vaswani et al., 2017). Like GCNs, each node still has a local filter with a receptive field covering the node itself and all of its adjacent nodes. When performing the weighted sum of feature vectors, each neighbor receives a different weight by measuring the correlation between its feature vector and that of the central node. Mathematically, for a node and one of its adjacent nodes , the correlation measurement process between layer and is given by

(2) | ||||

where and represent the corresponding feature vectors, i.e., the -th and -th row in , respectively, is a shared linear transformation and represents a single-layer feed-forward neural network, is the weight for node in the feature aggregation operation of node . Although in this way, GATs provide different and trainable weights to different adjacent nodes, the learning process differs from that of regular CNNs where weights in local filters are learned directly. Also, the attention mechanism requires extra computation between a node and all of its adjacent nodes, which will cause memory and computational resource problems in practice.

Unlike these prior models, which modified the regular convolutional operations to fit them for generic graph data, we instead propose to transform graphs into grid-like data to enable the use of CNNs directly. This idea was previously explored in (Niepert et al., 2016). However, the transformation in (Niepert et al., 2016) is implemented in the preprocessing process while our method includes the transformation in the networks. Additionally, we introduce a sub-graph training method in this work, which is a simple yet effective approach to allow large-scale training.

## 3. Methods

In this section, we introduce the learnable graph convolutional layer (LGCL) and the sub-graph training strategy on generic graph data. Based on these developments, we propose the large-scale learnable graph convolutional networks (LGCNs).

### 3.1. Challenges of Applying Convolutional Operations on Graph Data

In order to apply regular convolutional operations on graphs, we need to overcome two main challenges that are caused by two major differences between generic graphs and grid-like data. First, the number of adjacent nodes usually varies for different nodes in a generic graph. Second, we cannot order the neighboring nodes in generic graphs, since there is no ranking information among them. For example, in a social network, each person in the network can be seen as a node and the edges represent friendships between people. Obviously, the number of adjacent nodes differs for each node since people can have different numbers of friends. Meanwhile, it is hard to order these friends without additional information for ranking.

Note that grid-like data can be viewed as a special type of graph data, where each node has a fixed number of ordered neighbors. As convolutional operations apply directly on grid-like data such as images, we analyze why the two characteristics mentioned above are necessary to performing regular convolutions. To see the need of having a fixed number of adjacent nodes with ranking information, consider a convolutional filter with a size of scanning an image. We think of the image as a special graph by thinking of each pixel as a node. During the scan, the computation involves a central node with adjacent nodes each time. These nodes become neighbors of the central node by having edges connecting them in the special graph. Meanwhile, we can order these neighboring nodes by their relative positions with respect to the central node. This is crucial to convolutional operations since the correspondence between weights in the filter and nodes in the graph must be maintained during the scan. For instance, in the example above, the upper left weight in the filter should always be multiplied with the neighboring node at the top left of the central node. Without such ranking information, the outputs of convolution operations are no longer deterministic. We can see from the above discussions that it is challenging to directly apply regular convolutional operations on generic graph data. To address these two challenges, we propose an approach to transform generic graphs into grid-like data.

### 3.2. Learnable Graph Convolutional Layers

To enable the use of regular convolutional operations on generic graphs, we propose the learnable graph convolutional layer (LGCL). Following the notations defined in Section 2, the layer-wise propagation rule of LGCL is formulated as

(3) | ||||

where is the adjacency matrix, is an operation that performs the -largest node selection to transform generic graphs to data of grid-like structures, and denotes a regular 1-D CNN that aggregates neighboring information and outputs a new feature vector for each node. We discuss and separately below.

-largest Node Selection. We propose a novel method known as the -largest node selection to achieve the transformation from graphs to grid-like data, where is a hyper-parameter of LGCL. After this operation, each node aggregates neighboring information and is represented in a 1-D grid-like format with positions. The transformed data is then fed into a 1-D CNN to generate the updated feature vector.

Suppose with row vectors , representing a graph of nodes where each node has features. We are given the adjacency matrix and a fixed . Now consider a specific node whose feature vector is and it has neighboring nodes. Through a simple look-up operation in , we can obtain the indices of these adjacent nodes, say . Concatenating the corresponding feature vectors outputs a matrix . Without the loss of generalization, assume that . If in practice, we can pad using columns of zeros. The -largest node selection is conducted on ; that is, for each column, we rank the values and select -largest values. This gives us a output matrix. As the columns in represent features, the operation is equivalent to selecting -largest values for each feature. By inserting in the first row, the output becomes . This is illustrated in the left part of Figure 2. By repeating this process for each node, transforms to .

Note that can be viewed as a 1-D grid-like structure by considering , , and as the batch size, the spatial size, and the number of channels, respectively. Therefore, the -largest node selection function successfully achieves the transformation from generic graphs to grid-like data. The operation makes use of the natural ranking information among real numbers and forces each node to have a fixed number of ordered neighbors.

1-D Convolutional Neural Networks. As discussed in Section 3.1, regular convolutional operations can be directly applied on grid-like data. As is 1-D, we employ a 1-D CNN model . The basic functionality of LGCL is to aggregate adjacent information and update the feature vector for each node. Consequently, it requires , where is the dimension of the updated feature space. The 1-D CNN should take as input and output a matrix of dimension , or equivalently, . Basically, reduces the spatial size from to 1.

Note that is considered as the batch size, which is not related to the design of . As a result, we focus on only one data sample, i.e., one node in the graph. Taking the example above, for node , the transformed output is , which serves as the input to . Due to the fact that any regular convolutional operation with a filter size larger than one and no padding reduces the spatial size, the simplest has only one convolutional layer with a filter size of and no padding. The numbers of input and output channels are and , respectively. Meanwhile, any multi-layer CNN can be employed, provided its final output has the dimension of . The right part of Figure 2 illustrates an example of a two-layer CNN. Again, applying for all the nodes outputs . In summary, our LGCL transforms generic graphs to grid-like data using the proposed -largest node selection and applies a regular 1-D CNN to perform feature aggregation and refine the feature vector for each node.

### 3.3. Learnable Graph Convolutional Networks

It is known that deeper networks usually yield better performance. However, prior deep models on graphs like GCNs only have two layers. While they suffer from performance loss when going deeper (Kipf and Welling, 2017), our LGCL enables a deeper design, resulting in the learnable graph convolutional networks (LGCNs) for graph node classification. We build LGCNs based on the architecture of densely connected convolutional networks (DCNNs) (Huang et al., 2017; He et al., 2016b), which achieved state-of-the-art performance in the ImageNet classification challenge (Krizhevsky et al., 2012a).

In LGCNs, we first apply a graph embedding layer to produce low-dimensional representations of nodes, since the original inputs are usually very high-dimensional feature vectors in some graph dataset, such as the Cora (Sen et al., 2008). The graph embedding layer is essentially a linear transformation in the first layer expressed as

(4) |

where represents the high-dimensional input and changes the dimension of feature space from to . As a result, and . Alternatively, a GCN layer can be used for graph embedding. As illustrated in Section 2, the number of training parameters in a GCN layer is equal to that of a regular graph embedding layer.

After the graph embedding layer, we stack multiple LGCLs, according to the complexity of the graph data. As each LGCL only aggregates information from first-order neighboring nodes, i.e., direct neighboring nodes, stacked LGCLs can collect information from a larger set of nodes, which is commonly done in regular CNNs. In order to promote the model performance and facilitate the training process, we apply skip connections to concatenate the inputs and outputs of LGCLs. Finally, a fully-connected layer is used before the softmax function for final predictions.

Following the design principle of LGCNs, and the number of stacked LGCLs are the most important hyper-parameters. The average degree of nodes in the graph can be a good reference for selecting . Meanwhile, the number of LGCLs should depend on the complexity of tasks, such as the number of classes, the number of nodes in a graph, etc. More complicated tasks require deeper models.

Dataset | #Nodes | #Features | #Classes | #Training Nodes | #Validation Nodes | #Test Nodes | Degree | Setting |
---|---|---|---|---|---|---|---|---|

Cora | 2708 | 1433 | 7 | 140 | 500 | 1000 | 4 | Transductive |

Citeseer | 3327 | 3703 | 6 | 120 | 500 | 1000 | 5 | Transductive |

Pubmed | 19717 | 500 | 3 | 60 | 500 | 1000 | 6 | Transductive |

PPI | 56944 | 50 | 121 | 44906 (20 graphs) | 6514 (2 graphs) | 5524 (2 graphs) | 31 | Inductive |

### 3.4. Sub-Graph Training on Large-Scale Data

Most prior deep models on graphs suffer from another limitation. In particular, during training the inputs are the feature vectors of all the nodes along with the adjacency matrix of the whole graph, whose sizes become large for large graph data. These prior models work properly on small-scale graphs. However, for large-scale graphs, those methods usually result in excessive memory and computational resource requirements, which limit the practical applications of these models.

Similar problems also happen for deep neural networks on other types of data, such as grid-like data. For example, deep models on image segmentation usually use randomly cropped patches when dealing with large images. Motivated by this strategy, we intend to randomly “crop” a graph to obtain smaller graphs for training. However, while a rectangular patch of an image naturally maintains neighboring information among pixels, how to handle irregular connections between nodes in a graph remains challenging.

In this work, we propose a sub-graph selection algorithm to address the memory and computational resource problems on large-scale graph data, as shown in Algorithm 1. Given a graph, we first sample some initial nodes. Staring from them, we use the Breadth-First-Search (BFS) algorithm to expand adjacent nodes into the sub-graph iteratively. With multiple iterations, high-order neighboring nodes of the initial nodes are included. Note that we use a single parameter in Algorithm 1 for simplicity. In practice, we can set to different values for each iteration. Figure 4 provides an example of the sub-graph selection process.

With such randomly “cropped” sub-graphs, we are able to train deep models on large-scale graphs. In addition, we can take advantage of the mini-batch training strategy to accelerate the learning process. In each training iteration, we can use the proposed sub-graph selection algorithm to sample several sub-graphs and put them in a mini-batch. The corresponding feature vectors and adjacency matrices form the inputs to the networks.

## 4. Experimental Studies

In this section, we evaluate our proposed large-scale learnable
graph convolutional networks (LGCNs) on node classification tasks
under both transductive and inductive learning settings. In addition
to comparisons with prior state-of-the-art models, some performance
studies are performed to investigate how to choose hyper-parameters.
Experiments are also conducted to analyze the training strategy
based on the proposed sub-graph selection algorithm. Experimental
results show that LGCNs yield improved performance, and the
sub-graph training is much more efficient than whole-graph training.
Our code is publicly
available^{1}^{1}1https://github.com/divelab/lgcn/.

### 4.1. Datasets

In our experiments, we focus on node classification tasks under both transductive and inductive learning settings.

Transduction Learning. Under the transductive setting, the unlabeled testing data are accessible and available during training. To be specific, for node classification, only a part of nodes in the graph are labeled. The testing nodes, which are also in the same graph, are accessible during training, including their features and connections, except for the labels. This means the training process knows about the graph structure that contains testing nodes. We use three standard benchmark datasets for transductive learning experiments; those are the Cora, Citeseer, and Pubmed (Sen et al., 2008), as summarized in Table 1. These three datasets are citation networks with nodes and edges representing documents and citations, respectively. The feature vector of each node corresponds to a bag-of-word representation for a document. For these three datasets, we employ the same experimental settings as those in GCN (Kipf and Welling, 2017). For each class, 20 nodes are used for training, 500 nodes are used for validation and 1,000 nodes are used for testing.

Inductive Learning. For inductive learning, the testing data are not available during training, which means the training process does not learn about the structure of test graphs. In inductive learning tasks, we usually have different training, validation, and testing graphs. During training, the model only use the training graphs without access to validation and testing graphs. We use the protein-protein interaction (PPI) dataset (Zitnik and Leskovec, 2017), which contains 20 graphs for training, 2 graphs for validation, and 2 graphs for testing. Since the graphs for validation and testing are separate, the training process does not use them. There are 2,372 nodes in each graph on average. Each node has 50 features including positional, motif genes and signatures. Each node has multiple labels from 121 classes.

### 4.2. Experimental Setup

We describe the experimental setup under both transductive and inductive learning settings.

Transduction Learning. In transductive learning tasks, we employ the proposed LGCN models as illustrated in Figure 3. Since transductive learning datasets employ high-dimensional bag-of-word representations as feature vectors of nodes, the inputs go through a graph embedding layer to reduce the dimension. Here, we use a GCN layer as the graph embedding layer. The dimension of the embedding output is 32. Then we apply LGCLs, each of which uses and produces 8-component feature vectors. For the Cora, Citeseer, and Pubmed, we stack 2, 1, and 1 LGCLs, respectively. We use concatenation in skip connections. Finally, a fully-connected layer is used as a classifier to make predictions. Before the fully-connected layer, we perform a simple sum to aggregate feature vectors of adjacent nodes. Dropout (Srivastava et al., 2014) is applied on both input feature vectors and adjacency matrices in each layer with rates of 0.16 and 0.999, respectively. All LGCN models in transductive learning tasks use the sub-graph training strategy. The sub-graph size is set to .

Inductive Learning. For inductive learning, the same LGCN model as above is used except for some hyper-parameters. For the graph embedding layer, the dimension of output feature vectors is 128. We stack two LGCLs with . We also employ the sub-graph training strategy, with sub-graph initial node size equal to 500 and 200. Dropout with a rate of 0.9 is applied in each layer.

For both transductive and inductive learning LGCN models, the following configurations are shared. For all layers, only the identity activation function is used, which means no nonlinearity is involved in the networks. In order to avoid over-fitting, the regularization with is applied. For training, the Adam optimizer (Kingma and Ba, 2015) with a learning rate of 0.1 is used. Weights in LGCNs are initialized by the Glorot initialization (Glorot and Bengio, 2010). We employ the early stopping strategy based on the validation accuracy and train 1,000 epochs at most.

Models | Cora | Citeseer | Pubmed |
---|---|---|---|

DeepWalk (Perozzi et al., 2014) | 67.2% | 43.2% | 65.3% |

Planetoid (Yang et al., 2016) | 75.7% | 64.7% | 77.2% |

Chebyshev (Defferrard et al., 2016) | 81.2% | 69.8% | 74.4% |

GCN (Kipf and Welling, 2017) | 81.5% | 70.3% | 79.0% |

83.3 0.5% | 73.0 0.6% | 79.5 0.2% |

Models | PPI |
---|---|

GraphSAGE-GCN (Hamilton et al., 2017) | 0.500 |

GraphSAGE-mean (Hamilton et al., 2017) | 0.598 |

GraphSAGE-pool (Hamilton et al., 2017) | 0.600 |

GraphSAGE-LSTM (Hamilton et al., 2017) | 0.612 |

0.772 0.002 |

### 4.3. Analysis of Results

The experimental results are summarized in Tables 2 and 3 for transductive and learning settings, respectively.

Transduction Learning. For transductive learning experiments, we report node classification accuracies as in (Kipf and Welling, 2017). Table 2 provides the comparisons with other graph models. According to the results, our LGCN models achieve better performance over the current state-of-the-art GCNs by a margin of 1.8%, 2.7%, and 0.6% on the Cora, Citeseer, and Pubmed datasets, respectively.

Inductive Learning. For inductive learning experiments, we report micro-averaged F1 scores like (Hamilton et al., 2017). From table 3, we can observe that our LGCN model outperforms GraphSAGE-LSTM by a margin of 16%. Without observing the structure of test graphs in training, the LGCN model still achieves good generalization.

The results above show that the proposed LGCN models on generic graphs consistently yield new state-of-the-art performance in node classification tasks on different datasets. These results demonstrate the effectiveness of applying regular convolutional operations on transformed graph data. In addition, the proposed transformation approach through the -largest node selection is shown to be effective.

### 4.4. LGCL versus GCN Layers

It may be argued that our LGCN models employ a deeper network architecture than GCNs, which could explain the improved performance. However, the performance of GCNs is reported to decrease when going deeper by stacking more layers. In addition, we conduct another experiment by replacing all LGCLs in LGCN models by GCN layers, denoted as LGCN-GCN model. All the other settings remain the same in order to ensure the fairness of the comparisons. Table 4 provides the comparison results between LGCN and LGCN-GCN. The results show that LGCN has better performance than LGCN-GCN, which indicates that the LGCL is more effective than the GCN layer.

Models | Cora | Citeseer | Pubmed |
---|---|---|---|

LGCN-GCN | 82.2 0.5% | 71.1 0.5% | 79.0 0.2% |

83.3 0.5% | 73.0 0.6% | 79.5 0.2% |

### 4.5. Sub-Graph versus Whole-Graph Training

For the experiments above, we use the sub-graph training strategy to learn the LGCN models, which aims at saving memory and training time. However, since the sub-graph selection algorithm samples some nodes as a sub-graph from the whole graph, it means that the models trained in this way do not learn about the structure of whole graph during training. Meanwhile, in transductive learning tasks, the information of testing nodes may be ignored, which raises the risk of performance loss. To address this concern, we perform experiments on transductive learning tasks to compare the sub-graph training strategy with the previous whole-graph training strategy. Through the experiments, we show the advantages of using the sub-graph training strategy, with negligible loss in terms of model performance.

For the sub-graph selection process described in Algorithm 1, the algorithm starts with some initial nodes that are randomly selected. In transductive learning tasks, we sample initial nodes only from the nodes with training labels to make sure that training can be conducted. To be specific, we sample 140, 120, and 60 initial nodes when selecting the sub-graph for the Cora, Citeseer, and Pubmed datasets, respectively. For each iteration in the sub-graph selection algorithm, we do not set to limit the number of nodes expanded into the sub-graph. The maximum number of nodes in the sub-graph is set to 2,000 for all the three datasets, which is an feasible size for our GPUs in hand.

For comparison, we perform experiments using the same LGCN models, but train them using the same whole-graph training strategy as GCNs, which means the inputs are representations of the entire graph. We denote such models as LGCN, compared to LGCN with the sub-graph training strategy. The comparing results of these two models with GCNs are provided in Table 5. The number of nodes reported represents how many nodes are used for one iteration of training. The time reported here is the training time for running 100 epochs using a single TITAN Xp GPU.

Cora | Citeseer | Pubmed | ||
---|---|---|---|---|

GCN | # Nodes | 2708 | 3327 | 19717 |

Accuracy | 81.5% | 70.3% | 79.0% | |

Time | 7s | 4s | 38s | |

# Nodes | 2708 | 3327 | 19717 | |

Accuracy | 83.8 0.5% | 73.0 0.6% | 79.5 0.2% | |

Time | 58s | 30s | 1080s | |

# Nodes | 644 | 442 | 354 | |

Accuracy | 83.3 0.5% | 73.0 0.6% | 79.5 0.2% | |

Time | 14s | 3.6s | 2.6s |

It can be seen that the actual numbers of nodes in the training sub-graph for the Cora, Citeseer, and Pubmed datasets are 644, 442, and 354, respectively, which are far smaller than the maximum sub-graph size of 2,000. This indicates that the nodes in the Cora, Citeseer, and Pubmed datasets are sparsely connected. Specifically, starting from several initial nodes with training labels, only a small set of nodes will be selected by expanding neighboring nodes to form connected sub-graphs. While these datasets are usually considered as a single large graph, the whole graph is actually composed of several separate sub-graphs that have no connection to each other. The sub-graph training strategy takes advantage of this fact and makes efficient use of the nodes with training labels. Since only the initial nodes have training labels and all their connectivity information is included in the selected sub-graphs, the amount of information loss in the sub-graph training is minimized, resulting in negligible performance loss. This is demonstrated by comparing the node classification accuracies of LGCN and LGCN. According to the results, LGCN models only have a subtle performance loss of 0.5% on the Cora dataset, while yielding the same performance on the Citeseer and Pubmed datasets, as compared to the LGCN models.

After investigating the risk of performance loss, we point out the great advantages of the sub-graph training strategy in terms of training speed. By using the sub-graph training, LGCN models take a sub-graph of fewer nodes as inputs in contrast to the whole graph, which is expected to greatly promote the training efficiency. It can be seen from the results in Table 5 that the improvement is outstanding. Although GCNs require simpler computation, its running time is much longer than that of LGCN models on large-scale graph datasets like the Pubmed. Powerful deep models are usually used on large-scale data, which makes the sub-graph training strategy useful in practice. The sub-graph training strategy enables using more complex layers such as the proposed LGCLs without the concern of long training time. As a result, our large-scale LGCNs with the sub-graph training strategy are not only effective but also very efficient.

### 4.6. Performance Study of k

As described in Section 3.3, the average degree of nodes in graph can be helpful when choosing the hyper-parameter in LGCNs. In this part, we conduct experiments to show how different values of affect the performance of LGCN models. We vary the value of in LGCLs and observe the node classification accuracies on the Cora, Citeseer, and Pubmed datasets. The values of are selected from 2, 4, 8, 16, and 32, which cover a reasonable range of integer values.

Figure 5 plots the performance change of LGCN models under different values of . As demonstrated in the figure, the LGCN models achieve the best performance on all the three datasets when choosing . In the Cora, Citeseer, and Pubmed datasets, the average node degrees are 4, 5, and 6, respectively. This indicates that the best is usually a bit larger than the average node degree in the dataset. When is too large, the performance of LGCN models decreases. A possible explanation is that if is much larger than the average node degree in the graph, too many zero padding is used in the -largest node selection process, which compromises the performance of the following 1-D CNN models. For the inductive learning task on the PPI dataset, we also explore different values of . The best performance is given by while the average node degree is 31. This is consistent with our results above.

## 5. Conclusions and Future Work

In this work, we propose the learnable graph convolutional layer (LGCL), which transforms generic graphs to data of grid-like structures and enables the use of regular convolutional operations. The transformation is conducted through a novel -largest node selection process, which uses the ranking between node feature values. Based on our LGCL, we build deeper networks, known as learnable graph convolutional networks (LGCNs), for node classification tasks on graphs. Experimental results show that the proposed LGCN models yield consistently better performance than prior methods under both transductive and inductive learning settings. Our LGCN models achieve new state-of-the-art results on four different datasets, demonstrating the effectiveness of LGCLs.

In addition, we propose a sub-graph selection algorithm, resulting in the sub-graph training strategy, which can solve the problem of excessive requirements for memory and computational resources on large-scale graph data. With the sub-graph training, the proposed LGCN models are both effective and efficient. Our experiments indicate that the sub-graph training strategy brings a significant advantage in terms of training speed, with a negligible amount of performance loss. The new training strategy is very useful as it enables the use of more complex models efficiently.

Based on this work, we discuss several possible directions for future work. First, our methods mainly address the node classification problems. In practice, many other interesting tasks can be formulated as graph classification problems, where each graph has a label. While they are similar to image classification tasks, current graph convolutional methods, including ours, are not able to perform down-sampling on graphs, like the pooling operations on image data. We need a layer to reduce the number of nodes effectively, which is necessary for graph classification. Second, our methods are mainly applied to generic graph data like citation networks. For other data like text, our methods may also be helpful, since we can treat text data as graphs. We will explore these directions in the future.

###### Acknowledgements.

This work was supported in part by National Science Foundation grants DBI-1641223, IIS-1633359 and Defense Advanced Research Projects Agency grant N66001-17-2-4031.## References

- (1)
- Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. International Conference on Learning Representations (2015).
- Chen et al. (2016) Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. 2016. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. Transactions on Pattern Analysis and Machine Intelligence (2016).
- Cho et al. (2014) Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the Properties of Neural Machine Translation: Encoder–Decoder Approaches. Syntax, Semantics and Structure in Statistical Translation (2014), 103.
- Defferrard et al. (2016) Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems. 3844–3852.
- Deng et al. (2009) J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. ImageNet: A Large-Scale Hierarchical Image Database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- Gehring et al. (2017) Jonas Gehring, Michael Auli, David Grangier, and Yann N Dauphin. 2017. A convolutional encoder model for neural machine translation. Annual Meeting of the Association for Computational Linguistics (2017).
- Glorot and Bengio (2010) Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. 249–256.
- Grover and Leskovec (2016) Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 855–864.
- Hamilton et al. (2017) William L. Hamilton, Rex Ying, and Jure Leskovec. 2017. Inductive Representation Learning on Large Graphs. In NIPS.
- He et al. (2017) Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. IEEE International Conference on Computer Vision (2017).
- He et al. (2016a) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016a. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
- He et al. (2016b) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016b. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- Huang et al. (2017) Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. 2017. Densely connected convolutional networks. IEEE Conference on Computer Vision and Pattern Recognition (2017).
- Kingma and Ba (2015) Diederik Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. The International Conference on Learning Representations (2015).
- Kipf and Welling (2017) Thomas N Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. International Conference on Learning Representations (2017).
- Krizhevsky et al. (2012a) Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. 2012a. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25, P. Bartlett, F.C.N. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger (Eds.). 1106–1114.
- Krizhevsky et al. (2012b) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012b. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097–1105.
- LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278–2324.
- Luong et al. (2015) Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attention-based neural machine translation. Conference on Empirical Methods in Natural Language Processing (2015).
- Niepert et al. (2016) Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. 2016. Learning convolutional neural networks for graphs. In International Conference on Machine Learning. 2014–2023.
- Perozzi et al. (2014) Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 701–710.
- Ren et al. (2015) Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems. 91–99.
- Sen et al. (2008) Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. 2008. Collective classification in network data. AI magazine 29, 3 (2008), 93.
- Simonyan and Zisserman (2015) Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations.
- Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, 1 (2014), 1929–1958.
- Szegedy et al. (2015) Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going Deeper With Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1–9.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 6000–6010.
- Veličković et al. (2017) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2017. Graph Attention Networks. arXiv preprint arXiv:1710.10903 (2017).
- Wang et al. (2012) Tao Wang, David J Wu, Adam Coates, and Andrew Y Ng. 2012. End-to-end text recognition with convolutional neural networks. In Pattern Recognition (ICPR), 2012 21st International Conference on. IEEE, 3304–3308.
- Yang et al. (2016) Zhilin Yang, William W Cohen, and Ruslan Salakhutdinov. 2016. Revisiting semi-supervised learning with graph embeddings. International Conference on Machine Learning (2016).
- Zitnik and Leskovec (2017) Marinka Zitnik and Jure Leskovec. 2017. Predicting multicellular function through multi-layer tissue networks. Bioinformatics 33, 14 (2017), i190–i198.