# Bayesian Graph Convolutional Neural Networks using Node Copying

###### Abstract

Graph convolutional neural networks (GCNN) have numerous applications in different graph based learning tasks. Although the techniques obtain impressive results, they often fall short in accounting for the uncertainty associated with the underlying graph structure. In the recently proposed Bayesian GCNN (BGCN) framework, this issue is tackled by viewing the observed graph as a sample from a parametric random graph model and targeting joint inference of the graph and the GCNN weights. In this paper, we introduce an alternative generative model for graphs based on copying nodes and incorporate it within the BGCN framework. Our approach has the benefit that it uses information provided by the node features and training labels in the graph topology inference. Experiments show that the proposed algorithm compares favourably to the state-of-the-art in benchmark node classification tasks.

## 1 Introduction

Recently, there has been an increased research focus on graph convolutional neural networks (GCNNs) due to their successful application in various graph based learning problems such as node and graph classification, matrix completion, and learning of node embeddings. Prior work leading to the development of GCNNs includes (Bruna et al., 2013; Henaff et al., 2015; Duvenaud et al., 2015). (Defferrard et al., 2016) propose an approach based on spectral filtering which is also followed in (Levie et al., 2019; Chen et al., 2018b; Kipf and Welling, 2017). Other works (Atwood and Towsley, 2016; Hamilton et al., 2017) consider spatial filtering and aggregation strategies. A general framework for learning on graphs and manifolds with neural networks is derived in (Monti et al., 2017) and this includes various other existing methods as special cases.

Several modifications can improve the performance of the GCNN, including adding attention nodes (Veličković et al., 2018), gates (Li et al., 2016c; Bresson and Laurent, 2017), edge conditioning and skip connections (Sukhbaatar et al., 2016; Simonovsky and Komodakis, 2017). Other approaches involve the use of graph ensembles (Anirudh and Thiagarajan, 2017), multiple adjacency matrices (Such et al., 2017), the dual graph (Monti et al., 2018), or random perturbation (Sun et al., 2019). Employing localized sampling methods (Hamilton et al., 2017), importance sampling (Chen et al., 2018b) or control variate based stochastic approximation (Chen et al., 2018a) has been shown to improve the scalability of these methods for processing large graphs.

The majority of the existing approaches process the graph as the ground truth. However, in many practical settings, the graph is often derived from noisy data or inaccurate modelling assumptions. As a result, spurious edges may be present or edges between very similar nodes might be omitted. This can lead to deterioration in the performance of the learning algorithms. Various existing approaches like the graph attention network (Veličković et al., 2018) and graph ensemble based approach (Anirudh and Thiagarajan, 2017) address this issue partially. Nevertheless, neither of these methods has the flexibility to add edges that could be missing from the observed graph. A principled way to address the uncertainty in the graph structure is to consider the graph as a random sample drawn from a probability distribution over graphs. The Bayesian framework of (Zhang et al., 2019) proposes to use a parametric random graph model as the generative model of the graph and formulates the learning task as the inference of the joint posterior distribution of the graph and the weights of the GCNN. Despite the effectiveness of the approach, the choice of a suitable random graph model is crucial and heavily dependent on the learning task and datasets. Furthermore, the method in (Zhang et al., 2019) conducts the posterior inference of the graph solely conditioned on the observed graph topology. This results in a complete disregard of any information provided by the node features and the training labels, which is undesirable if these data are highly correlated with the true graph structure.

In this paper, we introduce a novel generative model for graphs based on copying nodes from one location to another. While this idea is similar to the full duplication process presented in (Chung et al., 2003), we do not grow the graph since we only copy existing nodes rather than adding new ones. This results in a formulation in which the posterior inference of the graph is carried out conditioned on the features and training labels as well as the observed graph topology. Experimental results demonstrate the efficacy of our approach for the semi-supervised node classification task, particularly if a limited number of training labels is available. The rest of the paper is organized as follows. We provide a brief review of the GCNN in Section 2 and present the proposed approach in Section 3. We report the results of the numerical experiments in Section 4 and make concluding remarks in Section 5.

## 2 Graph convolutional neural networks

Although graph convolutional neural networks are suitable for a variety of learning tasks, here we restrict ourselves to the discussion of the node classification problem on a graph for brevity. In this setting, an observed graph is available, where is the set of nodes and denotes the set of edges. There is a feature vector associated with each node and its class label is denoted by . The labels are known only for the nodes in the training set . The goal is to predict the labels of the remaining nodes using the information provided by the observed graph , the feature matrix and the training labels .

In a GCNN, learning is performed using graph convolution operations within a neural network architecture. A layerwise propagation rule for the simpler architectures (Defferrard et al., 2016; Kipf and Welling, 2017) is written as:

(1) | ||||

(2) |

The normalized adjacency operator is derived from the observed graph and it controls the aggregation of the output features across the neighbouring nodes at each layer. denotes a pointwise non-linear activation function and are the output features from layer . represents the weights of the neural network at layer . We use to denote the collection of GCNN weights across all layers. In an -layer network, the final output is collected from the last layer . The weights of the neural network are learned via backpropagation with the objective of minimizing an error metric between the training labels and the network predictions at the nodes in the training set.

## 3 Methodology

In the Bayesian paradigm, the observed graph is viewed as a random quantity and the posterior inference for the underlying graph is required. We postulate a model which allows sampling of a random graph by copying the observed graph and then replacing each node’s edges with a high probability by the edges of a similar node randomly selected from the observed graph, while the node features remain unchanged.

### 3.1 Node-Copying Graph Model

In order to sample graph from the proposed model, we introduce an auxiliary random vector , where the ’th entry denotes the node whose edges are to replace the edges of the ’th node in the observed graph. The entries in are assumed to be mutually independent. For sampling the s, we use a base classification algorithm using the observed graph , the features and the training labels to obtain labels for each node in the graph. Then for each class , we collect the nodes with predicted label into the set :

(3) |

We define the posterior distribution of as follows:

(4) |

for and . Sampling from this model boils down to selecting a node at random from the collection of nodes that have the same predictive label as the ’th node. Conditioned on and the observed graph , the sampling of graph is carried out by copying the ’th node of in the place of the ’th node of , independently for all with a high probability. More formally, the generative model is given as:

(5) |

where, is a hyperparameter and denotes the indicator function of copying ’th node of in place of the ’th node of . The copying operation involves changing the set of neighbours of the ’th node of to be the same as the set of neighbours of the ’th node of .

### 3.2 Bayesian Graph Convolutional Neural Networks

As in (Zhang et al., 2017), we compute the marginal posterior probability of the node labels via marginalization with respect to the graph and the GCNN weights.

(6) |

Here denotes the random weights of a Bayesian GCNN over the graph and is an -dimensional random vector associated with the proposed node copying model. In a node classification problem with classes, the term is modelled using a -dimensional categorical distribution by applying a softmax function to the output of the GCNN. In (Zhang et al., 2019), is viewed as a sample realization from a collection of graphs associated with a parametric random graph model and posterior inference of is targeted via marginalization of the random graph parameters. Their approach thus ignores any possible dependence of the graph on the features and the labels . By contrast, our approach models the marginal posterior distribution of the graph as . This allows us to incorporate the information provided by the features and the training labels in the graph inference process. The integral in equation (6) is not analytically tractable. Hence, a Monte Carlo approximation is formed as follows:

(7) |

In this approximation, samples are drawn from . The graphs are sampled from and subsequently weight matrices are sampled from from the Bayesian GCN corresponding to the graph .

Sampling graphs from the node-copying model in Section 3.1 has several advantages compared to the graph inference technique based on mixed membership stochastic block models (MMSBMs) (Airoldi et al., 2009), which was adopted in (Zhang et al., 2019). First, the sampling of is computationally much cheaper than the inference of parameters of the parametric model, which becomes more severe as the size of the graph increases. Second, it is in general extremely difficult to carry out accurate inference for high dimensional MMSBM parameters (Li et al., 2016b) and inaccuracies in parameter estimates results in sampling of graphs which are very different from the observed graph. This can impact classification performance adversely, particularly if the observed graph does not fit the MMSBM well. However, for the proposed copying model, the similarity between the sampled graph and the observed graph depends mostly on the performance of the base classifier. If a state-of-the-art graph based classification method (e.g., GCNN) is used, we can obtain more representative graph samples from this model, particularly for large graphs. The expected graph edit distance between the random graphs and the observed graph can be controlled by the choice of the parameter . A low value of is chosen since it causes high variability among the random graph samples which was found to be effective empirically. Third, sampling a graph from the MMSBM scales as whereas the proposed method offers complexity.

For the Bayesian inference of GCNN weights, we can use various techniques including expectation propagation (Hernández-Lobato and Adams, 2015), variational inference (Gal and Ghahramani, 2016; Sun et al., 2017; Louizos and Welling, 2017), and Markov Chain Monte Carlo methods (Neal, 1993; Korattikara et al., 2015; Li et al., 2016a). Similar to (Zhang et al., 2019), we train a GCNN on and use Monte Carlo dropout (Gal and Ghahramani, 2016) to sample . This is equivalent to sampling the weights from a variational approximation of , with a particular structure. The resulting algorithm is summarized in Algorithm 1.

## 4 Numerical Experiments and Results

We address a semi-supervised node classification task for three citation networks (Sen et al., 2008): Cora, CiteSeer, and Pubmed. In these datasets each node represents a scientific publication and an undirected edge is formed between two nodes if any one of them cites the other. Each node has a sparse bag-of-words feature vector and the label describes the topic of the document. During training, we have access to the labels of only a few nodes per class and the goal is to infer labels for the other nodes.

We consider two different strategies for splitting the data into training and test sets, as specified in (Zhang et al., 2019). In the first setting, we use the fixed split from (Yang et al., 2016), which contains 20 labels per class in the training set. For the cases with 5 and 10 training labels per class in the fixed split scenario, the first 5 and 10 labels in the original partition of (Yang et al., 2016) are used. The second type of split is constructed by sampling the training and test sets randomly for each trial. Since a specific split of data can impact the classification performance significantly, random splitting provides a more robust comparison of performance of the algorithms.

We compare the proposed BGCN in this paper with ChebyNet (Defferrard et al., 2016), GCNN (Kipf and Welling, 2017), GAT (Veličković et al., 2018) and the BGCN in (Zhang et al., 2019). The hyperparameters of GCNN are set according to (Kipf and Welling, 2017) and the same values are used for the BGCN algorithms as well. For the proposed BGCN, we use GCNN (Kipf and Welling, 2017) as the base classification method. For both splitting strategies, each algorithm is run for 50 trials with random weight initializations. The average accuracies for Cora, Citeseer and Pubmed datasets along with their standard errors are reported in Table 1, 2 and 3 respectively.

Random split | 5 labels | 10 labels | 20 labels |
---|---|---|---|

ChebyNet | 61.76.8 | 72.53.4 | 78.81.6 |

GCNN | 70.03.7 | 76.02.2 | 79.81.8 |

GAT | 70.43.7 | 76.62.8 | 79.91.8 |

BGCN | 74.62.8 | 77.52.6 | 80.21.5 |

BGCN (ours) | 73.82.7 | 77.62.6 | 80.31.6 |

Fixed split | |||

ChebyNet | 67.93.1 | 72.72.4 | 80.40.7 |

GCNN | 74.40.8 | 74.90.7 | 81.60.5 |

GAT | 73.52.2 | 74.51.3 | 81.60.9 |

BGCN | 75.30.8 | 76.60.8 | 81.20.8 |

BGCN (ours) | 75.11.3 | 76.70.7 | 81.40.6 |

Random split | 5 labels | 10 labels | 20 labels |
---|---|---|---|

ChebyNet | 58.54.8 | 65.82.8 | 67.51.9 |

GCNN | 58.54.7 | 65.42.6 | 67.82.3 |

GAT | 56.75.1 | 64.13.3 | 67.62.3 |

BGCN | 63.04.8 | 69.92.3 | 71.11.8 |

BGCN (ours) | 63.94.2 | 68.52.3 | 70.22.0 |

Fixed split | |||

ChebyNet | 53.01.9 | 67.71.2 | 70.20.9 |

GCNN | 55.41.1 | 65.81.1 | 70.80.7 |

GAT | 55.42.6 | 66.11.7 | 70.81.0 |

BGCN | 57.30.8 | 70.80.6 | 72.20.6 |

BGCN (ours) | 61.42.3 | 69.60.6 | 71.90.6 |

Random split | 5 labels | 10 labels | 20 labels |
---|---|---|---|

ChebyNet | 62.76.9 | 68.65.0 | 74.33.0 |

GCNN | 69.74.5 | 73.93.4 | 77.52.5 |

GAT | 68.04.8 | 72.63.6 | 76.43.0 |

BGCN | 70.24.5 | 73.33.1 | 76.02.6 |

BGCN (ours) | 71.04.2 | 74.63.3 | 77.52.4 |

Fixed split | |||

ChebyNet | 68.12.5 | 69.41.6 | 76.01.2 |

GCNN | 69.70.5 | 72.80.5 | 78.90.3 |

GAT | 70.00.6 | 71.60.9 | 76.90.5 |

BGCN | 70.90.8 | 72.30.8 | 76.60.7 |

BGCN (ours) | 71.20.5 | 73.60.5 | 79.10.4 |

We observe that the proposed BGCN algorithm obtains higher classification accuracy compared to its competitors in most cases. The improvement in accuracy compared to GCNN is more significant when the number of available labels is limited to 5 or 10. From Figure 3, we observe that in most cases, for the Cora and the Citeseer datasets, the proposed BGCN algorithm corrects more errors of the GCNN base classifier for nodes with lower degree.

## 5 Conclusion

In this paper, we present a Bayesian GCNN using a node copying based generative model for graph. The proposed algorithm exhibits superior performance in the semi-supervised node classification task when the amount of available labels for training is limited. Future work will involve conducting a more thorough experimental evaluation and exploring ways to extend the methodology to other graph based learning tasks.

## References

- Mixed membership stochastic blockmodels. In Proc. Adv. Neural Inf. Proc. Systems, pp. 33–40. Cited by: §3.2.
- Bootstrapping graph convolutional neural networks for autism spectrum disorder classification. arXiv:1704.07487. Cited by: §1, §1.
- Diffusion-convolutional neural networks. In Proc. Adv. Neural Inf. Proc. Systems, Cited by: §1.
- Residual gated graph convnets. arXiv:1711.07553. Cited by: §1.
- Spectral networks and locally connected networks on graphs. In Proc. Int. Conf. Learning Representations, Scottsdale, AZ, USA. Cited by: §1.
- Stochastic training of graph convolutional networks with variance reduction. In Proc. Int. Conf. Machine Learning, Cited by: §1.
- FastGCN: fast learning with graph convolutional networks via importance sampling. In Proc. Int. Conf. Learning Representations, Cited by: §1, §1.
- Duplication models for biological networks. J. of Computat. Biology 10 (5), pp. 677–687. Cited by: §1.
- Convolutional neural networks on graphs with fast localized spectral filtering. In Proc. Adv. Neural Inf. Proc. Systems, Cited by: §1, §2, §4.
- Convolutional networks on graphs for learning molecular fingerprints. In Proc. Adv. Neural Inf. Proc. Systems, Cited by: §1.
- Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In Proc. Int. Conf. Machine Learning, Cited by: §3.2.
- Inductive representation learning on large graphs. In Proc. Adv. Neural Inf. Proc. Systems, Cited by: §1, §1.
- Deep convolutional networks on graph-structured data. arXiv:1506.05163. Cited by: §1.
- Probabilistic backpropagation for scalable learning of Bayesian neural networks. In Proc. Int. Conf. Machine Learning, Cited by: §3.2.
- Semi-supervised classification with graph convolutional networks. In Proc. Int. Conf. Learning Representations, Cited by: §1, §2, §4.
- Bayesian dark knowledge. In Proc. Adv. Neural Inf. Proc. Systems, Cited by: §3.2.
- CayleyNets: graph convolutional neural networks with complex rational spectral filters. IEEE Trans. Signal Processing 67 (1). Cited by: §1.
- DeepGraph: graph structure predicts network growth. arXiv:1610.06251. Cited by: §3.2.
- Scalable MCMC for mixed membership stochastic blockmodels. In Proc. Artificial Intelligence and Statistics, pp. 723–731. Cited by: §3.2.
- Gated graph sequence neural networks. In Proc. Int. Conf. Learning Representations, Cited by: §1.
- Multiplicative normalizing flows for variational Bayesian neural networks. arXiv:1703.01961. Cited by: §3.2.
- Geometric deep learning on graphs and manifolds using mixture model CNNs. In Proc. IEEE Conf. Comp. Vision and Pattern Recognition, Cited by: §1.
- Dual-primal graph convolutional networks. arXiv:1806.00770. Cited by: §1.
- Bayesian learning via stochastic dynamics. In Proc. Adv. Neural Inf. Proc. Systems, pp. 475–482. Cited by: §3.2.
- Collective classification in network data. AI Magazine 29 (3), pp. 93. Cited by: §4.
- Dynamic edge-conditioned filters in convolutional neural networks on graphs. arXiv:1704.02901. Cited by: §1.
- Robust spatial filtering with graph convolutional neural networks. IEEE J. Sel. Topics Signal Proc. 11 (6), pp. 884–896. Cited by: §1.
- Learning multiagent communication with backpropagation. In Proc. Adv. Neural Inf. Proc. Systems, Cited by: §1.
- Fisher-Bures adversary graph convolutional networks. arXiv e-prints, arXiv : 1903.04154. Cited by: §1.
- Learning structured weight uncertainty in Bayesian neural networks. In Proc. Artificial Intelligence and Statistics, Cited by: §3.2.
- Graph attention networks. In Proc. Int. Conf. Learning Representations, Vancouver, Canada. Cited by: §1, §1, §4.
- Revisiting semi-supervised learning with graph embeddings. arXiv preprint arXiv:1603.08861. Cited by: §4.
- Random graph models for dynamic networks. Eur. Phys. J. B, pp. 90–200. Cited by: §3.2.
- Bayesian graph convolutional neural networks for semi-supervised classification. In Proc. AAAI Conf. Artificial Intelligence, Cited by: §1, §3.2, §3.2, §3.2, §4, §4.