PreTraining Graph Neural Networks for
Generic Structural Feature Extraction
Abstract
Graph neural networks (GNNs) are shown to be successful in modeling applications with graph structures. However, training an accurate GNN model requires a large collection of labeled data and expressive features, which might be inaccessible for some applications. To tackle this problem, we propose a pretraining framework that captures generic graph structural information that is transferable across tasks. Our framework can leverage the following three tasks: 1) denoising link reconstruction, 2) centrality score ranking, and 3) cluster preserving. The pretraining procedure can be conducted purely on the synthetic graphs, and the pretrained GNN is then adapted for downstream applications. With the proposed pretraining procedure, the generic structural information is learned and preserved, thus the pretrained GNN requires less amount of labeled data and fewer domainspecific features to achieve high performance on different downstream tasks. Comprehensive experiments demonstrate that our proposed framework can significantly enhance the performance of various tasks at the level of node, link, and graph.
PreTraining Graph Neural Networks for
Generic Structural Feature Extraction
Ziniu Hu, Changjun Fan, Ting Chen, KaiWei Chang, Yizhou Sun University of California, Los Angeles {bull, cjfan2017, tingchen, kwchang, yzsun}@cs.ucla.edu
noticebox[b]Preprint. Under review.\end@float
1 Introduction
Graphs are a fundamental abstraction for modeling relational data in physics, biology, neuroscience and social science. Although there are numerous types of graph structures, some graphs are known to exhibit rich and generic connectivity patterns that appear general in graphs associated with different applications. Taking network motifs, which are some small subgraph structures, as an example, they are considered as the building blocks for many graphrelated tasks benson2016higher (), e.g., triangular motifs are crucial for social networks, twohop paths are essential to understand air traffic patterns, etc. Despite its importance, previous researchers are required to design various specific rules or patterns to extract motif structures, in order to serve as features for different applications manually. This process is tedious and adhoc.
Recently, researchers have adopted deep representation learning into graph domain and proposed various Graph Neural Network architectures gcn (); graphsage (); DBLP:conf/iclr/VelickovicCCRLB18 () to alleviate this issues by automatically capturing complex information of graph structures from data. In general, GNNs take a graph with attributes as input and apply convolution filters to generate node embeddings with different granularity levels layer by layer. The GNN framework is often trained in an endtoend manner towards some specific tasks and has shown to achieve competitive performance in various graphbased applications, such as semisupervised node classification gcn (), recommendation systems DBLP:conf/kdd/YingHCEHL18 () and knowledge graphs DBLP:conf/esws/SchlichtkrullKB18 ().
Despite the success, most of the GNN applications heavily rely on the domainspecific input features. For example, in PPI dataset zitnik2017predicting (), which is widely used as a node classification benchmark task, positional gene sets, motif gene sets, and immunological signatures are used as features. However, these domainspecific features can be hard to obtain, and they cannot generalize to other tasks. Without access to these domainspecific features, the performance of GNNs is often suboptimal. Besides, unlike in the fields of computer vision (CV) or natural language processing (NLP), where largescale labeled data are available (deng2009imagenet, ), in many graphrelated applications, graphs with highquality labels are expensive or even inaccessible. For example, researchers in neural science use fMRI to scan and construct brain networks brain (), which is extremely timeconsuming and costly. Consequently, the insufficiency of labeled graphs restricts the potential to train deep GNNs that learn generic graph features from scratch. To tackle these issues, in this paper, we consider training a deep GNN from a large set of unlabelled graphs and transferring the learned knowledge to downstream tasks with only a few labels, an idea known as pretraining. We focus on the following research questions: Can GNNs learn the generic graph structure information via pretraining? And to what extent can the pretraining benefit the downstream tasks?
Although the idea of unsupervised pretraining has been proved successful in CV and NLP girshick2014rich (); DBLP:journals/corr/abs181004805 (), adopting this idea for training GNNs is still a nontrivial problem. First, even for unlabeled graphs, collecting a highquality and generalizable graph dataset for pretraining is still hard. Graphs in different domains usually possess different properties snapnets (). Taking the degree distribution barabasi2004network () as an example, it tends to be uniform in chemical molecular networks, while it always follows the powerlaw in social networks. Therefore, the model trained on graphs from one domain often cannot generalize well to another domain. As a consequence, obtaining a highquality graph dataset that covers various graph properties is crucial and challenging for pretraining GNNs. To solve this, we propose to generate synthetic graphs with different statistical properties as the pretraining data. Specifically, we exploit the degreecorrected stochastic block model holland1983stochastic () to generate synthetic graphs, where various of parameters are used in order to generate a variety of graphs with different properties.
Second, as is shown in newman2010networks (), graphs have a wide spectrum of structural properties, ranging from nodes, edges to subgraphs. Existing unsupervised objectives such as forcing the node embeddings to preserve the similarity derived from random walks perozzi2014deepwalk (); grover2016node2vec () only focus on capturing the relatively local structures and overlook the higherlevel structural information. To capture information present in different levels of graph structures, we design three selfsupervised pretraining tasks: 1) denoising link reconstruction, 2) centrality score ranking, and 3) cluster preserving. Guided by these tasks, the pretrained GNNs are able to capture general graph properties and benefit the downstream tasks.
Our main contributions are as follows:

We propose a pretraining framework that enables GNNs to learn generic graph structural features. The framework utilizes synthetic graphs with adjustable statistical properties as the pretraining data. Therefore, it does not require additional label data. The pretrained model can benefit applications with unseen graphs.

We explore three selfsupervised pretraining tasks which focus on different levels of graph structures, enabling GNNs to capture multiperspective graph structure information.

We perform extensive experiments on different downstream tasks and demonstrate that all of them can be significantly benefited from transferring the knowledge learned via pretraining. We will release source code and data to facilitate future research on this line.
2 Related Work
The goal of pretraining is to allow a neural network model to initialize its parameters with weights learned from the pretraining tasks. In this way, the model can leverage the commonality between the pretraining and the downstream tasks. Pretraining has shown superior in boosting the performance of many downstream applications in computer vision (CV), natural language processing (NLP) and graph mining. In the following, we review approaches and applications of pretraining in these fields.
Pretraining strategies for graph applications
Previous studies have proposed to utilize pretraining to learn graph representations. These attempts directly parameterize the node embedding vectors and optimize them by preserving some deterministic measures, such as the network proximity tang2015line () or statistics derived from random walks grover2016node2vec (). However, the embedding learned in this way cannot generalize to another unseen graph as the information they capture are graphspecified. In contrast, we consider a transfer learning setting, where our goal is to learn generic graph representations.
With the increasing focus on graph neural networks (GNNs), researchers have explored the direction of pretraining GNNs on unannotated data. GraphSAGE graphsage () adds an unsupervised loss by using random walk based similarity metric. Graph Infomax DBLP:journals/corr/abs180910341 () proposes to maximize mutual information between node representations obtained from GNNs and a graph summary representation. All these studies only pretrain GNNs on a particular set of graphs with taskspecific feature inputs. As the feature types across graph datasets are quite different, it is hard to generalize the knowledge learned from one set to another. Instead, our approach learns generic structural information of graphs, regardless of their specific feature types.
Pretraining strategies for other machine learning applications
Pretraining in CV girshick2014rich (); zeiler2014visualizing (); donahue2014decaf () mostly follows the following paradigm: first pretrain a model on largescale supervised datasets (such as ImageNet deng2009imagenet ()), then finetune the pretrained model on downstream tasks girshick2014rich () or directly extract the representation as features donahue2014decaf (). This approach has been shown to make significant impact on various downstream tasks including object detection girshick2014rich (); he2017mask (), image segmentation chen2018encoder (); long2015fully (), etc.
Pretraining has also been used in various NLP tasks. On the word level, word embedding models mikolov2013distributed (); pennington2014glove (); bojanowski2017enriching () capture semantics of words by leveraging occurrence statistics on text corpus and have been used as a fundamental component in many NLP systems. On the sentence level, pretraining approaches have been applied to derive sentence representations kiros2015skip (); le2014distributed (). Recently, contextualized word embeddings DBLP:conf/naacl/PetersNIGCLZ18 (); DBLP:journals/corr/abs181004805 () are proposed to pretrain a text encoder on large corpus with a language model objective to better encode words and their context. The approach has shown to reach stateoftheart performance in multiple language understanding tasks on the GLUE benchmark DBLP:conf/emnlp/WangSMHLB18 () and on a question answering dataset DBLP:conf/emnlp/RajpurkarZLL16 ().
3 Approach
In this section, we first provide an overview of the proposed pretraining framework. Then, we introduce three selfsupervised tasks, in which different types of structural graph information can be captured by the designed GNN. Finally, we discuss how to pretrain the GNN model on synthetic graphs and how to adapt the pretrained GNN to downstream tasks.
3.1 Pretraining Framework
Since the goal of pretraining is to learn a good feature extractor and parameter initialization from pretraining tasks for downstream tasks, the model architecture and input attributes have to be consistent through these two phases. Based on this requirement, we design a pretraining framework, as is summarized in Figure 1.
We consider the following encoderdecoder framework. Given a graph , where and denote the set of nodes and edges. denotes the adjacency matrix, which is normally sparse. We encode each node in the graph by a multilayer GNN encoder into a set of representation vectors . These node representations are then fed into taskspecific decoders to perform the downstream tasks. If a large collection of labeled data is provided, the encoder and the taskspecific decoder can be trained jointly in an endtoend manner. However, the number of labeled data of some downstream tasks is often scarce in practice. Consequently, the performance by endtoend training may be suboptimal. In this paper, we propose to first pretrain the encoder on selfsupervised tasks, which are designed for facilitating the encoder to capture generic graph structures. Note that, the node representation of an layer GNN’s output utilizes the information from hop neighborhoods context. Thus, if pretrained well, the encoder with learned weight can capture highlevel graph structure information that cannot be easily represented by local features. We then cascade the resulted with downstream decoder and finetuned the model. We discuss the input representation and model architecture as follow:
Input Representation
As our goal is to enable GNNs to learn generalizable graph structural features, the input to the model should be generic to all graphs. Thus, we select four nodelevel features which can be computed efficiently. They are (1) Degree, which defines the local importance of a node; (2) Core Number, which defines the connectivity of a node’s subgroup; (3) Collective Influence, which defines a node’s neighborhood’s importance; and (4) Local Clustering Coefficient, which defines the connectity of a node’s 1hop neighborhood^{1}^{1}1Detail descriptions and time complexity of these features are in Appendix.
To generalize these features to graphs with different sizes, we conduct a minmax normalization to them except the Local Clustering Coefficient (because it is already normalized). We then concatenate these four features and cascade them with a nonlinear transformation to encode the local features into a dimensional embedding vector, where is the input dimension to GNNs.
GNN Model Architecture
As discussed above, our GNN architecture design should be (1) powerful enough to capture graph structural features at both local and globallevel; (2) general enough so that it can be pretrained and transferred. Based on these conditions, we design a modified Graph Convolutional Networks (GCNs) operator as the basic building block and stack blocks together to construct the proposed architecture (denoted as GNN through this paper). Similar to GCNs, our architecture considers a fixed convolutional filter as it makes the model easier to be transferred. Mathematically, our modified GCN block is defined as
(1) 
where is the input of the th hidden layer of GNN, i.e., the output of the th hidden layer, is the normalized Laplacian matrix of graph as the convolutional filter. is the adjacency matrix with selfloop, which serves as skipconnection, and is a diagonal matrix, where , representing degree of each node. and are the weight matrix in the th hidden layer that would be trained, and is the activation function. We choose two weight matrix as DBLP:journals/corr/abs181000826 () points out that multilayer perceptron can help GNNs learn better representation. is the batch normalization DBLP:conf/icml/IoffeS15 () operation to help convergence. In summary, for any given node, a GNN layer will transform each node by a twolayer perceptron , with normalization , followed by an aggregation of the transformed embeddings of its neighbors with weights specified in .
To conduct different tasks using the output of GNNs, we need to extract a representation vector as the task’s input. As prior studies DBLP:conf/naacl/PetersNIGCLZ18 () observe that different tasks require information from different layers, we consider as a linear combination of node representations from different GNNs layers , so that different tasks can utilize different perspective of structural information:
(2) 
where is a scaling vector for ease of optimization, and is softmaxnormalized vector gotten from to give different attention to different layers. The obtained is then fed to different taskspecific decoder to conduct different tasks. Different from DBLP:conf/naacl/PetersNIGCLZ18 (), which considers only a single pretraining task, we have multiple pretraining tasks that focus on different aspects of the graph structure. Therefore, we use the same weight combination architecture and maintain different sets of and for different pretraining tasks, which will also be learned.
3.2 Selfsupervised Pretraining Tasks
As is shown in newman2010networks (), graphs have a wide spectrum of structural properties, ranging from nodes, edges to subgraphs. Based on this, we design three selfsupervised tasks that focus on different perspectives of graph structure to pretrain the GNN model. In the following, we use to denote the representation for node in for Task indexed by . Note for each task, shares the same GCN parameter , but has different scaling parameter and mixing parameter .
Task 1: Denoising Link Reconstruction
A good feature extractor should be able to restore links even if they are removed from the given graph. Based on this intuition, we proposed denoising link reconstruction (i.e., ). Specifically, we add noise to an input graph and obtain its noised version by randomly removing a fraction of existing edges. The GNN model takes the noised graph as input and learns to represent the noisy graph as . The learned representation is then passed to a neural tensor network (NTN) DBLP:conf/nips/SocherCMN13 () pairwise decoder , which predicts if two nodes and are connected or not as: . Both encoder and decoder are optimized jointly with an empirical binary crossentropy loss:
In this way, the pretrained GNNs is able to learn a robust representation of the input data DBLP:journals/jmlr/VincentLLBM10 (), which is especially useful for incomplete or noisy graphs. Since we want such capacity to deal with noisy graph maintained in the other two tasks, the following two tasks also take the noised graph as input.
Task 2: Centrality Score Ranking
Node centrality is an important metric for graphs 10.2307/2780000 (). It measures the importance of nodes based on their structural roles in the whole graph. Based on this, we propose to pretrain GNNs to rank nodes by their centrality scores (i.e., ), so the pretrained GNNs can capture structural roles of each node. Specifically, four centrality scores are used:

Eigencentrality (eigen, ) measures nodes’ influences based on the idea that highscore nodes contribute more to their neighbors. It describes a node’s ‘hub’ role in a whole graph.

Betweenness (freeman1977set, ) measures the number of times a node lies on the shortest path between other nodes. It describes a node’s ‘bridge’ role in a whole graph.

Closeness (closeness, ) measures the total length of the shortest paths between one node and the others. It describes a node’s ‘broadcaster’ role in a whole graph.

Subgraph Centrality (estrada2005subgraph, ) measures the the participation of each node (sum of closed walks) in all subgraphs in a network. It describes a node’s ‘motif’ role in a whole graph.
The above four centrality scores concentrate on the different roles of a node in the whole graph. Thus, a GNN that can estimate these scores is able to preserve multiperspective global information of the graph. Since centrality scores are not comparable among different graphs with different scales, we resort to rank the relative orders between nodes in the same graph regarding the three centrality scores. For a node pair () and a centrality score , with relative order as , a decoder for centrality score estimates its rank score by . Following the pairwise ranking setting defined in burges2005learning (), the probability of estimated rank order is defined by . We optimize and for each centrality score by:
In this way, the pretrained GNNs can learn the global roles of each node in the whole graph.
Task 3: Cluster Preserving
One important characteristics of real graphs is the cluster structure (DBLP:journals/csr/Schaeffer07, ), meaning that nodes in a graph have denser connections within clusters than inter clusters. We suppose the nodes in a graph is grouped into different nonoverlapping clusters , where and . We also suppose there is an indicator function to tell which cluster a given node v belongs to. Based on , we then pretrain GNNs to learn node representations so that the cluster information is preserved (i.e., ). First, we use an attentionbased aggregator to get a cluster representation by . Then, a neural tensor network (NTN) DBLP:conf/nips/SocherCMN13 () decoder estimates the similarity of node with cluster by: . We then estimate the probability that belongs to by . Finally, we optimize and by: .
The pretrained GNNs can learn to embed nodes in a graph to a representation space that can preserve the cluster information, which can be easily detected by .
3.3 Pretraining procedure
As stated in the introduction, we desire to pretrain on synthetically generated graphs, which can cover as wider range of graph property as possible. Therefore, we choose degreecorrected block model (DCBM) holland1983stochastic (), a famous graph generation algorithm that can generate network with underlying cluster structure and controlled degree distribution. Specifically, in DCBM, we assume there exist nonoverlapping clusters , randomly assign each node a corresponding cluster, and get an indicator function to tell which cluster a given node belongs to. We then sample a symmetric probability matrix denoting the probability that node in cluster will connect to node in cluster , and assume nodes in the same cluster have higher probability to be connected than in different clusters. Next, DCBM has another degreecorrected probability to control the degree distribution of each node. Since most realworld graphs follow powerlaw degree distribution, we assume are sampled from . Finally, we generate the network by sampling the adjacency matrix as: . Noted that here we have five hyperparameters to control the generated graph, the node number , total cluster number , density of cluster , and and to control degree distribution. Thus, by enumerating all these parameters, we can generate a wide range of graphs with various properties, as is shown in Figure 2.
3.4 Adaptation Procedure
In literature DBLP:journals/corr/abs190305987 (), there are two widely used methods for adapting the pretrained model to downstream tasks: Featurebased, and FineTuning. Both methods have their pros and cons. Inspired by them, in this paper we propose to combine the two methods into a more flexible way. As shown in Figure 1, after doing the weighted combination, we can choose a fixtune boundary in the middle of GNNs. The GNN blocks below this boundary are fixed, while the ones above the boundary are finetuned. Thus, for some tasks that are closely related to our pretrained tasks, we choose a higher boundary, and reduce the parameters to be finetuned, so that we can have more computation capacity left for the downstream decoder. While for tasks less related to our pretrained tasks, we choose a lower boundary to make the pretrained GNNs adjust to these tasks and learn taskspecific representation.
4 Experiments
In this section, we investigate to what extent the pretraining framework can benefit downstream tasks. We also conduct ablation study to analyze our approach.
Experiment Setting
We evaluate the pretraining strategy on tasks at the levels of node, link and graph. For node classification, we utilize Cora and Pubmed datasets to classify the subject categories of a paper. For link classification, we evaluate on MovieLens 100K and 1M datasets, which aim to classify the user’s rating of a movie. For graph classification, we consider IMDBMulti and IMDBBinary benchmarks to evaluate the model performance on classifying the genre of a movie based on the associated graph about actors and actresses.
To pretrain GNNs, we generate 900 synthetic graphs for training and 124 for validation, with node size ranges between 1002000. The cluster assignment and centrality can be directly obtained or calculated during the graph generation process. At each step of the pretraining, we sample 32 graphs, mask of the links, use the masked graph as input for all the three tasks and these masked links as labels for the Denoising Link Reconstruction task. The model is optimized based on corresponding pretraining losses by the Adam optimizer. For downstream tasks, we model them with a standard 2layer GCNs cascaded with the pretrained model. In the adaptation phase, without further mentioning, we fix the embedding transformation and finetune all the GNNs layers. For each setting, we independently conduct experiments 10 times and report mean and variance of the performance. All the results are evaluated in micro F1score.
Method  Node Classification  Link Classification  Graph Classification  
Cora  Pubmed  ML100K  ML1M  IMDBM  IMDBB  
Structure Feature Only (w/o. additional node attributes)  
Baseline (No Pretrain)  
Pretrain (All Tasks)  
With additional node attributes  
Baseline (Attr. Only)  /  /  
Baseline (No Pretrain + Attr.)  /  /  
Pretrain (All Tasks + Attr.)  /  / 
Method  Node Classification  Link Classification  Graph Classification  
Cora  Pubmed  ML100K  ML1M  IMDBM  IMDBB  
Structure Only  
Pretrain (Reconstruct)  
Pretrain (Rank)  
Pretrain (Cluster)  
Pretrain (All Tasks) 
To what extent can pretraining benefit downstream tasks?
The first experiment attempts to verify whether pretraining can capture generic structural features and improve performance of downstream models. To this end, we compare our approach with a baseline model whose parameters are initialized randomly. As is shown in the first block of Table 1, pretraining outperforms the baseline on all the six downstream tasks by 7.7% microF1 in average.
Next, we show that the pretraining can as well improve stateoftheart models when these models leverage additional node attributes as strong features. To this end, for the pretraining approach, we concatenate the node attributes with the model output, and adapt the entire model on the downstream tasks. We compare the model with an existing baseline that only takes the node attribute as the input features, and a baseline that concatenate the node attributes with a randomly initialized model without pretraining. As is shown in the second block of Table 1, both the baseline models and the pretrained model benefit from the additional node attributes, compared to the first block. And our pretrained model consistently improve the performance even with these strong node features given.
Ablation Studies
We further conduct ablation experiments to analyze the proposed approach.
First, we analyze which pretraining task can benefit more for a given downstream task. We follow the same setting to pretrain the GNN models on each individual task and adopt them to downstram tasks. The results are shown in Table 2. As we expected, different pretraining tasks are beneficial to different downstream tasks according to their task properties. For example, node classification task benefits the most from the cluster preserving pretraining task, indicating that cluster information is useful to detect node’s label. While link classification benefits more from the denoising link reconstruction task, as they both rely on robust representation of node pairs. Graph classification gains more from both tasks of centrality score ranking and denoising link reconstruction, as they can somehow capture the most significant local patterns. Overall, the results using all three tasks consistently improve the downstream models.
The fact that different pretraining tasks provide different levels of information about graphs can also be revealed by visualizing the attention weights on each GNN layer . Figure 5 shows that the cluster preserving pretraining focus mostly on highlevel structural information, while denoising link reconstruction on midlevel and centrality score ranking on lowlevel.
Second, we investigate the adaptation strategy. Existing techniques for adaptation ranging from fixing all the layers to finetuning all the layers of the pretrained model, and it is unclear what is the best strategy in the graph setting. To analyze this, we treat the layers to finetune as a hyperparameter and conduct experiments on the Cora dataset. Specifically, given a fixtune boundary, we fix all the bottom layers in GNNs with fixed parameters and finetune the rest. As a reference, we also show a competitive model, where the parameters in the upper layers above the boundary are randomly initialized (c.f., initialized by the pretrained model). As is shown in Figure 5, fixing and the bottom GNN layer achieves the best performance. Meanwhile, the performance of GNNs initialized by pretraining weights outperform that of randomly initialization in all the cases. This result confirms the benefits of the pretraining framework is brought by both the fixed feature extractor and parameter initialization.
Finally, we analyze how much benefit our pretraining framework can provide given different amounts of downstream data. In Figure 5, we show the performance difference between our approach against baseline with different amounts of training data (representing by the percentage of the training data used) on the Cora dataset. Results show that the benefits of pretraining are consistent. When the training data is extremely scarce (e.g., 10%), the improvement is more substantial.
5 Conclusion
In this paper, we propose to pretrain GNNs to learn generic graph structural features by using three selfsupervised tasks: denoising link reconstruction, centrality score ranking, and cluster preserving. The GNNs can be pretrained purely on synthetic graphs covering a wide range of graph properties, and then be adapted and benefit various downstream tasks. Experiments demonstrate the effectiveness of the proposed pretraining framework to various tasks at the level of node, link and graph.
References
 [1] AlbertLaszlo Barabasi and Zoltan N Oltvai. Network biology: understanding the cell’s functional organization. Nature reviews genetics, 5(2):101, 2004.
 [2] Vladimir Batagelj and Matjaz Zaversnik. An o(m) algorithm for cores decomposition of networks. CoRR, cs.DS/0310049, 2003.
 [3] Austin R Benson, David F Gleich, and Jure Leskovec. Higherorder organization of complex networks. Science, 353(6295):163–166, 2016.
 [4] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146, 2017.
 [5] Phillip Bonacich. Power and centrality: A family of measures. American Journal of Sociology, 92(5):1170–1182, 1987.
 [6] Phillip Bonacich. Power and centrality: A family of measures. American Journal of Sociology, 92(5):1170–1182, 1987.
 [7] Ed Bullmore and Olaf Sporns. Complex brain networks: graph theoretical analysis of structural and functional systems. Nature Reviews Neuroscience, 2009.
 [8] Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. Learning to rank using gradient descent. In ICML, 2005.
 [9] LiangChieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoderdecoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 801–818, 2018.
 [10] Jia Deng, Wei Dong, Richard Socher, LiJia Li, Kai Li, and Li FeiFei. Imagenet: A largescale hierarchical image database. 2009.
 [11] Jacob Devlin, MingWei Chang, Kenton Lee, and Kristina Toutanova. BERT: pretraining of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018.
 [12] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In International conference on machine learning, pages 647–655, 2014.
 [13] Ernesto Estrada and Juan A RodriguezVelazquez. Subgraph centrality in complex networks. Physical Review E, 71(5):056103, 2005.
 [14] Linton C Freeman. A set of measures of centrality based on betweenness. Sociometry, pages 35–41, 1977.
 [15] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014.
 [16] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 855–864. ACM, 2016.
 [17] Aric A. Hagberg, Daniel A. Schult, and Pieter J. Swart. Exploring network structure, dynamics, and function using networkx. In Gaël Varoquaux, Travis Vaught, and Jarrod Millman, editors, Proceedings of the 7th Python in Science Conference, pages 11 – 15, Pasadena, CA USA, 2008.
 [18] William L. Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 49 December 2017, Long Beach, CA, USA, pages 1025–1035, 2017.
 [19] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask rcnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
 [20] Paul W Holland, Kathryn Blackmond Laskey, and Samuel Leinhardt. Stochastic blockmodels: First steps. Social networks, 5(2):109–137, 1983.
 [21] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 611 July 2015, pages 448–456, 2015.
 [22] Thomas N. Kipf and Max Welling. Semisupervised classification with graph convolutional networks. In 5th International Conference on Learning Representations (ICLR17), 2017.
 [23] Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Skipthought vectors. In Advances in neural information processing systems, pages 3294–3302, 2015.
 [24] Quoc Le and Tomas Mikolov. Distributed representations of sentences and documents. In International conference on machine learning, pages 1188–1196, 2014.
 [25] Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data, June 2014.
 [26] Qimai Li, Zhichao Han, and XiaoMing Wu. Deeper insights into graph convolutional networks for semisupervised learning. In Proceedings of the ThirtySecond AAAI Conference on Artificial Intelligence, (AAAI18), the 30th innovative Applications of Artificial Intelligence (IAAI18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI18), New Orleans, Louisiana, USA, February 27, 2018, pages 3538–3545, 2018.
 [27] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
 [28] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
 [29] Flaviano Morone and Hernán A. Makse. Influence maximization in complex networks through optimal percolation. CoRR, abs/1506.08326, 2015.
 [30] Mark Newman. Networks: an introduction. Oxford university press, 2010.
 [31] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.
 [32] Bryan Perozzi, Rami AlRfou, and Steven Skiena. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 701–710. ACM, 2014.
 [33] Matthew Peters, Sebastian Ruder, and Noah A. Smith. To tune or not to tune? adapting pretrained representations to diverse tasks. CoRR, abs/1903.05987, 2019.
 [34] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACLHLT 2018, New Orleans, Louisiana, USA, June 16, 2018, Volume 1 (Long Papers), pages 2227–2237, 2018.
 [35] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100, 000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 14, 2016, pages 2383–2392, 2016.
 [36] Gert Sabidussi. The centrality index of a graph. Psychometrika, page 581–603, 1966.
 [37] Satu Elisa Schaeffer. Graph clustering. Computer Science Review, 1(1):27–64, 2007.
 [38] Michael Sejr Schlichtkrull, Thomas N. Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, and Max Welling. Modeling relational data with graph convolutional networks. In The Semantic Web  15th International Conference, ESWC 2018, Heraklion, Crete, Greece, June 37, 2018, Proceedings, pages 593–607, 2018.
 [39] Richard Socher, Danqi Chen, Christopher D. Manning, and Andrew Y. Ng. Reasoning with neural tensor networks for knowledge base completion. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 58, 2013, Lake Tahoe, Nevada, United States., pages 926–934, 2013.
 [40] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. Line: Largescale information network embedding. In Proceedings of the 24th international conference on world wide web, pages 1067–1077. International World Wide Web Conferences Steering Committee, 2015.
 [41] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30  May 3, 2018, Conference Track Proceedings, 2018.
 [42] Petar Velickovic, William Fedus, William L. Hamilton, Pietro Liò, Yoshua Bengio, and R. Devon Hjelm. Deep graph infomax. In 7th International Conference on Learning Representations (ICLR19), 2019.
 [43] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and PierreAntoine Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11:3371–3408, 2010.
 [44] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multitask benchmark and analysis platform for natural language understanding. In Proceedings of the Workshop: Analyzing and Interpreting Neural Networks for NLP, BlackboxNLP@EMNLP 2018, Brussels, Belgium, November 1, 2018, pages 353–355, 2018.
 [45] Duncan J. Watts and Steven H. Strogatz. Collective dynamics of ‘smallworld’ networks. nature, 1998.
 [46] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? 7th International Conference on Learning Representations, ICLR 2019, abs/1810.00826, 2018.
 [47] Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L. Hamilton, and Jure Leskovec. Graph convolutional neural networks for webscale recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, London, UK, August 1923, 2018, pages 974–983, 2018.
 [48] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833. Springer, 2014.
 [49] Marinka Zitnik and Jure Leskovec. Predicting multicellular function through multilayer tissue networks. Bioinformatics, 33(14):i190–i198, 2017.
Appendix A Implementation Details about Pretraining Framework
In this section, we provide some implementation details of each component of our proposed pretraining framework, for ease of reproductivity.
a.1 Details about Four Local Node Feature
As discussed before, the input to the model should be generic to all graphs and computationally efficient to be calculated. Therefore, we choose four nodelevel graph properties. We list the description of them and the time complexity to calculate them for all the nodes in the graph as follow:

Degree of a node is the number of edges linked to . It is the most basic node structure property in graph theory, which can represent a node’s local importance. It is also a frequently used feature for existing GNNs when there is no node attributes. After estimating the Degree, The time complexity to calculate Degree for all nodes is .

Core Number [2] defines the connectivity of a node’s subgroup. Core number of a node is the largest such that a core contains , where core is a maximal subgraph such that its nodes’ minimal degree is . Mathematically, a subgraph induced by a subset of nodes is a kcore iff: . After estimating the Degree, the time complexity to find all kcore and assign core number to each node is .

Collective Influence [29] () defines the node’s neighborhood’s importance. of a node in hop neighborhood is defined as: , where defines the hop neighborhood of . In our experiment, we only calculate hop Collective Influence ( where ) for computation efficiency. After estimating the Degree, the time complexity of computing hop Collective Influence for each node is .

Local Clustering Coefficient [45] () of a node is the fraction of closed triplets among all the triplets containing within its hop neighborhood. . The time complexity to calculate for each node is .
Combing all these four features together, we can quickly capture different local structural properties of each node. Instead of Collective Influence, which we implemented on our own, the other three features are calculated using networkx package [17]. After calculating them, we concatenate them as the 4dim features for each node, afterward by a linear transformation and a nonlinear transformation with to get the local embedding of each node.
a.2 Details about GNN Model Architecture
As we need to combine the outputs of each GCN block’s output, dimensions of these output vectors have to be unchanged. Therefore, both and are weight matrices, so that the dimension of hidden vector at each layer is always 512.
Based on the observation reported in [26], after applying multiple layers of GCN block, the node representations get closer. Therefore, the variance of the output nodes reduce. This observation is accordant to that the highorder Laplacian Matrix converges to a stationary distribution, so that the output of a multilayer GCN is nearly the same. To avoid this problem, we add batchnormalization on all the nodes’ embedding within one batch, so that we can concentrate on the "difference" part of each node instead of the "average" part of the whole graph. As the fullbatch version of GCN training will load all the nodes in the graph, it actually does normalization among all the nodes. Suppose , the operation of normalization is as follow:
(3) 
a.3 Details about Four Node Centrality Definitions

Eigenvector centrality of a node is calculated based on the centrality of its neighbors. The eigenvector centrality for node is the th element of the vector defined by the equation
(4) where is the adjacency matrix of the graph with eigenvalue . By virtue of the Perron–Frobenius theorem, there is a unique solution , all of whose entries are positive, if is the largest eigenvalue of the adjacency matrix [30]. The time complexity of eigenvalue centrality is .

Betweenness Centrality of a node is defined as:
(5) where denotes the number of nodes in the network, denotes the number of shortest paths from to , and denotes the number of shortest paths from to that pass through . The time complexity of betweenness centrality is .

Closeness Centrality of a node is defined as:
(6) where is the distance between nodes u and v. The time complexity of closeness centrality is .

Subgraph centrality: Subgraph centrality of the node is the sum of weighted closed walks of all lengths starting and ending at node [13]. The weights decrease with path length. Each closed walk is associated with a connected subgraph. It is defined as:
(7) where is the th element of eigenvector of the adjacency matrix corresponding to the eigenvalue . The time complexity of subgraph centrality is .
a.4 Details about Decoders for the three tasks
Both and are implemented by pairwise Neural Tensor Network (NTN) [39]. Mathematically:
(8) 
where and are two representation vector for the pair input with dimension (in our case, it’s node embedding of GNN output or cluster embedding which aggregates multiple node embeddings) and the output of is a dim vector. is a bilinear tensor product, results in a dim vector, which model the interaction of and . Each entry of this vector is computed by one slice of the tensor , resulted in , where . The second part is a linear transformation of the concatenation of and , with , adding a bias term .
After getting the dim vector output, we can cascade it by an output layer to conduct classification for Denoising Link Reconstruction and Cluster Preserving.
For Centrality Score Ranking, we simply decode the node embedding into a twolayer MLP (), so as to get one digit output as the rank score of each centrality metric. We thus can cascade it with a pairwise ranking layer to conduct the task.
a.5 Details about Pretraining Procedure
As is discussed above, we choose degreecorrected block model (DCBM) [20] to generate network with underlying cluster structure and controlled degree distribution. Specifically, we generate the graph with adjacency matrix as .
There are two parts we can control the generation process, the symmetric probability matrix of cluster connectivity: and degreecorrected term following . For cluster matrix, we sample the total cluster number randomly from 2 to 10, with a parameter to control how much the ratio is the incluster probability divided by the intercluster probability, which is sampled from 3 to 6. For degreecorrected term , we sample from 0.1 to 2, which controls the connectivity of the graph, where a higher makes the graph denser, and vice versa. We then sample from 2 to 10, which controls how much the degree distribution fits the powerlaw distribution, where a higher will make the degree distribution more uniform, and vice versa. Utilizing all these parameters, we generate 1024 graphs with different graph structure properties, where the first 900 for training and the remaining 124 for validation. After that, the details of the pretraining procedure is described in Algorithm 1, where we iteratively sample part of the graphs, mask some links out and train the model using three selfsupervised loss.
Appendix B Details about Experimental Datasets
The Datasets statistics of our three different type of downstream tasks are summarized as Table 3.
Task  Data  Type  Graphs  Nodes  Edges  Classes  Node Attributes  Train/Test rate  

Cora  Citation  1  2,708  5,429  7  bagofwords  0.052  
Pubmed  Citation  1  19,717  44,338  3  bagofwords  0.003  

ML100K  Bipartite  1  2,625  100,000  5  User/Item Tags  0.8  
ML1M  Bipartite  1  9,940  1,000,209  5  User/Item Tags  0.8  

IMDBM  Ego  1500  13.00  65.94  3  /  0.1  
IMDBB  Ego  1000  19.77  96.53  2  /  0.1 
Citation networks
We consider two citation networks for the node classification tasks: Cora and Pubmed. For these networks, nodes represent documents, and edges denote citations between documents. Each node contains a sparse bagofwords feature vector representing the documents.
MovieLens(ML) networks
We consider two movielens networks for the link classification task: ML100K and ML1M, where each node represents a user or a movie, and each edge denotes a rating of a movie made by a user. ML100K was collected through the MovieLens web site (movielens.umn.edu) during the sevenmonth period from September 19th, 1997 through April 22nd, 1998. This data has been cleaned up  users who had less than 20 ratings or did not have complete demographic information were removed from this data set. ML100K contains 100,000 ratings (15) from 943 users on 1682 movies. ML1M contains 1,000,209 anonymous ratings of approximately 3,900 movies made by 6,040 MovieLens users who joined MovieLens in 2000.
IMDB networks
We consider using IMDB networks, which is a widely used graph classification dataset, for evaluation. Each graph in this dataset corresponds to an egonetwork for each actor/actress, where nodes correspond to actors/actresses and an edge is drawn between two actors/actresses if they appear in the same movie. Each graph is derived from a prespecified genre of movies, and the task is to classify the genre that graph it is derived from. This dataset doesn’t have explicit features, so normally people only use the degree as an input feature. IMDBM denotes for the multiclass classification dataset, and IMDBB denotes for the binary classification dataset.