Rethinking Kernel Methods for
Node Representation Learning on Graphs
Abstract
Graph kernels are kernel methods measuring graph similarity and serve as a standard tool for graph classification. However, the use of kernel methods for node classification, which is a related problem to graph representation learning, is still illposed and the stateoftheart methods are heavily based on heuristics. Here, we present a novel theoretical kernelbased framework for node classification that can bridge the gap between these two representation learning problems on graphs. Our approach is motivated by graph kernel methodology but extended to learn the node representations capturing the structural information in a graph. We theoretically show that our formulation is as powerful as any positive semidefinite kernels. To efficiently learn the kernel, we propose a novel mechanism for node feature aggregation and a datadriven similarity metric employed during the training phase. More importantly, our framework is flexible and complementary to other graphbased deep learning models, e.g., Graph Convolutional Networks (GCNs). We empirically evaluate our approach on a number of standard node classification benchmarks, and demonstrate that our model sets the new state of the art. The source code is publicly available at https://github.com/bluer555/KernelGCN.
1 Introduction
Graph structured data, such as citation networks Giles et al. (1998); McCallum et al. (2000); Sen et al. (2008), biological models Gilmer et al. (2017); You et al. (2018), gridlike data Tang et al. (2018); Tian et al. (2018); Zhu et al. (2018) and skeletonbased motion systems Chen et al. (2019); Yan et al. (2018); Zhao et al. (2018, 2019), are abundant in the real world. Therefore, learning to understand graphs is a crucial problem in machine learning. Previous studies in the literature generally fall into two main categories: (1) graph classification Draief et al. (2018); Kipf and Welling (2017); Xu et al. (2019); Zhang et al. (2018b, c), where the whole structure of graphs is captured for similarity comparison; (2) node classification AbuElHaija et al. (2018); Kipf and Welling (2017); Veličković et al. (2018); Xu et al. (2018); Zhang et al. (2018a), where the structural identity of nodes is determined for representation learning.
For graph classification, kernel methods, i.e., graph kernels, have become a standard tool Kriege et al. (2017). Given a large collection of graphs, possibly with node and edge attributes, such algorithms aim to learn a kernel function that best captures the similarity between any two graphs. The graph kernel function can be utilized to classify graphs via standard kernel methods such as support vector machines or nearest neighbors. Moreover, recent studies Xu et al. (2019); Zhang et al. (2018b) also demonstrate that there has been a close connection between Graph Neural Networks (GNNs) and the WeisfeilerLehman graph kernel Shervashidze et al. (2011), and relate GNNs to the classic graph kernel methods for graph classification.
Node classification, on the other hand, is still an illposed problem in representation learning on graphs. Although identification of node classes often leverages their features, a more challenging and important scenario is to incorporate the graph structure for classification. Recent efforts in Graph Convolutional Networks (GCNs) Kipf and Welling (2017) have made great progress on node classification. In particular, these efforts broadly follow a recursive neighborhood aggregation scheme to capture structural information, where each node aggregates feature vectors of its neighbors to compute its new features AbuElHaija et al. (2018); Xu et al. (2018); Zhang et al. (2018a). Empirically, these GCNs have achieved the stateoftheart performance on node classification. However, the design of new GCNs is mostly based on empirical intuition, heuristics, and experimental trialanderror.
In this paper, we propose a novel theoretical framework leveraging kernel methods for node classification. Motivated by graph kernels, our key idea is to decouple the kernel function so that it can be learned driven by the node class labels on the graph. Meanwhile, its validity and expressive power are guaranteed. To be specific, this paper makes the following contributions:

[leftmargin=*]

We propose a learnable kernelbased framework for node classification. The kernel function is decoupled into a feature mapping function and a base kernel to ensure that it is valid as well as learnable. Then we present a datadriven similarity metric and its corresponding learning criteria for efficient kernel training. The implementation of each component is extensively discussed. An overview of our framework is shown in Fig. 1.

We demonstrate the validity of our learnable kernel function. More importantly, we theoretically show that our formulation is powerful enough to express any valid positive semidefinite kernels.

A novel feature aggregation mechanism for learning node representations is derived from the perspective of kernel smoothing. Compared with GCNs, our model captures the structural information of a node by aggregation in a single step, other than a recursive manner, thus is more efficient.

We discuss the close connection between the proposed approach and GCNs. We also show that our method is flexible and complementary to GCNs and their variants but more powerful, and can be leveraged as a general framework for future work.
2 Related Work
Graph Kernels. Graph kernels are kernels defined on graphs to capture the graph similarity, which can be used in kernel methods for graph classification. Many graph kernels are instances of the family of convolutional kernels Haussler (1999). Some of them measure the similarity between walks or paths on graphs Borgwardt and Kriegel (2005); Vishwanathan et al. (2010). Other popular kernels are designed based on limitedsized substructures Horváth et al. (2004); Shervashidze et al. (2009); Shervashidze and Borgwardt (2009); Shervashidze et al. (2011). Most graph kernels are employed in models which have learnable components, but the kernels themselves are handcrafted and motivated by graph theory. Some learnable graph kernels have been proposed recently, such as Deep Graph Kernels Yanardag and Vishwanathan (2015) and Graph Matching Networks Li et al. (2019). Compared to these approaches, our method targets at learning kernels for node representation learning.
Node Representation Learning. Conventional methods for learning node representations largely focus on matrix factorization. They directly adopt classic techniques for dimension reduction Ahmed et al. (2013); Belkin and Niyogi (2002). Other methods are derived from the random walk algorithm Mikolov et al. (2013); Perozzi et al. (2014) or subgraph structures Grover and Leskovec (2016); Tang et al. (2015); Yang et al. (2016); Ribeiro et al. (2017). Recently, Graph Convolutional Networks (GCNs) have emerged as an effective class of models for learning representations of graph structured data. They were introduced in Kipf and Welling (2017), which consist of an iterative process aggregating and transforming representation vectors of its neighboring nodes to capture structural information. Recently, several variants have been proposed, which employ selfattention mechanism Veličković et al. (2018) or improve network architectures Xu et al. (2018); Zhang et al. (2018a) to boost the performance. However, most of them are based on empirical intuition and heuristics.
3 Preliminaries
We begin by summarizing some of the most important concepts about kernel methods as well as representation learning on graphs and, along the way, introduce our notations.
Kernel Concepts. A kernel is a function of two arguments: for . The kernel function is symmetric, i.e., , which means it can be interpreted as a measure of similarity. If the Gram matrix defined by for any is positive semidefinite (p.s.d.), then is a p.s.d. kernel Murphy (2012). If can be represented as , where is a feature mapping function, then is a valid kernel.
Graph Kernels. In the graph space , we denote a graph as , where is the set of nodes and is the edge set of . Given two graphs and in , the graph kernel measures the similarity between them. According to the definition in Scholkopf and Smola (2001), the kernel must be p.s.d. and symmetric. The graph kernel between and is defined as:
(1) 
where is the base kernel for any pair of nodes in and , and is a function to compute the feature vector associated with each node. However, deriving a new p.s.d. graph kernel is a nontrivial task. Previous methods often implement and as the dot product between handcrafted graph heuristics Neuhaus and Bunke (2005); Shervashidze and Borgwardt (2009); Borgwardt and Kriegel (2005). There are little learnable parameters in these approaches.
Representation Learning on Graphs. Although graph kernels have been applied to a wide range of applications, most of them depend on handcrafted heuristics. In contrast, representation learning aims to automatically learn to encode graph structures into lowdimensional embeddings. Formally, given a graph , we follow Hamilton et al. (2017) to define representation learning as an encoderdecoder framework, where we minimize the empirical loss over a set of training node pairs :
(2) 
Equation (2) has three methodological components: ENCDEC, and . Most of the previous methods on representation learning can be distinguished by how these components are defined. The detailed meaning of each component is explained as follows.

[leftmargin=*]

is an encoderdecoder function. It contains an encoder which projects each node into a dimensional vector to generate the node embedding. This function contains a number of trainable parameters to be optimized during the training phase. It also includes a decoder function, which reconstructs pairwise similarity measurements from the node embeddings generated by the encoder.

is a pairwise similarity function defined over the graph . This function is userspecified, and it is used for measuring the similarity between nodes in .

is a loss function, which is leveraged to train the model. This function evaluates the quality of the pairwise reconstruction between the estimated value and the true value .
4 Proposed Method: Learning Kernels for Node Representation
Given a graph , as we can see from Eq. (2), the encoderdecoder ENCDEC aims to approximate the pairwise similarity function , which leads to a natural intuition: we can replace ENCDEC with a kernel function parameterized by to measure the similarity between nodes in , i.e.,
(3) 
However, there exist two technical challenges: (1) designing a valid p.s.d. kernel which captures the node feature is nontrivial; (2) it is impossible to handcraft a unified kernel to handle all possible graphs with different characteristics Ramon and Gärtner (2003). To tackle these issues, we introduce a novel formulation to replace . Inspired by the graph kernel as defined in Eq. (1) and the mapping kernel framework Shin and Kuboyama (2008), our key idea is to decouple into two components: a base kernel which is p.s.d. to maintain the validity, and a learnable feature mapping function to ensure the flexibility of the resulting kernel. Therefore, we rewrite Eq. (3) by for of the graph to optimize the following objective:
(4) 
Theorem 1 demonstrates that the proposed formulation, i.e., , is still a valid p.s.d. kernel for any feature mapping function parameterized by .
Theorem 1.
Let be a function which maps nodes (or their corresponding features) to a Mdimensional Euclidean space. Let be any valid p.s.d. kernel. Then, is a valid p.s.d. kernel.
Proof.
Let be the corresponding feature mapping function of the p.s.d. kernel . Then, we have , where . Substitute for , and we have . Write the new feature mapping as , and we immediately have that . Hence, is a valid p.s.d. kernel. ∎
A natural followup question is whether our proposed formulation, in principle, is powerful enough to express any valid p.s.d. kernels? Our answer, in Theorem 2, is yes: if the base kernel has an invertible feature mapping function, then the resulting kernel is able to model any valid p.s.d. kernels.
Theorem 2.
Let be any valid p.s.d. kernel for node pairs . Let be a p.s.d. kernel which has an invertible feature mapping function . Then there exists a feature mapping function , such that .
Proof.
Let be the corresponding feature mapping function of the p.s.d. kernel , and then we have . Similarly, for , we have . Substitute for , and then it is easy to see that is the desired feature mapping function when exists. ∎
4.1 Implementation and Learning Criteria
Theorems 1 and 2 have demonstrated the validity and power of the proposed formulation in Eq. (4). In this section, we discuss how to implement and learn , , and , respectively.
Implementation of the Feature Mapping Function . The function aims to project the feature vector of each node into a better space for similarity measurement. Our key idea is that in a graph, connected nodes usually share some similar characteristics, and thus changes between nearby nodes in the latent space of nodes should be smooth. Inspired by the concept of kernel smoothing, we consider as a feature smoother which maps into a smoothed latent space according to the graph structure. The kernel smoother estimates a function as the weighted average of neighboring observed data. To be specific, given a node , according to NadarayaWatson kernelweighted average Friedman et al. (2001), a feature smoothing function is defined as:
(5) 
where is a mapping function to compute the feature vector of each node, and here we let ; is a predefined kernel function to capture pairwise relations between nodes. Note that we omit for here since there are no learnable parameters in Eq. (5). In the context of graphs, the natural choice of computing is to follow the graph structure, i.e., the structural information within the node’s hop neighborhood.
To compute , we let be the adjacent matrix of the given graph and be the identity matrix with the same size. We notice that is a valid p.s.d. matrix, where . Thus we can employ this matrix to define the kernel function . However, in practice, this matrix would lead to numerical instabilities and exploding or vanishing gradients when used for training deep neural networks. To alleviate this problem, we adopt the renormalization trick Kipf and Welling (2017): , where and . Then the hop neighborhood can be computed directly from the power of , i.e., . And the kernel for node pairs is computed as . After collecting the feature vector of each node into a matrix , we rewrite Eq. (5) approximately into its matrix form:
(6) 
Next, we enhance the expressive power of Eq. (6) to model any valid p.s.d. kernels by implementing it with deep neural networks based on the following two aspects. First, we make use of multilayer perceptrons (MLPs) to model and learn the composite function in Theorem 2, thanks to the universal approximation theorem Hornik (1991); Hornik et al. (1989). Second, we add learnable weights to different hops of node neighbors. As a result, our final feature mapping function is defined as:
(7) 
where means the set of parameters in ; is a learnable parameter for the hop neighborhood of each node ; is the Hadamard (elementwise) product; is an indicator matrix where equals to 1 if is a th hop neighbor of and 0 otherwise. The hyperparameter controls the number of layers in the MLP.
Equation (7) can be interpreted as a weighted feature aggregation schema around the given node and its neighbors, which is employed to compute the node representation. It has a close connection with Graph Neural Networks. We leave it in Section 5 for a more detailed discussion.
Implementation of the Base Kernel . As we have shown in Theorem 2, in order to model an arbitrary p.s.d. kernel, we require that the corresponding feature mapping function of the base kernel must be invertible, i.e., exists. An obvious choice would let be an identity function, then will reduce to the dot product between nodes in the latent space. Since maps node representations to a finite dimensional space, the identity function makes our model directly measure the node similarity in this space. On the other hand, an alternative choice of is the RBF kernel which additionally projects node representations to an infinite dimensional latent space before comparison. We compare both implementations in the experiments for further evaluation.
DataDriven Similarity Metric and Criteria . In node classification, each node is associated with a class label . We aim to measure node similarity with respect to their class labels other than handdesigned metrics. Naturally, we define the pairwise similarity as:
(8) 
However, in practice, it is hard to directly minimize the loss between and in Eq. (8). Instead, we consider a “soft” version of , where we require that the similarity of node pairs with the same label is greater than those with distinct labels by a margin. Therefore, we train the kernel to minimize the following objective function on triplets:
(9) 
where is a set of node triplets: is an anchor, and is a positive of the same class as the anchor while is a negative of a different class. The loss function is defined as:
(10) 
It ensures that given two positive nodes of the same class and one negative node, the kernel value of the negative should be farther away than the one of the positive by the margin . Here, we present Theorem 3 and its proof to show that minimizing Eq. (9) leads to .
Theorem 3.
If for any , minimizing Eq. (9) with yields .
Proof.
Let be all triplets satisfying , . Suppose that for , Eq. (10) holds for all . It means for all . As , we have for all and for all . Hence, . ∎
We note that can be simply achieved by letting be the dot product and normalizing all to the norm ball. In the following sections, the normalized is denoted by .
4.2 Inference for Node Classification
Once the kernel function has learned how to measure the similarity between nodes, we can leverage the output of the feature mapping function as the node representation for node classification. In this paper, we introduce the following two classifiers.
Nearest Centroid Classifier. The nearest centroid classifier extends the nearest neighbors algorithm by assigning to observations the label of the class of training samples whose centroid is closest to the observation. It does not require additional parameters. To be specific, given a testing node , for all nodes with class label in the training set, we compute the perclass average similarity between and : , where is the set of nodes belonging to class . Then the class assigned to the testing node :
(11) 
Softmax Classifier. The idea of the softmax classifier is to reuse the ground truth labels of nodes for training the classifier, so that it can be directly employed for inference. To do this, we add the softmax activation after to minimize the following objective:
(12) 
where is the onehot ground truth vector. Note that Eq. (12) is optimized together with Eq. (9) in an endtoend manner. Let denote the corresponding feature mapping function of , then we have . In this case, we use the node feature produced by for classification since projects node features into the dotproduct space which is a natural metric for similarity comparison. To this end, is fixed to be the identity function for the softmax classifier, so that we have and thus .
5 Discussion
Our feature mapping function proposed in Eq. (7) has a close connection with Graph Convolutional Networks (GCNs) Kipf and Welling (2017) in the way of capturing node latent representations. In GCNs and most of their variants, each layer leverages the following aggregation rule:
(13) 
where is a layerspecific trainable weighting matrix; denotes an activation function; denotes the node features in the th layer, and . Through stacking multiple layers, GCNs aggregate the features for each node from its hop neighbors recursively, where is the network depth. Compared with the proposed , GCNs actually interleave two basic operations of : feature transformation and NadarayaWatson kernelweighted average, and repeat them recursively.
We contrast our approach with GCNs in terms of the following aspects. First, our aggregation function is derived from the kernel perspective, which is novel. Second, we show that aggregating features in a recursive manner is inessential. Powerful hop node representations can be obtained by our model where aggregation is performed only once. As a result, our approach is more efficient both in storage and time when handling very large graphs, since no intermediate states of the network have to be kept. Third, our model is flexible and complementary to GCNs: our function can be directly replaced by GCNs and other variants, which can be exploited for future work.
Time and Space Complexity. We assume the number of features is fixed for all layers and both GCNs and our method have layers. We count matrix multiplications as in Chiang et al. (2019). GCN’s time complexity is , where is the number of nonzeros of and is the number of nodes in the graph. While ours is , since we do not aggregate features recursively. Obviously, is constant but is linear to . For space complexity, GCNs have to store all the feature matrices for recursive aggregation which needs space, where is for storing trainable parameters of all layers, and thus the first term is linear to . Instead, ours is where the first term is again constant to . Our experiments indicate that we save 20% (0.3 ms) time and 15% space on Cora dataset McCallum et al. (2000) than GCNs.
6 Experiments
We evaluate the proposed kernelbased approach on three benchmark datasets: Cora McCallum et al. (2000), Citeseer Giles et al. (1998) and Pubmed Sen et al. (2008). They are citation networks, where the task of node classification is to classify academic papers of the network (graph) into different subjects. These datasets contain bagofwords features for each document (node) and citation links between documents.
We compare our approach to five stateoftheart methods: GCN Kipf and Welling (2017), GAT Veličković et al. (2018), FastGCN Chen et al. (2018), JK Xu et al. (2018) and KLED Fouss et al. (2006). KLED is a kernelbased method, while the others are based on deep neural networks. We test all methods in the supervised learning scenario, where all data in the training set are used for training. We evaluate the proposed method in two different experimental settings according to FastGCN Chen et al. (2018) and JK Xu et al. (2018), respectively. The statistics of the datasets together with their data split settings (i.e., the number of samples contained in the training, validation and testing sets, respectively) are summarized in Table 1. Note that there are more training samples in the data split of JK Xu et al. (2018) than FastGCN Chen et al. (2018). We report the average means and standard deviations of node classification accuracy which are computed from ten runs as the evaluation metrics.
Dataset  Nodes  Edges  Classes  Features  Data split of FastGCN Chen et al. (2018)  Data split of JK Xu et al. (2018) 

Cora McCallum et al. (2000)  2,708  5,429  7  1,433  1,208 / 500 / 1,000  1,624 / 542 / 542 
Citeseer Giles et al. (1998)  3,327  4,732  6  3,703  1,827 / 500 / 1,000  1,997 / 665 / 665 
Pubmed Sen et al. (2008)  19,717  44,338  3  500  18,217 / 500 / 1,000   
6.1 Variants of the Proposed Method
As we have shown in Section 4.1, there are alternative choices to implement each component of our framework. In this section, we summarize all the variants of our method employed for evaluation.
Choices of the Feature Mapping Function . We implement the feature mapping function according to Eq. (7). In addition, we also choose GCN and GAT as the alternative implementations of for comparison, and denote them by and , respectively.
Choices of the Base Kernel . The base kernel has two different implementations: the dot product which is denoted by , and the RBF kernel which is denoted by . Note that when the softmax classifier is employed, we set the base kernel to be .
Choices of the Loss and Classifier . We consider the following three combinations of the loss function and classifier. (1) in Eq. (9) is optimized, and the nearestcentroid classifier is employed for classification. This combination aims to evaluate the effectiveness of the learned kernel. (2) in Eq. (12) is optimized, and the softmax classifier is employed for classification. This combination is used in a baseline without kernel methods. (3) Both Eq. (9) and Eq. (12) are optimized, and we denote this loss by . The softmax classifier is employed for classification. This combination aims to evaluate how the learned kernel improves the baseline method.
In the experiments, we use to denote kernelbased variants and to denote ones without the kernel function. All these variants are implemented by MLPs with two layers. Due to the space limitation, we ask the readers to refer to the supplementary material for implementation details.
6.2 Results of Node Classification
The means and standard deviations of node classification accuracy (%) following the setting of FastGCN Chen et al. (2018) are organized in Table 2. Our variant of sets the new state of the art on all datasets. And on Pubmed dataset, all our variants improve previous methods by a large margin. It proves the effectiveness of employing kernel methods for node classification, especially on datasets with large graphs. Interestingly, our nonkernel baseline even achieves the stateoftheart performance, which shows that our feature mapping function can capture more flexible structural information than previous GCNbased approaches. For the choice of the base kernel, we can find that outperforms on two large datasets: Citeseer and Pubmed. We conjecture that when handling complex datasets, the nonlinear kernel, e.g., the RBF kernel, is a better choice than the liner kernel.
To evaluate the performance of our feature mapping function, we report the results of two variants and in Table 2. They utilize GCN and GAT as the feature mapping function respectively. As expected, our outperforms and among most datasets. This demonstrates that the recursive aggregation schema of GCNs is inessential, since the proposed aggregates features only in a single step, which is still powerful enough for node classification. On the other hand, it is also observed that both and outperform their original nonkernel based implementations, which shows that learning with kernels yields better node representations.
Table 3 shows the results following the setting of JK Xu et al. (2018). Note that we do not evaluate on Pubmed in this setup since its corresponding data split for training and evaluation is not provided by Xu et al. (2018). As expected, our method achieves the best performance among all datasets, which is consistent with the results in Table 2. For Cora, the improvement of our method is not so significant. We conjecture that the results in Table 3 involve more training data due to different data splits, which narrows the performance gap between different methods on datasets with small graphs, such as Cora.
Method  Cora McCallum et al. (2000)  Citeseer Giles et al. (1998)  Pubmed Sen et al. (2008) 

KLED Fouss et al. (2006)  82.3    82.3 
GCN Kipf and Welling (2017)  86.0  77.2  86.5 
GAT Veličković et al. (2018)  85.6  76.9  86.2 
FastGCN Chen et al. (2018)  85.0  77.6  88.0 
86.68 0.17  77.92 0.25  89.22 0.17  
86.12 0.05  78.68 0.38  89.36 0.21  
88.40 0.24  80.28 0.03  89.42 0.01  
87.56 0.14  79.80 0.03  89.24 0.14  
87.04 0.09  77.12 0.23  87.84 0.12  
86.10 0.33  77.92 0.19   
Method  Cora McCallum et al. (2000)  Citeseer Giles et al. (1998) 

GCN Kipf and Welling (2017)  88.20 0.70  77.30 1.30 
GAT Veličković et al. (2018)  87.70 0.30  76.20 0.80 
JKConcat Xu et al. (2018)  89.10 1.10  78.30 0.80 
89.24 0.31  80.78 0.28 
6.3 Ablation Study on Node Feature Aggregation Schema
In Table 4, we implement three variants of (2hop and 2layer with by default) to evaluate the proposed node feature aggregation schema. We answer the following three questions. (1) How does performance change with fewer (or more) hops? We change the number of hops from 1 to 3, and the performance improves if it is larger, which shows capturing longrange structures of nodes is important. (2) How many layers of MLP are needed? We show results with different layers ranging from 1 to 3. The best performance is obtained with two layers, while networks overfit the data when more layers are employed. (3) Is it necessary to have a trainable parameter ? We replace with a fixed constant , where . We can see larger improves the performance. However, all results are worse than learning a weighting parameter , which shows the importance of it.
Variants of  Cora McCallum et al. (2000)  Citeseer Giles et al. (1998)  Pubmed Sen et al. (2008) 

Default  88.40 0.24  80.28 0.03  89.42 0.01 
1hop  85.56 0.02  77.73 0.02  88.98 0.01 
3hop  88.25 0.01  80.13 0.01  89.53 0.01 
1layer  82.60 0.01  77.63 0.01  85.80 0.01 
3layer  86.33 0.04  78.53 0.20  89.46 0.05 
69.33 0.09  74.48 0.03  84.68 0.02  
76.98 0.10  77.47 0.04  86.45 0.01  
84.25 0.01  77.99 0.01  87.45 0.01  
87.31 0.01  78.57 0.01  88.68 0.01 
6.4 tSNE Visualization of Node Embeddings
We visualize the node embeddings of GCN, GAT and our method on Citeseer with tSNE. For our method, we use the embedding of which obtains the best performance. Figure 2 illustrates the results. Compared with other methods, our method produces a more compact clustering result. Specifically our method clusters the “red” points tightly, while in the results of GCN and GAT, they are loosely scattered into other clusters. This is caused by the fact that both GCN and GAT minimize the classification loss , only targeting at accuracy. They tend to learn node embeddings driven by those classes with the majority of nodes. In contrast, are trained with both and . Our kernelbased similarity loss encourages data within the same class to be close to each other. As a result, the learned feature mapping function encourages geometrically compact clusters.
Due to the space limitation, we ask the readers to refer to the supplementary material for more experiment results, such as the results of link prediction and visualization on other datasets.
7 Conclusions
In this paper, we introduce a kernelbased framework for node classification. Motivated by the design of graph kernels, we learn the kernel from ground truth labels by decoupling the kernel function into a base kernel and a learnable feature mapping function. More importantly, we show that our formulation is valid as well as powerful enough to express any p.s.d. kernels. Then the implementation of each component in our approach is extensively discussed. From the perspective of kernel smoothing, we also derive a novel feature mapping function to aggregate features from a node’s neighborhood. Furthermore, we show that our formulation is closely connected with GCNs but more powerful. Experiments on standard node classification benchmarks are conducted to evaluated our approach. The results show that our method outperforms the state of the art.
Acknowledgments
This work is funded by AROMURI68985NSMUR and NSF 1763523, 1747778, 1733843, 1703883.
Appendix A Supplementary Materials
a.1 Implementation Details
We use different network settings for the combinations of the loss function and inference method in Section 6.1 of the original paper. For Variant (1), we choose the output dimension of the first and second layers to be 512 and 128, respectively. We train this combination with 10 epochs on Cora and Citeseer and 100 epochs on Pubmed.
For GAT Veličković et al. (2018), due to its large memory cost, its output dimension of the first and second layers is chosen to be 64 and 8, respectively.
For Variants (2) and (3), the output dimension of the first layer is chosen to be 16. The output dimension of the second layer is the same as the number of node classes. We train this combination 100 epochs for GAT and 200 epochs for other setups.
In Eq. (9) of the original paper, we randomly sample 10,000 triplets in each epoch. In Eq. (10) of the original paper, is set to be 0.1 for all datasets. All methods are optimized using Adam Kingma and Ba (2014) with the learning rate of 0.01. We use the best model achieved on the validation set for testing. Each result is reported based on an average over 10 runs.
a.2 Additional Experimental Results
a.2.1 Results of Link Prediction
In addition to node classification, we also conduct experiments for link prediction to demonstrate the generalizability of the proposed framework in different graphbased tasks. We train the models using an incomplete version of the three citation datasets (Cora, Citeseer and Pubmed) according to Kipf and Welling (2016): the node features remain but parts of the citation links (edges) are missing. The validation and test sets are constructed following the setup of Kipf and Welling (2016).
We choose to be the dot product and set to be the feature mapping function. Given graph , for , the similarity measure is defined as:
(14) 
The feature mapping function can be learned by minimizing the following objective function in a datadriven manner:
(15) 
where is the set of training edges, and is the binary cross entropy loss.
Table 5 summarizes the link prediction results of our kernelbased method, the variational graph autoencoder (VGAE) Kipf and Welling (2016) and its nonprobabilistic variant (GAE). Our kernelbased method is highly comparable with these stateoftheart methods, showing the potential of applying the proposed framework in different applications on graphs.
Cora  Citeseer  Pubmed  

Method  AUC  AP  AUC  AP  AUC  AP 
GAE Kipf and Welling (2016)  91.0 0.02  92.0 0.03  89.5 0.04  89.9 0.05  96.4 0.00  96.5 0.00 
VGAE Kipf and Welling (2016)  91.4 0.01  92.6 0.01  90.8 0.02  92.0 0.02  94.4 0.02  94.7 0.02 
Ours  93.1 0.06  93.2 0.07  90.9 0.08  91.8 0.04  94.5 0.03  94.2 0.01 
a.2.2 tSNE visualization on Cora
We visualize the node embeddings of GCN Kipf and Welling (2017), GAT Veličković et al. (2018) and our method on Cora with tSNE in Fig. 3. Our method produces tight and clear clustering embeddings (especially for the “red” points and “violet” points), which shows that compared with GCN and GAT, our method is able to learn more reasonable feature embeddings for nodes.
References
 [1] (2018) NGCN: multiscale graph convolution for semisupervised node classification. arXiv preprint arXiv:1802.08888. Cited by: §1, §1.
 [2] (2013) Distributed largescale natural graph factorization. In Proceedings of the International Conference on World Wide Web (WWW), pp. 37–48. Cited by: §2.
 [3] (2002) Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in Neural Information Processing Systems (NeurIPS), pp. 585–591. Cited by: §2.
 [4] (2005) Shortestpath kernels on graphs. In Proceedings of the IEEE International Conference on Data Mining (ICDM), Cited by: §2, §3.
 [5] (2018) Fastgcn: fast learning with graph convolutional networks via importance sampling. arXiv preprint arXiv:1801.10247. Cited by: §6.2, Table 1, Table 2, §6.
 [6] (2019) Construct dynamic graphs for hand gesture recognition via spatialtemporal attention. In Proceedings of the British Machine Vision Conference (BMVC), Cited by: §1.
 [7] (2019) ClusterGCN: an efficient algorithm for training deep and large graph convolutional networks. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 257–266. Cited by: §5.
 [8] (2018) KONG: kernels for orderedneighborhood graphs. In Advances in Neural Information Processing Systems (NeurIPS), pp. 4051–4060. Cited by: §1.
 [9] (2006) An experimental investigation of graph kernels on a collaborative recommendation task. In Proceedings of the International Conference on Data Mining (ICDM), pp. 863–868. Cited by: Table 2, §6.
 [10] (2001) The elements of statistical learning. Springer series in statistics New York. Cited by: §4.1.
 [11] (1998) CiteSeer: an automatic citation indexing system. In Proceedings of the Third ACM Conference on Digital Libraries, pp. 89–98. Cited by: §1, Table 1, Table 2, Table 3, Table 4, §6.
 [12] (2017) Neural message passing for quantum chemistry. In Proceedings of the International Conference on Machine Learning (ICML), pp. 1263–1272. Cited by: §1.
 [13] (2016) Node2vec: scalable feature learning for networks. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 855–864. Cited by: §2.
 [14] (2017) Representation learning on graphs: methods and applications. arXiv preprint arXiv:1709.05584. Cited by: §3.
 [15] (1999) Convolution kernels on discrete structures. Technical report Department of Computer Science, University of California at Santa Cruz. Cited by: §2.
 [16] (1989) Multilayer feedforward networks are universal approximators. Neural networks 2 (5), pp. 359–366. Cited by: §4.1.
 [17] (1991) Approximation capabilities of multilayer feedforward networks. Neural networks 4 (2), pp. 251–257. Cited by: §4.1.
 [18] (2004) Cyclic pattern kernels for predictive graph mining. In Proceedings of the ACM SIGKDD International Conference on Knowledge discovery and Data Mining (KDD), pp. 158–167. Cited by: §2.
 [19] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §A.1.
 [20] (2016) Variational graph autoencoders. arXiv preprint arXiv:1611.07308. Cited by: §A.2.1, §A.2.1, Table 5.
 [21] (2017) Semisupervised classification with graph convolutional networks. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §A.2.2, §1, §1, §2, §4.1, §5, Table 2, Table 3, §6.
 [22] (2017) A unifying view of explicit and implicit feature maps for structured data: systematic studies of graph kernels. arXiv preprint arXiv:1703.00676. Cited by: §1.
 [23] (2019) Graph matching networks for learning the similarity of graph structured objects. In Proceedings of the International Conference on Machine Learning (ICML), Cited by: §2.
 [24] (2000) Automating the construction of internet portals with machine learning. Information Retrieval 3 (2), pp. 127–163. Cited by: §1, §5, Table 1, Table 2, Table 3, Table 4, §6.
 [25] (2013) Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems (NeurIPS), pp. 3111–3119. Cited by: §2.
 [26] (2012) Machine learning: a probabilistic perspective. MIT press. Cited by: §3.
 [27] (2005) Selforganizing maps for learning the edit costs in graph matching. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 35 (3), pp. 503–514. Cited by: §3.
 [28] (2014) Deepwalk: online learning of social representations. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 701–710. Cited by: §2.
 [29] (2003) Expressivity versus efficiency of graph kernels. In Proceedings of the International Workshop on Mining Graphs, Trees and Sequences, pp. 65–74. Cited by: §4.
 [30] (2017) Struc2vec: learning node representations from structural identity. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 385–394. Cited by: §2.
 [31] (2001) Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press. Cited by: §3.
 [32] (2008) Collective classification in network data. AI magazine 29 (3), pp. 93–93. Cited by: §1, Table 1, Table 2, Table 4, §6.
 [33] (2009) Fast subtree kernels on graphs. In Advances in Neural Information Processing Systems (NeurIPS), pp. 1660–1668. Cited by: §2, §3.
 [34] (2011) Weisfeilerlehman graph kernels. Journal of Machine Learning Research 12, pp. 2539–2561. Cited by: §1, §2.
 [35] (2009) Efficient graphlet kernels for large graph comparison. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 488–495. Cited by: §2.
 [36] (2008) A generalization of haussler’s convolution kernel: mapping kernel. In Proceedings of the International Conference on Machine Learning (ICML), pp. 944–951. Cited by: §4.
 [37] (2015) Line: largescale information network embedding. In Proceedings of the International Conference on World Wide Web (WWW), pp. 1067–1077. Cited by: §2.
 [38] (2018) Quantized Densely Connected UNets for Efficient Landmark Localization. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 339–354. Cited by: §1.
 [39] (2018) CRGAN: learning complete representations for multiview generation. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pp. 942–948. Cited by: §1.
 [40] (2018) Graph attention networks. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §A.1, §A.2.2, §1, §2, Table 2, Table 3, §6.
 [41] (2010) Graph kernels. Journal of Machine Learning Research 11 (Apr), pp. 1201–1242. Cited by: §2.
 [42] (2019) How powerful are graph neural networks?. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §1, §1.
 [43] (2018) Representation learning on graphs with jumping knowledge networks. In Proceedings of the 34th International Conference on Machine Learning (ICML), Cited by: §1, §1, §2, §6.2, Table 1, Table 3, §6.
 [44] (2018) Spatial temporal graph convolutional networks for skeletonbased action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Cited by: §1.
 [45] (2015) Deep graph kernels. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 1365–1374. Cited by: §2.
 [46] (2016) Revisiting semisupervised learning with graph embeddings. In Proceedings of the International Conference on Machine Learning (ICML), pp. 40–48. Cited by: §2.
 [47] (2018) Graph convolutional policy network for goaldirected molecular graph generation. In Advances in Neural Information Processing Systems (NeurIPS), pp. 6410–6421. Cited by: §1.
 [48] (2018) Graph nodefeature convolution for representation learning. arXiv preprint arXiv:1812.00086. Cited by: §1, §1, §2.
 [49] (2018) An endtoend deep learning architecture for graph classification. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Cited by: §1, §1.
 [50] (2018) Retgk: graph kernels based on return probabilities of random walks. In Advances in Neural Information Processing Systems (NeurIPS), pp. 3964–3974. Cited by: §1.
 [51] (2019) Semantic graph convolutional networks for 3D human pose regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3425–3435. Cited by: §1.
 [52] (2018) Learning to forecast and refine residual motion for imagetovideo generation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 387–403. Cited by: §1.
 [53] (2018) A generative adversarial approach for zeroshot learning from noisy texts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.