A Survey on Graph Kernels

A Survey on Graph Kernels

Nils M. Kriege Department of Computer Science
TU Dortmund University, Dortmund, Germany
{nils.kriege,christopher.morris}@tu-dortmund.de
Fredrik D. Johansson Institute for Medical Engineering and Science, MIT
fredrikj@mit.edu
Christopher Morris Department of Computer Science
TU Dortmund University, Dortmund, Germany
{nils.kriege,christopher.morris}@tu-dortmund.de
Abstract

Graph kernels have become an established and widely-used technique for solving classification tasks on graphs. This survey gives a comprehensive overview of techniques for kernel-based graph classification developed in the past 15 years. We describe and categorize graph kernels based on properties inherent to their design, such as the nature of their extracted graph features, their method of computation and their applicability to problems in practice. In an extensive experimental evaluation, we study the classification accuracy of a large suite of graph kernels on established benchmarks as well as new datasets. We compare the performance of popular kernels with several baseline methods and study the effect of applying a Gaussian RBF kernel to the metric induced by a graph kernel. In doing so, we find that simple baselines become competitive after this transformation on some datasets. Moreover, we study the extent to which existing graph kernels agree in their predictions (and prediction errors) and obtain a data-driven categorization of kernels as result. Finally, based on our experimental results, we derive a practitioner’s guide to kernel-based graph classification.

\PassOptionsToPackage

tablexcolor \fxsetupinline,nomargin,theme=color \addtokomafontdisposition \addtokomafontdescriptionlabel

1 Introduction

Machine learning analysis of large, complex datasets has become an integral part of research in both the natural and social sciences. Largely, this development was driven by the empirical success of supervised learning of vector-valued data or image data. However, in many domains, such as chemo- and bioinformatics, social network analysis or computer vision, observations describe relations between objects or individuals and cannot be interpreted as vectors or fixed grids; instead, they are naturally represented by graphs. This poses a particular challenge in the application of traditional data mining and machine learning approaches. In order to learn successfully from such data, it is necessary for algorithms to exploit the rich information inherent to the graphs’ structure and annotations associated with their vertices and edges.

A popular approach to learning with graph-structured data is to make use of graph kernels—functions which measure the similarity between graphs—plugged into a kernel machine, such as a support vector machine. Due to the prevalence of graph-structured data and the empirical success of kernel-based methods for classification, a large body of work in this area exists. In particular, in the past 15 years, numerous graph kernels have been proposed, motivated either by their theoretical properties or by their suitability and specialization to particular application domains. Despite this, there are no review articles aimed at comprehensive comparison between different graph kernels nor at giving practical guidelines for choosing between them. As the number of methods grow, it is becoming increasingly difficult for both non-expert practitioners and researchers new to the field to identify an appropriate set of candidate kernels for their application.

This survey is intended to give an overview of the graph kernel literature, targeted at the active researcher as well as the practitioner. First, we describe and categorize existing kernels in terms of their theoretical properties, their computational complexity and their expressivity. Second, we perform an extensive experimental evaluation of state-of-the-art graph kernels on a wide range of benchmark datasets for graph classification stemming from chemo- and bioinformatics as well as social network analysis and computer vision. Finally, we provide guidelines for the practitioner for the successful application of graph kernels.

1.1 Contributions

We summarize our contributions below.

  • We give an comprehensive overview of the graph kernel literature, categorizing kernels according to several properties. Primarily, we distinguish graph kernels by their mathematical definition and which graph features they use to measure similarity. Moreover, we discuss whether kernels are applicable to (i) graphs annotated with continuous attributes, or (ii) discrete labels, or (iii) unlabeled graphs only. Additionally, we describe which kernels rely on the kernel trick as opposed to being computed from feature vectors and what effects this has on the running time and flexibility. Finally, we give a brief review of approaches to graph comparison and learning based on deep neural networks.

  • We give an overview of applications of graph kernels in different domains and review theoretical work on the expressive power of graph kernels.

  • We compare state-of-the-art graph kernels in an extensive experimental study across a wide range of established and new benchmark datasets. Specifically, we show the strengths and weaknesses of the individual kernels or classes of kernels for specific datasets.

    • We compare popular kernels to simple baseline methods in order to assess the need for more sophisticated methods which are able to take more structural features into account. To this end, we analyze the ability of graph kernels to distinguish the non-isomorphic graphs in common benchmark datasets.

    • Moreover, we investigate the effect of combining a Gaussian RBF kernel with the metric induced by a graph kernel in order to learn non-linear decision boundaries in the feature space of the graph kernel. We observe that with this approach simple baseline methods become competitive to state-of-the-art kernels for some datasets, but fail for others.

    • We study the similarity between graph kernels in terms of their classification predictions and errors on graphs from the chosen datasets. This analysis provides a qualitative, data-driven means of assessing the similarity of different kernels in terms of which graphs they deem similar.

  • Finally, we provide guidelines for the practitioner and new researcher for the successful application of graph kernels.

1.2 Related Work

The most recent surveys of graph kernels are the works of Ghosh et al. [1] and Zhang et al. [2]. Ghosh et al. [1] place a strong emphasis on covering the fundamentals of kernels methods in general and summarizing known experimental results for graph kernels. The article does not, however, cover the most recent contributions to the literature and the authors do not perform (nor reproduce) original experiments on graph classification. The survey by Zhang et al. [2] focuses on kernels for graphs without attributes which is a small subset of the scope of this survey. Another survey was published in 2010 by Vishwanathan et al. [3] but its main topic are random walk kernels and it does not include recent advances. Moreover, various PhD theses give (incomplete or dated) overviews, see, e.g., [4, 5, 6, 7].

1.3 Outline

In Section 2, we introduce notation and provide mathematical definitions necessary to understand the rest of the paper. Section 3 gives an overview of the graph kernel literature. We start off by introducing kernels based on neighborhood aggregation techniques. Subsequently, we describe kernels based on assignments, substructures, walks and paths, and neural networks, as well as approaches that do not fit into any of the former categories. In Section 4, we survey theoretical work on the expressivity of kernels and in Section 5 we describe applications of graph kernels in four domain areas. Finally, in Section 6 we introduce and analyze the results of a large-scale experimental study of graph kernels in classification problems, and provide guidelines for the successful application of graph kernels.

2 Fundamentals

In this section, we cover notation and definitions of fundamental concepts pertaining to graph-structured data, kernel methods, and graph kernels. In Section 3, we use these concepts to define and categorize popular graph kernels.

2.1 Graph Data

A graph is a pair of a finite set of vertices and a set of edges . A vertex is typically used to represent an object (e.g., an atom) and an edge a relation between objects (e.g., a molecular bond). We denote the set of vertices and the set of edges of by and , respectively. We restrict our attention to undirected graphs in which no two edges with identical (unordered) end points, nor any self-cycles exist. For ease of notation we denote the edge in by or . A labeled graph is a graph endowed with a label function , where is some alphabet, e.g., the set of natural or real numbers. We say that is the label of . In the case for some , is the (continuous) attribute of . In Section 5, we give examples of applications involving graphs with vertex labels and attributes. The edges of a graph may also be assigned labels or attributes (e.g., weights representing vertex similarity), in which case the domain of the labeling function may be extended to the edge set.

We let denote the neighborhood of a vertex in , i.e., . The degree of a vertex is the size of its neighborhood, . A walk in a graph is an ordered sequence of vertices such that any two subsequent vertices are connected by an edge. A path is a walk that starts in and ends in with no repeated vertices. A graph is called connected if there is a path between any pair of vertices in and disconnected otherwise.

We say that two unlabeled graphs and are isomorphic, denoted by , if there exists a bijection , such that if and only if for all in . For labeled graphs, isomorphism holds only if the bijection maps only vertices and edges with the same label. Finally, a graph is a subgraph of a graph if and . Let be a subset of vertices in . Then denotes the subgraph induced by with .

Graphs are often represented in matrix form. Perhaps most frequent is the adjacency matrix with binary elements .111Weighted graphs are represented by their corresponding edge weight matrix. An alternative representation is the graph Laplacian , defined as , where is the diagonal degree matrix , such that . Finally, the indidence matrix of a graph is the binary matrix with vertex-edge-pair elements representing the event that the vertex is incident on the edge . It holds that . The matrices , and all carry the same information.

Figure 1: Graph representation fundamentals. Here, represents the shortest path (sequence of vertices) between vertices and .

2.2 Kernel Methods

Kernel methods refer to machine learning algorithms that learn by comparing pairs of data points using particular similarity measures—kernels. We give an overview below; for an in-depth treatment, see [8, 9]. Consider a non-empty set of data points , such as or a finite set of graphs, and let be a function. Then, is a kernel on if there is a Hilbert space and a feature map such that for , where denotes the inner product of . Such a feature map exists if and only if is a positive-semidefinite function. A trivial example is where and , in which case the kernel equals the dot product, .

An important concept in kernel methods is the Gram matrix , defined with respect to a finite set of data points . The Gram matrix of a kernel has elements , for equal to the kernel value between pairs of data points, i.e., . If the Gram matrix of is positive semidefinite for every possible set of data points, is a kernel [10]. Kernel methods have the desirable property that they do not rely on explicitly characterizing the vector representation of data points, but access data only via the Gram matrix . The benefit of this is often illustrated using the Gaussian radial basis function (RBF) kernel on , , defined as

(1)

where is a bandwidth parameter. The Hilbert-space associated with the Gaussian RBF kernel has infinite dimension but the kernel may be readily computed for any pair of points (see [11] for further details). Kernel methods have been developed for most machine learning paradigms, e.g., support vector machines (SVM) for classification [12], Gaussian processes (GP) for regression [13], kernel PCA, k-means for unsupervised learning and clustering [10], and kernel density estimation (KDE) for density estimation [14]. In this work, we restrict our attention to classification of objects in a non-empty set of graphs . In this setting, a kernel is called a graph kernel. Like kernels on vector spaces, graph kernels can be calculated either explicitly (by computing ) or implicitly (by computing only ). Traditionally, learning with implicit kernel representations means that the value of the chosen kernel applied to every pair of graphs in the training set must be computed and stored. Explicit computations means that we compute a finite dimensional feature vector for each graph; the values of the kernel can then be computed on-the-fly during learning as the inner product of feature vectors. If explicit computation is possible, and the dimensionality of the resulting feature vectors is not too high, or the vectors are sparse, then it is usually faster and more memory efficient than implicit computation, see also [15, 16].

2.3 Design Paradigms For Kernels on Structured Data

When working with vector-valued data, it is common practice for kernels to compare objects using differences between vector components (see for example the Gaussian RBF kernel in Equation 1). The structure of a graph, however, is invariant to permutations of its representation—the ordering by which vertices and edges are enumerated does not change the structure—and vector distances between, e.g., adjacency matrices are typically uninformative. For this reason, it is important to compare graphs in ways that are themselves permutation invariant. As mentioned previously, two graphs with identical structure (irrespective of representation) are called isomorphic, a concept that could in principle be used for learning. However, not only is there no known polynomial-time algorithm for testing graph isomorphism [17] but ismorphism is also typically too strict for learning—it is akin to learning with the equality operator. In practice, it is often desirable to have smoother metrics of comparison in order to gain generalizable knowledge from the comparison of graphs.

The vast majority of graph kernels proposed in the literature are instances of so-called convolution kernels. Given two discrete structures, e.g., two graphs, the idea of Haussler’s Convolution Framework [18] is to decompose these two structures into substructures, e.g., vertices or subgraphs, and then evaluate a kernel between each pair of such substructures. The convolution kernel is defined below.

Definition 2.1 (Convolution Kernel).

Let denote a space of components such that a composite object decomposes into elements of . Further, let denote the mapping from components to objects, such that if and only if the components make up the object , and let . Then, the -convolution kernel is

(2)

where is a kernel on for in .

In our context, we may view the inverse map of the convolution kernel as the set of all components of a graph that we wish to compare. A simple example of the -convolution kernel is the vertex label kernel for which the mapping takes the attributes of each vertex and maps them to the graph that is a member of. We expand on this notion in Section 3.3. A benefit of the convolution kernel framework when working with graphs is that if the kernels on substructures are invariant to orderings of vertices and edges, so is the resulting graph kernel.

A property of convolution kernels often regarded as unfavorable is that the sum in Equation 2 applies to all pairs of components. When the considered components become more and more specific, each object becomes increasingly similar to itself, but no longer to any other objects. This phenomenon is referred to as the diagonal dominance problem, since the entries on the main diagonal of the Gram matrix are much higher than the others entries. This problem was observed for graph kernels, for which weights between the components were introduced to alleviate the problem [19, 20]. In addition, the fact that convolution kernels compare all pairs of components may be unsuitable in situations where each component of one object corresponds to exactly one component of the other (such as the features of two faces). Shin and Kuboyama [21] studied mapping kernels, where the sum moves over a predetermined subset of pairs rather than the entire cross product. It was shown that, for general primitive kernels , a valid mapping kernel is obtained if and only if the considered subsets of pairs are transitive on . This does not necessarily hold, when assigning the components of two objects to each other such that a correspondence of maximum total similarity w.r.t. is obtained. As a consequence, this approach does not lead to valid kernels in general. However, graph kernels following this approach have been studied in detail and are often referred to as optimal assignment kernels, see Section 3.2.

3 Graph Kernels

The first methods for graph comparison referred to as graph kernels were proposed in 2003 [22, 23]. However, several approaches similar to graph kernels had been developed in the field of chemoinformatics, long before the term graph kernel was coined. The timeline in Figure 2 shows milestones in the development of graph kernels and related learning algorithms for graphs. We postpone the discussion of the latter to Section 5. Following the introduction of graph kernels, subsequent work focused for a long time on making kernels computationally tractable for large graphs with (predominantly) discrete vertex labels. Since 2012, several kernels specifically designed for graphs with continuous attributes have been proposed. In a very recent development, neural networks have become increasingly popular tools in graph classification; these methods are discussed briefly in Section 3.6. It remains a current challenge in research to develop neural techniques for graphs that are able to learn feature representations that are clearly superior to the fixed feature spaces used by graph kernels.

In the following, we give an overview of the graph kernel literature in order of popular design paradigms. We begin our treatment with kernels that are based on neighborhood aggregation techniques. The subequent subsections deal with assignment- and matching-based kernels, and kernels based on the extraction of subgrah patterns, respectively. The final subsections deal with kernels based on walks and paths, and kernels that do not fall into either of the previous categories. Table 1 gives an overview of the discussed graph kernels and their properties.

Graph Kernel Computation Labels Attributes
Shortest-Path [24] IM + +
Generalized Shortest-Path [25] IM + +
Graphlet [26] EX
Cycles and Trees [27] EX +
Tree Pattern Kernel [28, 29] IM + +
Ordered Directed Acyclic Graphs [30, 31] EX +
GraphHopper [32] IM + +
Graph Invariant [33] IM + +
Subgraph Matching [34] IM + +
Weisfeiler-Lehman Subtree [35] EX +
Weisfeiler-Lehman Edge [35] EX +
Weisfeiler-Lehman Shortest-Path [35] EX +
k-dim. Local Weisfeiler-Lehman Subtree [36] EX +
Neighborhood Hash Kernel [37] EX +
Propagation Kernel [38] EX + +
Neighborhood Subgraph Pairwise Distance Kernel [39] EX +
Random Walk [22, 23, 40, 3, 41, 42] IM + +
Optimal Assignment Kernel [43] IM + +
Weisfeiler-Lehman Optimal Assignment [44] IM +
Pyramid Match [45] IM +
Matchings of Geometric Embeddings [46] IM + +
Descriptor Matching Kernel [47] IM + +
Graphlet Spectrum [48] EX +
Multiscale Laplacian Graph Kernel [49] IM + +
Global Graph Kernel [50] EX
Deep Graph Kernels [19] IM +
Smoothed Graph Kernels [51] IM +
Hash Graph Kernel [52] EX + +
Depth-based Representation Kernel [53] IM
Aligned Subtree Kernel [54] IM +
Table 1: Summary of selected graph kernels: Computation by explicit (EX) and implicit (IM) feature mapping and support for attributed graphs. The column ’Labels’ refers to whether the kernels support comparison of graphs with discrete vertex and edge labels in a way that depends on the interplay between structure and labels. The column ’Attributes’ refer to the same capability but for continuous or more general vertex attributes. — not considered in publication, but method can be extended; — vertex annotations only.

Fingerprints for chemical similarity [55]

1973

Systematic evaluation of fingerprint similarities [56]

1986

Extended connectivity fingerprints [57]

2000

Random walk kernels [22, 23]

2003

Tree pattern kernels [28, 29]

2003

Cycles and Trees kernel [27]

2004

Shortest-path kernel [24]

2005

Kernels from chemical similarities [58]

2005

Optimal assignment kernels [43]

2005

Molecular graph networks [59]

2005

Graphlet kernels [26]

2009

Neighborhood Hash Kernel [37]

2009

Weisfeiler-Lehman kernels [35]

2009

Neighborhood subgraph pairwise distance kernel [39]

2010

Ordered Directed Acyclic Graphs [30]

2012

Subgraph matching kernel [34]

2012

GraphHopper kernel [32]

2013

Generalized shortest-path kernel [25]

2015

Graph Invariant [33]

2015

Neural molecular fingerprints [60]

2015

Descriptor matching kernel [47]

2016

Hash graph kernels [52]

2016

Valid optimal assignment kernels [44]

2016

Graph convolutional networks [61]

2017

Neural message passing [62]

2017

GraphSAGE [63]

2017

SplineCNN [64]

2018

-GNN [65]

2019

Figure 2: Timeline. Selected techniques for graph classification with a focus on kernels. Techniques based on fingerprints are marked in gray and methods using neural networks in brown. Methods proposed for cheminformatics are shown in italics, kernels for attributed graphs in bold.

3.1 Neighborhood Aggregation Approaches

One of the dominating paradigms in the design of graph kernels is representation and comparison of local structure. Two vertices are considered similar if they have identical labels—even more so if their neighborhoods are labeled similarly. Expanding on this notion, two graphs are considered similar if they are composed of vertices with similar neighborhoods, i.e., that they have similar local structure. The different ways by which local structure is defined, represented and compared form the basis for several influential graph kernels. We describe a first example next.

Neighborhood aggregation approaches work by assigning an attribute to each vertex based on a summary of the local structure around them. Iteratively, for each vertex, the attributes of its immediate neighbors are aggregated to compute a new attribute for the target vertex, eventually representing the structure of its extended neighborhood. Shervashidze et al. [35] introduced a highly influential class of neighborhood aggregation kernels for graphs with discrete labels based on the 1-dimensional Weisfeiler-Lehman (-WL) or color refinement algorithm—a well-known heuristic for the graph isomorphism problem, see e.g., [66]. We illustrate an application of the -WL algorithm in Figure 3.

Let and be graphs, and let be the observed vertex label function of and .222If the graph is unlabeled, let map to a constant. In a series of iterations , the -WL algorithm computes new label functions , each of which can be used to compare and . In iteration we set and in subsequent iterations , we set

(3)

for , where returns a sorted tuple of the multiset and the injection maps the pair to a unique value in which has not been used in previous iterations. Now if and have an unequal number of vertices with label , we can conclude that the graphs are not isomorphic. Moreover, if the cardinality of the image of equals the cardinality of the image of , the algorithm terminates.

Figure 3: Weisfeiler-Lehman (WL) relabeling. Two iterations of WL for a graph with discrete labels in .

The idea of the Weisfeiler-Lehman subtree graph kernel is to compute the above algorithm for iterations, and after each iteration compute a feature vector for each graph , where denotes the image of . Each component counts the number of occurrences of vertices labeled with . The overall feature vector is defined as the concatenation of the feature vectors of all iterations, i.e.,

Then the Weisfeiler-Lehman subtree kernel for iterations is defined as . The running time for a single feature vector computation is in and for the computation of the Gram matrix for a set of graphs [35], where and denote the maximum number of vertices and edges over all graphs, respectively.

The WL subtree kernel suggests a general paradigm for comparing graphs at different levels of resolution: iteratively relabel graphs using the WL algorithms and construct a graph kernel based on a base kernel applied at each level. Indeed, in addition to the subtree kernel, Shervashidze et al. [35] introduced two other variants, the Weisfeiler-Lehman edge and the Weisfeiler-Lehman shortest-path kernel. Instead of counting the labels of vertices after each iteration the Weisfeiler-Lehman edge kernel counts the colors of the two endpoints for all edges. The Weisfeiler-Lehman shortest-path kernel is the sum of shortest-path kernels applied to the graphs with refined labels for .

Morris et al. [36] introduced a graph kernel based on higher dimensional variants of the Weisfeiler-Lehman algorithm. Here, instead of iteratively labeling vertices, the algorithm labels -tuples or sets of cardinality . Morris et al. [36] also provide efficient approximation algorithm to scale the algorithm up to large datasets. In [37], a graph kernel similar to the -WL was introduced which replaces the neighborhood aggregation function Equation 3 by a function based on binary arithmetic. Similarly, in [38] the propagation kernel is defined which propagates labels, and real-valued attributes for several iterations while tracking their distribution for every vertex. A randomized approach based on -stable locality-sensitive hashing is used to obtain unique features after each iteration.

Bai et al. [53, 54] proposed graph kernels based on depth-based representations, which can be seen as a different form of neighborhood aggregation. For a vertex the -layer expansion subgraph is the subgraph induced by the vertices of shortest-path distance at most from the vertex . In order to obtain a vertex embedding for the Shannon entropy of these subgraphs is computed for all , where is a given parameter [53]. A similar concept is applied in [54], where depth-based representations are used to compute strengthened vertex labels. Both methods are combined with matching-based techniques to obtain a graph kernel.

3.2 Assignment- and Matching-based Approaches

A common approach to comparing two composite or structured objects is to identify the best possible matching of the components making up the two objects. For example, when comparing two chemical molecules it is instructive to map each atom in one graph to the atom in the other graph that is most similar in terms of, for example, neighborhood structure and attached chemical and physical measurements. This idea has been used also in graph kernels, an early example of which was proposed by Fröhlich et al. [43] in the optimal assignment (OA) kernel. In the OA kernel, each vertex is endowed with a representation (e.g., a label) that is compared using a base kernel. Then, a similarity value for a pair of graphs is computed based on a mapping between their vertices such that the total similarity between the matched vertices with respect to a base kernel is maximized. An illustration of the optimal assignment kernel can be seen in Figure 4. The OA kernel can be defined as follows.

Figure 4: Assignment kernels. Illustration of optimal assignment kernels with vertex embeddings.
Definition 3.1 (Optimal assignment kernel).

Let and be sets of components from and a base kernel on components. The optimal assignment kernel is

(4)

where is the set of all possible permutations of . In order to apply the assignment kernel to sets of different cardinality, we fill the smaller set with objects and define for all .

The careful reader may have noticed a superficial similarity between the OA kernel and the -convolution and mapping kernels (see Section 2.3). However, instead of summing the base kernel over a fixed ordering of component pairs, the OA kernel searches for the optimal mapping between components of two objects . Unfortunately, this means that Equation (4) is not a positive-semidefinite kernel in general [67, 3]. This fact complicates the use of assignment similarities in kernel methods, although generalizations of SVMs for arbitrary similarity measures have been developed, see, e.g., [68] and references therein. Moreover, kernel methods, such as SVMs, have been found to work well empirically also with indefinite kernels [46], without enjoying the guarantees that apply to positive definite kernels.

Several different approaches to obtain positive definite graph kernels from indefinite assignment similarities have been proposed. Woźnica et al. [69] derived graph kernels from set distances and employed a matching-based distance to compare graphs, which was shown to be a metric [70]. In order to obtain a valid kernel, the authors use so-called prototypes, an idea prevalent also in the theory of learning with (non-kernel) similarity functions under the name landmarks [71]. Prototypes are a selected set of instances (e.g., graphs) to which all other instances are compared. Each graph is then represented by a feature vector in which each component is the distance to a different prototype. Prototypes were used also by Johansson and Dubhashi [46] who proposed to embed the vertices of a graph into the -dimensional real vector space in order to compute a matching between the vertices of two graphs with respect to the Euclidean distance. Several methods for the embedding were proposed; in particular, the authors used Cholesky decompositions of matrix representations of graphs including the graph Laplacian and its pseudo-inverse. The authors found empirically that the indefinite graph similarity matrix from the matching worked as well as prototypes. In Section 6, we use this, indefinite version.

Instead of generating feature vectors from prototypes, Kriege et al. [44] showed that Equation (4) is a valid kernel for a restricted class of base kernels . These, so-called strong base kernels, give rise to hierarchies from which the optimal assignment kernels are computed in linear time by histogram intersection. For graph classification, a base kernel based on Weisfeiler-Lehman refinement was proposed. The derived Weisfeiler-Lehman optimal assignment kernel often provides better classification accuracy on real-world benchmark datasets than the Weisfeiler-Lehman subtree kernel (see Section 6).

Pachauri et al. [72] studied a generalization of the assignment problem to more than two sets, which was used to define transitive assignment kernels for graphs [73]. The method is based on finding a single assignment between the vertices of all graphs of the dataset instead of finding an optimal assignment for each pairs of graphs. This approach satisfies the transitivity constraint of mapping kernels and therefore leads to positive-semidefinite kernels. However, non-optimal assignments between individual pairs of graphs are possible. Nikolentzos et al. [45] proposed a matching-based approach based on the Earth Mover’s Distance, which results in an indefinite kernel function. In order to deal with this they employ a variation of the SVM algorithm, specialized for learning with indefinite kernels. Additionally, they propose an alternative solution based on the pyramid match kernel, a generic kernel for comparing sets of features [74]. The pyramid match kernel avoids the indefiniteness of other assignment kernels by comparing features through a multi-resolution histograms (with bins determined globally, rather than for each pair of graphs).

3.3 Subgraph Patterns

Figure 5: Graphlets. Illustration of graphlets on 3 vertices.

In many applications, a strong baseline for representations of composite objects such as documents, images or graphs is one that ignores the structure altogether and represents objects as bags of components. A well-known example is the so-called bag-of-words representation of text—statistics of word occurrences without context—which remains a staple in natural language processing. For additional specificity, it is common to compare statistics also of bigrams (sequences of two words), trigrams etc. A similar idea may be used to compare graphs by ignoring large-scale structure and viewing graphs as bags of vertices or edges. The vertex label kernel does precisely this by comparing graphs only at the level of similarity between all pairs of vertex labels from two different graphs,

With the base kernel the equality indicator function, is a linear kernel on the (unnormalized) distributions of vertex labels in and . Similar in spirit, the edge label kernel is defined as the sum of base kernel evaluations on all pairs of edge labels (or triplets of the edge label and incident vertex labels). Note that such kernels are a paramount example for instances of the convolution kernel framework, see Section 2.3.

A downside of vertex and edge label kernels is that they ignore the interplay between structure and labels and are almost completely uninformative for unlabeled graphs. Instead of viewing graphs as bags of vertices or edges, we may view them as bags of subgraph patterns. To this end, Shervashidze et al. [26] introduced a kernel based on counting occurrences of subgraph patterns of a fixed size—so called graphlets (see Figure 5). Every graphlet is an instance of an isomorphism type—a set of graphs that are all isomorphic—such as a graph on three vertices with two edges. While there are three graphs that connect three vertices with two edges, they are all isomorphic and considered equivalent as graphlets.

Graphlet kernels count the isomorphism types of all induced (possibly disconnected) subgraphs on vertices of a graph . Let for denote the number of instances of isomorphism type where denotes the number of different types. The kernel computes a feature map for ,

The graphlet kernel is finally defined as for two graphs and .

The time required to compute the graphlet kernel scales exponentially with the size of the considered graphlets. To remedy this, Shervashidze et al. [26] proposed two algorithms for speeding up the computation time of the feature map for in . In particular, it is common to restrict the kernel to connected graphlets (isomorphism types). Additionally, the statistics used by the graphlet kernel may be estimated approximately by subgraph sampling, see, e.g., [75, 76, 77, 78]. Please note that the graphlet kernel as proposed by Shervashidze et al. [26] does not consider any labels or attributes. However, the concept (but not all speed-up tricks) can be extended to labeled graphs by using labeled isomorphism types as features, see, e.g., [79]. Mapping (sub)graphs to their isomorphism type is known as graph canonization problem, for which no polynomial time algorithm is known [17]. However, this is not a severe restriction for small graphs such as graphlets and, in addition, well-engineered algorithms solving most practical instance in a short time exist [80]. Horváth et al. [27] proposed a kernel which decomposes graphs into cycles and tree patterns, for which the canonization problem can be solved in polynomial time and simple practical algorithms for this are known.

Costa and De Grave [39] introduced the neighborhood subgraph pairwise distance kernel which associates a string with every vertex representing its neighborhood up to a certain depth. In order to avoid solving the graph canonization problem, they proposed using a graph invariant that may, in rare cases, map non-isomorphic neighborhood subgraphs to the same string. Then, pairs of these neighborhood graphs together with the shortest-path distance between their central vertices are counted as features. The approach is similar to the Weisfeiler-Lehman shortest-path kernel (see Section 3.1).

An alternative to subgraph patterns, tree patterns may contain repeated vertices just like random walks and were initially proposed for use in graph comparison by Ramon and Gärtner [28] and later refined by Mahé and Vert [29]. Tree pattern kernels are similar to the Weisfeiler-Lehman subtree kernel, but do not consider all neighbors in each steps, but also all possible subsets [35], and hence do not scale to larger datasets. Da San Martino et al. [30] proposed decomposing a graph into trees and applying a kernel defined on trees. In Da San Martino et al. [31], a fast hashing-based computation scheme for the aforementioned graph kernel is proposed.

3.4 Walks and Paths

A downside of the subgraph pattern kernels described in the previous section is that they require the specification of a set of patterns, or subgraph size, in advance. To ensure efficient computation, this often restricts the patterns to a fairly small scale, emphasizing local structure. A popular alternative is to compare the sequences of vertex or edge attributes that are encountered through traversals through graphs. In this section, we describe two families of traversal algorithms which yield different attribute sequences and thus different kernels—shortest paths and random walks.

3.4.1 Shortest-path kernels

One of the very first, and most influential, graph kernels is the shortest-path (SP) kernel [24]. The idea of the SP kernel is to compare the attributes and lengths of the shortest paths between all pairs of vertices in two graphs. The shortest path between two vertices is illustrated in Figure 1. Formally, let and be graphs with label function and let denote the shortest-path distance between the vertices and in the same graph. Then, the kernel is defined as

(5)

where

Here, is a kernel for comparing vertex labels and is a kernel to compare shortest-path distances, such that if or .

The running time for evaluating the general form of the SP kernel for a pair of graphs is in when using the Floyd-Warshall algorithm for solving the all-pair shortest-path problem. This is prohibitively large for most practical applications. However, in the case of discrete vertices and edge labels, e.g., a finite subset of the natural numbers, and the indicator function, we can compute the feature map corresponding to the kernel explicitly. In this case, each component of the feature map counts the number of triples for and in and . Using this approach, the time complexity of the SP kernel is reduced to the time complexity of the Floyd-Warshall algorithm, which is in . In [25] the shortest-path is generalized by considering all shortest paths between two vertices.

3.4.2 Random walk kernels

Gärtner et al. [22] and Kashima et al. [23] simultaneously proposed graph kernels based on random walks, which count the number of (label sequences along) walks that two graphs have in common. The description of the random walk kernel by Kashima et al. [23] is motivated by a probabilistic view of kernels and based on the idea of so-called marginalized kernels. The feature space of the kernel comprises all possible label sequences produced by random walks; since the length of the walks is unbounded, the space is of infinite dimension. A method of computation is proposed based on a recursive reformulation of the kernel, which at the end boils down to finding the stationary state of a discrete-time linear system. Since this kernel was later generalized by [3] we do not go into the mathematical details of the original publication. The approach fully supports attributed graphs, since vertex and edge labels encountered on walks are compared by user-specified kernels.

Mahé et al. [40] extended the original formulation of random walk kernels with a focus on application in cheminformatics [81] to improve the scalability and relevance as similarity measure. A mostly unfavorable characteristic of random walks is that they may visit the same vertex several times. Walks are even allowed to traverse an edge from to and instantly return to via the same edge, a problem referred to as tottering. These repeated consecutive vertices do not provide useful information and may even harm the validity as similarity measure. Hence, the marginalized graph kernel was extended to avoid tottering by replacing the underlying first-order Markov random walk model by a second-order Markov random walk model. This technique to prevent tottering only eliminates walks with for some , but it does not require the considered walks to be paths, i.e., repeated vertices still occur.

Like other random walk kernels, Gärtner et al. [22] define the feature space of their kernel as the label sequences derived from walks, but propose a different method of computation based on the direct product graph of two labeled input graphs.

Definition 3.2 (Direct Product Graph).

For two labeled graphs and the direct product graph is denoted by and defined as

A vertex (edge) in has the same label as the corresponding vertices (edges) in and .

Figure 6: Direct product graph. Two graphs , and their direct product graph .

The concept is illustrated in Figure 6. There is a one-to-one correspondence between walks in and walks in the graphs and with the same label sequence. The direct product kernel is then defined as

(6)

where is the adjacency matrix of and a sequence of weights such that the above sum converges. This is the case for , , and with , where is the maximum degree of . For this choice of weights and with the identity matrix, there exists a closed-form expression,

(7)

which can be computed by matrix inversion. Since the expression reminds of the geometric series transferred to matrices, Equation (7) is referred to as geometric random walk kernel. The running time to compute the geometric random walk kernel between two graphs is dominated by the inversion of the adjacency matrix associated with the direct product graph. The running time is given as roughly  [3].

Vishwanathan et al. [3] propose a generalizing framework for random walk based graph kernels and argue that the approach by Kashima et al. [23] and Gärtner et al. [22] can be considered special cases of this kernel. The paper does not address vertex labels and makes extensive use of the Kronecker product between matrices denoted by and lifts it to the feature space associated with an (edge) kernel. Given an edge kernel on attributes from the set , let be a feature map. For an attributed graph , the feature matrix is then defined as if and otherwise. Then, yields a weight matrix of the direct product graph .333Here vertex labels are ignored, i.e., . The proposed kernel is defined as

(8)

where and are initial and stopping probability distributions and coefficients such that the sum converges. Several methods of computation are proposed, which yield different running times depending on a parameter , specific to that approach. The parameter either denotes the number of fixed-point iterations, power iterations or the effective rank of . The running times to compare graphs of order also depend on the edge labels of the input graphs and the desired edge kernel: For unlabeled graphs the running time is achieved and for labeled graphs, where is the size of the label alphabet. The same running time is attained by edge kernels with a -dimensional feature space, while time is required in the infinite case. For sparse graphs, is achieved in all cases, where a graph is said to be sparse if . Further improvements of the running time were subsequently achieved by non-exact algorithms based on low rank approximations [42]. Recently, the phenomenon of halting in random walk kernels has been studied Sugiyama and Borgwardt [41], which refers to the fact that walk-based graph kernels may down-weight longer walks so much that their value is dominated by walks of length .

The classical random walk kernels described above in theory take all walks without a limitation in length into account, which leads to a high-dimensional feature space. Several application-related papers used walks up to a certain length only, e.g., for the prediction of protein functions [82] or image classification [83]. These walk based kernels are not susceptible to the phenomenon of halting. Kriege et al. [15, 16] systematically studied kernels based on all the walks of a predetermined fixed length , referred to as -walk kernel, and all the walks with length at most , called Max--walk kernel, respectively. For these computation schemes based on implicit and explicit feature maps were proposed and compared experimentally. Computation by explicit feature maps provides a better performance for graphs with discrete labels with a low label diversity and small walk lengths. Conceptually different, Zhang et al. [84] derived graph kernels based on return probabilities of random walks.

3.5 Graph Kernels for Graph with Continuous Labels

Most real-world graphs have attributes, mostly real-valued vectors, associated with their vertices and edges. For example, atoms of chemical molecules have physical and chemical properties; individuals in social networks have demographic information; and words in documents carry semantic meaning. Kernels based on pattern counting or neighborhood aggregation are of a discrete nature, i.e., two vertices are regarded as similar if and only if they exactly match, structure-wise as well as attribute-wise. However, in most applications it is desirable to compare real-valued attributes with more nuanced similarity measures such as the Gaussian RBF kernel of Equation 1.

Kernels suitable for attributed graphs typically rely on user-defined kernels for the comparison of vertex and edge labels. These kernels are then combined with kernels on structure through operations that yield a valid kernel on graphs, such as addition or multiplication. Two examples of this, the recently proposed kernels for attributed graphs, GraphHopper [32] and GraphInvariant [33], can be expressed as

(9)

Here, is a user-specified kernel comparing vertex attributes and is a kernel that determines a weight for a vertex pair based on the individual graph structures. Kernels belonging to this family are easily identifiable as instances of -convolution kernels, cf. Definition 2.1.

For graphs with real-valued attributes, one could set to the Gaussian RBF kernel. The selection of the kernel is essential to take the graph structure into account and allows to obtain different instances of weighted vertex kernels. One implementation of motivated along the lines of GraphInvariant [33] is

where denotes the discrete label of the vertex after the -th iteration of Weisfeiler-Lehman label refinement of the underlying unlabeled graph. Intuitively, this kernel reflects to what extent the two vertices have a structurally similar neighborhood.

Another graph kernel, which fits into the framework of weighted vertex kernels, is the GraphHopper kernel [32] with

Here and are matrices, where the entry for in counts the number of times the vertex appears as the -th vertex on a shortest path of discrete length in , where denotes the maximum diameter over all graphs, and is the Frobenius inner product.

Kriege and Mutzel [34] proposed the subgraph matching kernel which is computed by considering all bijections between all subgraphs on at most vertices, and allows to compare vertex attributes using a custom kernel. Moreover, in [47] the Descriptor Matching kernel is defined, which captures the graph structure by a propagation mechanism between neighbors, and uses a variant of the VG kernel [85] to compare attributes between vertices. The kernel can be computed in time linear in the number of edges.

Morris et al. [52] introduced a scalable framework to compare attributed graphs. The idea is to iteratively turn the continuous attributes of a graph into discrete labels using randomized hash functions. This allows to apply fast explicit graph feature maps, which are limited to graphs with discrete annotations such as the one associated with the Weisfeiler-Lehman subtree kernel [35]. For special hash functions, the authors obtain approximation results for several state-of-the-art kernels which can handle continuous information. Moreover, they derived a variant of the Weisfeiler-Lehman subtree kernel which can handle continuous attributes.

3.6 Neural Approaches

Graph kernels are typically defined with respect to a fixed set of substructures or patterns.444Matching- and optimal assignment kernels are important exceptions. As a result, their capacity to distinguish graphs of different classes does not adapt to the given data distribution. In recent years, graph neural networks (GNNs) have emerged as a machine learning framework to address this issue. Standard GNNs can be viewed as a feed-forward neural network version of the -WL algorithm, where colors (labels) are replaced by continuous feature vectors and network layers are used to aggregate over vertex neighborhoods [63, 61]. More specifically, a basic GNN can be implemented as follows [86]. In each layer , we compute a new feature

(10)

in for , where and are parameter matrices from , and denotes a component-wise non-linear function, e.g., a sigmoid or a ReLU.

Following [62], one may also replace the sum defined over the neighborhood in the above equation by a permutation-invariant, differentiable function, and one may substitute the outer sum, e.g., by a column-wise vector concatenation or LSTM-style update step. Thus, in full generality a new feature is computed as

(11)

where aggregates over the set of neighborhood features and merges the representations of the vertices from step with the computed neighborhood features. Both and may be arbitrary differentiable, permutation-invariant functions (e.g., neural networks), and, by analogy to Equation 10, we denote their parameters as and , respectively.

A vector representation over the whole graph can be computed by summing over the vector representations computed for all vertices, i.e.,

where denotes the last layer. More refined approaches use differential pooling operators based on sorting [87] and soft assignments [88].

In order to adapt the parameters and of Equations 11 and 10, to a given data distribution, they are optimized in an end-to-end fashion (usually via stochastic gradient descent) together with the parameters of a neural network used for classification or regression. Recently, a connection between the -WL and GNNs has been established [65], showing that any possible GNN architecture cannot be more powerful than the -WL in terms of distinguishing non-isomorphic graphs.

Yanardag and Vishwanathan [19] uses recent neural techniques from neural language modeling, such as skip-gram [89]. The authors build on known state-of-the-art kernels, but allow to respect relationships between their features. This is demonstrated by hand-designed matrices encoding the similarities between features for selected graph kernels such as the graphlet and Weisfeiler-Lehman subtree kernel. Similar ideas were used in [51] where smoothing methods for multinomial distributions were applied to the graph domain.

3.7 Other Approaches

Kondor et al. [48] derived a graph kernel using graph invariants based on group representation theory. In [49], a graph kernel is proposed which is able to capture the graph structure at multiple scales, i.e., neighborhoods around vertices of increasing depth, by using ideas from spectral graph theory. Moreover, the authors provide a low-rank approximation algorithm to scale the kernel computation to large graphs. Johansson et al. [50] define a graph kernel based on the the Lovász number [90] and provides algorithms to approximate this kernel.

In [91], a kernel for dynamic graphs is proposed, where vertices and edges are added or deleted over time. The kernel is based on eigen decompositions. Kriege et al. [15, 16] investigated under which conditions it is possible and more efficient to compute the feature map corresponding to a graph kernel explicitly. They provide theoretical as well as empirical results for walk-based kernels. Li et al. [92] proposed a streaming version of the Weisfeiler-Lehman algorithm using a hashing technique. Aiolli et al. [20] and Massimo et al. [93] applied multiple kernel learning to the graph kernel domain. Nikolentzos et al. [94] proposed to first build the -core decomposition of graphs to obtain a hierarchy of nested subgraphs, which are then individually compared by a graph similarity measure. The approach has been combined with several graph kernels such as the Weisfeiler-Lehman subtree kernel and was shown to improve the accuracy on some datasets.

4 Expressivity of Graph Kernels

While a large literature has studied the empirical performance of various graph kernels, there exists comparatively few works that deal with graph kernels exclusively from a theoretical point of view. Most works that provide learning guarantees for graph kernels attempt to formalize their expressivity.

The expressivity of a graph kernel refers broadly to the kernel’s ability to distinguish certain patterns and properties of graphs. In an early attempt to formalize this notion, Gärtner et al. [22] introduced the concept a complete graph kernel—kernels for which the corresponding feature map is an injection. If a kernel is not complete, there are non-isomorphic graphs and with that cannot be distinguished by the kernel. In this case there is no way any classifier based on this kernel can separate these two graphs. However, computing a complete graph kernel is -hard, i.e., at least as hard as deciding whether two graphs are isomorphic [22]. For this problem no polynomial time algorithm for general graphs is known [17]. Therefore, none of the graph kernels used in practice are complete. Note however, that a kernel may be injective with respect to a finite or restricted family of graphs.

As no practical kernels are complete, attempts have been made to characterize expressivity in terms of which graph properties can be distinguished by existing graph kernels. In [95], a framework to measure the expressivity of graph kernels based on ideas from property testing was introduced. The authors show that graph kernels such as the Weisfeiler-Lehman subtree, the shortest-path and the graphlet kernel are not able to distinguish basic graph properties such as planarity or connectedness. Based on these results they propose a graph kernel based on frequency counts of the isomorphism type of subgraphs around each vertex up to a certain depth. This kernel is able to distinguish the above properties and computable in polynomial time for graphs of bounded degree. Finally, the authors provide learning guarantees for 1-nearest neighborhood classifiers. Similarly, [46] proved that optimal assignment kernels based on Laplacian embeddings of graphs can distinguish graphs with different densities as well as random graphs with and without planted cliques. In Johansson et al. [50], the authors studied global properties of graphs such as girth, density and clique number and proposed kernels based on vertex embeddings associated with the Lovász- and SVM- numbers which have been shown to capture these properties.

The expressivity of graph kernels has been studied also from statistical perspectives. In particular, Oneto et al. [96] use well-known results from statistical learning theory to give results which bound measures of expressivity in terms of Rademacher complexity and stability theory. Moreover, they apply their theoretical findings in an experimental study comparing the estimated expressivity of popular graph kernels, confirming some of their known properties. Finally, [75] studied the statistical tradeoff between expressivity and differential privacy [97].

5 Applications of Graph Kernels

The following section outlines a non-exhaustive list of applications of the kernels described in Section 3, categorized by scientific area.

Chemoinformatics

Chemoinformatics is the study of chemistry and chemical compounds using statistical and computational resources [98]. An important application is drug development in which new, untested medical compounds are modeled in silico before being tested in vitro or in animal tests. The primary object of study—the molecule—is well represented by a graph in which vertices take the places of atoms and edges that of bonds. The chemical properties of these atoms and bonds may be represented as vertex and edge attributes, and the properties of the molecule itself through features of the structure and attributes. The graphs derived from small molecules have specific characteristics. They typically have less than 50 vertices, their degree is bounded by a small constant ( with few exceptions), and the distribution of vertex labels representing atom types is specific (e.g., most of the atoms are carbon). Almost all molecular graphs are planar, most of them even outerplanar [99], and they have a tree-like structure [100]. Molecular graphs are not only a common benchmark dataset for graph kernels, but several kernels where specifically proposed for this domain, e.g., [27, 101, 102, 29, 43]. The pharmacophore kernel was introduced by Mahé et al. [103] to compare chemical compounds based on characteristic features of vertices together with their relative spatial arrangement. As a result, the kernel is designed to handle with continuous distances. The pharmacophore kernel was shown to be an instance of the more general subgraph matching kernel [34]. Mahé and Vert [29] developed new tree pattern kernels for molecular graphs, which were then applied in toxicity and anti-cancer activity prediction tasks. Kernels for chemical compounds such as this have been successfully employed for various tasks in cheminformatics including the prediction of mutagenicity, toxicity and anti-cancer activity [101].

However, such tasks have been addressed by computational methods long before the advent of graph kernels, cf. Figure 2. So-called fingerprints are a well-established classical technique in cheminformatics to represent molecules by feature vectors [98]. Commonly features are obtained by (i) enumeration of all substructures of a certain class contained in the molecular graphs, (ii) taken from a predefined dictionary of relevant substructures or (iii) generated in a preceding data-mining phase. Fingerprints are then used to encode the number of occurrences of a feature or only its presence or absence by a single bit per feature. Often hashing is used to reduce the fingerprint length to a fixed size at the cost of information loss [see, e.g., 104]. Such fingerprints are typically compared using similarity measures such as the Tanimoto coefficient, which are closely related to kernels [58]. Approaches of the first category are, e.g., based on all paths contained in a graph [104] or all subgraphs up to a certain size [79], similar to graphlets. Ralaivola et al. [58] experimentally compared random walk kernels to kernels derived from path-based fingerprints and has shown that these reach similar classification performance on molecular graph datasets. Extended connectivity fingerprints encode the neighborhood of atoms iteratively similar to the graph kernels discussed in Section 3.1 and can be considered a standard tool in cheminformatics for decades [57]. Predefined dictionaries compiled by experts with domain-specific knowledge exist, e.g., MACCS/MDL Keys for drug discovery [105].

Bioinformatics

Understanding proteins, one of the fundamental building blocks of life, is a central goal in bioinformatics. Proteins are complex molecules which are often represented in terms of larger components such as helices, sheets and turns. Borgwardt et al. [82] model protein data as graphs where each vertex represents such a component, and each edge indicates proximity in space or in amino acid sequence. Both vertices and edges are annotated by categorical and real-valued attributes. The authors used a modified random walk kernel to classify proteins as enzymes or non-enzymes. In related work, Borgwardt et al. [106] predict disease outcomes from protein-protein interaction networks. Here, each vertex is a protein and each edge the physical interaction between a protein-protein pair. In order to take missing edges into account, which is crucial for studying protein-protein-interaction networks, the kernel

was proposed, which is the sum of a random walk kernel applied to the original graphs and as well as to their complement graphs and . Studying pairs of complement graphs may be useful also in other applications.

Neuroscience

The connectivity and functional activity between neurons in the human brain are indicative of diseases such as Alzheimer’s disease as well as subjects’ reactions to sensory stimuli. For this reason, researchers in neuroscience have studied the similarities of brain networks among human subjects to find patterns that correlate with known differences between them. Representing parts of the brain as vertices and the strength of connection between them as edges, several authors have applied graph kernels for this purpose [107, 108, 109, 110, 111]. Unlike many other applications, the vertices in brain networks often have an identity, representing a specific part of the brain. Jie et al. [111] exploited this fact in learning to classify mild cognitive impairments (MCI). They find that their proposed kernel, based on iterative neighborhood expansion (similar to the Weisfeiler-Lehman kernel), which exploits the one-to-one mapping of vertices (brain regions) between different graphs consistently outperforms baseline kernels in this task.

Natural language processing

Natural language processing is ripe with relational data: words in a document relate through their location in text, documents relate through their publication venue and authors, named entities relate through the contexts in which they are mentioned. Graph kernels have been used to measure similarity between all of these concepts. For example, Nikolentzos et al. [112] use the shortest-path kernel to compute document similarity by converting each document to a graph in which vertices represent terms and two vertices are connected by an edge if the corresponding terms appear together in a fixed-size window. Hermansson et al. [113] used the co-occurrence network of person names in a large news corpus to classify which names belong to multiple individuals in the database. Each name was represented by the subgraph corresponding to the neighborhood of co-occuring names and labeled by domain experts. The output of the system was intended for use as preprocessing to an entity disambiguation system. In [114] the Weisfeiler-Lehman subtree kernel was used to define a similarity function for call graphs of Java programs to identify similar call graphs. de Vries [115] extended the Weisfeiler-Lehman subtree kernel so that it can handle RDF data.

Computer vision

Harchaoui and Bach [83] applied kernels based on walks of a fixed length to image classification and developed a dynamic programming approach for their computation. The also modified tree pattern kernels for image classification, where graphs typically have a fixed embedding in the plane. Wu et al. [116] proposed graph kernels for human action recognition in video sequences. To this end, they encode the features of each frame as well as the dynamic changes between successive frames by separate graphs. These graphs are compared by a linear combination of random walk kernels using multiple kernel leraning, which leads to an accurate classifcation of human actions. The propagation kernel was applied to predict object categories in order to facilitate robot grasping [117]. To this end, 3D point cloud data was represented by -nearest neighbor graphs.

6 Experimental Study

In our experimental study, we investigate various kernels considered to be state-of-the-art in detail and compare them to simple baseline methods using vertex and edge label histograms. We would like to answer the following research questions.

  1. Are the proposed graph kernels sufficiently expressive to distinguish the graphs of common benchmark datasets from each other according to their labels and structure?

  2. Can the classification accuracy of graph kernels be improved by finding non-linear decision boundaries in their feature space?

  3. Is there a graph kernel that is superior over the other graph kernels in terms of classification accuracy? Does the answer to 1 explain the differences in prediction accuracy?

  4. Which graph kernels predict similarly? Do different graph kernels succeed and fail for the same graphs?

  5. Is there a kernel for graphs with continuous attributes that is superior over the other graph kernels in terms of classification accuracy?

6.1 Methods

We describe the methods we used to answer the research questions and summarize our experimental setup.

6.1.1 Classification Accuracy

In order to answer several of our research questions, it is necessary to determine the prediction accuracy achieved by the different graph kernels. We performed classification experiments using the -SVM implementation LIBSVM [118]. We used nested cross-validation with folds in the inner and outer loop. In the inner loop the kernel parameters and the regularization parameter were chosen by cross-validation based on the training set for the current fold. In the same way it was determined whether the kernel matrix should be normalized. The parameter was chosen from . We repeated the outer cross-validation ten times with different random folds, and report average accuracies and standard deviations.

6.1.2 Complete Graph Kernels

The theoretical concept of complete graph kernels has little practical relevance and is not suitable for answering 1. Therefore we generalize the concept of complete graph kernels. For a given dataset of graphs with class labels for all , we say a graph kernel with a feature map is complete for if for all graphs the implication holds; it is is label complete for if for all graphs the implication holds. Note that we may test whether holds using the kernel trick without constructing the feature vectors. For a kernel on with a feature map the kernel metric is

(12)
(13)

Therefore, if and only if . We define the (label) completeness ratio of a graph kernel w.r.t. a dataset as the fraction of graphs in the dataset that can be distinguished from all other graphs (with different class labels) in the dataset. Note that the label completeness ratio limits the classification accuracy of a kernel on a specific dataset.

We investigate how this measures aligns with the observed prediction accuracy. Clearly, classifiers based on complete kernels not necessarily generalize well, e.g., because the feature vectors of graphs in different classes are not linearly separable in the feature space. In this case (an additional) mapping in a high-dimensional feature space might improve the accuracy.

6.1.3 Non-linear Decision Boundaries in the Feature Space of Graph Kernels

Many graph kernels explicitly compute feature vectors and thus essentially transform graph data to vector data, cf. Section 3. Typically these kernels then just apply the linear kernel to these vectors ob obtain a graph kernel. This is surprising since it is well-known that for vector data often better results can be obtained by a polynomial or Gaussian RBF kernel. These, however, are usually not used in combination with graph kernels. Sugiyama and Borgwardt [41] observed that applying a Gaussian RBF kernel to vertex and edge label histograms leads to a clear improvement over linear kernels. Moreover, for some datasets the approach was observed to be competitive with random walk kernels. Going beyond the application of standard kernels to graph feature vectors, Kriege [5] proposed to obtain modified graph kernels also from those based on implicit computation schemes by employing the kernel trick, e.g., by substituting the Euclidean distance in the Gaussian RBF kernel by the metric associated with a graph kernel. Since the kernel metric can be computed without explicit feature maps, any graph kernel can thereby be modified to operate in a different (high-dimensional) feature space. However, the approach was generally not employed in experimental evaluations of graph kernels. Only recently, Nikolentzos and Vazirgiannis [119] presented first experimental results of the approach for the shortest-path, Weisfeiler-Lehman and pyramid match graph kernel using a polynomial and Gaussian RBF kernel for successive embedding. Promising experimental results were presented, in particular, for the Gaussian RBF kernel. We present an in detail evaluation of the approach on a wide range of graph kernels and datasets.

We apply the Gaussian RBF kernel to the feature vectors associated with graph kernels by substituting the Euclidean distance in Equation 1 by the metric associated with graph kernels. Note that the kernel metric can be computed from feature vectors according to Equation 12 or by employing the kernel trick according to Equation 13. In order to study the effect of this modification experimentally, we have modified the computed kernel matrices as described above. The parameter was selected from by cross-validation in the inner cross-validation loop based on the training data sets.

6.2 Datasets

In our experimental evaluation, we have considered graph data from various domains, most of which has been used previously to compare graph kernels. Moreover, we derived new large datasets from the data published by the National Center for Advancing Translational Sciences in the context of the Tox21 Data Challenge 2014555https://tripod.nih.gov/tox21/challenge/ initiated with the goal to develop better toxicity assessment methods for small molecules. These datasets each contain more than 7000 graphs and thus exceed the size of the datasets typically used to evaluate graph kernels. We have made all datasets publicly available [120]. Their statistics are summarized in Table 2.

The datasets AIDS, BZR, COX2, DHFR, Mutagenicity, MUTAG, NCI1, NCI109, PTC and Tox21 are graphs derived from small molecules, where class labels encode a certain biological property such as toxicity and activity against cancer cells. The vertices and edges of the graphs represent the atoms and their chemical bonds, respectively, and are annotated by their atom and bond type. The datasets DD, ENZYMES and PROTEINS represent macromolecules using different graph models. Here, the vertices either represent protein tertiary structures or amino acids and the edges encode spatial proximity. The class labels are the 6 EC top-level classes or encode whether a protein is an enzyme. The datasets REDDIT-BINARY, IMDB-BINARY and IMDB-MULTI are derived from social networks. The MSRC datasets are associated with computer vision tasks. Images are encoded by graphs, where vertices represent superpixels with a semantic label and edges their adjacency. Finally, SYNTHETICnew and Synthie are synthetically generated graphs with continuous attributes. FRANKENSTEIN contains graphs derived from small molecules, where atom types are represented by high dimensional vectors of pixel intensities of associated images.

Dataset Properties Labels Attributes Ref.
Graphs Clas. Avg. Avg. Vertex Edge Vertex Edge
AIDS 2000 2 15.69 16.20 + + + (4) [121]
BZR 405 2 35.75 38.36 + + (3) [122]
COX2 467 2 41.22 43.45 + + (3) [122]
DHFR 467 2 42.43 44.54 + + (3) [122]
DD 1178 2 284.32 715.66 + [123, 35]
ENZYMES 600 6 32.63 62.14 + + (18) [82, 124]
FRANKENSTEIN 4337 2 16.90 17.88 + (780) [33]
IMDB-BINARY 1000 2 19.77 96.53 [19]
IMDB-MULTI 1500 3 13.00 65.94 [19]
Mutagenicity 4337 2 30.32 30.77 + + [121, 125]
MSRC-9 221 8 40.58 97.94 + [38]
MSRC-21 563 20 77.52 198.32 + [38]
MSRC-21C 209 20 40.28 96.60 + [38]
MUTAG 188 2 17.93 19.79 + + [126, 34]
NCI1 4110 2 29.87 32.30 + [35]
NCI109 4127 2 29.68 32.13 + [35]
PTC-FM 349 2 14.11 14.48 + + [127, 34]
PTC-FR 351 2 14.56 15.00 + + [127, 34]
PTC-MM 336 2 13.97 14.32 + + [127, 34]
PTC-MR 344 2 14.29 14.69 + + [127, 34]
PROTEINS 1113 2 39.06 72.82 + + (1) [82, 123]
REDDIT-BINARY 2000 2 429.63 497.75 [51]
SYNTHETICnew 300 2 100.00 196.25 + (1) [32]
Synthie 400 4 95.00 173.92 + (15) [52]
Tox21-AR 9362 2 18.39 18.84 + + [128]
Tox21-MMP 7320 2 17.49 17.83 + + [128]
Tox21-AHR 8169 2 18.09 18.50 + + [128]
Table 2: Dataset statistics and properties.

6.3 Graph Kernels

As a baseline we included the vertex label kernel (VL) and edge label kernel (EL), which are the dot products on vertex and edge label histograms, respectively. An edge label is a triplet consisting of the labels of the edge and the label of its two endpoints. We used the Weisfeiler-Lehman subtree (WL) and Weisfeiler-Lehman optimal assignment kernel (WL-OA), see Section 3.1. For both the number of refinement operations was chosen from by cross-validation. In addition we implemented a graphlet kernel (GL3) and the shortest-path kernel (SP) [24]. GL3 is based on connected subgraphs with three vertices taking labels into account similar to the approach used by Shervashidze et al. [35]. For SP we used the indicator function to compare path lengths and computed the kernel by explicit feature maps in case of discrete vertex labels, cf. [35]. These kernels were implemented in Java based on the same common data structures and support both vertex labels and—with exception of VL and SP—edge labels.

We compare three kernels based on matching of vertex embeddings, the matching kernel of Johansson and Dubhashi [46] with inverse Laplacian (MK-IL) and Laplacian (MK-L) embeddings and the Pyramid Match (PM) kernel of [45]. The MK kernels lack hyperparameters and for the PM-kernel, we used the default settings—vertex embedding dimension () and matching levels —in the implementation by Nikolentzos [129]. Finally, we include the shortest-path variant of the Deep Graph Kernel (DeepGK) [19] with parameters as suggested in Yanardag [130] (SP feature type, MLE kernel type, window size 5, 10 dimensions)666We did not perform a parameter search for the parameters of the Deep Graph kernel and the accuracy of the kernel may improve with a more tailored choice., the DBR kernel of Bai et al. [53] (no parameters, code obtained through correspondence) and the propagation kernel (Prop) [38, 131] for which we select the number of diffusion iterations by cross-validation and use the settings recommended by the authors for other hyperparameters.

In a comparison of kernels for graphs with continuous vertex attributes we use the shortest-path kernel [24] with a Gaussian RBF base kernel to compare vertex attributes, see also Equation 5, the GraphHopper kernel [32], the GraphInvariant kernel [33], the Propagation kernel (P2K) [38], and the Hash Graph kernel [52]. We set the parameter of the Gaussian RBF kernel to for the GraphHopper and the GraphInvariant kernel, as reported in [32, 33], where denotes the number of components of the vertex attributes. For datasets that do not have vertex labels, we either used the vertex degree instead or uniform labels (selected by (double) cross-validation).

6.4 Results and Discussion

We present our experimental results and discuss the research questions.

1.

For these experiments we only considered kernels that are permutation-invariant and guarantee that two isomorphic graphs are represented by the same feature vector. This is not the case for the MK-* and PM kernels because of the vertex embedding techniques applied.

Figure 7: Completeness ratio.
Figure 8: Label completeness ratio.

Figure 7 shows the completeness ratio of various permutation invariant graph kernels with different parameters on the datasets as a heatmap. The WL-OA kernels achieved the same results as the WL kernels and are therefore not depicted. As expected, VL achieves only a weak completeness ratio, since it ignores the graph structure completely. To a lesser extent, this also applies to EL and GL. The SP and the WL kernels with provide a high completeness ratio close to one for most datasets. However, for the IMDB-BINARY dataset shortest-paths appear to be less powerful features than small local graphlets. This indicates structural differences between this dataset and the molecular graph datasets, where SP consistently achieves better results than GL. As expected DeepGK performs similar to the SP kernel. WL and Prop are both based on a neighborhood aggregation mechanism, but WL achieves a higher completeness ratio on several datasets. This is explained by the fact that Prop does not support edge labels and does not employ a relabeling function after each propagation step. DBR does not take labels into account and consequently fails to distinguish many graphs of the datasets, for which vertex labels are informative. The difficulty of distinguishing the graphs in a dataset varies strongly based on the type of graphs. The computer vision graphs are almost perfectly distinguished by just considering the vertex label multiplicities, molecular graphs often require multiple iterations of Weisfeiler-Lehman or global features such as shortest paths. For social networks, the REDDIT-BINARY graphs are also effectively distinguished by Weisfeiler-Lehman refinement, while this is not possible for the two IMDB datasets. However, we observed that all the graphs in these two datasets that cannot be distinguished by WL are in fact isomorphic.

We now consider the label completeness ratio depicted in Figure 8. The label completion ratio generally shows the same trends, but higher values close to one are reached as expected. For the datasets IMDB-BINARY and IMDB-MULTI we have already observed that WL distinguishes all non-isomorphic graphs. As we see in Figure 8 these datasets contain a large number of isomorphic graphs that actually belong to different classes. Apparently, the information contained in the dataset is not sufficient to allow perfect classification. A general observations from the heatmaps is that WL (just as WL-OA) effectively distinguish most graphs after only few iterations of refinement. For some non-challenging datasets even VL and EL are sufficient expressive. Therefore, these kernels are interesting baselines for accuracy experiments. In order to effectively learn with a graph kernel, it is not sufficient to just distinguish graphs, which may lead to strong overfitting, but to provide a smooth similarity measure that allows the classifier to generalize to unseen data.

2.

We discuss the accuracy results of the classification experiments summarized in Tables 3 and 4. The classification accuracy of the simple kernels VL and EL can be drastically improved by combining them with the Gaussian RBF kernel for several datasets. A clear improvement is also achieved for GL3 on an average. For WL and WL-OA the Gaussian RBF kernel only leads to minor changes in classification accuracy for most datasets. However, a strong improvement is observed for WL and the dataset ENZYMES, even lifting the accuracy above the value reached by WL-OA on the same dataset. However, for the dataset REDDIT-BINARY the accuracy of WL is improved, but still far below the accuracy obtained by WL-OA, which is based on the histogram intersection kernel applied to the WL feature vectors. A surprising result is that the trivial EL kernel combined with the Gaussian RBF kernel performs competitive to many sophisticated graph kernels. On an average it provides a higher accuracy than the (unmodified) SP, GL3 and PM kernel. The DBR kernel does not take labels into account and performs poorly on most datasets.

The application of the Gaussian RBF kernel introduces the hyper-parameter , which must be optimized, e.g., via grid search and cross-validation. This is computational demanding for large datasets, in particular, when the graph kernel also requires parameters that must be optimized. Therefore, we suggest to combine VL, EL and GL3 with a Gaussian RBF kernel as a base line. For WL and WL-OA the parameter needs to be optimized and the accuracy gain is minor for most datasets, in particular for WL-OA. Therefore, their combination with an Gaussian RBF kernel cannot be generally recommended. Note that the combination with an Gaussian RBF kernel also complicates the application of fast linear classifiers, which are advisable for large datasets.

Dataset VL EL SP WL WL-OA GL3
NCI1 64.60.1 67.22.8 66.30.1 71.80.3 73.20.3 79.30.4 85.90.1 86.20.1 86.20.2 86.60.2 70.50.2 76.50.4
NCI109 63.60.2 68.91.4 64.90.1 71.40.5 72.70.3 77.60.3 85.90.3 86.00.3 86.20.2 86.40.2 69.30.2 76.00.4
PTC-FR 67.90.4 66.90.5 66.80.5 65.21.2 67.12.0 63.72.0 67.11.2 66.81.5 67.81.1 67.01.3 65.50.9 65.01.4
PTC-MR 57.80.9 59.41.4 56.71.6 60.51.8 58.82.2 62.01.8 60.41.5 62.72.0 62.61.5 62.71.0 57.41.6 60.41.6
PTC-FM 63.90.5 62.60.9 64.50.4 60.51.4 62.71.0 60.21.3 62.81.2 60.90.8 61.61.2 61.71.2 60.23.0 60.70.8
PTC-MM 66.60.8 64.70.4 64.11.0 62.71.6 63.31.2 63.20.8 67.82.1 67.71.3 66.41.1 66.31.7 61.41.7 61.31.4
MUTAG 85.40.7 82.91.0 83.61.0 88.42.2 83.11.3 85.21.4 86.60.6 87.91.0 87.52.1 87.31.7 87.21.1 87.81.1
Mutagenicity 67.00.2 73.90.3 72.40.1 80.30.3 77.40.2 80.10.2 83.60.2 84.50.3 84.20.2 84.70.4 79.80.2 82.70.3
AIDS 99.70.0 99.70.0 99.50.0 99.40.0 99.60.0 99.70.0 99.70.0 99.70.0 99.70.0 99.70.0 99.20.1 99.30.1
BZR 78.80.1 86.00.2 79.10.5 86.30.3 86.50.9 88.10.5 88.50.7 87.90.8 88.20.4 88.00.5 81.60.7 85.41.0
COX2 78.20.0 80.60.3 82.00.6 83.90.7 80.60.9 81.70.8 81.21.0 81.70.7 80.40.9 80.81.3 81.30.7 81.90.5
DHFR 60.90.2 74.81.2 67.90.6 73.20.9 77.50.6 80.70.7 82.70.4 83.50.6 83.01.0 83.30.6 74.70.6 81.21.0
DD 78.20.4 80.10.4 77.50.6 78.70.7 79.50.6 74.50.2 78.90.4 80.90.3 79.20.4 79.90.5 79.70.7 79.10.6
PROTEINS 71.90.4 74.70.4 73.40.3 75.20.5 75.90.4 74.00.3 75.50.3 73.90.7 76.20.4 75.90.6 72.70.6 73.00.6
ENZYMES 23.41.1 41.71.1 27.70.7 45.11.2 41.91.7 59.51.3 53.71.5 62.61.2 59.91.0 62.31.1 30.41.1 58.61.0
IMDB-BINARY 46.30.9 56.50.6 46.00.9 62.61.2 57.30.6 70.20.8 72.90.6 71.31.0 73.10.7 73.50.6 59.40.4 70.10.8
IMDB-MULTI 31.90.5 39.50.9 30.80.9 46.90.6 39.60.2 46.10.7 50.30.4 50.70.6 50.40.5 50.70.5 40.60.4 47.10.5
REDDIT-BINARY 75.30.1 77.60.2 75.10.1 79.40.1 81.70.2 67.80.2 80.90.4 83.90.5 89.30.2 88.90.1 60.10.2 73.60.1
MSRC-9 88.41.3 87.71.0 92.60.9 90.20.7 91.40.8 89.21.0 90.10.8 89.10.9 90.70.8 90.10.7 91.60.7 91.60.9
MSRC-21 89.40.3 90.00.5 89.50.3 87.30.4 89.40.6 37.41.2 89.30.6 89.80.4 90.00.6 90.50.4 90.50.7 85.10.6
MSRC-21C 81.21.2 80.81.7 84.50.8 81.81.3 83.81.2 78.31.3 81.90.9 82.11.1 84.90.8 84.51.0 84.01.7 82.61.0
Tox21-AR 95.90.0 96.40.0 95.90.0 97.50.0 97.10.0 97.50.0 97.90.0 98.00.0 98.00.0 98.00.0 96.40.0 97.60.0
Tox21-MMP 84.30.0 86.50.1 84.50.0 89.70.2 86.40.1 90.70.2 92.50.1 93.00.2 92.70.1 92.80.1 87.30.1 91.20.2
Tox21-AHR 88.40.0 89.10.2 88.60.1 91.40.2 88.40.0 91.90.1 93.40.1 93.70.1 93.50.1 93.60.1 89.70.1 92.80.2
Average 71.2 74.5 72.2 76.2 75.6 74.9 79.6 80.2 80.5 80.6 73.8 77.5
Table 3: Classification accuracy and standard deviation for several kernels and their variant when plugged into the Gaussian RBF kernel.
Dataset MK-IL MK-L PM DeepGK DBR777Computation of the DBR kernel did not finish within 48h on some datasets, as indicated by a line (—). The DBR kernel does not make use of label information. Prop