FeatGraph: A Flexible and Efficient Backend for Graph Neural Network Systems
Abstract
Graph neural networks (GNNs) are gaining popularity as a promising approach to machine learning on graphs. Unlike traditional graph workloads where each vertex/edge is associated with a scalar, GNNs attach a feature tensor to each vertex/edge. This additional feature dimension, along with consequently more complex vertex and edgewise computations, has enormous implications on locality and parallelism, which existing graph processing systems fail to exploit.
This paper proposes FeatGraph to accelerate GNN workloads by cooptimizing graph traversal and feature dimension computation. FeatGraph provides a flexible programming interface to express diverse GNN models by composing coarsegrained sparse templates with finegrained userdefined functions (UDFs) on each vertex/edge. FeatGraph incorporates optimizations for graph traversal into the sparse templates and allows users to specify optimizations for UDFs with a feature dimension schedule (FDS). FeatGraph speeds up endtoend GNN training and inference by up to 32 on CPU and 7 on GPU.
=0ex \belowrulesep=0ex
I Introduction
Graph neural networks (GNNs) are gaining popularity in recent years as a promising approach to machine learning on graphs. Because of the ability to incorporate multidimensional features on vertices and edges as well as graph structure information into a joint embedding for downstream tasks, GNNs have shown successful applications in social network mining [32], recommender systems [40], molecule analysis [7], combinatorial optimization [18], to name a few.
Driven by this trend, specialized software frameworks are emerging to simplify the development and processing of GNN workloads. These GNN frameworks are typically built on top of existing deep learning systems. For example, NeuGraph [20] relies on TensorFlow [1]; PyTorch geometric (PyG) [6] is built upon PyTorch [26]; DGL [37] supports multiple backends.
Unlike traditional neural network workloads that are dominated by dense operations, GNN workloads consist of both dense and sparse operations. The sparsity comes from the nature of the graph that normally each vertex only connects with a small number of other vertices. Empirically, sparse operations in a GNN model account for more than 60% of the total computation time, when both the sparse and dense operations are fully optimized. While the deep learning systems have benefited from years of development in optimizing dense operations such as convolution and matrix multiplication, they lack flexible support for sparse operations that are essential for highperformance GNN training and inference. Specifically, the deep learning systems rely on vendorprovided sparse libraries (e.g., MKL [11], cuSPARSE [24]), which offer highly optimized implementations for only a small subset of the kernels required by diverse GNN models.
On the other hand, graph processing systems have been extensively studied in literature [21, 28, 8, 30, 38, 43], offering an alternative solution that expresses computations on graphs with a vertex and/or edgecentric programming paradigm. As a representative attempt to circumvent the inflexibility of deep learning systems in handling sparse computations, DGL supports offloading the computation kernels in GNNs to existing graph processing systems such as Ligra [30] and Gunrock [38].
However, existing graph processing systems are not the panacea, either. They are designed for traditional graph workloads (e.g., BFS, PageRank) where each vertex is associated with a scalar, instead of a feature tensor in GNN’s use case. This additional feature dimension, along with consequently more complex vertexwise and edgewise computations, makes kernel optimizations substantially different. For example, existing graph partitioning techniques aiming at improving cache utilization [41, 45] do not take into consideration the feature dimension; hence the entire cache could be occupied by just a few feature tensors. Also, prior GPU graph processing systems [14, 15, 38] rarely exploit parallelism in feature dimension computation while mainly focusing on designing sophisticated load balancing methods to exploit parallelism in graph traversal.
This paper proposes FeatGraph to enable performant processing of GNN workloads. The key insight is that GNN workloads require cooptimizing graph traversal and feature dimension computation to achieve preferable performance. FeatGraph suits this need by the design of a twogranularity programming interface. More concretely, FeatGraph expresses the diverse variants of GNN models by composing coarsegrained sparse templates with finegrained feature dimension computations on each vertex/edge in the form of userdefined functions (UDFs). The coarsegrained level handles traversal over the graph topology, and the finegrained level focuses on computation over the dense feature tensor of each vertex/edge. FeatGraph provides two sparse templates: generalized SpMM (sparsedense matrix multiplication) for vertexwise computations and generalized SDDMM (sampled densedense matrix multiplication) for edgewise computations. Here, “generalized” means that the templates can take different finegrained UDFs. For example, a commonly used GNN kernel multilayer perceptron (MLP) aggregation [29, 25], shown in Figure 1, is mapped to generalized SpMM: it calculates features by performing MLP and aggregates them by taking the max. Note that the vanilla SpMM operation corresponds to copying features and aggregating them by taking the sum. Similarly, attention calculation on edges is mapped to generalized SDDMM; the vanilla SDDMM operation corresponds to a specific attention mechanism that performs a dot product between the source vertex feature vector (i.e., 1D tensor) and the destination vertex feature vector.
By cleanly decomposing a kernel specification into sparse templates and UDFs, FeatGraph enables decoupled, twolevel optimizations. At the coarsegrained level, FeatGraph incorporates optimizations for graph traversal into the sparse templates: applying graph partitioning techniques to improve cache utilization on CPU, adapting parallelization strategies for sparse patterns to fully utilize the massive compute capacity on GPU, etc. At the finegrained level, FeatGraph allows users to specify optimizations for UDFs, e.g., how to tile or parallelize feature dimension computation, with a feature dimension schedule (FDS). FeatGraph combines sparse templates with FDS, and extends a tensor compiler, namely Apache TVM [4], to generate efficient kernels for both CPU and GPU. In addition, by decoupling these two levels of optimizations, FeatGraph significantly improves the productivity of developing new kernels for emerging GNN models.
We perform a comprehensive evaluation to verify the efficiency and flexibility of FeatGraph. Compared with traditional graph processing systems (i.e., Ligra [30] on CPU and Gunrock [38] on GPU), FeatGraph achieves significantly higher performance. Compared with vendorprovided sparse libraries (i.e., MKL [11] on CPU and cuSPARSE [24] on GPU), FeatGraph achieves competitive performance on the kernels that are supported in these libraries while being more flexible to cover more kernels. We integrated FeatGraph into DGL, a popular GNN framework, to accelerate endtoend GNN training and inference by up to 32 on CPU and 7 on GPU. To the best of our knowledge, FeatGraph is the first unified and generic solution that can flexibly integrate with different GNN frameworks and efficiently process GNN workloads on both CPU and GPU. We summarize the characteristics of FeatGraph and other works in Table I.FeatGraph is available in opensource format at https://github.com/dglai/FeatGraph.
Specifically, this paper makes the following contributions:

FeatGraph provides a flexible programming interface that is able to express the diverse variants of GNN models by composing coarsegrained sparse templates with customizable finegrained feature dimension computations on each vertex/edge.

FeatGraph performs extensive optimizations in both graph traversal and feature dimension computation to generate efficient kernels. In addition, FeatGraph decouples these two levels of optimizations to improve the productivity of developing new kernels for emerging GNN models.

Experiment results on representative GNN models and a wide collection of datasets show that FeatGraph is portable to existing GNN frameworks and serves as a flexible and efficient backend.
The rest of the paper is organized as follows. Section II reviews the background of GNNs and tensor compilers, and motivates FeatGraph by examining the limitations of existing graph processing systems. Section III describes the programming interface design and optimization techniques of FeatGraph. Section IV presents the system implementation, followed by evaluation in Section V. We discuss related work in Section VI and summarize in Section VII.
Ii Background and Motivation
Iia Graph Neural Networks (GNNs)
In recent years, there is a rise of interest in adopting deep learning to structural data such as graphs. Unlike the dense objects (e.g., images, videos, texts) handled by traditional deep learning models, graphs represent sparsely, irregularly connected links. Essentially, graphs are defined on a nonEuclidean domain equipped with vastly different distance measurements and geometric properties, imposing the demand for new neural network architectures.
GNNs are an emerging family of neural networks capable of learning a joint representation for each vertex/edge using both features and topological data. Recent studies [7, 3] have unified different GNN models into a message passing paradigm where each vertex computes a new representation by aggregating features (messages) from its neighbors. More formally, given a graph , we denote the input feature tensor associated with vertex as , and that associated with the edge pointing from vertex to as . To get the representation of a vertex and an edge, the message passing paradigm carries out the following computations:
(1) 
(2) 
Here , , and are customizable or parameterized functions (e.g., neural network modules) for calculating messages, aggregating messages, and updating edge representations, respectively. Similar to convolutional neural networks (CNNs), a GNN model iteratively applies Equations (1) (2) to generate vertex and edge representations for higher layers.
There is a strong connection between Equations (1) (2) and sparse matrix operations. For example, given the vertex feature matrix and the adjacency matrix , the vertexwise computation in the graph convolutional network (GCN) [16], which copies source vertex features as messages and aggregates messages by taking the sum, is equivalent to SpMM (sparsedense matrix multiplication) as follows.
(3) 
For edgewise computations, many GNN models [35, 33] calculate an attention weight on each edge. One popular formulation for calculating attention weight is by a dot product between the source and destination vertex features [34], that is, . Its tensorized implementation corresponds to SDDMM (sampled densedense matrix multiplication) [42], which multiplies two dense matrices, followed by an elementwise multiplication with a sparse mask matrix, to output a sparse matrix.
(4) 
Hence, Equations (1) and (2) when implemented as tensor operations are generalized SpMM and SDDMM, respectively. They represent two distinct computation patterns in GNN workloads: reduction on each vertex and reduction on each edge. Moreover, according to the chain rule, the gradient computation of SpMM with respect to requires a dot product between the gradients of source and destination vertex features, thus following the SDDMM pattern. Likewise, the gradient computation of SDDMM follows the SpMM pattern. Therefore, these two computation patterns are essential for both inference and training of GNNs. In particular, our benchmarking shows that generalized SpMM and SDDMM occupy of the total run time in training a 2layer GNN model, using the existing solutions with suboptimized sparse kernels.
IiB Limitations of Existing Graph Processing Systems
Existing graph processing systems [30, 38] express computations on graphs with a vertex and/or edgecentric programming paradigm, and they employ a scheduler to realize efficient graph traversal. For example, to ensure load balance on GPU, Gunrock [38] assigns the edges of a vertex to be processed by a thread, a warp, or a block, according to the number of the edges. Edge is the unit for scheduling—the computation on an edge is blackbox to the scheduler. The underlying assumption is that the computation on an edge is lightweight.
However, that assumption breaks in GNNs, which attach a multidimensional feature tensor to each vertex/edge, and consequently have more complex vertexwise and edgewise computations than traditional graph workloads. For example, MLP aggregation, as shown in Figure 1, performs a sequence of vector matrix multiplication and nonlinear activation on each edge. Treating the computation on each edge as a blackbox, Gunrock fails to exploit the abundant parallelism in it.
To enable performant processing of GNN workloads, we need a system that: 1) makes vertexwise and edgewise computations whitebox to the scheduler; 2) cooptimizes graph traversal and feature dimension computation. FeatGraph achieves the goals by adopting a tensor compiler approach.
IiC Tensor Compiler
Computationintensive workloads typically operate on tensors, i.e., multidimensional data. For example, traditional deep learning models perform matrix multiplication and convolution over dense tensors; GNNs deal with both dense and sparse tensors (the feature tensor is dense and the adjacency matrix is sparse). Previously, people rely on vendorspecific libraries (e.g., MKL, cuBLAS) to obtain high performance of tensor computations over the vendors’ own CPUs and GPUs. These libraries require heavy manual tuning by experienced engineers. As a result, they evolve slowly in contrast with the rapid emergence of new workloads.
An alternative solution is tensorcompilation, which expresses the processing of tensors in its own intermediate representation (IR) [27, 4]. Tensor compilation separates the computation definition (i.e., what to compute) from the scheduling (i.e., how to compute) so as to focus on the scheduling part for performance optimization. A scheduling scheme can apply loop transformations, vectorization, thread binding, etc., to manipulate the tensor computation. Optimizing one computation kernel for different hardware architectures is essentially searching for different scheduling schemes.
FeatGraph adopts the tensorcompilation approach to optimize the computation kernels in GNN workloads. However, existing tensor compilers [4, 27] mostly focus on computations over dense tensors, and there is little support for computations involving sparse tensors. FeatGraph extends TVM [4] to support the core sparse patterns in GNNs (i.e., generalized SpMM and SDDMM), and allows customizable feature dimension computations on each vertex/edge by the design of a twogranularity programming interface (Sec IIIB).
Iii System Design and Optimization
In this section, we first give an overview of the software stack of FeatGraph (Sec IIIA). We then describe the design of the programming interface and demonstrate its expressiveness using code examples (Sec IIIB). Finally, we cover the optimization techniques for generating efficient GNN kernels on CPU and GPU (Sec IIIC).
Iiia System Overview
Figure 2 depicts the software stack of FeatGraph. At the top level, users define, train, and evaluate GNN models in specialized frameworks such as DGL and PyG,which handle dataflow programming and automatic differentiation. FeatGraph serves as a backend for these frameworks, targeting the message passing computation that is core to GNN workloads. FeatGraph provides a flexible programming interface to express the diverse variants allowed by the message passing paradigm. Specifically, FeatGraph describes feature dimension computations on each vertex/edge with userdefined functions (UDFs), and triggers UDFs by SpMM or SDDMM sparse template. FeatGraph incorporates optimizations for graph traversal into the sparse templates, and allows users to specify optimizations for UDFs with a feature dimension schedule (FDS). FeatGraph combines templates with FDS, and leverages the TVM tensor compiler [4] to generate efficient kernels for both CPU and GPU. By decoupling these two levels of optimizations, FeatGraph significantly improves the productivity of developing new kernels for emerging GNN models.
IiiB Programming Interface
There are two principles in the design of FeatGraph’s programming interface. First, the interface should closely follow the mathematical definition of GNNs as described in Section IIA. Second, it should facilitate optimizations.
To these ends, we propose to decompose a kernel specification into two parts: UDFs written in a tensor expression language adopted from TVM to describe finegrained feature dimension computations on each vertex/edge, and the choice of coarsegrained sparse patterns. FeatGraph provides two kernel templates featgraph.spmm and featgraph.sddmm for the SpMM and SDDMM sparse patterns that directly map to the vertexwise and edgewise computations in the message passing paradigm, i.e., Equations (1) and (2).
More concretely, featgraph.spmm takes in five arguments: an adjacency matrix, a message function, an aggregation function, the target (CPU or GPU), and an FDS to specify optimizations of the message function. Figure 2(a) shows the code for GCN aggregation, i.e., the message aggregation in GCN model as described in Section IIA. Given the edge ID tuple (src, dst, eid), the userdefined message function msgfunc (line 6–8) slices out the src row from the vertex feature matrix XV, which is equivalent to using the source vertex feature as the message. The aggregation function is sum and any commutative reducer is allowed. Figure 2(b) shows a more complex message function, which adds the source and destination vertex features, and then multiplies with a weight matrix, followed by a ReLU activation (i.e., taking the max with 0).
FeatGraph can easily support the commonly used message functions in GNNs—specifically, all the builtin ones provided by DGL
FeatGraph inlines the message function into the SpMM template to generate a fused kernel. In contrast, existing GNN frameworks (e.g., DGL, PyG, NeuGraph) that rely on deep learning systems as backend have to materialize the messages on every edge, causing inefficiency in both performance and memory consumption.
featgraph.sddmm takes in four arguments: an adjacency matrix, an edge function, the target (CPU or GPU), and an FDS to specify optimizations of the edge function. Figure 3(a) shows the code for dotproduct attention, where the userdefined edge function edgefunc (line 6–10) performs a dot product between the source and destination vertex feature vectors, and returns an attention weight as the new feature on the edge. Figure 3(b) shows a more complex edge function, which performs multiple dot products over the feature tensors.
This twogranularity programming interface simplifies implementing new GNN kernels and, more important, facilitates optimizations. By cleanly decomposing a kernel specification into coarsegrained sparse templates and finegrained feature dimension computations on each vertex/edge in the form of UDFs, FeatGraph enables decoupled, twolevel optimizations. Specifically, FeatGraph incorporates optimizations for graph traversal into the sparse templates and allows users to specify optimizations for UDFs with an FDS. Some FDS examples, both for CPU and for GPU, are shown in Figure 2(a) at line 11–15 and line 19–22. It is worth noting that when the FDS is missing, FeatGraph essentially degrades to traditional graph processing systems that are designed without special handling of feature dimension computation.
IiiC Decoupled, Twolevel Optimizations
This subsection describes the optimizations for graph traversal, which are incorporated into the sparse templates, and the optimizations for feature dimension computation, which are specified by users with an FDS. We analyze the interplay between these two levels of optimizations, and show that by combining them, FeatGraph enables performant processing of GNN workloads. Throughout this subsection, we use the sample graph shown in Figure 5 to illustrate optimization techniques.
Graph Partitioning and Feature Dimension Tiling
On CPU, the key factor limiting the efficiency of graph traversal is poor locality, which causes low cache utilization. Prior arts have attempted to improve locality in graph traversal by graph partitioning [41, 45]. FeatGraph proposes combining graph partitioning with feature dimension tiling to strike a balance between efficiency of graph traversal and efficiency of feature dimension computation.
Figure 6 illustrates how feature dimension tiling is combined with 1D graph partitioning [41], which partitions source vertices, to effectively optimize cache utilization in GCN aggregation, i.e., the vanilla SpMM operation. Here we assume the feature vector length is four, and the cache can hold two feature vectors. With 1D graph partitioning alone, source vertices are partitioned into four segments so that each segment fits into the cache; these segments are processed one by one to get four portions of intermediate results; in the end the intermediate results are merged. 1D graph partitioning improves read locality within each segment at the cost of merging intermediate results from different segments. When 1D graph partitioning is combined with feature dimension tiling, merge cost is reduced since more vertices can fit into the cache under the same capacity. As shown in Figure 5(b), tiling each feature vector into two subvectors reduces the number of segments from four to two, which translates to 50% saving in merge cost. However, feature dimension tiling results in traversing the graph twice, which means increased accesses to graph topological data (i.e., the adjacency matrix).
Thus, feature dimension tiling introduces the tradeoff between accesses to graph topological data and accesses to feature data. In GNNs, feature vectors have a typical length ranging from 32 to 1024. When the tiling factor is properly selected, the gain of improved locality in accessing feature data far outweighs the overhead of increased accesses to graph topological data.
More complex UDFs that compute on multidimensional feature tensors may require a multilevel tiling scheme. To efficiently support diverse UDFs, FeatGraph allows users to specify optimizations for UDFs with an FDS. Figure 8 shows the FDS for MLP aggregation—it tiles both dimensions of the weight matrix for cache optimization.
For edgewise computations, besides feature dimension tiling, FeatGraph employs a graph traversal scheme [22] based on Hilbert curve. Edgewise computations access both source and destination vertex features, and update edge features; Hilbert curve traversal exploits locality in accessing both source and destination vertices. The recursive structure of Hilbert curve enables exploiting locality across a spectrum of granularities, e.g., L1/L2/L3 caches. FeatGraph combines Hilbert curve traversal with feature dimension tiling to fully optimize edgewise computations.
Adaptive Parallelization Strategies
To utilize GPU’s massive parallel compute capacity, prior graph processing systems exploit parallelism in graph traversal by implementing either vertex parallelization or edge parallelization [38, 15, 14]. However, they are unable to exploit the abundant parallelism in feature dimension computation arising in GNN workloads due to treating the UDFs as a blackbox. FeatGraph enables exploiting parallelism in feature dimension computation by opening the blackbox of UDFs so as to inform the scheduler. Specifically, FeatGraph allows users to specify a parallelization scheme for UDFs with an FDS, which can be adapted to the diverse computation patterns of UDFs to fully exploit the parallelism in feature dimension computation.
For vertexwise computations, FeatGraph incorporates vertex parallelization into the SpMM template and allows users to specify a parallelization scheme for the message function with an FDS. For example, the FDS for GCN aggregation is shown in Figure 2(a) at line 19–22, which, combined with the SpMM template, defines the parallelization strategy shown in Figure 6(a): each CUDA block processes a number of vertices, which correspond to several rows in the adjacency matrix, and the feature dimension is parallelized across the threads in one CUDA block. This simple parallelization strategy turns out to be highly efficient—there is no load imbalance within each CUDA block since all threads are assigned exactly the same amount of work; no control divergence; read requests into global memory from the threads within one CUDA block are contiguous and can be coalesced to realize high bandwidth utilization. This parallelization strategy is first proposed in [39] that focuses on manually optimizing the vanilla SpMM kernel; we can easily express it with the programming infrastructure of FeatGraph to optimize a broad class of generalized SpMM computations.
For edgewise computations, FeatGraph incorporates edge parallelization into the SDDMM template and allows users to specify a parallelization scheme for the edge function with an FDS. For example, the FDS for dotproduct attention is shown in Figure 3(a) at line 13–16, which, combined with the SDDMM template, defines the parallelization strategy shown in Figure 6(b): each CUDA block processes a number of edges, which correspond to several nonzero elements in the adjacency matrix, and all the threads in one CUDA block collectively process the dotproduct operations on edges using tree reduction [10]. Prior graph processing systems (e.g., Gunrock [38]), which are designed without being aware of feature dimension computation, fail to exploit this form of parallelism.
More complex UDFs that compute on multidimensional feature tensors require a multilevel parallelization scheme. Figure 9 shows the FDS for MLP aggregation—it parallelizes the first dimension across CUDA blocks and the second dimension across threads.
Hybrid Partitioning on GPU
The optimizations for graph traversal on CPU (e.g., 1D graph partitioning) are not directly applicable to GPU due to the differences between CPU and GPU memory architectures—shared memory size(configurable up to 96 KB on Tesla V100 GPU) is much smaller than LLC, which is typically tens of Mega Bytes (MBs). To make effective use of limitedcapacity shared memory on GPU, we propose a hybrid partitioning method that processes highdegree vertices and lowdegree vertices differently. Specifically, this method reorders the vertices into a lowdegree part and a highdegree part according to a threshold; it only partitions highdegree vertices and loads them to shared memory. The intuition of hybrid partitioning is that highdegree vertices are accessed for more times and therefore can benefit more from shared memory optimization. The key tradeoff here is between read efficiency and merge cost—a smaller degree threshold leads to more partitions, which improves read efficiency but increases merge cost.
Iv System Implementation
This section describes the implementation of FeatGraph, in particular, how we extended TVM to support the core sparse patterns of GNNs (i.e., generalized SpMM and SDDMM), and how we integrated FeatGraph into DGL.
Iva TVM IR Templates
We implemented the SpMM and SDDMM templates as TVM IR templates. TVM is a domainspecific language and compiler for tensor computations and has been widely adopted to accelerate deep learning workloads [19, 36]. Because TVM does not support sparse representation and computation in its tensor expression language, we implemented and optimized SpMM and SDDMM templates by directly constructing and manipulating the IR (intermediate representation) using lowerlevel APIs. Feature dimension computations on each vertex/edge described by UDFs are dense and therefore easily supported. FeatGraph combines scheduling parameters from the sparse templates (e.g., number of graph partitions, number of CUDA blocks) and those from the FDS (e.g., feature dimension tiling factors) to create the design space. In this work we use naïve grid search to find the optimal parameters under a given input shape, and it is an interesting future direction to try more intelligent tuners [2, 5] for faster design space exploration. After performing optimizations for both the templates and UDFs, FeatGraph inlines UDFs into the templates to generate fused kernels.
We parallelize the kernels over multiple threads on CPU using the customized thread pool [19] in TVM runtime, which is lightweight and particularly efficient in handling the kind of embarrassingly parallel workloads. To avoid LLC contention after graph partitioning, we assign multiple threads to collectively work on one graph partition at a time instead of assigning each thread to a different partition.
IvB DGL Integration
In order to evaluate the performance of FeatGraph in endtoend GNN training and inference, we integrated FeatGraph into DGL, a popular opensource GNN framework. DGL implemented a minimal Gunrocklike graph kernel interface named Minigun [23]. With Minigun, DGL provided a set of builtin message functions and edge functions to support common GNN workloads. For each of these builtin functions, we implemented a corresponding one with the programming infrastructure of FeatGraph, such as GCN aggregation and dotproduct attention. To handle more complex cases such as MLP aggregation, the current solution in DGL is to calculate and materialize the messages on every edge using deep learning systems as backend. In contrast, FeatGraph generates fused kernels, thus both saving memory and improving efficiency. FeatGraph generates kernel codes for a specific graph topology (i.e., the adjacency matrix); since GNN training typically involves hundreds of epochs, the compilation cost is amortized and negligible.
The integration requires a small amount of effort (only 300 lines of Python code) because both FeatGraph and DGL follow the message passing paradigm in their programming interface design. The integration with DGL demonstrates that it is straightforward to have FeatGraph be the backend to accelerate GNN frameworks in general, including PyG, NeuGraph, etc.
V Evaluation
This section seeks to answer the following questions:

What is the performance gain of GNN kernels on both CPU and GPU?

What is the implication of each of our proposed optimization techniques for both templates and UDFs?

Is the kernel performance sensitive to scheduling parameters and graph sparsity?

What is the speedup of endtoend GNN training and inference brought by FeatGraphwithout affecting the accuracy of the models?
Va Experiment Setup
Environment. For CPU evaluation, we conduct experiments on Amazon EC2 c5.9xlarge instance, which is a onesocket 18core 3.0 GHz Intel Xeon Platinum 8124M machine with 25 MB LLC and 68 GB DRAM. For GPU evaluation, we conduct experiments on p3.2xlarge instance, which has a Tesla V100 GPU with 80 SMs;each SM has shared memory configurable up to 96 KB (the default size is 48 KB).
Datasets.
Table II lists the datasets used for evaluation:
ogbnproteins represents proteins and their biological associations with vertices and edges—this dataset is from Open Graph Benchmark
Baselines. We compare FeatGraph with stateoftheart graph processing systems, specifically Ligra on CPU and Gunrock on GPU. We also compare with vendorprovided sparse libraries, specifically MKL (2019.5) on CPU and cuSPARSE (10.1) on GPU whenever possible, as only a subset of GNN kernels are supported in these libraries.In all the experiments, we first do a warmup run and then take the average time of 10 runs as the measurement.
VB Performance Gain of GNN Kernels
We evaluate the performance gain of FeatGraph on three kernels: GCN aggregation, MLP aggregation, and dotproduct attention.The kernels are performed on the full graph. We do the evaluation across a spectrum of feature lengths. For MLP aggregation, the feature length refers to d2 that is shown in Figure 2(b); d1 is fixed as 8.
Singlethreaded CPU Performance. Table III shows that across all the evaluated datasets under different feature lengths, FeatGraph achieves 1.4–4.0 speedup over Ligra on GCN aggregation, 4.4–5.5 speedup on MLP aggregation, and 4.3–6.0 speedup on dotproduct attention, using a single thread. Compared against MKL on GCN aggregation, FeatGraph is faster in 14 out of 15 cases and achieves higher speedup with a larger feature length. Specifically, when the feature length is 512, FeatGraph is 1.8 faster on ogbnproteins, 2.4 faster on reddit, and 4.4 faster on rand100K. MKL does not support MLP aggregation and dotproduct attention.
Multithreaded CPU Performance. Figure 10 shows that with 16 threads, for GCN aggregation on reddit, FeatGraph achieves 12.6 speedup over its singlethreaded execution, which is slightly higher than Ligra (9.5) and MKL (9.8). Similar observation applies to other datasets and kernels. As a result, FeatGraph outperforms the others consistently in multithreaded environment. FeatGraph scales well due to two factors: 1) its parallelization method avoids LLC contention by assigning multiple threads to collectively work on one graph partition at a time; 2) the thread pool in TVM runtime is lightweight [19].
GPU Performance.
Table IV shows that FeatGraph is 24–206 faster than Gunrock on GCN aggregation, 18–96 faster on MLP aggregation, and 1.2–3.1 faster on dotproduct attention.
The extreme slowness of Gunrock on GCN aggregation and MLP aggregation is caused by two reasons: 1) Gunrock’s edge parallelization execution incurs huge overhead of atomic operations for vertexwise reductions such as GCN aggregation and MLP aggregation; 2) Gunrock fails to exploit parallelism in feature dimension computation.
FeatGraph is on par with cuSPARSE on GCN aggregation, being 10%–20% faster on ogbnproteins and rand100K while 10% slower on reddit.
Notably, cuSPARSE does not support MLP aggregation and dotproduct attention.
VC Optimization Implications
This subsection investigates the performance boost of each individual optimization technique described in Section III. For the sake of space, in each ablation analysis we only pick one dataset to show the optimization effects. Other datasets share similar observations.
Graph Partitioning and Feature Dimension Tiling. Figure 13 shows that feature dimension tiling combined with graph partitioning effectively boosts the performance of GCN aggregation on CPU. Specifically, when the feature length is 512, feature dimension tiling alone and graph partitioning alone bring 1.2 speedup and 1.7 speedup, respectively; combining two achieves 2.2 speedup.
Adaptive Parallelization Strategies on GPU. Figure 13 shows that tree reduction boosts the performance of dotproduct attention by up to 2. The naïve parallelization strategy in Gunrock that assigns the entire dot product operation on each edge to one CUDA thread is less efficient in handling large feature lengths due to consuming too many registers per thread.
Hybrid Partitioning on GPU. Figure 13 shows the effect of hybrid partitioning on GCN aggregation tested on rand100. FeatGraph gets 10%–20% performance boost by hybrid partitioning, and consequently outperforms cuSPARSE.
VD Sensitivity Analysis
Sensitivity to Partitioning Factors. Figure 14 shows that the performance of FeatGraph is sensitive to partitioning factors for GCN aggregation on CPU. Specifically, on reddit, when the feature length is 128, the best performance is achieved with 16 graph partitions and 4 feature partitions. On the same graph, as the feature length increases, the optimal number of feature partitions increases proportionately, while the optimal number of graph partitions stays constant. Transferable tuning across graphs, i.e., using the optimal partitioning factors tuned on one graph to predict the optimal partitioning factors for a new graph, is more challenging and worth further study.
Sensitivity to GPU Parameters. Figure 15 shows that FeatGraph performs better with a larger number of CUDA blocks for GCN aggregation on GPU, because a larger number of CUDA blocks can better utilize the massive parallel compute capacity of GPU. In the evaluation, we set the number of CUDA blocks to the number of rows of the adjacency matrix.
Sensitivity to Graph Sparsity. Table V shows that FeatGraph achieves higher speedup over MKL as the graph sparsity decreases for GCN aggregation on CPU. This trend is because a denser graph has more data reuse, which FeatGraph is able to exploit by graph partitioning and feature dimension tiling.
VE EndToEnd GNN Training and Inference
We integrated FeatGraph into DGL and evaluated the performance of FeatGraph in endtoend GNN training and inference on three models: a 2layer graph convolutional network (GCN) [16] of hidden size 512, a 2layer GraphSage [9] of hidden size 256, and a 2layer graph attention network (GAT) [35] of hidden size 256. GCN uses sum aggregation and requires generalized SpMM computations in both forward and backward propagation; GraphSage follows a similar architecture as GCN but allows more flexible aggregation functions (e.g., max); GAT uses dotproduct attention, thus requiring both generalized SpMM and SDDMM computations.
Accuracy. FeatGraph as a backend is for performance optimization without changing the semantics of GNN models. As a sanity check, we evaluate the accuracy of the three models on the task of vertex classification. The 233K vertices of the reddit dataset are split into 153K, 24K, and 56K for training, validation, and testing, respectively. We train the models for 200 epochs. The testing accuracy obtained by DGL using FeatGraph matches that obtained by DGL using its original backend Minigun—93.7% for GCN and 93.1% for GraphSage. The training of GAT does not converge due to gradient explosion, with either FeatGraph backend or Minigun backend.
Speedup. Table VI reports the training and inference time of one epoch for the three GNN models. The time of tuning partitioning factors is excluded, because it is amortized over multiple epochs—it is less than 1% of the time of training GCN for 200 epochs. Furthermore, the partitioning factors tuned on GCN are directly applied to GraphSage and GAT—the number of graph partitions is kept the same and the number of feature partitions is adjusted to the feature length. The results show that on CPU, FeatGraph speeds up both training and inference by more than 20 on all the three models; on GPU, FeatGraph speeds up training by more than 2, and inference by 1.4–7.1. The highest speedup is achieved on GAT, which has a more complex architecture than GCN and GraphSage.
Vi Related Work
Recent years have seen an emergence of specialized frameworks that attempt to make the processing of GNN workloads easier and faster. For example, DGL [37] and PyG [6] wrap deep learning systems with a messagepassing programming interface for GNNs. NeuGraph [20] addresses the challenge of largescale GNN training by partitioning the dataflow over multiple GPUs and employing a chainbased streaming schedule. FeatGraph focuses on optimizing graphspecific kernels, and can be integrated into these GNN frameworks to serve as an efficient backend on both CPU and GPU.
Systems for processing traditional graph workloads (e.g., BFS, PageRank) have been extensively studied in literature [21, 28, 8, 30, 38, 43]. These systems allow users to flexibly express graph algorithms by defining computation on each vertex/edge. Among them, Ligra [30] achieves superior performance on CPU by dynamically switching the message propagation direction (i.e., push or pull) based on the size of the frontier (active vertices) at each iteration, and Gunrock [38] achieves superior GPU performance by sophisticated scheduling methods to ensure load balance in its edge parallelization execution. However, Ligra is not exploiting cache optimization, and its pushpull optimization is no longer critical in GNN workloads since typically all vertices are active at each layer of a GNN model. Gunrock fails to achieve good performance for GNN workloads because it is unable to exploit parallelism in feature dimension computation, let alone adapt parallelization strategies for computation patterns.
There is another series of works that focus on formulating graph algorithms as sparse linear algebra operations [12, 31, 13, 44]. For example, BFS is formulated as a sequence of sparse matrix sparse vector multiplication (SpMSpV); PageRank is formulated as a sequence of sparse matrix dense vector multiplication (SpMV). FeatGraph borrows from these works the general idea of mapping graph computations to sparse kernels. FeatGraph differs from these works in two major aspects: 1) FeatGraph can express more complex userdefined functions (UDFs) to support the diverse variants of GNN models; 2) FeatGraph pays a special attention to optimizations of feature dimension computation, which are unexploited in previous efforts.
Vendorspecific libraries (e.g., MKL [11] and cuSPARSE [24]) provide highly optimized implementations for sparse kernels that are identified important to a broad range of applications. Compared with these libraries, FeatGraph is more comprehensive at kernel coverage for GNN’s use case. Besides, by adopting a tensor compiler approach in contrast to the manual optimization approach of these libraries, FeatGraph is able to search for the best scheduling schemes on both CPU and GPU.
TACO [17] is a compiler targeting general sparse tensor computations by the design of a flexible sparse tensor representation system. However, TACO does not allow scheduling as TVM, and it lacks support for generating highquality GPU code. Instead of targeting general sparse computations, FeatGraph targets the core sparse patterns of GNNs, namely, generalized SpMM for vertexwise computations and generalized SDDMM for edgewise computations. This design choice enables FeatGraph to fully exploit the optimization opportunities specific to GNN workloads.
Vii Conclusion
We propose FeatGraph to enable performant processing of graph neural network (GNN) workloads. FeatGraph provides a flexible programming interface that is able to express the diverse variants of GNN workloads by composing sparse templates with customizable feature dimension computations on each vertex/edge. FeatGraph extensively explores optimization opportunities in both graph traversal and feature dimension computation. Moreover, it decouples these two levels of optimizations to improve the productivity of developing new kernels for emerging GNN models. FeatGraph is portable to existing GNN frameworks as a highperformance backend. Our evaluation verifies that FeatGraph is comprehensive at kernel coverage and outperforms the stateoftheart solutions. Future work remains to utilize more intelligent tuners [5] to further improve the performance,and to integrate FeatGraph into largescale GNN training systems such as NeuGraph to accelerate multiGPU training.
Acknowledgement
We thank the anonymous reviewers for valuable comments. The authors affiliated with Cornell University were funded in part by CRISP, one of six centers in JUMP, a Semiconductor Research Corporation (SRC) program sponsored by DARPA, and by AFRL and DARPA under agreement number FA86501827863. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of AFRL and DARPA or the U.S. Government.
Footnotes
 https://docs.dgl.ai/api/python/function.html#messagefunctions
 https://ogb.stanford.edu/
 The latest cuSPARSE supports dotproduct attention via ConstrainedGeMM.
References
 (2016) Tensorflow: a system for largescale machine learning. USENIX Symp. on Operating Systems Design and Implementation (OSDI). Cited by: §I.
 (2014) Opentuner: an extensible framework for program autotuning. Int’l Conf. on Parallel Architectures and Compilation Techniques (PACT). Cited by: §IVA.
 (2018) Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261. Cited by: §IIA.
 (2018) TVM: an automated endtoend optimizing compiler for deep learning. USENIX Symp. on Operating Systems Design and Implementation (OSDI). Cited by: §I, §IIC, §IIC, §IIIA.
 (2018) Learning to optimize tensor programs. Conf. on Neural Information Processing Systems (NIPS). Cited by: §IVA, §VII.
 (2019) Fast graph representation learning with pytorch geometric. arXiv preprint arXiv:1903.02428. Cited by: §I, §VI.
 (2017) Neural message passing for quantum chemistry. Int’l Conf. on Machine Learning (ICML). Cited by: §I, §IIA.
 (2012) Powergraph: distributed graphparallel computation on natural graphs. USENIX Symp. on Operating Systems Design and Implementation (OSDI). Cited by: §I, §VI.
 (2017) Inductive representation learning on large graphs. Conf. on Neural Information Processing Systems (NIPS). Cited by: §VA, §VE.
 Optimizing parallel reduction in cuda. Note: \urlhttp://developer.download.nvidia.com/assets/cuda/files/reduction.pdf, 2012 Cited by: §IIIC2.
 Intel math kernel library. Note: \urlhttps://software.intel.com/content/www/us/en/develop/tools/mathkernellibrary.html Cited by: TABLE I, §I, §I, §VI.
 (2016) Mathematical foundations of the graphblas. IEEE High Performance Extreme Computing Conf. (HPEC). Cited by: §VI.
 (2011) Graph algorithms in the language of linear algebra. Society for Industrial and Applied Mathematics. Cited by: §VI.
 (2015) Scalable simdefficient graph processing on gpus. Int’l Conf. on Parallel Architectures and Compilation Techniques (PACT). Cited by: §I, §IIIC2.
 (2014) CuSha: vertexcentric graph processing on gpus. Int’l Symp. on HighPerformance Parallel and Distributed Computing (HPDC). Cited by: §I, §IIIC2.
 (2016) Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §IIA, §VE.
 (2017) The tensor algebra compiler. ObjectOriented Programming, Systems, Languages, and Applications (OOPSLA). Cited by: §VI.
 (2018) Combinatorial optimization with graph convolutional networks and guided tree search. Conf. on Neural Information Processing Systems (NIPS). Cited by: §I.
 (2019) Optimizing cnn model inference on cpus. USENIX Annual Technical Conf. (ATC). Cited by: §IVA, §IVA, §VB.
 (2019) NeuGraph: parallel deep neural network computation on large graphs. USENIX Annual Technical Conf. (ATC). Cited by: §I, §VI.
 (2010) Pregel: a system for largescale graph processing. Int’l Conf. on Management of Data (SIGMOD). Cited by: §I, §VI.
 (2015) Scalability! but at what cost?. Workshop on Hot Topics in Operating Systems (HotOS). Cited by: §IIIC1.
 (2019) Minigun: lightweight gpu kernel interface for graph operations. GitHub. Note: \urlhttps://github.com/dglai/minigun Cited by: §IVB.
 Cusparse library. Note: \urlhttps://developer.nvidia.com/cusparse Cited by: TABLE I, §I, §I, §VI.
 (2018) Recurrent relational networks. Conf. on Neural Information Processing Systems (NIPS). Cited by: §I.
 (2019) PyTorch: an imperative style, highperformance deep learning library. Conf. on Neural Information Processing Systems (NeurIPS). Cited by: §I.
 (2013) Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM SIGPLAN Notices. Cited by: §IIC, §IIC.
 (2013) Xstream: edgecentric graph processing using streaming partitions. ACM Symp. on Operating Systems Principles (SOSP). Cited by: §I, §VI.
 (2017) A simple neural network module for relational reasoning. Conf. on Neural Information Processing Systems (NIPS). Cited by: §I.
 (2013) Ligra: a lightweight graph processing framework for shared memory. ACM SIGPLAN Notices. Cited by: TABLE I, §I, §I, §IIB, §VI.
 (2015) Graphmat: high performance graph analytics made productive. Int’l Conf. on Very Large Data Bases (VLDB). Cited by: §VI.
 (2019) Deep representation learning for social network analysis. arXiv preprint arXiv:1904.08547. Cited by: §I.
 (2018) Attentionbased graph neural network for semisupervised learning. arXiv preprint arXiv:1803.03735. Cited by: §IIA.
 (2017) Attention is all you need. Conf. on Neural Information Processing Systems (NIPS). Cited by: §IIA.
 (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §IIA, §VE.
 (2019) A unified optimization approach for cnn model inference on integrated gpus. Int’l Conf. on Parallel Processing (ICPP). Cited by: §IVA.
 (2019) Deep graph library: towards efficient and scalable deep learning on graphs. arXiv preprint arXiv:1909.01315. Cited by: §I, §VI.
 (2016) Gunrock: a highperformance graph processing library on the gpu. ACM SIGPLAN Notices. Cited by: TABLE I, §I, §I, §I, §IIB, §IIIC2, §IIIC2, §VI.
 (2018) Design principles for sparse matrix multiplication on the gpu. European Conf. on Parallel Processing (EuroPar). Cited by: §IIIC2.
 (2018) Graph convolutional neural networks for webscale recommender systems. Int’l Conf. on Knowledge Discovery and Data Mining (KDD). Cited by: §I.
 (2017) Making caches work for graph analytics. IEEE Int’l Conf. on Big Data. Cited by: §I, §IIIC1, §IIIC1.
 (2014) High performance machine learning through codesign and rooflining. Ph.D. Thesis, UC Berkeley. Cited by: §IIA.
 (2015) FlashGraph: processing billionnode graphs on an array of commodity ssds. USENIX Conf. on File and Storage Technologies (FAST). Cited by: §I, §VI.
 (2016) Semiexternal memory sparse matrix multiplication for billionnode graphs. IEEE Trans. on Parallel and Distributed Systems (TPDS). Cited by: §VI.
 (2015) Gridgraph: largescale graph processing on a single machine using 2level hierarchical partitioning. USENIX Annual Technical Conf. (ATC). Cited by: §I, §IIIC1.