UWBGCN: Accelerating Graph Convolutional Networks through Runtime Workload Rebalancing
Abstract
Deep learning systems have been applied mostly to Euclidean data such as images, video, and audio. In many applications, however, information and their relationships are better expressed with graphs. Graph convolutional networks (GCNs) appear to be a promising approach to efficiently learn from graph data structures, having shown advantages in many critical applications such as power system analysis, chemical reactivity prediction, material property prediction, Ecommerce, cybersecurity, etc. As with other deep learning modalities, hardware acceleration is critical. The challenge is that realworld graphs are often extremely large and unbalanced; this poses both significant performance demands and design challenges.
We propose an architecture that accelerates GCN inference, the Ultra Workload Balanced GCN (UWBGCN). To address the major performance bottleneck of workload imbalance we propose two techniques, dynamic local sharing and dynamic remote switching. Both rely on hardware flexibility to autotune the system; this is effected with negligible area and delay overhead. In particular, UWBGCN profiles the sparse graph pattern while continuously adjusting the workload distribution strategy among a large number of processing elements (PEs). Once the system converges to an ideal configuration, this configuration is used for the remaining iterations. To the best of our knowledge, UWBGCN is the first accelerator design targeting GCNs and the first that autotunes workload balance in the accelerator in hardware rather than software. These methods result in nearideal workload balance in processing sparse matrices. Experimental results show that UWBGCN can perform inference of the Nell graph (66K vertices, 266K edges) in 8.1 ms; this corresponds to speedups of 199, 16, and 7.5 as compared with, respectively, CPU, GPU, and a baseline design with no workload rebalancing.
I Introduction
Deep learning paradigms such as Convolutional Neural Networks (CNNs) [23] and Long Short Term Memory (LSTM) [20] have been applied to a wide range of applications, from image classification, through video processing, to speech recognition, and to natural language processing. These paradigms are only able to extract and analyze latent information from euclidean data such as images, videos, audios and texts [37]. This fact limits the realworld applications of neural networks.
In the real world, an increasing number of applications, such as Ecommerce [6, 42], molecular bioactivity identification in medical research [14], social network analysis [35, 22], cascading failure prediction of national power grid, etc., use nonEuclidean data structure and are modeled as graphs with nodes referring to the objects involved in target applications and edges representing the relations between nodes. For all these applications, graphs have tremendously large numbers of nodes which degrees vary dramatically, leading to significant data irregularity.
The irregularity in graph data makes most existing deep learning algorithms fall short and critical feature extraction operations, such as convolutions, not directly applicable. As a result, Graph Neural Networks have been proposed, in various forms, to extend deep learning approaches to graph data [15, 30, 34, 27, 10, 37]. Among these, the Graph Convolutional Network (GCN), an approach that marries some ideas of CNNs to the distinct needs of graph data processing, has demonstrated significant potentials [7, 11, 22].
With the rapid development of GCNs, designing specialized hardware accelerators for GCN inference has become an urgent issue. GCNs have already been adopted in multiple realworld applications, including electric grid cascading failure analysis [29], prediction of chemical reactivity [9], prediction of synthesized material property [38], modeling polypharmacy sideeffects [49], accurate advertisement in Ecommerce [41], cybersecurity [31], etc.[26, 37]. Many of these applications require lowlatency, highthroughput GCN inference. Existing platforms, however, are not wellsuited to handling the irregularity of the sparse graph data, thus hindering the practical utilization of GCNs.
Although sparsity has already been addressed in many existing sparseCNN (SCNN) accelerator designs [18, 45, 2, 21, 47], the challenges of accelerating GCNs are significantly different. The first difference is the source of sparsity. For SCNNs, the majority of the sparsity comes from the weights (due to model redundancy). Two types of pruning/compression techniques have been proposed for leveraging this sparsity: structural pruning condenses the weight matrix during the training [12, 36], while unstructured pruning [18, 47, 25] can preprofile the weight matrix for design specialization, as the weight matrix has been fixed before inference. For GCNs, however, the sparsity comes from the input data itself and is only apparent at runtime. Therefore, existing weightfocused SCNN pruning and compression techniques cannot work well. In addition, compared with deep CNNs comprising dozens or even hundreds of layers, existing GCN models are usually very shallow, most of which are no more than three layers [46]
The second difference is the degree of sparsity. For SCNNs, the dimensions of the weights are typically from dozens to hundreds, with sparsity around 50%. Thus, it can either be assumed that the weight matrix fit in onchip memory (after compression) [18, 12], or a small scheduling window can be used for index matching [40] when multiplying the sparse matrices (i.e., pairing the rows of the first matrix with correct columns of the second matrix). For example, CambriconS [47, 45] and StitchX [25] compare hundreds (e.g., 256) of index pairs per cycle and assume that sufficient nonzeros (i.e., nnz) can always be found for each PE. This may be true for SCNNs, but for GCNs, the dimensions of the sparse matrices can range from thousands to millions, with sparsity larger than 99.9% (see Table I). This leads to insurmountable problems for both SCNN approaches. For the first, the sparse matrix no longer fits into onchip memory, leading to irregular offchip memory access. In the second, either the window size is kept the same and the system remains mostly idle, or the window size must be tremendously expanded making the hardware unfeasible.
The third difference is the distribution of sparsity. As shown in Figure 1, for SCNNs the distribution of number of nonzeros per row or column is roughly balanced so the impact of workload imbalance is not very prominent [47, 21]; it can thus be resolved via a centralized task queue [18]. However, the huge realword graphs for GCNs often follow the PowerLaw distribution (see Figure 1), meaning that a small set of rows/columns can have the majority of the nonzeros. This can lead to serious runtime workload balancing scenarios for which existing SCNN accelerators were not designed. Due to these reasons, novel accelerator designs are required for the emerging GCN workloads.
CORA  CITESEER  PUBMED  NELL  
Density  A  0.18%  0.11%  0.028%  0.0073%  0.043% 
W  100%  100%  100%  100%  100%  
X1  1.27%  0.85%  10.0%  0.011%  51.6%  
X2  78.0%  89.1%  77.6%  86.4%  60.0%  
Dimension  Node  2708  3327  19717  65755  232965 
F1  1433  3703  500  61278  602  
F2  16  16  16  64  64  
F3  7  6  3  186  41 
We thus propose UWBGCN, a hardware accelerator for GCN inference with runtime workload rebalancing. The processing begins by using an online profiler to accurately assess the workload imbalance. The imbalance information is used by two strategies for rebalancing the workload: local sharing and remote switching. Local sharing rebalances the workload among neighbors. However, given only local sharing, when elements are clustered, it may take many iterations for the autotuner to converge. We solve this problem with remote switching, which shuffles larger regions. Figure 2 illustrates the autotuning process among 10 PEs. The color of the PEs indicates their utilization. Utilization= means the workload processed by this PE is the same as if the entire workload is evenly distributed among all PEs. Utilization (red shift) and Utilization (blue shift) indicate overutilization and underutilization, respectively.
The goal is to balance the workload among all PEs. As can be seen in Figure 2 (starting top left and moving anticlockwise), in each autotuning iteration, we first perform remote switching to balance the workload of large regions (top left to bottom left) and then we perform local sharing for finegrained tuning among neighbors (bottom left to bottom right). We again measure the workload distribution and, if necessary, start a new rebalancing iteration (bottom right to top right). After several iterations, the system converges to the ideal balanced state, which is then used for the remainder of the computation. All profiling and adjustment are performed by hardware at runtime with negligible area and delay overhead.
We implement UWBGCN in VerilogHDL and evaluate it on a Xilinx VCU118 FPGA. This is only for demonstration purposes since UWBGCN does not rely on any FPGAspecific features. The results show that UWBGCN can enhance the average PE utilization from 60% to 92% as compared to a baseline design without workload balancing. Overall, this paper makes the following contributions:

We propose the first hardware accelerator for graph convolutional network inference.

To handle the ultra workload imbalance issue, we propose a hardwarebased workload distribution autotuning framework, which includes an efficient workload profiling technique and two workload rebalancing strategies.

Evaluations on an FPGA show that the autotuning framework can quickly converge to the optimal workload distribution and thus achieve significant performance improvement. Compared with CPU (Intel XeonE2698v4 + MKLv2018), GPU (NVIDIA TeslaP100 + cuSparsev10.0), and a baseline accelerator without workload rebalancing, UWBGCN achieves average speedups of 248.2, 79.0, and 2.8, respectively.
Ii Motivation
In this section we briefly introduce GCNs, showing their differences from traditional network models like CNNs and the challenges on hardware design arising from these differences.
Iia Graph Convolutional Network Structure
Equation 1 shows the layerwise forward propagation of a multilayer spectral GCN [37, 22]:
(1) 
is the graph adjacency matrix with each row delineating the connection of a vertex with all the other vertices in the graph. is the matrix of input features in layer; each column of represents a feature while each row denotes a node. is the weight matrix of layer. denotes the nonlinear activation function such as ReLU [23]. In general, needs to be normalized: where is the identity matrix, and . The reason is that, without normalization, multiplying the feature vector by will change its scale: those nodes with more neighbors tend to have larger values under feature extraction. Note that during both training and inference of GCN, remains constant. Since can be computed offline from , in the remainder of this paper we use to denote the normalized . In general, is multiplied only once per layer. However, when multihop neighboring information is to be collected, can be multiplied twice or more (i.e., , , etc.) per layer.
Equation 1 is derived from graph signal processing theory: convolutions on a graph can be converted to a multiplication of signal (i.e., a scalar for each node) and a filter in the frequency domain via the Fourier transform:
(2) 
where denotes the Hadamard product. is a collection of eigenvectors for the normalized graph Laplacian . The diagonal matrix comprises the eigenvalues. If a frequency domain filter is defined, then Equation 2 can be simplified [7] as:
(3) 
Equation 3 can be further simplified by defining the filter as the Chebyshev polynomials of the diagonal matrix [11, 22] to obtain Equation 1.
Figure 3 illustrates the structure of a GCN layer. By multiplying and , information from 1hop connected neighboring nodes are integrated. By multiplying with , and going through the nonlinear activation function , we obtain the output of this layer, which is also the feature matrix for the next layer . The matrix A in different layers can be the same (e.g., normal graphs) or different (e.g., hypergraphs). After multiple layers, the GCN is able to extract very highlevel abstracted features for various learning purposes.
IiB Workload Imbalance from PowerLaw Graphs
Realword graphs typically follow the powerlaw distribution, which states that the number of nodes of a given degree is in proportional to for some constant . This implies that in the adjacency matrix , a small number of the rows (or columns) include the majority of nonzeros whereas the majority of the rows (or columns) are almost empty. Figure 4 shows the distribution of nonzero elements for the five publicly available datasets that are widely used for GCN evaluation [22]. The powerlaw effect is quite prominent for Cora, Citeseer, and Nell.
Matrix is also sparse. For the first layer, the sparsity is usually larger than . This is because, in a large graph (e.g., with millions of vertices), many of the features are local features. Therefore, the entries in corresponding to these local features for remote nodes are zero (explaining the sparsity). As the weight matrix is usually dense, the output of is also dense. However, because of the activation function, more zeros are generated; thus the final output (also the input of the next layer) is sparse but with sparsity usually less than .
The sizes of the matrices in GCNs depend on the dataset and can range from thousands to millions or more. Therefore, can be extremely large and is stored in a sparse format. in different layers can be identical or distinct (e.g., evolving or samples from the same graph). is very large for the first layer. However, in contrast with CNNs where the number of features per layer is roughly similar or increasing, the number of features in GCNs often reduces drastically by layer. It is possible for there to be thousands of features in the first layer, but only a few dozen in the second. is stored in a sparse format and the output matrix after ReLU is also sparse.
Iii GCN Baseline Architecture
In this section we describe an initial architecture for GCN acceleration. This “baseline” design shares some similarities with existing sparse CNN accelerator designs, but in addition it can support ultrahigh sparsity and large matrix sizes. In the next section we augment this design to achieve nearoptimal workload balancing.
Iiia Matrix Computation Order
To compute , there are two alternative computation orders: and . The choice is significant as it dictates the volume of nonzero multiplications. Based on our profiling, is ultra sparse and large, is generally sparse and usually has a large number of columns, and is small and dense. Looking first at : since multiplying and requires complex sparsesparsematrixmultiplication and produces a very large dense matrix, multiplying by another dense matrix leads to significant computation workload and long delay. Alternatively, for , both are sparsedense matrix multiplications (SpMM) and the scale of computation is drastically smaller. Table II lists the amount of computation for the five datasets following the two approaches. Since the difference is quite obvious, in our design we first perform and then left multiply with .
Layer  Order  CORA  CITESEER  PUBMED  NELL  

Layer1  62.3M  197.5M  163.2M  257G  16.3G  
999.7K  1.87M  17.5M  47M  6.1G  
Layer2  468.2K  493.0K  2.3M  800M  764.3M  
329.3K  357.6K  1.06M  735M  530.3M  
ALL  62.8M  198.0M  165.5M  258G  17.1G  
1.33M  2.23M  18.6M  782M  6.6G 
IiiB Baseline SpMM Design
Given , if is , is , and is , then we can reformulate as:
(4) 
where is the th column of and is an element of at row and column. In other words, by broadcasting the th element from column of to the entire column of , we can obtain a partial column of . Essentially, is processed in a streaming fashion: each element finishes all computation it involves at once, and then is done. In this way, we reuse the entire sparse matrix for each column of ( times in total). Such a design brings additional advantages when and are stored in CompressedSparseColumn (CSC) format (see Figure 5). Further benefit is that it provides opportunities to pipeline multiple SpMM operations, as will be discussed later. Since a complete result element of requires an entire corresponding row of , to avoid expensive parallel reduction in hardware, we partition and along the rows and assign them to PEs. Figure 6 shows the procedure for calculating . The columns of and elements of in the same color are to be multiplied and stored as partial results in with the same color. To reduce the memory access demand for matrix A, we propose an interlayer interleaved data forwarding technique (discussed in Section III.D).
Workload Mapping: In the baseline design, with the assumption that nonzeros are evenly distributed among the rows, we use a direct and static mapping from matrix rows to PEs. In Figure 7 every two rows of are mapped to a separated PE; each PE eventually processes three nonzeros of .
IiiC Baseline Architecture Design
Figure 8 illustrates the baseline design, comprising the modules of sparsematrixmemory (SpMMeM), densecolumnmemory (DCM), taskdistributor & Queue (TDQ), PEarray, and an accumulationbuffersarray (ACC). SpMMeM buffers the input sparse matrix . DCM buffers the input dense matrix . TDQ is for task distribution to the PEs. PEarray is for concurrent multiplication. Finally, ACC buffers the partial results of the resulting matrix for accumulation. Depending on the sparsity and storage format of , we have two alternative designs for TDQ:
TDQ1 (left side of Figure 8) is used when is generally sparse and stored in dense format. We perform the direct row partition as discussed and map to the input buffer of a PE (see Figure 7). Each cycle, data are forwarded to a PE given evenly distributed nonzeros. As one PE may account for more than a single row of , we allocate multiple Task Queues (TQs) per PE. As shown in Figure 8(left), in each cycle a PE can receive up to 4 nonzero elements. We have four queues to buffer these nonzeros from different rows of . Each cycle, an arbitrator selects a nonempty queue, pops an element, checks for a ReadafterWrite (RaW) hazard (discussed later), and forwards it to the PE for processing.
TDQ2 (right side of Figure 8) is used when is ultrasparse and stored in CSC format. Since in CSC the nonzeros are contiguous in a dense array, if we can directly process the dense array, we gain from avoiding all the zeros. However, we suffer from the overhead of navigating to the correct PE as the indices are no longer continuous and (essentially) stored in another index array. We use a multistage Omeganetwork for routing the nonzero data stream to the correct PE according to their row indices from the index array. Each router in the Omeganetwork has a local buffer in case the buffer of the next stage is saturated. Our design attempts to balance the data forwarding rate and the processing capability of the PEs. This is achieved when nonzero elements are distributed evenly among rows. Compared with a global crossbar network, the Omeganetwork design incurs much less area and hardware complexity; this is especially the case when we have a large number of PEs. Meanwhile, TDQ also accepts streaming data from a particular column of the dense matrix in DCM.
PEs fetch partial results of from ACC, perform the new multiplication task, add to the partial results, and save back to ACC. Each PE is coupled with a bank of ACC to store the rows of it accounts for. A PE has two units: a multiplyaccumulateunit (MAC), and an addressgenerationunit (AGU) for result address generation and forwarding. Since is roughly a dense matrix and stored in dense format, the rows of are statically partitioned among ACC buffers. Synchronization is only needed when an entire column of the resulting matrix is completely calculated. Consequently, the imbalanced distribution of nonzeros across columns does not cause any performance problems.
An important issue here is the ReadafterWrite (RaW) hazard. Since the computations are all floatingpoint, the pipelined MAC unit usually takes several cycles to process, but can still accept new tasks while processing. If the new task tries to accumulate the same partial result of (i.e., from the same row of ), it actually fetches a stale partial result from ACC, and a RaW hazard occurs. To avoid this hazard, we implement a stall buffer of size , where is the delay of the MAC units. We track the row indices currently being processed by the MAC and check whether the current element is targeting the same row in the RaWcheckunit (see Figure 8). If so, we buffer that job and delay (for a few cycles) until the hazard is resolved. This is similar to the role of the scoreboard for register RaW hazards in processor design.
Overall, for each layer of GCN, we first execute SpMM on . Since is generally sparse and stored in dense format, we use TDQ1. The result of is dense. We then compute which again is SpMM. However, as is ultrasparse and stored in CSC format, we use TDQ2. The result is dense, but after ReLU, a large fraction of the entries become zero and we again have a sparse matrix as the input feature matrix for the next layer.
IiiD Pipelining SpMM Chains
IntraLayer SpMM Pipelining: One can exploit the parallelism between consecutive SpMMs (i.e., and ) in a layer. This is based on the observation that when a column of has finished computing, and is constant and ready, we can start the multiplication of with that column without waiting for the entire (Figure 9). This design has two major benefits: (i) we gain extra parallelism and reduce the overall delay through coarsegrained pipelining, and (ii) instead of needing large offchip storage to cache the resulting matrix, we only need to buffer a single column of ; this can be done onchip. This method can be reused within a GCN layer if leftmultiplied by other sparse matrices. For example, some GCNs collect information from 2hop neighbors so the layer formulation becomes and the three multiplications can be pipelined.
InterLayer SpMM Pipelining: SpMMs from different layers can also be pipelined. To avoid bubbles and large intermediate buffers, we allocate hardware resources (i.e., number of PEs) in proportion to the workload of each SpMM stage (see Figure 10). In this way the output generation rate of the previous SpMM matches the data consumption rate of the current SpMM. Pipelining SpMMs from different layers has three benefits: (i) being able to exploit interlayer parallelism; (ii) no offchip memory access is required on the intermediate result matrix since data are processed by SpMM engines in a streaming manner; and (iii) if is the same for all layers, it can be reused by SpMM engines across the layers thus avoiding extra offchip accesses. This is done by forwarding elements of through the layers. If the access rates to are different across the layers, we forward elements of with interleaving to match the ratio of the access rates.
Bandwidth Analysis: Offchip data access of the big adjacency matrix can be a concern. However, as the access to is always continuous in our design, the performance benefits greatly from the DRAM burst mode. If can fit in the onchip memory, we reuse across layers. Otherwise, we cache part of onchip to mitigate the pressure on the offchip bus. Based on our evaluation, the UWBGCN accelerator requires around 227 Gbps offchip bandwidth to keep the hardware busy with 1024 PEs for the 5 datasets evaluated, which can be generally satisfied by modern computation platforms (e.g., Xilinx VCU118 FPGA provides 384 Gbps offchip bandwidth, VCU128 provides 3680 Gbps HBM bandwidth, NVIDIA V100 provides 7176 Gbps HBM bandwidth).
IiiE The Workload Balance Problem
The baseline architecture works well when nonzeros are evenly distributed among the rows of . However, when this assumption does not hold, the performance of the baseline architecture can degrade considerably due to workload imbalance among PEs and network congestion. Figures 11(A) and (B) illustrate two types of workload imbalance, local and remote, and also the histogram from mapping to eight PEs. Note that both types of imbalance lead to significant performance degradation, from the expected 2 cycles to 5 cycles and 7 cycles, respectively.
This imbalance issue is unique for GCNs and has not been faced or resolved by existing work such as with sparseCNNs [18, 2, 45, 21]. The reason is that nonzeros in those sparse matrices are more or less evenly distributed. However, when dealing with huge and ultra sparse matrices such as the adjacency matrix of a socialnetwork graph following a powerlaw distribution, the condition is quite different. Efficiently handling of this unique workload balance problem from this new GCN application is the major research problem for this work. Typically, when dealing with sparse data structures such as sparse matrices/tensors, trees and graphs, etc., to achieve workload balance, the software approach is to profile or scan the structure through, for example, symbolic analysis, in a preprocessing stage, and then use the sampled information to guide the partition strategy later for real processing. In this work, we show how to dynamically adjust hardware configuration for continuous workload rebalancing. Our design can be applied to a variety of specialized accelerators for processing sparse data structures.
Iv UWBGCN Architecture
We treat the two types of imbalance problems (shown in Figure 11) separately. For local imbalance, we propose dynamic local sharing; for remote imbalance, we propose dynamic remote switching. Both are dynamic techniques that measure and adjust for a better task distribution configuration each time a column of the dense input matrix is processed. After several columns, the optimal configuration for best matching the nonzero structure of the sparse input matrix is obtained. This configuration is then reused for the processing of the remaining columns of the dense matrix.
The difference is granularity. As described in Section I, Figure 2 shows the design flow. Initially, we employ equal partitioning of the baseline design. Some PEs are overutilized while others are underutilized. The ultimate purpose of the design is to balance the colors (i.e., utilization) by adjusting or exchanging the workloads of PEs. At the processing of every column of matrix (called ), we employ local balancing by averaging out some of the overloaded work to neighbors, improving the situation. However, the offloaded work needs to be returned for aggregation after processing. The architecture is able to track the runtime utilization information of PEs with local balancing at a current iteration. Due to chip area and design complexity restrictions, we may exchange workload between direct neighbors, 2hop neighbors, or even 3hop neighbors, but not all of them. The local strategy is not efficient enough when nonzeros are clustered in a region across several PEs.
To address the nonzero clustering issues, we propose remote PE switching, making remote workload (partially or completely) exchange between overloaded PEs and underloaded PEs. At the end of each round, the utilization information tracked after local workload sharing is analyzed by specific hardware and used to optimize the remote switch strategy. By interchanging workloads between remote overutilized and underutilized PEs, followed by another round of local sharing, we can significantly improve load balancing. Note, the local and remote rebalancing occur in the processing of every column of the dense matrix . As the same sparse matrix is reused during the processing of every column of matrix , the remote switch strategy generated in prior rounds is valuable and can guide the processing of the later rounds. Our accelerator remembers the remote switch plan at the end of each round and incrementally adjust it when processing every next column based on the newly obtained utilization information from local balancing. After several rounds, the configuration best matching the sparse structure of is obtained, and we use it for the remaining rounds with almost perfect workload balancing. In the following, we discuss how to realize this strategy in hardware.
Iva Dynamic Local Sharing
PE utilization differences must be estimated before the workload can be adjusted. Figure 12 illustrates 1hop local sharing for TDQ1 and TDQ2.
TDQ1: Before a new task is pushed into a PE’s TQ, the PE compares the number of pending tasks with those in the neighboring TQs. The task is then forwarded to the TQ with the fewest pending tasks. If forwarded to a neighbor, the result needs to be returned to the ACC buffer of its original PE for accumulation after the multiplication (see Figure 12(B)). The valid return address is calculated in the AGU unit of a PE.
TDQ2: For the Omega network, the final layer of the multistage network handles forwarding. In Figure 12(C), two PEs share the same finallayer switch; we refer to this pair as a group. In Figure 13, a group has four PEs sharing the same finallayer switch so we focus on the TQs of the final layer. After determining the pending task status, the id of the destination PE is known. We adjust the address tag of the task before it is pushed to the TQs of the final layer. To enable PEs on the group edge (i.e. the leftmost or rightmost PEs) to communicate with their outofthegroup neighbors, we add extra links in the final layer, as shown in Figure 12(D). Note that Figure 12(D) shows sharing only among 1hop neighbors. By considering more distant hop neighbors, a more balanced design is obtained at the cost of higher hardware complexity and area. This is discussed in the evaluation section.
IvB Dynamic Remote Switching
For remote switching, a multiround autotuning approach is again used. The idea is to find the most overutilized PE and the most underutilized PE per round (i.e., for a column of ), and switch a part, or all of their workloads. The fraction to be switched depends on their utilization gap. The design is shown in Figure 13.
The most overutilized and underutilized PEs are identified by using the PE Status Monitor (PESM). Recall that each TQ has a counter to track the number of pending tasks; these can trigger an empty signal when reaching zero. The counters are all connected to a multistage MUXtree with output signal . After the jobs of the current round are dispatched monitoring begins. When triggers, some PEs have become idle. By voting at each level, the MUXtree is able to identify the PE group with the highest number of empty signals, i.e., the coldspot. When all PEs have triggered the empty signal, the last to finish is the hotspot.
Having identified hotspot and coldspot PEtuples with id for the current round , to avoid thrashing, we only exchange a portion of the workload between them. We propose the following formula to calculate the number of jobs (i.e., rows of ) to be switched in the th round (i.e., a column of ) :
(5) 
where is the largest workload gap (i.e., the workload difference between hotspot and coldspot PEs) for the th round, and is the initial workload under equal partition. In the current design, each tuple is tracked for two rounds using the PE Status Monitor in Figure 13. Each PEtuple being tracked is updated every round according to Equation 5. The number of rounds tracked simultaneously can be customized. In Figure 13, two consecutive rounds are tracked.
The workload switching ratio for each tracked PEtuple is then adjusted for two or more rounds and is highly likely to converge. The number of rounds we can track depends on the size of the tracking window in the PESM and is an area/performance tradeoff. Calculating Equation 5 is done in the Utilization Gap Tracker (Figure 13). To reduce the hardware cost of calculating we use a hardwareefficient approximation; details are omitted due to space limitations. Once the number of rows to be switched between remote PEs is known, the Shuffle Lookup Table (SLT) is used to determine which rows are to be interchanged. The IDs of these rows are forwarded to the Remote Balancing Control Register (RBCR). In the next round, the destination PE of these rows is updated in the Shuffle Switches (SS).
For routing data from PEs to ACC buffers note that remote switching does not require an extra interconnect. The data at the rows selected to be switched do not need to be routed back to the original ACC. Instead, at the end of each iteration, the SpMM engines access the data cached in ACC of the previous SpMM engines (through a MUX set based on the remote switch decision). The overhead is low: storage for a copy of the switch record and a comparator for the index check.
V Evaluation
We evaluate the baseline and UWBGCN designs and compare them with the same GCN networks on other platforms.
Va Evaluation Configuration
We implement the RTL of the baseline and UWBGCN in Verilog HDL. We measure the PE utilization, performance, energy efficiency, and hardware resource consumption on a Xilinx Virtex UltraScale+ VCU118 FPGA board. Note that the FPGA is only used as an evaluation platform to demonstrate the performance of UWBGCN. The design is a general architecture that does not leverage any FPGAspecific features or units.
To measure utilization we add a counter to each PE to track the number of idle cycles. The number of operating cycles (i.e., execution delay) is measured with a cycleaccurate hardware counter. The counter triggers when the first data is forwarded to UWBGCN and stops when the last output feature is received. The hardware consumption and operating frequency are reported by Vivado Design Suite2019.1 after synthesis and implementation. The boardlevel power consumption is measured by a power meter. To perform the crossplatform comparison, we implement the reference GCN networks in PyTorch and run them on a highend serverclass CPU Intel Xeon E52698V4 and an NVIDIA Tesla P100 GPU. For the SpMM computation PyTorch uses the MKLv2018 and cuSPARSEv10.0 libraries. We also adapt two existing SCNN designs EIE [18] and CambriconS [47] for the GCN workload and make a comparison. All latency results reported in our paper are endtoend. Data access from DDR/GDDR/HBM of CPU/GPU/FPGA is included.
The datasets used for evaluation are Cora, Citeseer, Pubmed, Nell and Reddit; these are the five most widely used publicly available datasets in GCN research.
VB UWBGCN Evaluation
The efficiency of the design is evaluated by comparing the performance, hardware resource consumption, and PE utilization of the baseline design (i.e., Base) with the four different design choices of UWBGCNs: (i) 1hop local sharing (i.e., Design(A)), (ii) 2hop local sharing (i.e., Design(B)), (iii) 1hop local sharing plus remote switching (i.e., Design(C)), and (iv) 2hop local sharing plus remote switching (i.e., Design(D)). The only exception is for Nell where we use 2hop and 3hop local sharing; reasons are given below.
Figure 14 compares the overall GCN inference delay, including SpMM kernels and other operations (e.g. ReLU), and average utilization of PEs for the five designs over the five datasets. The lines show the overall PE utilization. The bars show the breakdown of delay cycles according to GCN layer. We also mark the latency lower bound assuming full PE utilization. For Cora, Citeseer and Pubmed, using 2hop local sharing can improve PE utilization from 53%, 71%, and 69%, to 83%, 83%, and 93%, respectively, leading to , , and performance improvement. Enabling remote switching can further improve PE utilization to 90%, 91%, and 96%, respectively, bringing performance gain to , , and .
After analysis, we found that the remaining 410% utilization gap is due to PE underutilization in the autotuning phase. For Nell, as shown in Figure 4, the nonzeros are highly clustered. In this case, one or two PEs are extremely overutilized in the baseline design, leading to only 13% overall utilization. In this case, even 2hop local sharing is still insufficient to rebalance the workload. Therefore, for the Nell dataset only, we use 2hop and 3hop local sharing (rather than 1hop and 2hop) in our evaluation. The results show that 2hop and 3hop local sharing can enhance PE utilization from 13% to 44%, and 53% respectively, resulting in 3.4 and 4.3 performance improvements. With remote switching enabled, the utilization further increases to 69% and 81%, leading to 5.9 and 7.5 performance gains. For Reddit, through local sharing, the utilization is already 99% (from 93% in the baseline). In UWBGCN, hardware resources allocated to different layers are in proportion to their volume of operations. Thus, when perfect utilization is achieved, the same execution delay can be observed for all the layers. As shown in Figure 14, the green and red bars have similar length at Design(D), while their lengths vary significantly at Baseline.
Figure 15 further breaks down the numbers of operating cycles and shows results for each SpMM kernel; this demonstrates the benefits of UWBGCN on kernels with various sparsity and imbalance distributions. As shown in Figure 15, the execution times of different SpMM kernels are almost the same so that SpMM engines are rarely idle while waiting for their neighbors. The shaded area of the bars represents the Sync cycles due to workload imbalance; the unshaded area represents the Ideal cycles assuming perfect workload balance. The bars in different colors represent the cycles of the four SpMM operations in the twolayer GCNs [22, 37]: of Layer1, of Layer1, of Layer2, and of Layer2. The curves show the corresponding PE utilizations.
Comparing the datasets: for Cora, Citeseer, and Pubmed, the imbalance occurs mainly in the SpMM operation of the input layer, which is significantly mitigated by the rebalancing techniques of UWBGCN. For Nell, the imbalance occurs mainly in the SpMM of the hidden layer; this imbalance is also diminished by UWBGCN. Reddit by itself is already wellbalanced. Comparing now the SpMM operations, the utilization improves significantly for of Layer2, of Layer1, and of Layer2. For of Layer2, although is sparse after being filtered by the ReLU activation function of Layer1, its sparsity is much lower than that of in Layer1; the utilization is thus also high for the baseline (except for Cora).
Figure 16 compares the overall hardware resource consumption of the five designs over the five datasets. The hardware resources cost is normalized to the number of Configurable Logic Blocks (CLBs) used in the design, which is the basic component of the FPGA. In an ASIC design, this can be normalized to the number of transistors. The red area represents the CLB consumption for the TQs of the TDQ modules. The argument is that, if the task distribution is more unbalanced, TQs require more slots for the buffering queue to avoid backpressure. Therefore, by introducing the rebalancing techniques of UWBGCN, the area cost of TQs can be reduced. This is especially the case for Pubmed, Nell, and Reddit. The green area represents other hardware modules excluding the TQs. It remains almost unchanged across the five datasets, which means that the area overhead from the rebalancing logic of UWBGCN is very small – only 2.1%, 3.9%, and 1.9% of the whole baselinedesign area for 1hop localsharing, 2hop local sharing, and remote switching designs, respectively; for Nell, it is 2hop and 3hop. Combining the two parts, the UWBGCN design has reduced hardware resource consumption as compared with the baseline design. This is largely due to dramatically reduced perPE TQ size under more balanced workloads. For example, for Nell, the required TQ depth in the baseline design for A(XW) of Layer1 is 65128; in Design(D) it is just 2675.
Finally, Figure 17 shows the utilization improvement due to iterative workload autotuning. Rebalancing can be accomplished within 10 iterations, which is around only of the total iterations. This means that more than of the iterations can benefit from operating under the converged optimal strategy.
VC Scalability of UWBGCN
We evaluate the scalability of UWBGCN by running GCN inference of the five datasets on the baseline as well as Designs (B) and (D) of UWBGCN and varying the number of PEs from 512 to 768 to 1024. In Figure 18, the bars represent the endtoend inference execution cycles and the lines represent the PE utilizations. The dotted lines mark the full utilization (100%) upper bound.
For the baseline design, the PE utilization drops with increasing number of PEs since more PEs means fewer rows per PE, highlighting the imbalance among PEs: they have fewer opportunities to absorb interrow imbalance. In contrast, GCNs with both local sharing and remote switching show stable (and high) PE utilization. The PE utilizations with only local sharing scale better than the baseline, but worse than with both local sharing and remote switching. Overall, by introducing the rebalancing techniques, the performance of UWBGCN scales almost linearly with the number of PEs, and much better than the baseline design.
VD Crossplatform Comparison
Table III presents the crossplatform evaluation of UWBGCN. We compare inference latency (milliseconds), energy efficiency (normalized to Graph Inference/kJ), and operating frequency (MHz). The systems evaluated are UWBGCN (Design (D) with 1024 PEs); GCN implementations of MKLv2018based Intel Xeon E52698V4 CPU and cuSPARSEv10.0based NVIDIA TeslaP100 GPU; and the baseline UWBGCN design (with 1024 PEs) without workload balancing.
We also evaluate a reproduced EIE reference implementation [18] tweaked and optimized for GCN processing. EIE conserves the sparse weight matrices onchip, and distributively preloads the weight matrices to the local buffers of the PEs. Therefore, the original EIE design cannot directly apply to the GCN task here. To compare with EIE, we augment the original EIE design with the same offchip access and task forwarding modules of UWBGCN. The EIE results in Table III represents the performance that can be obtained by applying existing SCNN designs for GCN workload. The comparison demonstrate the effectiveness of workload rebalancing and the necessity of designing new GCNspecific accelerators.
We also attempted to reproduce CambriconS [47] for the GCN workload here, and make a comparison. However, this is not feasible because CambriconS adopts a 256slot scheduling window for index matching. It is assumed that, by fetching 256 elements per cycle, at least 16 elements can be successfully matched in order to keep the PE busy. This may be true for CNN workload given the relatively smaller matrix size and low sparsity. However, for GCN workloads, to guarantee 16 elements can be forwarded to the PEs per cycle, we need a window that can process a million elements per cycle, which is obviously hardware unfeasible. Alternatively, we keep the window size as 256 and measure the PE utilization of CambriconS on GCN workloads. As a result, the extremely high sparsity and large dimension of the matrix leads to merely 0.0013% PE utilization, once again revealing the necessity of a GCNspecific accelerator like UWBGCN.
Table III shows that UWBGCN has the best speedup (7.5) over EIE with NELL, the most unbalanced dataset. Realworld graphs mostly follow the powerlaw; therefore CNN designs like EIE is not directly applicable (other CNN designs are discussed in Section VI). As we can see, despite running at a relatively low frequency, UWBGCN achieves , , and speedups on average, over the highend CPU, GPU, the baseline design without workload balancing, and the reference EIE design, respectively, across the five GCN graph datasets. Our design also achieves , , and better energy efficiency, respectively.
In Table IV, we compare the kernellevel utilization of UWBGCN (Design (D)) with the cuSPARSEbased TeslaP100 GPU implementation with respect to the five datasets. UWBGCN achieves much higher utilization than the GPU. The low utilization of GPU is mainly due to: (i) For small datasets such as Cora and Citeseer, the total workload is too small that they cannot saturate all the available SMs; (ii) For large datasets, the utilization is better, but still suffers from workload imbalance addressed by UWBGCN.
Network  Cora  Citeseer  Pubmed  Nell  
Xeon  Freq  2.23.6 GHz  
E5  Latency  3.90  4.33  34.15  1.61E3  1.08E4 
2698V4  Energy  1.90E3  1.71E3  216.9  4.61  0.69 
NVIDIA  Freq  13281481 MHz  
Tesla  Latency  1.78  2.09  7.71  130.65  2.43E3 
P100  Energy  1.87E3  1.59E3  432.3  25.51  1.37 
EIElike:  Freq(MHz)  285 MHz  
VCU118  Latency  0.022  0.024  0.22  59.1  56.3 
FPGA  Energy  1.19E6  1.11E6  1.20E5  438.2  452.1 
Baseline:  Freq(MHz)  275 MHz  
VCU118  Latency  0.023  0.025  0.23  61.0  58.9 
FPGA  Energy  1.21E6  1.09E6  1.16E5  433.3  447.0 
UWBGCN:  Freq(MHz)  275 MHz  
VCU118  Latency  0.011  0.018  0.14  8.1  53.2 
FPGA  Energy  2.38E6  1.43E6  1.86E5  3.17E3  497.3 
Cora  Citeseer  Pubmed  Nell  
UWBGCN  XW1  0.93  0.90  0.98  0.72  0.99 
A(XW1)  0.87  0.88  0.93  0.92  0.99  
XW2  0.92  0.94  0.94  0.93  0.98  
A(XW2)  0.88  0.91  0.99  0.73  0.99  
P100  XW1  0.10  0.11  0.18  0.19  0.18 
A(XW1)  0.03  0.03  0.18  0.39  0.50  
XW2  0.09  0.12  0.17  0.19  0.18  
A(XW2)  0.03  0.03  0.19  0.40  0.49 
VE Generalization of UWBGCN
The arithmetic primitives in GNN can be abstracted and categorized as aggregation and embedding, where the former multiplies the adjacency matrix by the feature matrix, and the latter multiplies the aggregated feature matrix with the weight matrix to shrink the feature vector length of each node. Various dataflow combinations of the two primitives compose multiple GNN designs, such as GraphSage [17], GINConv [39], and the latest GTN [44]. Nevertheless, the kernel of both primitives are SpMM. Given the fact that all these GNNs work on powerlaw based realworld graphs, and requires SpMMs for embedding and aggregation operations, they can all benefit from the design practice of UWBGCN. With that, we claim that UWBGCN is not for a single algorithm but a unified and versatile architecture for various GNN algorithms.
Vi Related Work
Graph neural networks or GNNs use neural networkbased approaches to address problems in graph domain. The first GNN model is proposed by Gori et al. [15]. In Gori’s GNN, target nodes collect information of their neighbor nodes through recurrent neural structure until a converged status is achieved. In the past decade, researchers never stop optimizing the algorithms of GNNs and exploring new neural network approaches that can be useful for the tasks in graph domain [10, 37, 30, 34].
More recently, influenced by convolutional neural networks (CNN) that achieve great success in extracting local features from euclidean data such as images or videos, graph convolutional networks (GCNs) is proposed to address the feature extraction of noneuclidean data such as graphs by redefining the convolution operators for graph calculation. In 2013, Bruna et al. [7] proposed a design of graph convolutional networks based on spectral graph theory; this was followed and further developed by a number of variants [19, 11, 22]. Meanwhile, other types of GNNs ,that are not based on spectral graph theory, have also been explored and proposed, including spatialbased GCNs [13, 10], graph attention networks [1], and graph generative networks [43]. Among different types of GNNs, spectralbased GCNs are at the center of the researches with regards to the neural networkbased graph approaches.
To the best of our knowledge, the present work is the first accelerator design focusing on GCNs. There have been many efforts on accelerating sparse CNNs [21, 45, 2, 18, 33, 24]. We briefly summarize these studies and explain why these solutions fall short when applied to GCNs. Kung et al. condense the sparse parameter matrix through column grouping [24]. In case of conflict, only the most significant parameters are kept, others are discarded. Essentially, some accuracy is tradedoff for performance. Kim et al. [21] mitigate the workload imbalance problem of sparse CNNs, but use information from designtime profiling and prescanning. Han et al. [18] propose EIE, an SpMV accelerator that forwards nonzeros to PEs in columnmajor order; this is similar to our baseline design with TDQ1. However, they only focus on SpMV and do not address the workload imbalance issue among the rows of the sparse matrix. Zhang et al. [45] rely on different indexing methods to identify and select nonzeros. However, these techniques do not function well when the matrix becomes ultrasparse, as in GCNs. The reason these studies do not touch on the workload imbalance issue is partially because, compared with GCNs that process graphs, the impact of workload imbalance for sparse CNNs is much less significant. Chen et al. [8] propose Eyeriss. Rather than skipping zeros, Eyeriss saves energy by powergating computations with zeros involved.
Besides a bunch of acceleration works on sparse CNNs, researchers also propose some architectures for general SpMM. Zhuo and Prasanna [48] present an SPMV design for FPGAs. They use the CSR format, which can be applied to various sparse matrices. However, this design still suffers from irregular sparse structures and the workload imbalance problem. Pal [32] proposes an outerproductbased SpMM architecture. This work focuses on reducing redundant memory access to nonzeros and does not essentially address the ultraworkloadimbalanced issue we are facing in GCNs. In their experiment results, loadimbalances during the merge phase and the uneven data sharing patterns during the multiply phase lead to degraded speedup for the dataset with highlyunbalanced nonzero element distribution.
Another active area of research about SpMM is software optimizations on generalpurpose multicores and GPUs [16, 28, 3, 4, 5]. However, there are 3 reasons that solutions on generalpurpose processors do not meet the strict timing requirement of GCN on latency. (1) significant overheads of prescanning: many recent new approaches compute the entire sparse matrix by first scanning the entire matrix, collecting the shape characteristics and classifying the sparse rows with different numbers of nonzero elements into different bins, and then, for different bins, different optimized kernels are launched [16, 28, 3]. UWBGCN does not require any prescanning and profiling. (2) poor performance and low utilization with ultraworkload imbalance: all GPU software implementations (PyTorch, TensorFlow) rely on âSpMMâ in cuSPARSE. However, cuSPARSE itself shows poor performance regarding such ultraimbalanced conditions (as shown in Table IV). (3) some sparse matrices (feature matrices) in GCN are generated during processing and some matrices (adjacency matrices) are evolving at runtime, making any profiling and prescanning even less efficient.
Vii Conclusion
In this paper, we propose an architecture called Ultra Workload Balanced GCN to accelerate graph convolutional network inference. To tackle the major performance issues derived from workload imbalance, we propose dynamic local workload sharing and remote workload switching techniques. These rely on hardware flexibility to realize performance autotuning with negligible area and delay overhead. This is the first accelerator design for GCNs that relies on hardware autotuning to achieve workload rebalancing for sparse matrix computations. We create RTL designs and run experiments on a Xilinx VCU118 FPGA with five widely used GCN graph datasets. These show that UWBGCN can achieve, on average, , , and speedups, respectively, over highend CPUs and GPUs, and the baseline design without workload rebalancing.
Footnotes
 According to [26], stacking multiple GCN layers leads to oversmoothing and accuracy degradation, i.e., all nodes converge to the same value.
References
 S. AbuElHaija, B. Perozzi, R. AlRfou, and A. A. Alemi, “Watch your step: Learning node embeddings via graph attention,” in Advances in Neural Information Processing Systems, 2018, pp. 9180–9190.
 J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and A. Moshovos, “Cnvlutin: Ineffectualneuronfree deep neural network computing,” ACM SIGARCH Computer Architecture News, vol. 44, no. 3, pp. 1–13, 2016.
 A. Ashari, N. Sedaghati, J. Eisenlohr, S. Parthasarathy, and P. Sadayappan, “Fast sparse matrixvector multiplication on gpus for graph applications,” in Proceedings of the international conference for high performance computing, networking, storage and analysis. IEEE Press, 2014, pp. 781–792.
 N. Bell and M. Garland, “Efficient sparse matrixvector multiplication on cuda,” Nvidia Technical Report NVR2008004, Nvidia Corporation, Tech. Rep., 2008.
 N. Bell and M. Garland, “Implementing sparse matrixvector multiplication on throughputoriented processors,” in Proceedings of the conference on high performance computing networking, storage and analysis. ACM, 2009, p. 18.
 R. v. d. Berg, T. N. Kipf, and M. Welling, “Graph convolutional matrix completion,” arXiv preprint arXiv:1706.02263, 2017.
 J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral networks and locally connected networks on graphs,” arXiv preprint arXiv:1312.6203, 2013.
 Y.H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energyefficient reconfigurable accelerator for deep convolutional neural networks,” IEEE Journal of SolidState Circuits, vol. 52, no. 1, pp. 127–138, 2016.
 C. W. Coley, W. Jin, L. Rogers, T. F. Jamison, T. S. Jaakkola, W. H. Green, R. Barzilay, and K. F. Jensen, “A graphconvolutional neural network model for the prediction of chemical reactivity,” Chemical science, vol. 10, no. 2, pp. 370–377, 2019.
 H. Dai, Z. Kozareva, B. Dai, A. Smola, and L. Song, “Learning steadystates of iterative algorithms over graphs,” in International Conference on Machine Learning, 2018, pp. 1114–1122.
 M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neural networks on graphs with fast localized spectral filtering,” in Advances in neural information processing systems, 2016, pp. 3844–3852.
 C. Ding, S. Liao, Y. Wang, Z. Li, N. Liu, Y. Zhuo, C. Wang, X. Qian, Y. Bai, G. Yuan et al., “Circnn: accelerating and compressing deep neural networks using blockcirculant weight matrices,” in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, 2017, pp. 395–408.
 H. Gao, Z. Wang, and S. Ji, “Largescale learnable graph convolutional networks,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 2018, pp. 1416–1424.
 J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl, “Neural message passing for quantum chemistry,” in Proceedings of the 34th International Conference on Machine LearningVolume 70. JMLR. org, 2017, pp. 1263–1272.
 M. Gori, G. Monfardini, and F. Scarselli, “A new model for learning in graph domains,” in Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005., vol. 2. IEEE, 2005, pp. 729–734.
 J. L. Greathouse and M. Daga, “Efficient sparse matrixvector multiplication on gpus using the csr storage format,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’14, 2014, pp. 769–780.
 W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” in Advances in neural information processing systems, 2017, pp. 1024–1034.
 S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “Eie: efficient inference engine on compressed deep neural network,” in 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). IEEE, 2016, pp. 243–254.
 M. Henaff, J. Bruna, and Y. LeCun, “Deep convolutional networks on graphstructured data,” arXiv preprint arXiv:1506.05163, 2015.
 S. Hochreiter and J. Schmidhuber, “Long shortterm memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
 D. Kim, J. Ahn, and S. Yoo, “A novel zero weight/activationaware hardware architecture of convolutional neural network,” in Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017. IEEE, 2017, pp. 1462–1467.
 T. N. Kipf and M. Welling, “Semisupervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
 A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
 H. Kung, B. McDanel, and S. Q. Zhang, “Packing sparse convolutional neural networks for efficient systolic array implementations: Column combining under joint optimization,” in Proceedings of the TwentyFourth International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 2019, pp. 821–834.
 C.E. Lee, Y. S. Shao, J.F. Zhang, A. Parashar, J. Emer, S. W. Keckler, and Z. Zhang, “Stitchx: An accelerator architecture for exploiting unstructured sparsity in deep neural networks,” in SysML Conference, 2018.
 Q. Li, Z. Han, and X.M. Wu, “Deeper insights into graph convolutional networks for semisupervised learning,” in ThirtySecond AAAI Conference on Artificial Intelligence, 2018.
 Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel, “Gated graph sequence neural networks,” arXiv preprint arXiv:1511.05493, 2015.
 W. Liu and B. Vinter, “An efficient gpu general sparse matrixmatrix multiplication for irregular data,” in 2014 IEEE 28th International Parallel and Distributed Processing Symposium. IEEE, 2014, pp. 370–381.
 Y. Liu, N. Zhang, D. Wu, A. Botterud, R. Yao, and C. Kang, “Guiding cascading failure search with interpretable graph convolutional network,” arXiv preprint arXiv:2001.11553, 2020.
 A. Micheli, “Neural network for graphs: A contextual constructive approach,” IEEE Transactions on Neural Networks, vol. 20, no. 3, pp. 498–511, 2009.
 H.T. Nguyen, Q.D. Ngo, and V.H. Le, “Iot botnet detection approach based on psi graph and dgcnn classifier,” in 2018 IEEE International Conference on Information Communication and Signal Processing (ICICSP). IEEE, 2018, pp. 118–122.
 S. Pal, J. Beaumont, D.H. Park, A. Amarnath, S. Feng, C. Chakrabarti, H.S. Kim, D. Blaauw, T. Mudge, and R. Dreslinski, “Outerspace: An outer product based sparse matrix multiplication accelerator,” in 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2018, pp. 724–736.
 A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, “Scnn: An accelerator for compressedsparse convolutional neural networks,” in 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2017, pp. 27–40.
 F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini, “The graph neural network model,” IEEE Transactions on Neural Networks, vol. 20, no. 1, pp. 61–80, 2008.
 P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention networks,” arXiv preprint arXiv:1710.10903, 2017.
 W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured sparsity in deep neural networks,” in Advances in neural information processing systems, 2016, pp. 2074–2082.
 Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu, “A comprehensive survey on graph neural networks,” arXiv preprint arXiv:1901.00596, 2019.
 T. Xie and J. C. Grossman, “Crystal graph convolutional neural networks for an accurate and interpretable prediction of material properties,” Physical review letters, vol. 120, no. 14, p. 145301, 2018.
 K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are graph neural networks?” arXiv preprint arXiv:1810.00826, 2018.
 C. Yang, A. Buluc, and J. D. Owens, “Graphblast: A highperformance linear algebrabased graph framework on the gpu,” arXiv preprint arXiv:1908.01407, 2019.
 H. Yang, “Aligraph: A comprehensive graph neural network platform,” in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 2019, pp. 3165–3166.
 R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, and J. Leskovec, “Graph convolutional neural networks for webscale recommender systems,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 2018, pp. 974–983.
 J. You, R. Ying, X. Ren, W. L. Hamilton, and J. Leskovec, “Graphrnn: A deep generative model for graphs,” arXiv preprint arXiv:1802.08773, 2018.
 S. Yun, M. Jeong, R. Kim, J. Kang, and H. J. Kim, “Graph transformer networks,” in Advances in Neural Information Processing Systems, 2019, pp. 11 960–11 970.
 S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen, “Cambriconx: An accelerator for sparse neural networks,” in The 49th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Press, 2016, p. 20.
 J. Zhou, G. Cui, Z. Zhang, C. Yang, Z. Liu, L. Wang, C. Li, and M. Sun, “Graph neural networks: A review of methods and applications,” arXiv preprint arXiv:1812.08434, 2018.
 X. Zhou, Z. Du, Q. Guo, S. Liu, C. Liu, C. Wang, X. Zhou, L. Li, T. Chen, and Y. Chen, “Cambricons: Addressing irregularity in sparse neural networks through a cooperative software/hardware approach,” in 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2018, pp. 15–28.
 L. Zhuo and V. K. Prasanna, “Sparse matrixvector multiplication on fpgas,” in Proceedings of the 2005 ACM/SIGDA 13th international symposium on Fieldprogrammable gate arrays. ACM, 2005, pp. 63–74.
 M. Zitnik, M. Agrawal, and J. Leskovec, “Modeling polypharmacy side effects with graph convolutional networks,” Bioinformatics, vol. 34, no. 13, pp. i457–i466, 2018.