Accelerating PageRank using Partition-Centric Processing
PageRank is a fundamental link analysis algorithm and a key representative of the performance of other graph algorithms and Sparse Matrix Vector (SpMV) multiplication. PageRank computation generates fine granularity random memory accesses resulting in large amount of wasteful DRAM traffic and poor bandwidth utilization. In this paper, we present a novel Partition-Centric Processing Methodology (PCPM) that drastically reduces the amount of DRAM communication and simultaneously sustains high memory bandwidth. PCPM’s partition-centric abstraction coupled with the Gather-Apply-Scatter (GAS) programming model performs partition-wise scatter and gather of updates. The state-of-the-art Vertex-centric PageRank implementation also uses GAS model but suffers from random DRAM accesses and redundant update value propagation from nodes to their neighbors. In contrast, PCPM propagates single update from a node to all neighbors in a partition, thus decreasing redundancy effectively. We further use this characteristic of PCPM to develop a novel bipartite Partition-Node Graph (PNG) data layout that enables streaming memory accesses with small pre-processing overhead.
We perform detailed analysis of PCPM and provide theoretical bounds on the amount of communication and random DRAM accesses. We experimentally evaluate our approach using 6 large graph datasets and demonstrate an average speedup in execution time and reduction in communication, compared to the state-of-the-art. We also show that unlike the Vertex-centric GAS implementation, PCPM is able to take advantage of intelligent node labeling that enhances locality in graphs, by reducing the main memory traffic even further. Although we use PageRank as the target application in this paper, our approach can be applied to generic SpMV computation.
High performance graph analytics has garnered substantial research interests in the recent years. A large fraction of this research is focussed on shared memory platforms because of their low communication overhead compared to distributed systems and the high DRAM which enables a single server to fit large graphs on main memory [2, 3]. However, efficient utilization of computing resources is challenging even on a single node because of the low computation to communication ratio and irregular memory access patterns of graph algorithms. The growing disparity between CPU speed and external memory bandwidth, termed memory wall , has become a key issue in the performance of graph analytics.
PageRank is a quintessential algorithm that exemplifies the performance challenges posed by graph computations. It is also a fundamental link analysis algorithm  that iteratively performs Sparse Matrix Vector (SpMV) multiplication over the adjacency matrix of the target graph and the dense PageRank value vector . The irregularity in adjacency matrices leads to random accesses to which comprise majority of the main memory traffic. The resulting communication volume becomes a performance bottleneck for PageRank computation.
Recent works [6, 7] have proposed the use of Binning with Vertex-centric Gather-Apply-Scatter (BVGAS) methodology to improve locality and reduce communication for PageRank. The GAS model splits computation into phases: scatter source node values on edges and gather propagated values on edges to compute destination node values. By binning the values on the basis of their destination nodes, BVGAS increases locality and reduces DRAM traffic. It also supports efficient parallel processing within each phase without the need of locks or atomics.
We observe that the BVGAS approach, though having several attractive features, also has some drawbacks leading to inefficient memory accesses, both qualitative as well as quantitative. First, we note that under BVGAS, a vertex repeatedly writes its value on all outgoing edges, resulting in large number of reads and writes. Secondly, we also observe that the vertex-centric nature of graph traversal in BVGAS results in random DRAM accesses during binning.
Our premise is that by changing the focus of computation from a single vertex to a cacheable group of vertices (partition), we can effectively identify and reduce redundant edge traversals as well as avoid random accesses to DRAM, while still retaining the benefits of BVGAS. Based on these insights, we develop a new partition-centric approach to compute PageRank. The major contributions of our work are:
A novel Partition-Centric Processing Methodology (PCPM) that propagates updates from nodes to partitions instead of neighboring nodes, thus reducing the redundancy associated with BVGAS.
A bipartite Partition-Node Graph (PNG) data layout that enables streaming all updates to one destination partition at a time. By imposing a partition-centric ordering on the traversal of edges during the scatter phase, we prevent random DRAM accesses, at very low pre-processing cost.
Using a reordering algorithm GOrder , we demonstrate that PCPM can also take advantage of intelligent node labeling to further reduce the communication volume, making it suitable even for high locality graphs.
We conduct extensive analytical and experimental evaluation on a shared memory system. On 6 large datasets, we observe that PCPM achieves speedup and reduction in communication over state-of-the-art BVGAS method. Although we demonstrate the advantages of our approach on PageRank algorithm in this paper, it is applicable to generic SpMV computation as we show in section IV-D.
The rest of the paper is organized as follows: Section II provides motivation and related work; Section III discusses BVGAS based implementation; Section IV describes PCPM along with PNG data layout and bandwidth optimizations; Section V gives a detailed analytical evaluation of PCPM; Section VI reports the experimental results and Section VII concludes the paper.
Ii Background and Related Work
Ii-a Pagerank Computation
In this section, we describe how PageRank is calculated and what makes it challenging to achieve high performance on conventional implementation. Table I lists a set of notations that we use to mathematically represent the algorithm.
|Input directed graph|
|vertex set of|
|edge set of|
|adjacency matrix of|
|in-neighbors of vertex|
|out-neighbors of vertex|
|PageRank value vector after iteration|
|PageRank value of after iteration|
|scaled PageRank vector|
|damping factor in PageRank algorithm|
PageRank is computed iteratively and in each iteration, all vertex values are updated by the new weighted sum of their in-neighbors’ PageRank values as shown in equation 1.
To assist visualization of some techniques and optimizations, we also use the Linear Algebraic perspective where a PageRank iteration can be re-written as follows:
The most computationally intensive term that dictates the performance of computing this equation is the Sparse Matrix-Vector (SpMV) multiplication . Henceforth, we will focus on the SpMV term to improve the performance of PageRank algorithm.
PageRank is typically computed in pull direction [3, 9, 10, 11] where each vertex pulls the value of its in-neighbors and accumulates into its own value as shown in algorithm 1. This corresponds to traversing in a row-major fashion and computing dot product of each row vector with the . Since the iterative SpMV computation uses only , we use the array to store during the computation and while outputting final results.
In the pull direction implementation, each row completely owns the computation of its corresponding element in the output vector. This enables all rows of to be traversed asynchronously in parallel without the need of storing partial sums in main memory. On the contrary, in push direction, each node updates its out-neighbors by adding its own value to them. This requires a column-major traversal of , storage for partial sums since each column contributes partially to multiple elements in output vector and synchronization to ensure conflict-free processing of multiple columns that update the same output element.
Challenges in PageRank Performance: Sparse matrix layouts like Compressed Sparse Row (CSR) store all non-zero elements of a row sequentially in memory allowing fast row-major traversal of matrix. However, the neighbors of a node (non-zero columns in the corresponding row of ) can be scattered anywhere in the whole graph and reading their values results in random accesses to . Similarly, the push direction implementation suffers from random accesses to the partial sums vector. These low locality and fine granularity access patterns incur large number of cache misses and waste a large fraction of data in cache lines.
Ii-B Related Work
The performance of PageRank is bounded by the total DRAM communication which is largely determined by the locality in memory access pattern of the graph. Since node adjacencies are stored sequentially in memory in order of their labels, node labelling has significant impact on graph locality. Therefore, many prior works have investigated locality enhancement by the use of node reordering or partitioning [12, 13, 14, 15]. Recently, a relabelling algorithm GOrder  was proposed that improves spatial locality and outperforms the well known partitioning and tree-based techniques. We further extended this line of research and proposed a Block Reordering algorithm that performs joint spatio-temporal locality optimization . Such sophisticated algorithms provide significant speedup but also introduce substantial pre-processing overhead which limits their practicability. Other approaches like  and  propose the use of Hilbert curves to increase locality by reordering edges in the graph. However, this imposes edge-centric programming and makes it challenging to parallelize the application. In addition, scale-free graphs like social networks are less tractable by reordering transformations because of their skewed degree distributions .
Cache Blocking (CB) is another popular method used to induce locality by restricting random node accesses to an address range that can fit in cache [18, 19]. CB partitions the matrix along columns into multiple block matrices, each of which is stored as a CSR or edge list. However, SpMV computation with CB requires the partial sums to be re-read for each block, introducing extra work. In addition, the extremely sparse nature of these block matrices negatively impacts the locality of partial sum accesses .
Recently,  and  proposed the use of Vertex-centric Gather-Apply-Scatter (GAS) model for computing SpMV and PageRank, respectively. GAS is a popular model that has been widely implemented in many graph analytics frameworks [21, 22, 23]. It splits the graph analytic computation into scatter and gather phases. In the scatter phase, each vertex transmits updates on all of its outgoing edges and in the gather phase, these updates are aggregated into the corresponding destination vertices. The updates for PageRank algorithm correspond to scaled PageRank values and we will use these terms interchangeably in the paper. [6, 7] exploit the phased computation by binning the updates in a semi-sorted manner to induce spatio-temporal locality. Since our approach builds upon the BVGAS method, we discuss BVGAS in detail below.
Iii Binning with Vertex-centric GAS (BVGAS)
The Binning with Vertex-centric GAS (BVGAS) model divides the graph nodes into partitions and allocates bin space for each partition to store incoming updates. In the scatter phase, the updates (and the destination node ID) are written at consecutive locations in respective bins of the destination vertices as shown in fig. 1. Number of partitions is kept small so that insertion points for all bins can fit in the cache, thus providing good spatial locality in the scatter phase.
At the same time, the size of partition is selected in such a way that the range of vertices contained in it fits in the cache. The gather phase processes one bin at a time and thus, enjoys good temporal locality with accesses to partial sums of destination vertices being served by cache.  develop an FPGA accelerator for PageRank which employs similar technique that partitions graph into vertex sets such that each set can fit in on-chip BRAM. However, the accelerator processes only one partition at a time while  and  can process distinct partitions in parallel.
 further observed that since the graph is traversed in same order in all iterations, destination IDs written in first iteration remain unchanged and can be reused. This halves the amount of writes during scatter phase.  proposed collecting updates in cached buffers before writing them into the main memory. This ensures full cache line utilization and enables cache bypassing during buffer copy, to reduce scatter phase communication. BVGAS is the state-of-the-art PageRank implementation and outperforms Cache Blocking and pull direction method for low locality graphs in terms of both DRAM traffic and execution time.
However, the advantages of BVGAS come at the expense of redundant reads and writes of an update value on multiple edges. This redundancy manifests itself in the form of BVGAS’ inability to utilize high locality in graphs with optimized node labelling. Moreover, even though the scatter phase enjoys full cache line utilization, the selection of destination bin to write into is data dependent which results in random DRAM accesses and poor bandwidth utilization. Assembling updates in cached buffers introduces additional data dependent branches that worsen sustained bandwidth of BVGAS .
Iv Partition-Centric Processing Methodology
We propose a new Partition-Centric Processing Methodology (PCPM) that overcomes the drawbacks associated with BVGAS. It replaces the vertex-centric nature of BVGAS with a partition-centric approach by abstracting the graph as a set of connections between nodes and partitions. In PCPM, each thread traverses one graph partition at a time and propagates updates to neighboring partitions instead of neighboring nodes to reduce communication volume. Even within a partition, instead of scattering one source node at a time, PCPM scatters all updates to one destination partition consecutively to avoid random memory accesses. With static pre-allocation of separate bin spaces for each partition, PCPM can asynchronously scatter or gather multiple partitions in parallel. This section provides a detailed discussion on PCPM based computation and the required data layout.
Iv-a Partition-centric Update Propagation
The unique abstraction of PCPM naturally leads to transmitting single update from a source node to all the destination nodes in a partition. To implement such scattering, PCPM decouples the bin space allocated to store updates and destination node IDs into separate and , respectively. Fig. 2 illustrates the difference between PCPM and BVGAS scatter process for the example graph shown in fig. 0(a). Note that the destination IDs are propagated only in the first iteration and just the updates are written in the subsequent iterations. Since the number of updates to be written is lesser than the total edges in graph, PCPM drastically reduces the writes during scatter phase.
PCPM manipulates the Most Significant Bit (MSB) of destination node IDs to indicate the range of nodes in a partition that use the same update value. In the , it consecutively writes IDs of all nodes in the neighborhood of same source vertex and sets the MSB of first ID in this range to for demarcation. This reduces the size of vertex set of supported graphs from 4 billion to 2 billion for 4 Byte node IDs. However, to the best of our knowledge, this is still enough to support most of the large publicly available datasets.
The gather phase reads and in a disjoint manner. It checks the MSB of destination IDs to determine whether to apply the previously read update or to read the next value. The MSB is then masked to generate the true ID of destination node whose partial sum is updated. Algorithm 3 describes the scatter and gather functions for PageRank computation using PCPM.
Although algorithm 3 reduces propagation of redundant update values, it still reads unused edges and inserts updates to random bins during the scatter phase. The manipulation of MSB in node identifiers introduces additional data dependent branches in both scatter and gather phases which hurts memory bandwidth utilization.
Iv-B Bipartite Partition-Node Graph (PNG) Data Layout
In section IV-A, we used a column-wise partitioned CSR format to represent the adjacency matrix . In this subsection, we describe a new Bipartite Partition-Node Graph (PNG) data layout that rids PCPM of the redundant edge reads. PNG also brings out the true partition-centric nature of PCPM and reduces random memory accesses by grouping the edges on the basis of destination partition.
PNG exploits the fact that once the are written, the only required information in PCPM is the connectivity between nodes and partitions. Therefore, edges going from a source to all destinations in the same partition are compressed into a single edge whose new destination is the corresponding partition number. This gives rise to a bipartite graph (fig. 3) consisting of two disjoint vertex sets - and , where is the set of partitions in original graph . Such a bipartisan division has the following effects:
Eff the redundant unused edges are removed
Eff the range of destination IDs reduces from to .
The advantages of Eff are obvious but those of Eff will only become clear when we discuss the storage format for PNG.
The compression step reduces memory traffic but the issue of random DRAM accesses still persists. To tackle this problem, we transpose the adjacency matrix of the bipartite PNG. The resulting CSC matrix stores edges sorted by destination partitions which enables streaming updates to partition at a time. We construct PNG on a per-partition basis i.e. every partition creates a separate bipartite graph so that source nodes in the partition being processed are cached and can be randomly accessed efficiently.
Eff is crucial for the transposition of bipartite graphs in all partitions. The number of offsets required to store an adjacency matrix in CSC format is equal to the range of destination node IDs. By reducing this range, EFF reduces the storage requirement for offsets from to .
Although PNG construction looks like a -step approach, the compression and transposition can actually be merged in step. We first scan the edges for each partition and individually compute the degree of all the destination partitions while discarding redundant edges. A prefix sum of these degrees is carried out to compute the offsets array for CSC matrix. In the next scan, the edge array in CSC format is filled with source vertices completing both compression and transposition. PNG construction can also be easily parallelized over all partitions making pre-processing very cost effective.
Algorithm 4 shows the modified pseudocode for PCPM scatter phase using PNG layout. Note that unlike algorithm 3, the scatter function in algorithm 4 does not contain data dependent branches to check redundant edges, which also boosts the sustained memory bandwidth. Overall, using PNG provides dramatic performance gains for the scatter phase with very little pre-processing overhead.
Iv-C Branch Avoidance
Data dependent branches have been shown to have significant impact on performance of graph algorithms  and PNG largely reduces such branches in scatter phase of PCPM. In this subsection, we propose a branch avoidance mechanism for the PCPM gather phase. Branch avoidance enhances the sustained memory bandwidth but does not impact the amount of DRAM communication.
At this point, it is worth mentioning that the pop operations shown in algorithm 3 are implemented using pointers that increment after reading an entry from the respective bin. Let and be the pointers to and , respectively. Note that the is always incremented whereas the is only incremented if .
To implement the branch avoiding gather function, instead of using a condition check over , we add it directly to . When is , the pointer is not incremented and the same update value is read in the next iteration; when is , the pointer is incremented executing the pop operation on . The modified pseudocode for gather phase is shown in algorithm 5.
Iv-D Extension to Weighted Graphs and SpMV
PCPM can be easily extended for computation on weighted graphs by storing the edge weights along with destination IDs in . These weights can be read in the gather phase and applied to the source node (update) value before updating the destination node. PCPM can also be extended to generic SpMV with non-square matrices by partitioning the rows and columns separately. In this case, the outermost loops in scatter phase (algorithm 4) and gather phase (algorithm 5) will iterate over column partitions and row partitions, respectively.
V Analytical Evaluation
In this section, we present analytical models to compare the performance of conventional Pull Direction PageRank (PDPR), BVGAS and PCPM based PageRank implementations. Our models provide valuable insights into the behavior of different methodologies with respect to varying graph structure and locality. Table II defines the parameters used in the analysis. We use a synthetic kronecker graph  of scale 25 (kron) as an example for illustration purposes.
V-a Communication Model
We analyze the amount of DRAM communication performed per iteration of PageRank. We assume that data from DRAM is accessed in quantum of one cache line and BVGAS exhibits full cache line utilization. We assume that BVGAS and PCPM use a deterministic layout for BVGAS and PCPM so that destination indices are written only in the first iteration and hence, not accounted for in our model.
PDPR: The pull technique scans all edges in the graph once (algorithm 1). For a CSR format, this requires reading edge offsets and source node indices. Reading source node values incurs cache misses generating Bytes of DRAM traffic. For each vertex, the updated values are outputted using Bytes of writes to DRAM. Summing all individual contributions, we get the total memory traffic for PDPR as:
BVGAS: The scatter phase (algorithm 2) scans the graph and writes updates for outgoing edges of every source node, thus communicating Bytes. The gather phase loads updates and destination node IDs on all the edges and accumulates them into the new rank values generating Bytes of read traffic. After a bin is processed, it writes Bytes of new rank values into the DRAM. We ignore the apply phase communication since it can be merged with gather phase. Total DRAM communication for BVGAS is therefore, given by:
PCPM with PNG: In the scatter phase (algorithm 4), a scan of PNG reads Bytes and update insertion generates Bytes of DRAM traffic. The gather phase(algorithm 5) reads destination IDs and updates followed by new PageRank value writes. Net amount of data that PCPM communicates is:
Comparison: Performance of pull technique depends heavily on . In the worst case, all accesses are cache misses i.e. and in best case, only cold misses are encountered to load the PageRank values in cache i.e. . Assuming , we get . On the other hand, communication for BVGAS stays constant given by equation 4. BVGAS is inherently suboptimal since it performs additional loads and stores and can never reach the lower bound of . Comparatively, achieves optimality when for every vertex, its outgoing edges can be compressed into a single edge i.e. . In the worst case when , PCPM is still as good as BVGAS and we get . reaches same lower bound as and thus, gets rid of the inherent suboptimality of BVGAS.
In comparison, PCPM offers an easier constraint on (by a factor of ) becoming advantageous when:
The RHS in eq. 6 is constant indicating that BVGAS is advantageous for low locality graphs. With optimized node ordering, we can reduce and outperform BVGAS. On the contrary, in the RHS of eq. 7 is a function of locality. With an intelligent node labeling, also increases which helps PCPM to remain advantageous even in high locality graphs. Fig. 4 shows the effect of on predicted DRAM communication for the kron graph. Obtaining an optimal nodel labeling that makes might be very difficult or even impossible for some graphs. However, as can be observed from fig. 4, DRAM traffic decreases rapidly for and converges slowly for . Therefore, a node reordering that can achieve is good enough to optimize communication for kron.
V-B Random Memory Accesses
We define a random access as a non-sequential jump in the address of memory location being read from or written to DRAM. Random accesses can incur latency penalties and negatively impact the sustained memory bandwidth. In this subsection, we model the amount of random accesses performed by different methodologies in PageRank iteration.
PDPR: Reading edge offsets and source node IDs in pull technique is completely sequential because of CSR format. However, all accesses to source node rank values that are served by DRAM can potentially be random resulting in:
BVGAS: In scatter phase of algorithm 2, updates are inserted into bins chosen on the basis of destination node index which can be arbitrary. Therefore, BVGAS generates a total of random writes. Assuming full cache line utilization for BVGAS, for every Bytes written, there is at most random memory access. The gather phase sequentially reads the bins generating only random access per bin. Hence, total random DRAM accesses for BVGAS are given by:
PCPM: For the PNG layout, since updates destined to same partition are produced consecutively, scattering updates from a partition generates random accesses. Accesses to nodes within cached partition are not served by main memory and hence, do not add to random DRAM accesses. Scanning bipartite graphs in CSC format generates random access per partition. Since there are such partitions, total number of random accesses in PCPM is bound by:
Comparison: BVGAS exhibits less random accesses than PDPR for low locality graphs. However, is much smaller than both and irrespective of graph locality, for example, in the kron dataset with Bytes, Bytes and , M whereas M.
It is worth mentioning that the number of data dependent unpredictable branches in BVGAS is also , since for every update written, the scatter function has to check if the corresponding cached buffer is full. In contrast, the number of branch mispredictions for PCPM (using branch avoidance) is with misprediction for every destination partition () switch in algorithm 4. This enables PCPM to utilize available bandwidth much more efficiently than BVGAS. The derivations are similar to random access model and for the sake of brevity, we will not provide a detailed deduction.
Vi Experimental Evaluation
Vi-a Experimental Setup and Datasets
We conducted our experiments on a dual-socket Sandy Bridge server equipped with 8-core Intel Xeon E5-2650 v2 email@example.com GHz running Ubuntu 14.04 OS. Table III lists important characteristics of our machine. Memory bandwidth is measured using STREAM benchmark . All codes are written in C++ and compiled using G++ 4.7.1 with the highest optimization -O3 flag. The memory statistics are collected using Intel Performance Counter Monitor . All data types used for indices and PageRank values are Bytes.
|Socket||no. of cores||8|
|shared L3 cache||25MB|
|Core||L1d cache||32 KB|
|L2 cache||256 KBB|
|Read BW||59.6 GB/s|
|Write BW||32.9 GB/s|
We use 6 large real world and synthetic graph datasets coming from different applications for performance evaluation. Table IV summarizes the size and sparsity characteristics of these graphs. Gplus and twitter are follower graphs on social networks; pld, web and sd1 are hyperlink graphs obtained by web crawlers; and kron is a scale 25 graph generated using Graph500 Kronecker generator. The web is a very sparse graph but has high locality and the kron has higher edge density as compared to other datasets.
Vi-B Implementation Details
We use a simple hand coded implementation of algorithm 1 for pull direction baseline and parallelize it over vertices using the OpenMP pragma parallel for. Our baseline does not incur overheads associated with similar implementations in frameworks [3, 11, 10] and hence, is faster than framework based programs .
To parallelize BVGAS scatter phase (algorithm 2), we give each thread its own set of bins for different partitions and a statically load balanced fixed range of nodes to scatter. We use the Intel AVX non-temporal store instructions  to bypass the cache while writing updates and use cache line aligned buffers to accumulate the updates for streaming stores. The gather phase is parallelized over partitions with load balanced using OpenMP dynamic scheduling. The optimal partition size is empirically determined as KB (K nodes) and buffer size is set to cache lines for each destination partition.
The PCPM scatter and gather functions are parallelized over partitions and load balancing in both the phases is done dynamically using OpenMP. Partition size is empirically determined and set to KB. A detailed design space exploration of PCPM is discussed in section VI-C2.
Bin spaces for each thread are pre-allocated for BVGAS and PCPM, to avoid atomicity issues. All the programs mentioned in this section execute iterations of PageRank on cores. For accuracy of the information, we repeat these algorithms times and report the average values.
Vi-C1 Comparison with Baselines
In this section we compare PCPM against the Pull Direction PageRank (PDPR) and BVGAS baselines using various criterion.
Execution Time: Fig. 5 gives a comparison of the GTEPS (computed as the ratio of giga edges in the graph to the runtime of single PageRank iteration) achieved by different implementations. We observe that PCPM is faster than the state-of-the-art BVGAS implementation and faster than PDPR (fig. 5). BVGAS achieves constant throughput irrespective of the graph structure and is able to accelerate computation on low locality graphs. However, it is worse than the pull direction baseline for high locality (web) and dense (kron) graphs. PCPM is able to outperform PDPR and BVGAS on all datasets, though the speedup on web graph is minute because of high performance of PDPR.
Detailed results for execution time of BVGAS and PCPM during different phases of computation are given in table V. The PCPM scatter phase benefits from a multitude of optimizations and the PNG data layout to achieve a dramatic speedup over BVGAS scatter phase.
Communication and Bandwidth: Fig. 6 shows the amount of data communicated with main memory normalized by the number of edges in the graph. Average communication in PCPM is and less than BVGAS and PDPR, respectively. Further, PCPM memory traffic per edge for web and kron is lower than other graphs because of their high compression ratio (table VI). The normalized communication for BVGAS is almost constant and therefore, its utility depends on the efficiency of pull direction baseline.
Note that the speedup obtained by PCPM is larger than the reduction in communication volume. This is because by avoiding random DRAM accesses and unpredictable branches, PCPM is able to efficiently utilize the available DRAM bandwidth. As shown in fig. 7, PCPM can sustain an average GBps bandwidth compared to GBps and GBps of PDPR and BVGAS, respectively. For large graphs like sd1, PCPM achieves of the peak read bandwidth (table III) of our system. Although both PDPR and BVGAS suffer from random memory accesses, the former executes very few instructions and therefore, has better bandwidth utilization.
The reduced communication and streaming access patterns in PCPM also enhance its energy efficiency resulting in lower /edge consumption as compared to BVGAS and PDPR, as shown in fig. 8. Energy efficiency is particularly important from eco-friendly computing perspective as highlighted by the Green Graph500 benchmark .
Effect of Locality: To assess the impact of locality on different methodologies, we relabel the nodes in our graph datasets using the GOrder  algorithm. We refer to the original node labeling in graph as Orig and GOrder labeling as simply GOrder. GOrder increases spatial locality by placing nodes with common in-neighbors closer in the memory. As a result, outgoing edges of the nodes tend to be concentrated in few partitions which significantly enhances the compression ratio as shown in table VI. However, the web graph exhibits near optimal compression () with Orig and does not show any improvement with GOrder.
Table VII shows the impact of GOrder on DRAM communication. As expected, BVGAS communicates a constant amount of data for a given graph irrespective of the labeling scheme used. On the contrary, memory traffic generated by PDPR and PCPM decreases because of reduced and increased , respectively. These observations are in complete accordance with the performance models discussed in section V-A. The effect on PCPM is not as drastic as PDPR because when becomes greater than a threshold, PCPM communication decreases slowly as shown in fig. 4. Nevertheless, for almost all of the datasets, the net data transferred in PCPM is remarkably lesser than both PDPR and BVGAS for either of the vertex labelings.
Vi-C2 PCPM Design Space Exploration
Fig. 9 shows the individual effects of various optimizations in PCPM on the execution time of PageRank, normalized against the BVGAS baseline. The Partition-centric update propagation (section IV-A) reduces redundant update transfers to provide a significant speedup averaging (excluding the web graph). The web graph has a very high compression ratio which results in acceleration just because of this abstraction. This shows that redundancy reduction is extremely crucial for high locality graphs.
A major contributor to the performance is the PNG data layout (section IV-B) which gives an average of speedup. The branch avoidance mechanism (section IV-C) improves sustained memory bandwidth in the gather phase to provide another improvement in execution time. Overall, computing PageRank using PCPM is faster than the state-of-the-art BVGAS method.
Partition size represents an important tradeoff in PCPM. Large partitions force neighbors of each node to fit in fewer partitions resulting in better compression but poor locality. Small partitions on the other hand ensure high locality random accesses within partitions but reduce compression. We evaluate the impact on the performance of PCPM by varying the partition size from KB (K nodes) to MB (M nodes). We observe a reduction in DRAM communication volume with increasing partition size (fig. 10). However, if the partition size grows beyond a threshold, cache is unable to fit all the nodes of a partition resulting in cache misses and a drastic increase the main memory traffic.
The execution time (fig. 11) also benefits from communication reduction and is penalized by cache misses for large partitions. Note that the for partition size KB and MB, communication volume decreases but execution time increases. This is because in this range many requests are served from the larger shared L3 cache which is slower than the private L1 and L2 caches. This phenomenon decelerates the computation but does not add to DRAM traffic.
Vi-C3 Pre-processing Time
We assume that adjacency matrix in CSR format is available and hence, PDPR does not need any pre-processing. Both BVGAS and PCPM however, require a beforehand computation of bin space and offsets for each partition. PCPM also constructs the PNG layout incurring additional pre-processing time as shwon in table VIII. Fortunately, the offset computation can be easily merged with PNG construction (section IV-B) to reduce the overhead. The pre-processing time also gets amortized over multiple iterations of PageRank providing substantial overall time savings.
In this paper, we presented a novel Partition-Centric Processing Methodology (PCPM) that perceives a graph as a set of links between nodes and partitions instead of nodes and their individual neighbors. This abstraction reduces redundant data transfers and enables PCPM to take advantage of locality optimized node ordering as compared to state-of-the-art GAS based method that is oblivious to labeling scheme used. The redundancy reduction feature can be generalized to distributed and even frontier based graph algorithms like BFS. We will explore these generalizations and algorithms to optimize node labeling for PCPM in our future work.
We developed a novel PNG data layout for PCPM. The idea for a new layout originated when we were trying to relax the vertex-centric programming constraint that all outgoing edges of a vertex should be traversed consecutively. The additional freedom arising from treating edges individually allowed us to group edges in a way that avoids random memory accesses to sustain of peak memory bandwidth. We merge several pre-processing steps to reduce cost of computing PNG and show that its benefits far exceed the pre-processing overhead. The streaming memory access patterns of PNG enabled PCPM make it highly suitable for Accelerators and High Bandwidth Memory (HBM) based architectures.
We conducted extensive analytical and experimental evaluation of PCPM and observed speedup and reduction in DRAM communication volume over state-of-the-art. We prove that irrespective of the graph locality and density, PCPM remains the preferred choice of PageRank implementation as it communicates the least amount of data at higher bandwidth compared to other methodologies. Although we demonstrate the advantages of PCPM on PageRank algorithm, we show that it can be easily extended to weighted graphs and generic SpMV computation.
-  R. C. Murphy, K. B. Wheeler, B. W. Barrett, and J. A. Ang, “Introducing the graph 500,” Cray Users Group (CUG), 2010.
-  F. McSherry, M. Isard, and D. G. Murray, “Scalability! but at what COST?” in HotOS XV. USENIX Association, 2015.
-  J. Shun and G. E. Blelloch, “Ligra: a lightweight graph processing framework for shared memory,” in ACM Sigplan Notices, 2013.
-  W. A. Wulf and S. A. McKee, “Hitting the memory wall: implications of the obvious,” ACM SIGARCH computer architecture news, 1995.
-  L. Page, S. Brin, R. Motwani, and T. Winograd, “The pagerank citation ranking: Bringing order to the web.” Stanford InfoLab, Tech. Rep., 1999.
-  D. Buono, F. Petrini, F. Checconi, X. Liu, X. Que, C. Long, and T.-C. Tuan, “Optimizing sparse matrix-vector multiplication for large-scale data analytics,” in Proceedings of the 2016 ICS. ACM.
-  S. Beamer, K. Asanović, and D. Patterson, “Reducing pagerank communication via propagation blocking,” in IPDPS, 2017. IEEE.
-  H. Wei, J. X. Yu, C. Lu, and X. Lin, “Speedup graph processing by graph ordering,” in Proceedings of the 2016 International Conference on Management of Data. ACM.
-  Y. Wang, A. Davidson, Y. Pan, Y. Wu, A. Riffel, and J. D. Owens, “Gunrock: A high-performance graph processing library on the gpu,” in Proceedings of the 21st ACM PPoPP, 2016.
-  N. Sundaram, N. Satish, M. M. A. Patwary, S. R. Dulloor, M. J. Anderson, S. G. Vadlamudi, D. Das, and P. Dubey, “Graphmat: High performance graph analytics made productive,” Proceedings of the VLDB Endowment, 2015.
-  D. Nguyen, A. Lenharth, and K. Pingali, “A lightweight infrastructure for graph analytics,” in Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, 2013.
-  P. Boldi, M. Santini, and S. Vigna, “Permuting web and social graphs,” Internet Mathematics, 2009.
-  W.-H. Liu and A. H. Sherman, “Comparative analysis of the cuthill–mckee and the reverse cuthill–mckee ordering algorithms for sparse matrices,” SIAM Journal on Numerical Analysis, 1976.
-  P. Boldi, M. Rosa, M. Santini, and S. Vigna, “Layered label propagation: A multiresolution coordinate-free ordering for compressing social networks,” in Proceedings of the 20th WWW. ACM, 2011.
-  A. Abou-Rjeili and G. Karypis, “Multilevel algorithms for partitioning power-law graphs,” in IPDPS, 2006. IEEE.
-  K. Lakhotia, S. Singapura, R. Kannan, and V. Prasanna, “Recall: Reordered cache aware locality based graph processing,” to appear in International Conference on High Performance Computing. IEEE, 2017.
-  A.-J. N. Yzelman and R. H. Bisseling, “A cache-oblivious sparse matrix–vector multiplication scheme based on the hilbert curve,” Progress in Industrial Mathematics at ECMI 2010, 2012.
-  S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel, “Optimization of sparse matrix–vector multiplication on emerging multicore platforms,” Parallel Computing, 2009.
-  M. Penner and V. K. Prasanna, “Cache-friendly implementations of transitive closure,” Journal of Experimental Algorithmics (JEA), 2007.
-  R. Nishtala, R. W. Vuduc, J. W. Demmel, and K. A. Yelick, “When cache blocking of sparse matrix vector multiply works and why,” Applicable Algebra in Engineering, Communication and Computing, 2007.
-  A. Roy, I. Mihailovic, and W. Zwaenepoel, “X-stream: Edge-centric graph processing using streaming partitions,” in Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, 2013.
-  G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski, “Pregel: a system for large-scale graph processing,” in Proceedings of the 2010 ACM International Conference on Management of data.
-  M. Han and K. Daudjee, “Giraph unchained: Barrierless asynchronous parallel execution in pregel-like graph processing systems,” Proceedings of the VLDB Endowment, 2015.
-  S. Zhou, C. Chelmis, and V. K. Prasanna, “Optimizing memory performance for fpga implementation of pagerank,” in ReConFigurable Computing and FPGAs (ReConFig), 2015 International Conference on. IEEE, 2015, pp. 1–6.
-  O. Green, M. Dukhan, and R. Vuduc, “Branch-avoiding graph algorithms,” in Proceedings of the 27th ACM SPAA, 2015.
-  J. D. McCalpin, “Stream benchmark,” Link: www.cs.virginia.edu/stream/ref.html, 1995.
-  T. Willhalm, R. Dementiev, and P. Fay, “Intel performance counter monitor-a better way to measure cpu utilization. 2012,” URL: http://software.intel.com/en-us/articles/intel-performance-counter-monitor-a-better-way-to-measure-cpuutilization, 2016.
-  N. Z. Gong, W. Xu, L. Huang, P. Mittal, E. Stefanov, V. Sekar, and D. Song, “Evolution of social-attribute networks: measurements, modeling, and implications using google+,” in Proceedings of the 2012 ACM conference on IMC.
-  R. Meusel, S. Vigna, O. Lehmberg, and C. Bizer, “The graph structure in the web: Analyzed on different aggregation levels,” The Journal of Web Science, 2015.
-  H. Kwak, C. Lee, H. Park, and S. Moon, “What is twitter, a social network or a news media?” in Proceedings of the 19th WWW. ACM.
-  “IntelÂ® c++ compiler 17.0 developer guide and reference,” Intel Corporation, 2016, available at https://software.intel.com/en-us/intel-cplusplus-compiler-17.0-user-and-reference-guide.
-  T. Hoefler, “Green graph500,” available at http://green.graph500.org/.