Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable

Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable

Laxman Dhulipala
CMU
ldhulipa@cs.cmu.edu
   Guy E. Blelloch
CMU
guyb@cs.cmu.edu
   Julian Shun
MIT CSAIL
jshun@mit.edu
Abstract

There has been significant interest in parallel graph processing recently due to the need to quickly analyze the large graphs available today. Many graph codes have been designed for distributed memory or external memory. However, today even the largest publicly-available real-world graph (the Hyperlink Web graph with over 3.5 billion vertices and 128 billion edges) can fit in the memory of a single commodity multicore server. Nevertheless, most experimental work in the literature report results on much smaller graphs, and the ones that use the Hyperlink graph are done in distributed or external memory. Therefore, it is natural to ask whether we can efficiently solve a broad class of graph problems on this graph in memory.

This paper shows that theoretically-efficient parallel graph algorithms can scale to the largest publicly-available graphs using a single machine with a terabyte of RAM, processing them in minutes. We give implementations of theoretically-efficient parallel algorithms for 13 important graph problems. We also present the optimizations and techniques that we used in our implementations, which were crucial in enabling us to process these large graphs quickly. We show that the running times of our implementations outperform existing state-of-the-art implementations on the largest real-world graphs. For many of the problems that we consider, this is the first time they have been solved on graphs at this scale. We provide a publicly-available benchmark suite containing our implementations.

\DeclareCaptionType

copyrightbox

1 Introduction

Today, the largest publicly-available graph, the Hyperlink Web graph, consists of over 3.5 billion vertices and 128 billion edges [77]. This graph presents a significant computational challenge for both distributed and shared memory systems. Indeed, very few algorithms have been applied to this graph, and those that have often take hours to run [34, 70, 57], with the fastest times requiring between 1–6 minutes using a supercomputer [112, 113]. In this paper, we show that a wide range of fundamental graph problems can be solved quickly on this graph, often in minutes, on a single commodity shared-memory machine with a terabyte of RAM.111These machines are roughly the size of a workstation and can be easily rented in the cloud (e.g., on Amazon EC2). For example, our -core implementation takes under 3.5 minutes on 72 cores, whereas Slota et al. [113] report a running time of about 6 minutes for approximate -core on a supercomputer with over 8000 cores. They also report that they can identify the largest connected component on this graph in 63 seconds, whereas we can identify all connected components in 38.3 seconds. Another recent result by Stergiou et al. [114] solves connectivity on the Hyperlink 2012 graph in 341 seconds on a 1000 node cluster with 24000 cores and 128TB of RAM. Compared to this result, our implementation is 8.9x faster on a system with 128x less memory and 333x fewer cores. However, we note that they are able to process a significantly larger private graph that we would not be able to fit into our memory footprint. Our running times are also significantly faster than the numbers reported by disk-based graph processing systems [34, 70, 57].

Importantly, all of our implementations have strong theoretical bounds on their work and depth. There are several reasons that algorithms with good theoretical guarantees are desirable. For one, they are robust as even adversarially-chosen inputs will not cause them to perform extremely poorly. Furthermore, they can be designed on pen-and-paper by exploiting properties of the problem instead of tailoring solutions to the particular dataset at hand. Theoretical guarantees also make it likely that the algorithm will continue to perform well even if the underlying data changes. Finally, careful implementations of algorithms that are nearly work-efficient can perform much less work in practice than work-inefficient algorithms. This reduction in work often translates to faster running times on the same number of cores [36]. We note that most running times that have been reported in the literature on the Hyperlink Web graph use parallel algorithms that are not theoretically-efficient.

In this paper, we present implementations of parallel algorithms with strong theoretical bounds on their work and depth for connectivity, biconnectivity, strongly connected components, low-diameter decomposition, maximal independent set, maximal matching, graph coloring, single-source shortest paths, betweenness centrality, minimum spanning forest, -core decomposition, approximate set cover, and triangle counting. We describe the techniques used to achieve good performance on graphs with billions of vertices and hundreds of billions of edges and share experimental results for the Hyperlink 2012 and Hyperlink 2014 Web crawls, the largest and second largest publicly-available graphs, as well as several smaller real-world graphs at various scales. Some of the algorithms we describe are based on previous results from Ligra, Ligra+, and Julienne [104, 108, 36], and other papers on efficient parallel graph algorithms [19, 50, 109]. However, most existing implementations were changed significantly in order to be more memory efficient. Several algorithm implementations for problems like strongly connected components, minimum spanning forest, and biconnectivity are new, and required implementation techniques to scale that we believe are of independent interest. We also had to extend the compressed representation from Ligra+ [108] to ensure that our graph primitives for mapping, filtering, reducing and packing the neighbors of a vertex were theoretically-efficient. We note that using compression techniques is crucial for representing the symmetrized Hyperlink 2012 graph in 1TB of RAM, as storing this graph in an uncompressed format would require over 900GB to store the edges alone, whereas the graph requires 330GB in our compressed format (less than 1.5 bytes per edge). We show the running times of our algorithms on the Hyperlink 2012 graph as well as their work and depth bounds in Table 1. To make it easy to build upon or compare to our work in the future, we describe a benchmark suite containing our problems with clear I/O specifications, which we have made publicly-available.222https://github.com/ldhulipala/gbbs

We present an experimental evaluation of all of our implementations, and in almost all cases, the numbers we report are faster than any previous performance numbers for any machines, even much larger supercomputers. We are also able to apply our algorithms to the largest publicly-available graph, in many cases for the first time in the literature, using a reasonably modest machine. Most importantly, our implementations are based on reasonably simple algorithms with strong bounds on their work and depth. We believe that our implementations are likely to scale to larger graphs and lead to efficient algorithms for related problems.

Problem Algorithm Model Work Depth
Breadth-First Search - TS
Integral-Weight SSSP (weighted BFS) [36] PW expected w.h.p.*
General-Weight SSSP (Bellman-Ford) [33] PW
Single-Source Betweenness Centrality (BC) [27] FA
Low-Diameter Decomposition [80] TS expected w.h.p.
Connectivity [107] TS expected w.h.p.
Biconnectivity [115] FA expected w.h.p.
Strongly Connected Components [20] PW expected w.h.p.
Minimum Spanning Forest [118] PW
Maximal Independent Set [19] FA expected w.h.p.
Maximal Matching [19] PW expected w.h.p.
Graph Coloring [50] FA
-core [36] FA expected w.h.p.
Approximate Set Cover [22] PW expected w.h.p.
Triangle Counting [109] -
Table 1: Theoretical bounds for the implementations and the variant of the MT-RAM used are presented in the last three columns. *We use w.h.p. to mean probability for any constant .
Problem (1) (72h) (S)
Breadth-First Search (BFS) 649 10.7 60
Integral-Weight SSSP (weighted BFS) 3770 58.1 64
General-Weight SSSP (Bellman-Ford) 4010 59.4 67
Single-Source Betweenness Centrality (BC) 2260 37.1 60
Low-Diameter Decomposition (LDD) 1150 19.6 58
Connectivity 2080 38.3 54
Biconnectivity 9860 165 59
Strongly Connected Components (SCC) 8130 185 43
Minimum Spanning Forest (MSF) 9520 187 50
Maximal Independent Set (MIS) 2190 32.2 68
Maximal Matching (MM) 7150 108 66
Graph Coloring 8920 158 56
-core 8515 184 46
Approximate Set Cover 5630 98.4 57
Triangle Counting (TC) 1470
Table 2: The last three columns report the running times (in seconds) of our implementations on the Hyperlink2012 graph where (1) is the single-thread time, (72h) is the 72 core time using hyper-threading and (S) is the speedup. We mark times that did not finish in 3 hours with —.

2 Related Work

Parallel Graph Algorithms. Parallel graph algorithms have received significant attention since the start of parallel computing, and many elegant algorithms with good theoretical bounds have been developed over the decades (e.g., [103, 61, 69, 4, 115, 81, 93, 56, 29, 89, 80, 39, 15, 79]). A major goal in parallel graph algorithm design is to find work-efficient algorithms with polylogarithmic depth. While many suspect that work-efficient algorithms may not exist for all parallelizable graph problems, as inefficiency may be inevitable for problems that depend on transitive closure, many problems that are of practical importance do admit work-efficient algorithms [60]. For these problems, which include connectivity, biconnectivity, minimum spanning forest, maximal independent set, maximal matching, and triangle counting, giving theoretically-efficient implementations that are simple and practical is important, as the amount of parallelism available on modern systems is still modest enough that reducing the amount of work done is critical for achieving good performance. Aside from intellectual curiosity, investigating whether theoretically-efficient graph algorithms also perform well in practice is important, as theoretically-efficient algorithms are less vulnerable to adversarial inputs than ad-hoc algorithms that happen to work well in practice.

Unfortunately, some problems that are not known to admit work-efficient parallel algorithms due to the transitive-closure bottleneck [60], such as strongly connected components (SCC) and single-source shortest paths (SSSP) are still important in practice. One method for circumventing the bottleneck is to give work-efficient algorithms for these problems that run in depth proportional to the diameter of the graph—as real-world graphs have low diameter, and theoretical models of real-world graphs predict a logarithmic diameter, these algorithms offer theoretical guarantees in practice [99, 20]. Other problems, like -core are -complete [6], which rules out polylogarithmic-depth algorithms for them unless  [46]. However, even -core admits an algorithm with strong theoretical guarantees that is efficient in practice [36].

Parallel Graph Processing Frameworks. Motivated by the need to process very large graphs, there have been many graph processing frameworks developed in the literature (e.g., [72, 44, 68, 83, 104] among many others). We refer the reader to [75, 117] for surveys of existing frameworks. Several recent graph processing systems evaluate the scalability of their implementations by solving problems on massive graphs [112, 34, 70, 36, 57, 114]. All of these systems report running times either on the Hyperlink 2012 graph or Hyperlink 2014 graphs, two web crawls released by the WebDataCommons that are the largest and second largest publicly-available graphs respectively. We describe these recent systems and give a detailed comparison of how our implementations perform compare to their codes in Section 6. We review existing parallel graph algorithm benchmarks in Section C.

3 Preliminaries

Graph Notation. We denote an unweighted graph by , where is the set of vertices and is the set of edges in the graph. A weighted graph is denoted by , where is a function which maps an edge to a real value (its weight). The number of vertices in a graph is , and the number of edges is . Vertices are assumed to be indexed from to . For undirected graphs we use to denote the neighbors of vertex and to denote its degree. For directed graphs, we use and to denote the in and out-neighbors of a vertex . We use to refer to the diameter of the graph, or the longest shortest path distance between any vertex and any vertex reachable from . is used to denote the maximum degree of the graph. We assume that there are no self-edges or duplicate edges in the graph. We refer to graphs stored as a list of edges as being stored in the edgelist format and the compressed-sparse column and compressed-sparse row formats as CSC and CSR respectively.

Atomic Primitives. We use three common atomic primitives in our algorithms: test-and-set (TS), fetch-and-add (FA), and priority-write (PW). A instruction checks if is , and if so atomically sets it to and returns ; otherwise it returns . A instruction atomically returns the current value of and then increments . A instruction atomically compares with the current value of using the priority function , and if has higher priority than the value of according to it sets to and returns ; otherwise it returns .

Model. In the analysis of algorithms we use the following work-depth model, which is closely related to the PRAM but better models current machines and programming paradigms that are asynchronous and allow dynamic forking. We can simulate the model on the CRCW PRAM equipped with the same operations with an additional factor in the depth due to load-balancing. Furthermore, a PRAM algorithm using processors and time can be simulated in our model with work and depth.

The Multi-Threaded Random-Access Machine (MT-RAM) consists of a set of threads that share an unbounded memory. Each thread is basically equivalent to a Random Access Machine—it works on a program stored in memory, has a constant number of registers, and has standard RAM instructions (including an end to finish the computation). The MT-RAM extends the RAM with a fork instruction that takes a positive integer and forks new child threads. Each child thread receives a unique integer in the range in its first register and otherwise has the identical state as the parent, which has a in that register. They all start by running the next instruction. When a thread performs a fork, it is suspended until all the children terminate (execute an end instruction). A computation starts with a single root thread and finishes when that root thread ends. This model supports what is often referred to as nested parallelism. If the root thread never does a fork, it is a standard sequential program.

A computation can be viewed as a series-parallel DAG in which each instruction is a vertex, sequential instructions are composed in series, and the forked subthreads are composed in parallel. The work of a computation is the number of vertices and the depth is the length of the longest path in the DAG. We augment the model with three atomic instructions that are used by our algorithms: test-and-set (TS), fetch-and-add (FA), and priority-write (PW) and discuss our model with these operations as the TS, FA, and PW variants of the MT-RAM. As is standard with the RAM model, we assume that the memory locations and registers have at most bits, where is the total size of the memory used. More details about the model can be found in [17].

Parallel Primitives. The following parallel procedures are used throughout the paper. Scan takes as input an array of length , an associative binary operator , and an identity element such that for any , and returns the array as well as the overall sum, . Scan can be done in work and depth (assuming takes work) [56]. Reduce takes an array and a binary associative function and returns the sum of the elements in with respect to . Filter takes an array and a predicate and returns a new array containing for which is true, in the same order as in . Reduce and filter can both be done in work and depth (assuming takes work).

Ligra, Ligra+, and Julienne. We make use of the Ligra, Ligra+, and Julienne frameworks for shared-memory graph processing in this paper and review components from these frameworks here [104, 108, 36]. Ligra provides data structures for representing a graph , vertexSubsets (subsets of the vertices). We make use of the edgeMap function provided by Ligra, which we use for mapping over edges. edgeMap takes as input a graph , a vertexSubset , and two boolean functions and . edgeMap applies to such that and (call this subset of edges ), and returns a vertexSubset where if and only if and . can side-effect data structures associated with the vertices. edgeMap runs in work and depth assuming and take work. edgeMap either applies a sparse or dense method based on the number of edges incident to the current frontier. Both methods run in work and depth. We note that in our experiments we use an optimized version of the dense method which examines in-edges sequentially and stops once returns . This optimization lets us potentially examine significantly fewer edges than the depth version, but at the cost of depth.

4 Algorithms

In this section we describe I/O specifications of our benchmark, discuss related work and present the theoretically-efficient algorithm we implemented for each problem. We cite the original papers that our algorithms are based on in Table 1. We mark implementations based on prior work with a and discuss the related work, algorithms, and implementations for these problems in Section A. Section A also contains self-contained descriptions of all of our algorithms.

Shortest Path Problems

  Problem: Breadth-First Search (BFS)
Input: , an unweighted graph, .
Output: , a mapping where is the shortest path distance from src to in and if is unreachable.

  Problem: Integral-Weight SSSP (weighted BFS)
Input: , a weighted graph with integral edge weights, .
Output: , a mapping where is the shortest path distance from src to in and if is unreachable.

  Problem: General-Weight SSSP (Bellman-Ford)
Input: , a weighted graph, .
Output: , a mapping where is the shortest path distance from src to in and if is unreachable. All distances must be if contains a negative-weight cycle reachable from src.

  Problem: Single-Source Betweenness Centrality (BC)
Input: , an undirected graph, .
Output: , a mapping from each vertex to the centrality contribution from all shortest paths that pass through .

 

Low-Diameter Decomposition

  Input: , a directed graph, .
Output: , a mapping from each vertex to a cluster ID representing a decomposition. A -decomposition partitions into such that the shortest path between two vertices in using only vertices in is at most , and the number of edges where is at most .

 

Connectivity

  Input: , an undirected graph.
Output: , a mapping from each vertex to a unique label for its connected component.

 

Biconnectivity

  Input: , an undirected graph.
Output: , a mapping from each edge to the label of its biconnected component.

  Sequentially, biconnectivity can be solved using the Hopcroft-Tarjan algorithm [52]. The algorithm uses depth-first search (DFS) to identify articulation points and requires work to label all edges with their biconnectivity label. It is possible to parallelize the sequential algorithm using a parallel DFS, however, the fastest parallel DFS algorithm is not work-efficient [2]. Tarjan and Vishkin present the first work-efficient algorithm for biconnectivity [115] (as stated in the paper the algorithm is not work-efficient, but it can be made so by using a work-efficient connectivity algorithm). Another approach relies on the fact that biconnected graphs admit open ear decompositions to solve biconnectivity efficiently [73, 94].

In this paper, we implement the Tarjan-Vishkin algorithm for biconnectivity in work and depth on the FA-MT-RAM. Our implementation first computes connectivity labels using the algorithm from Section 4, which runs in work and depth w.h.p. and picks an arbitrary source vertex from each component. Next, we compute a spanning forest rooted at these sources using breadth-first search, which runs in work and depth. We note that the connectivity algorithm can be modified to compute a spanning forest in the same work and depth as connectivity, which would avoid the breadth-first-search. We compute , , and for each vertex by running leaffix and rootfix sums on the spanning forests produced by BFS with fetch-and-add, which requires work and depth. Finally, we compute an implicit representation of the biconnectivity labels for each edge, using an idea from [13]. This step computes per-vertex labels by removing all critical edges and computing connectivity on the remaining graph. The resulting vertex labels can be used to assign biconnectivity labels to edges by giving tree edges the connectivity label of the vertex further from the root in the tree, and assigning non-tree edges the label of either endpoint. Summing the cost of each step, the total work of this algorithm is in expectation and the total depth is w.h.p.

Minimum Spanning Forest

  Input: , a weighted graph.
Output: , a set of edges representing a minimum spanning forest of .

  Borůvka gave the first known sequential and parallel algorithm for computing a minimum spanning forest (MSF) [26]. Significant effort has gone into finding linear-work MSF algorithms both in the sequential and parallel settings [59, 29, 89]. Unfortunately, the linear-work parallel algorithms are highly involved and do not seem to be practical. Significant effort has also gone into designing practical parallel algorithms for MSF; we discuss relevant experimental work in Section 6. Due to the simplicity of Borůvka, many parallel implementations either directly implement Borůvka or variants of it.

In this paper, we present an implementation of Borůvka’s algorithm that runs in work and depth on the PW-MT-RAM. Our implementation is based on a recent implementation of Borůvka by Zhou [118] that runs on the edgelist format. We made several changes to the algorithm which improve performance and allow us to solve MST on graphs stored in the CSR/CSC format, as storing an integer-weighted graph in edgelist format would require well over 1TB of memory to represent the edges in the Hyperlink2012 graph alone. Our code uses an implementation of Borůvka that works over an edgelist; to make it efficient we ensure that the size of the lists passed to it are much smaller than . Our approach is to perform a constant number of filtering steps. Each filtering step solves an approximate ’th smallest problem in order to extract the lightest edges in the graph (or all remaining edges) and runs Borůvka on this subset of edges. We then filter the remaining graph, packing out any edges that are now in the same component. This idea is similar to the theoretically-efficient algorithm of Cole et al. [29], except that instead of randomly sampling edges, we select a linear number of the lowest weight edges. Each filtering step costs work and depth, but as we only perform a constant number of steps, they do not affect the work and depth asymptotically. In practice, most of the edges are removed after 3–4 filtering steps, and so the remaining edges can be copied into an edgelist and solved in a single Borůvka step. We also note that as the edges are initially represented in both directions, we can pack out the edges so that each undirected edge is only inspected once (we noticed that earlier edgelist-based implementations stored undirected edges in both directions).

Strongly Connected Components

  Input: , a directed graph.
Output: , a mapping from each vertex to the label of its strongly-connected component.

  Tarjan’s algorithm is the textbook sequential algorithm for computing the strongly connected components (SCCs) of a directed graph [33]. As it uses depth-first search, we currently do not know how to efficiently parallelize it [2]. The current theoretical state-of-the-art for parallel SCC algorithms with polylogarithmic depth reduces the problem to computing the transitive closure of the graph. This requires work using combinatorial algorithms [43], which is significantly higher than the work done by sequential algorithms. As the transitive-closure based approach performs a significant amount of work even for moderately sized graphs, subsequent research on parallel SCC algorithms has focused on improving the work while potentially sacrificing depth [41, 32, 99, 20]. Conceptually, these algorithms first pick a random pivot and use a reachability-oracle to identify the SCC containing the pivot. They then remove this SCC, which partitions the remaining graph into several disjoint pieces, and recurse on the pieces.

In this paper, we present the first implementation of the SCC algorithm from Blelloch et al. [20]. Our implementation runs in in expected work and depth w.h.p. on the PW-MT-RAM. One of the challenges in implementing this SCC algorithm is how to compute reachability information from multiple vertices simultaneously and how to combine the information to (1) identify SCCs and (2) refine the subproblems of visited vertices. In our implementation, we explicitly store and , the forward and backward reachability sets for the set of centers that are active in the current phase, . The sets are represented as hash tables that store tuples of vertices and center IDs, , representing a vertex in the same subproblem as that is visited by a directed path from . We explain how to make the hash table technique practical in Section 5. The reachability sets are computed by running simultaneous breadth-first searches from all active centers. In each round of the BFS, we apply edgeMap to traverse all out-edges (or in-edges) of the current frontier. When we visit an edge we try to add ’s center IDs to . If succeeds in adding any IDs, it test-and-set’s a visited flag for , and returns it in the next frontier if the test-and-set succeeded. Each BFS requires at most rounds as each search adds the same labels on each round as it would have had it run in isolation.

After computing and , we deterministically assign (with respect to the random permutation of vertices generated at the start of the algorithm) vertices that we visited in this phase a new label, which is either the label of a refined subproblem or a unique label for the SCC they are contained in. We first intersect the two tables and perform, for any tuple contained in the intersection, a priority-write with min on the memory location corresponding to ’s SCC label with as the label. Next, for all pairs in we do a priority-write with min on ’s subproblem label, which ensures that the highest priority search that visited sets its new subproblem.

We implemented an optimized search for the first phase, which just runs two regular BFSs over the in-edges and out-edges from a single pivot and stores the reachability information in bit-vectors instead of hash-tables. It is well known that many directed real-world graphs have a single massive strongly connected component, and so with reasonable probability the first vertex in the permutation will find this giant component [28]. We also implemented a ‘trimming’ optimization that is reported in the literature [76, 111], which eliminates trivial SCCs by removing any vertices that have zero in- or out-degree. We implement a procedure that recursively trims until no zero in- or out-degree vertices remain, or until a maximum number of rounds are reached.

Maximal Independent Set and Maximal Matching

  Problem: Maximal Independent Set
Input: , an undirected graph.
Output: , a set of vertices such that no two vertices in are neighbors and all vertices in have a neighbor in .

  Problem: Maximal Matching
Input: , an undirected graph.
Output: , a set of edges such that no two edges in share an endpoint and all edges in share an endpoint with some edge in .

 

Maximal independent set (MIS) and maximal matching (MM) are easily solved in linear work sequentially using greedy algorithms. Many efficient parallel maximal independent set and matching algorithms have been developed over the years [61, 69, 4, 54, 19, 15]. Blelloch et al. show that when the vertices (or edges) are processed in a random order, the sequential greedy algorithms for MIS and MM can be parallelized efficiently and give practical algorithms [19]. Recently, Fischer and Noever showed an improved depth bound for this algorithm [40].

In this paper, we implement the rootset-based algorithm for MIS from Blelloch et al. [19] which runs in expected work and depth w.h.p. on the FA-MT-RAM. To the best of our knowledge this is the first implementation of the rootset-based algorithm; the implementations from [19] are based on processing appropriately-sized prefixes of an order generated by a random permutation . Our implementation of the rootset-based algorithm works on a priority-DAG defined by directing edges in the graph from the higher-priority endpoint to the lower-priority endpoint. On each round, we add all roots of the DAG into the MIS, compute , the neighbors of the rootset that are still active, and finally decrement the priorities of . As the vertices in are at arbitrary depths in the priority-DAG, we only decrement the priority along an edge , if . The algorithm runs in work as we process each edge once; the depth bound is as the priority-DAG has depth w.h.p. [40], and each round takes depth. We were surprised that this implementation usually outperforms the prefix-based implementation from [19], while also being very simple to implement.

Our maximal matching implementation is based on the prefix-based algorithm from [19] that takes expected work and depth w.h.p. on the PW-MT-RAM (using the improved depth shown in [40]). We had to make several modifications to run the algorithm on the large graphs in our experiments. The original code from [19] uses an edgelist representation, but we cannot directly use this implementation as uncompressing all edges would require a prohibitive amount of memory for large graphs. Instead, as in our MSF implementation, we simulate the prefix-based approach by performing a constant number of filtering steps. Each filter step packs out of the highest priority edges, randomly permutes them, and then runs the edgelist based algorithm on the prefix. After computing the new set of edges that are added to the matching, we filter the remaining graph and remove all edges that are incident to matched vertices. In practice, just 3–4 filtering steps are sufficient to remove essentially all edges in the graph. The last step uncompresses any remaining edges into an edgelist and runs the prefix-based algorithm. The filtering steps can be done within the work and depth bounds of the original algorithm.

Graph Coloring

  Input: , an undirected graph.
Output: , a mapping from each vertex to a color such that for each edge , , using at most colors.

 

As graph coloring is -hard to solve optimally, algorithms like greedy coloring, which guarantees a -coloring, are used instead in practice, and often use much fewer than colors on real-world graphs [116, 50]. Jones and Plassmann (JP) parallelize the greedy algorithm using linear work, but unfortunately adversarial inputs exist for the heuristics they consider that may force the algorithm to run in depth. Hasenplaugh et al. introduce several heuristics that produce high-quality colorings in practice and also achieve provably low-depth regardless of the input graph. These include LLF (largest-log-degree-first), which processes vertices ordered by the log of their degree and SLL (smallest-log-degree-last), which processes vertices by removing all lowest log-degree vertices from the graph, coloring the remaining graph, and finally coloring the removed vertices. For LLF, they show that it runs in work and depth, where in expectation.

In this paper, we implement a synchronous version of Jones-Plassmann using the LLF heuristic in Ligra, which runs in work and depth on the FA-MT-RAM. The algorithm is implemented similarly to our rootset-based algorithm for MIS. In each round, after coloring the roots we use a fetch-and-add to decrement a count on our neighbors, and add the neighbor as a root on the next round if the count is decremented to 0.

-core

  Input: , an undirected graph.
Output: , a mapping from each vertex to its coreness value.

  -cores were defined independently by Seidman [100], and by Matula and Beck [74] who also gave a linear-time algorithm for computing the coreness value of all vertices, i.e. the maximum -core a vertex participates in. Anderson and Mayr showed that -core (and therefore coreness) is in NC for , but is P-complete for [6]. The Matula and Beck algorithm is simple and practical—it first bucket-sorts vertices by their degree, and then repeatedly deletes the minimum-degree vertex. The affected neighbors are moved to a new bucket corresponding to their induced degree. As each edge in each direction and vertex is processed exactly once, the algorithm runs in work. In [36], the authors give a parallel algorithm based on bucketing that runs in expected work, and depth w.h.p. is the peeling-complexity of the graph, defined as the number of rounds to peel the graph to an empty graph where each peeling step removes all minimum degree vertices.

Our implementation of -core in this paper is based on the implementation from Julienne [36]. One of the challenges to implementing the peeling algorithm for -core is efficiently computing the number of edges removed from each vertex that remains in the graph. A simple approach is to just fetch-and-add a counter per vertex, and update the bucket of the vertex based on this counter, however this incurs significant contention on real-world graphs with vertices with large degree. In order to make this step faster in practice, we implemented a work-efficient histogram which compute the number of edges removed from remaining vertices while incurring very little contention. We describe our histogram implementation in Section 5.

Approximate Set Cover

  Input: , an undirected graph representing a set cover instance.
Output: , a set of sets such that with being an -approximation to the optimal cover.

 

Triangle Counting

  Input: , an undirected graph.
Output: , the total number of triangles in .

 

5 Implementations and Techniques

In this section, we introduce several general implementation techniques and optimizations that we use in our algorithms. Due to lack of space, we describe some techniques, such as a more cache-friendly sparse edgeMap that we call edgeMapBlocked, and compression techniques in Section 5.

A Work-efficient Histogram Implementation. Our initial implementation of the peeling-based algorithm for -core algorithm suffered from poor performance due to a large amount of contention incurred by fetch-and-adds on high-degree vertices. This occurs as many social-networks and web-graphs have large maximum degree, but relatively small degeneracy, or largest non-empty core (labeled in Table 3). For these graphs, we observed that many early rounds, which process vertices with low coreness perform a large number of fetch-and-adds on memory locations corresponding to high-degree vertices, resulting in high contention [105]. To reduce contention, we designed a work-efficient histogram implementation that can perform this step while only incurring contention w.h.p. The Histogram primitive takes a sequence of pairs, and an associative and commutative operator and computes a sequence of pairs, where each key only appears once, and its associated value is the sum of all values associated with keys in the input, combined with respect to .

A useful example of histogram to consider is summing for each for a vertexSubset , the number of edges where (i.e., the number of incoming neighbors from the frontier). This operation can be implemented by running histogram on a sequence where each appears once per edge as a tuple using the operator . One theoretically efficient implementation of histogram is to simply semisort the pairs using the work-efficient semisort algorithm from [47]. The semisort places pairs from the sequence into a set of heavy and light buckets, where heavy buckets contain a single key that appears many times in the input sequence, and light buckets contain at most distinct keys keys, each of which appear at most times w.h.p. (heavy and light keys are determined by sampling). We compute the reduced value for heavy buckets using a standard parallel reduction. For each light bucket, we allocate a hash table, and hash the keys in the bucket in parallel to the table, combining multiple values for the same key using . As each key appears at most times w.h.p, we incur at most ) contention w.h.p. The output sequence can be computed by compacting the light tables and heavy arrays.

While the semisort implementation is theoretically efficient, it requires a likely cache miss for each key when inserting into the appropriate hash table. To improve cache performance in this step, we implemented a work-efficient algorithm with depth based on radix sort. Our implementation is based on the parallel radix sort from PBBS [106]. As in the semisort, we first sample keys from the sequence and determine the set of heavy-keys. Instead of directly moving the elements into light and heavy buckets, we break up the input sequence into blocks, each of size , and sequentially sort the keys within a block into light and heavy buckets. Within the blocks, we reduce all heavy keys into a single value and compute an array of size which holds the starting offset of each bucket within the block. Next, we perform a segmented-scan [16] over the arrays of the blocks to compute the sizes of the light buckets, and the reduced values for the heavy-buckets, which only contain a single key. Finally, we allocate tables for the light buckets, hash the light keys in parallel over the blocks and compact the light tables and heavy keys into the output array. Each step runs in work and depth. Compared to the original semisort implementation, this version incurs fewer cache misses because the light keys per block are already sorted and consecutive keys likely go to the same hash table, which fits in cache. We compared our times in the histogram-based version of -core and the fetch-and-add-based version of -core and saw between a 1.1–3.1x improvement from using the histogram.

Techniques for overlapping searches. In this section, we describe how we compute and update the reachability labels for vertices that are visited in a phase of our SCC algorithm. Recall that each phase performs a graph traversal from the set of active centers on this round, , and computes for each center , all vertices in the weakly-connected component for the subproblem of that can be reached by a directed path from it. We store this reachability information as a set of pairs in a hash-table, which represent the fact that can be reached by a directed path from . A phase performs two graph traversals from the centers to compute and , the out-reachability set and in-reachability sets respectively. Each traversal allocates an initial hash table and runs rounds of edgeMap until no new label information is added to the table. We now discuss how we implement the tables and perform the traversal.

The main challenge in implementing one round in the traversal is (1) ensuring that the table has sufficient space to store all pairs that will be added this round, and (2) efficiently iterating over all of the pairs associated with a vertex. We implement (1) by performing a parallel reduce to sum over vertices , the current frontier, the number of neighbors in the same subproblem, multiplied by the number of distinct labels currently assigned to . This upper-bounds the number of distinct labels that could be added this round, and although we may overestimate the number of actual additions, we will never run out of space in the table. We update the number of elements currently in the table during concurrent insertions by storing a per-processor count which gets incremented whenever the processor performs a successful insertion. The counts are then summed together at the end of a round and used to update the count of the number of elements in the table.

One simple implementation of (2) is to simply allocate space for every vertex, as the maximum number of centers that visit any vertex during a phase is at most w.h.p. However, this will waste a significant amount of space, as most vertices are visited just a few times. Instead, our implementation stores pairs in the table for visited vertices , and computes hashes based only on the ID of . As each vertex is only expected to be visited a constant number of times during a phase, the expected probe length is still a constant. Storing the pairs for a vertex in the same probe-sequence is helpful for two reasons. First, we may incur fewer cache misses than if we had hashed the pairs based on both entries, as multiple pairs for a vertex can fit in the same cache line. Second, storing the pairs for a vertex along the same probe sequence makes it extremely easy to find all pairs associated with a vertex , as we simply perform linear-probing, reporting all pairs that have as their key until we hit an empty cell. Our experiments show that this technique is practical, and we believe that it may have applications in similar algorithms, such as computing least-element lists or FRT trees in parallel [20, 21].

6 Experiments

In this section, we describe our experimental results on a set of real-world graphs and also discuss related experimental work. Tables 1 and  5 show the running times for our implementations on our graph inputs. For compressed graphs, we use the compression schemes from Ligra+ [108], which we extended to ensure theoretical efficiency.

Experimental Setup. We run all of our experiments on a 72-core Dell PowerEdge R930 (with two-way hyper-threading) with Intel 18-core E7-8867 v4 Xeon processors (with a 4800MHz bus and 45MB L3 cache) and 1TB of main memory. Our programs use Cilk Plus to express parallelism and are compiled with the g++ compiler (version 5.4.1) with the -O3 flag. By using Cilk’s work-stealing scheduler we are able obtain an expected running time of for an algorithm with work and depth on processors [24]. For the parallel experiments, we use the command numactl -i all to balance the memory allocations across the sockets. All of the speedup numbers we report are the running times of our parallel implementation on 72-cores with hyper-threading over the running time of the implementation on a single thread.

Graph Dataset Num. Vertices Num. Edges
LiveJournal 4,847,571 68,993,773 16
LiveJournal-Sym 4,847,571 85,702,474 20 3480 372
com-Orkut 3,072,627 234,370,166 9 5,667 253
Twitter 41,652,231 1,468,365,182 65*
Twitter-Sym 41,652,231 2,405,026,092 23* 14,963 2488
3D-Torus 1,000,000,000 6,000,000,000 1500* 1 6
ClueWeb 978,408,098 42,574,107,469 821*
ClueWeb-Sym 978,408,098 74,744,358,622 132* 106,819 4244
Hyperlink2014 1,724,573,718 64,422,807,961 793*
Hyperlink2014-Sym 1,724,573,718 124,141,874,032 207* 58,711 4160
Hyperlink2012 3,563,602,789 128,736,914,167 5275*
Hyperlink2012-Sym 3,563,602,789 225,840,663,232 331* 130,728 10565
Table 3: Graph inputs, including both vertices and edges. is the diameter of the graph. For undirected graphs, and are the number of peeling rounds, and the largest non-empty core (degeneracy). We mark values where we are unable to calculate the exact diameter with * and report the effective diameter observed during our experiments, which is a lower bound on the actual diameter.

Graph Data. To show how our algorithms perform on graphs at different scales, we selected a representative set of real-world graphs of varying sizes. Most of the graphs are Web graphs and social networks—low diameter graphs that are frequently used in practice. To test our algorithms on large diameter graphs, we also ran our implementations 3-dimensional tori where each vertex is connected to its 2 neighbors in each dimension.

We list the graphs used in our experiments, along with their size, approximate diameter, peeling complexity [36], and degeneracy (for undirected graphs) in Table 3. LiveJournal is a directed graph of the social network obtained from a snapshot in 2008 [25]. com-Orkut is an undirected graph of the Orkut social network. Twitter is a directed graph of the Twitter network, where edges represent the follower relationship [64]. ClueWeb is a Web graph from the Lemur project at CMU [25]. Hyperlink2012 and Hyperlink2014 are directed hyperlink graphs obtained from the WebDataCommons dataset where nodes represent web pages [77]. 3D-Torus is a 3-dimensional torus with 1 billion vertices and 6 billion edges. We mark symmetric versions of the directed graphs with the suffix -Sym. We create weighted graphs for evaluating weighted BFS, Borůvka, and Bellman-Ford by selecting edge weights between uniformly at random. We process LiveJournal, com-Orkut, Twitter, and 3D-Torus in the uncompressed format, and ClueWeb, Hyperlink2014, and Hyperlink2012 in the compressed format.

Problem LiveJournal com-Orkut Twitter 3D-Torus
(1) (72h) (SU) (1) (72h) (SU) (1) (72h) (SU) (1) (72h) (SU)
Breadth-First Search (BFS) 0.59 0.018 32.7 0.41 0.012 34.1 5.45 0.137 39.7 301 5.53 54.4
Integral-Weight SSSP (weighted BFS) 1.45 0.107 13.5 2.03 0.095 21.3 33.4 0.995 33.5 437 18.1 24.1
General-Weight SSSP (Bellman-Ford) 1.96 0.086 22.7 3.98 0.168 23.6 48.7 1.56 31.2 6280 133 47.2
Single-Source Betweenness Centrality (BC) 1.66 0.049 33.8 2.52 0.057 44.2 26.3 3.26 8.06 496 12.5 39.6
Low-Diameter Decomposition (LDD) 0.54 0.027 20.0 0.33 0.019 17.3 8.48 0.186 45.5 275 7.55 36.4
Connectivity 1.20 0.050 24.0 1.64 0.056 29.2 26.1 0.807 32.3 351 14.3 24.5
Biconnectivity 5.36 0.261 20.5 7.31 0.292 25.0 146 4.86 30.0 1610 59.6 27.0
Strongly Connected Components (SCC)* 1.61 0.116 13.8 13.3 0.495 26.8
Minimum Spanning Forest (MSF) 3.64 0.204 17.8 4.58 0.227 20.1 61.8 3.02 20.4 617 23.6 26.1
Maximal Independent Set (MIS) 1.18 0.034 34.7 2.23 0.052 42.8 34.4 0.759 45.3 236 4.44 53.1
Maximal Matching (MM) 2.42 0.095 25.4 4.65 0.183 25.4 46.7 1.42 32.8 403 11.4 35.3
Graph Coloring 4.69 0.392 11.9 9.05 0.789 11.4 148 6.91 21.4 350 11.3 30.9
-core 3.75 0.641 5.85 8.32 1.33 6.25 110 6.72 16.3 753 6.58 114.4
Approximate Set Cover 4.65 0.613 7.58 4.51 0.786 5.73 66.4 3.31 20.0 1429 40.2 35.5
Triangle Counting (TC) 13.5 0.342 39.4 78.1 1.19 65.6 1920 23.5 81.7 168 6.63 25.3
Table 4: Running times (in seconds) of our algorithms over our graph inputs in the uncompressed format on a 72-core machine (with hyper-threading) where (1) is the single-thread time, (72h) is the 72 core time using hyper-threading and (SU) is the speedup of the application (single-thread time divided by 72-core time).
Problem ClueWeb Hyperlink2014 Hyperlink2012
(1) (72h) (SU) (1) (72h) (SU) (1) (72h) (SU)
Breadth-First Search (BFS) 106 2.29 46.2 250 4.50 55.5 649 10.7 60
Integral-Weight SSSP (weighted BFS) 736 14.4 51.1 1390 22.3 62.3 3770 58.1 64
General-Weight SSSP (Bellman-Ford) 1050 16.2 64.8 1460 22.9 63.7 4010 59.4 67
Single-Source Betweenness Centrality (BC) 569 27.7 20.5 866 16.3 53.1 2260 37.1 60
Low-Diameter Decomposition (LDD) 176 3.62 48.6 322 6.84 47.0 1150 19.6 58
Connectivity 552 11.2 49.2 990 17.1 57.8 2080 38.3 54
Biconnectivity 2250 48.7 46.2 3520 71.5 49.2 9860 165 59
Strongly Connected Components (SCC)* 1240 38.1 32.5 2140 51.5 41.5 8130 185 43
Minimum Spanning Forest (MSF) 2490 45.6 54.6 3580 71.9 49.7 9520 187 50
Maximal Independent Set (MIS) 551 8.44 65.2 1020 14.5 70.3 2190 32.2 68
Maximal Matching (MM) 1760 31.8 55.3 2980 48.1 61.9 7150 108 66
Graph Coloring 2050 49.8 41.1 3310 63.1 52.4 8920 158 56
-core 2370 62.9 37.6 3480 83.2 41.8 8515 184 46
Approximate Set Cover 1490 28.1 53.0 2040 37.6 54.2 5630 98.4 57
Triangle Counting (TC) 272 568 1470
Table 5: Running times (in seconds) of our algorithms over our graph inputs stored in the compressed format on a 72-core machine (with hyper-threading) where (1) is the single-thread time, (72h) is the 72 core time using hyper-threading and (SU) is the speedup of the application (single-thread time divided by 72-core time). We mark experiments that are not applicable for a graph with , and experiments that did not finish within 5 hours with —.

SSSP Problems. We first discuss the running times for BFS, weighted BFS, Bellman-Ford, and betweenness centrality on our graphs. Our implementations achieve between a 8–67x speedup across all inputs. We ran all of our shortest path experiments on the symmetrized versions of the graph. Our experiments show that our weighted BFS and Bellman-Ford implementations perform as well as or better than our prior implementations from Julienne [36]. Our running times for BFS and betweenness centrality are the same as the times of the implementations in Ligra [104]. We note that our running times for weighted BFS on the Hyperlink graphs are larger than the times reported in Julienne. This is because the shortest-path experiments in Julienne were run on directed version of the graph, where the average vertex can reach many fewer vertices than on the symmetrized version. We set a flag for our weighted BFS experiments on the ClueWeb and Hyperlink graphs that lets the algorithm switch to a dense edgeMap once the frontiers are sufficiently dense, which lets the algorithm run within half of the RAM on our machine. Before this change, our weighted BFS implementation would request a large amount of amount of memory when processing the largest frontiers which then caused the graph to become partly evicted from the page cache. In an earlier paper [36], we compared the running time of our weighted BFS implementation to two existing parallel shortest path implementations from the GAP benchmark suite [12] and Galois [71], as well as a fast sequential shortest path algorithm from the DIMACS shortest path challenge, showing that our implementation is between 1.07–1.1x slower than the -stepping implementation from GAP, and 1.6–3.4x faster than the Galois implementation. Our old version of Bellman-Ford was between 1.2–3.9x slower than weighted BFS; we note that after changing it to use the edgeMapBlocked optimization, it is now competitive with weighted BFS and is between 1.2x faster and 1.7x slower on our graphs with the exception of 3D-Torus, where it performs 7.3x slower than weighted BFS, as it performs work on this graph.

Connectivity Problems. Our low-diameter decomposition (LDD) implementation achieves between 17–58x speedup across all inputs. We fixed to in all of the codes that use LDD. The running time of LDD is comparable to the cost of a BFS that visits most of the vertices. We are not aware of any prior experimental work that reports the running times for an LDD implementation.

Our work-efficient implementation of connectivity achieves 25–57x speedup across all inputs. We note that our implementation does not assume that vertex IDs in the graph are randomly permuted and always generates a random permutation, even on the first round, as adding vertices based on their original IDs can result in poor performance. There are several existing implementations of fast parallel connectivity algorithms [88, 106, 107, 111], however, only the implementation from [107], which presents the algorithm that we implement in this paper, is theoretically-efficient. The implementation from Shun et al. was compared to both the Multistep [111] and Patwary et al. [88] implementations, and shown to be competitive on a broad set of graphs. We compared our connectivity implementation to the work-efficient connectivity implementation from Shun et al. on our uncompressed graphs and observed that our code is between 1.2–2.1x faster in parallel.

Despite our biconnectivity implementation having depth, our implementation achieves between a 20–59x speedup across all inputs, as the diameter of most of our graphs is extremely low. Our biconnectivity implementation is about 3–5 times slower than running connectivity on the graph, which seems reasonable as our current implementation performs two calls to connectivity, and one breadth-first search. There are a several existing implementations of biconnectivity. Cong and Bader [30] parallelize the Tarjan-Vishkin algorithm and demonstrated speedup over the Hopcroft-Tarjan (HT) algorithm. Edwards and Vishkin [38] also implement the Tarjan-Vishkin algorithm using the XMT platform, and show that their algorithm achieves good speedups. Slota and Madduri [110] present a BFS-based biconnectivity implementation which requires work in the worst-case, but behaves like a linear-work algorithm in practice. We ran the Slota and Madduri implementation on 36 hyper-threads allocated from the same socket, the configuration on which we observed the best performance for their code, and found that our implementation is between 1.4–2.1x faster than theirs. We used a DFS-ordered subgraph corresponding to the largest connected component to test their code, which produced the fastest times. Using the original order of the graph affects the running time of their implementation, causing it to run between 2–3x slower as the amount of work performed by their algorithm depends on the order in which vertices are visited.

Our strongly connected components implementation achieves between a 13–43x speedup across all inputs. Our implementation takes a parameter , which is the base of the exponential rate at which we grow the number of centers added. We set between for our experiments and note that using a larger value of can improve the running time on smaller graphs by up to a factor of 2x. Our SCC implementation is between 1.6x faster to 4.8x slower than running connectivity on the graph. There are several existing SCC implementations that have been evaluated on real-world directed graphs [51, 111, 76]. The Hong et al. algorithm [51] is a modified version of the FWBW-Trim algorithm from McLendon et al. [76], but neither algorithm has any theoretical bounds on work or depth. Unfortunately [51] do not report running times, so we are unable to compare our performance with them. The Multistep algorithm [111] has a worst-case running time of , but the authors point-out that the algorithm behaves like a linear-time algorithm on real-world graphs. We ran our implementation on 16 cores configured similarly to their experiments and found that we are about 1.7x slower on LiveJournal, which easily fits in cache, and 1.2x faster on Twitter (we scaled the time to account for a small difference in graph sizes). While the multistep algorithm may be slightly faster on some graphs, our SCC implementation has the advantage of being theoretically-efficient and performs a predictable amount of work.

Our minimum spanning forest implementation achieves between 17–50x speedup over the implementation running on a single thread across all of our inputs. Obtaining practical parallel algorithms for MSF has been a longstanding goal in the field, and several existing implementations exist [9, 84, 31, 106, 118]. We compared our implementation with the union-find based MSF implementation from PBBS [106] and the implementation of Borůvka from [118], which is one of the fastest implementations we are aware of. Our MSF implementation is between 2.6–5.9x faster than the MSF implementation from PBBS. Compared to the edgelist based implementation of Borůvka from [118] our implementation is between 1.2–2.9x faster.

MIS, Maximal Matching, and Graph Coloring. Our MIS and maximal matching implementations achieve between 31–70x and 25–70x speedup across all inputs. The implementations by Blelloch et al. [19] are the fastest existing implementations of MIS and maximal matching that we are aware of, and are the basis for our maximal matching implementation. They report that their implementations are 3–8x faster than Luby’s algorithm on 32 threads, and outperform a sequential greedy MIS implementation on more than 2 processors. We compared our rootset-based MIS implementation to the prefix-based implementation, and found that the rootset-based approach is between 1.1–3.5x faster. Our maximal matching implementation is between 3–4.2x faster than the implementation from [19]. Our implementation of maximal matching can avoid a significant amount of work, as each of the filter steps can extract and permute just the highest priority edges, whereas the edgelist-based version in PBBS must permute all edges. Our coloring implementation achieves between 11–56x speedup across all inputs. We note that our implementation appears to be between 1.2–1.6x slower than the asynchronous implementation of JP in [50], due to synchronizing on many rounds which contain few vertices.

-core, Approximate Set Cover, and Triangle Counting. Our -core implementation achieves between 5–46x speedup across all inputs, and 114x speedup on the 3D-Torus graph as there is only one round of peeling in which all vertices are removed. There are several recent papers that implement parallel algorithms for -core [35, 36, 96, 58]. Both the ParK algorithm [35] and Kabir and Madduri algorithm [58] implement the peeling algorithm in work, which is not work-efficient. Our implementation is between 3.8–4.6x faster than ParK on a similar machine configuration. Kabir and Madduri show that their implementation achieves an average speedup of 2.8x over ParK. Our implementation is between 1.3–1.6x faster than theirs on a similar machine configuration.

Our approximate set cover implementation achieves between 5–57x speedup across all inputs. Our implementation is based on the implementation presented in Julienne [36]; the one major modification was to regenerate random priorities for sets that are active on the current round. We compared the running time of our implementation with the parallel implementation from [23] which is available in the PBBS library. We ran both implementations with . Our implementation is between 1.2x slower to 1.5x faster than the PBBS implementation on our graphs, with the exception of 3D-Torus. On 3D-Torus, the implementation from [23] runs 56x slower than our implementation as it does not regenerate priorities for active sets on each round causing worst-case behavior. Our performance is also slow on this graph, as nearly all of the vertices stay active (in the highest bucket) during each round, and using causes a large number of rounds to be performed. We note that using a larger value (i.e., ) greatly improves our running time for 3D-Torus.

Our triangle counting (TC) implementation achieves between 39–81x speedup across all inputs. Unfortunately, we are unable to report speedup numbers for TC on our larger graphs as the single-threaded times took too long due to the algorithm performing work. There are a number experimental papers that consider multicore triangle counting [101, 45, 62, 109, 1, 68]. We implement the algorithm from [109], and adapted it to work on compressed graphs. We note that in our experiments we intersect directed adjacency lists sequentially, as there was sufficient parallelism in the outer parallel-loop. There was no significant difference in running times between our implementation and the implementation from [109]. We ran our implementation on 48 threads on the Twitter graph to compare with the times reported by EmptyHeaded [1] and found that our times are about the same.


Figure 1: Log-linear plot of normalized throughput vs. number of vertices for MIS, BFS, BC, and coloring on the 3D-Torus graph family.

Performance on 3D-Torus. We ran experiments on a family of 3D-Torus graphs with different sizes to study how our diameter-bounded algorithms scale relative to algorithms with polylogarithmic depth. We were surprised to see that the running time of some of our polylogarithmic depth algorithms on this graph, like LDD and connectivity, are 17–40x more expensive than their running time on Twitter and Twitter-Sym, despite 3D-Torus only having 4x and 2.4x more edges than Twitter and Twitter-Sym. Our slightly worse scaling on this graph can be accounted for by the fact that we stored the graph ordered by dimension, instead of storing it using a local ordering. It would be interesting to see how much improvement we could gain by reordering the vertices. As Bellman-Ford requires work, on the 3D-Torus this translates to work which explains the poor performance of Bellman-Ford on this graph.

In Figure 1 we show the normalized throughput of MIS, BFS, BC, and graph coloring for 3-dimensional tori of different sizes, where throughput is measured as the number of edges processed per second. We note that the throughput for each application becomes saturated before our largest-scale graph for all applications except for BFS, which is saturated on a graph with 2 billion vertices. The throughput curves show that the theoretical bounds are useful in predicting how the half-lengths333The graph size when the system achieves half of its peak-performance. are distributed. The half-lengths are ordered as follows: coloring, MIS, BFS, and BC. This is the same order as sorting these algorithms by their depth with respect to this graph.

Locality. While our algorithms are efficient on the MT-RAM, we do not analyze their cache complexity, and in general they may not be efficient in a model that takes caches into account. Despite this, we observed that our algorithms have good cache performance on the graphs we tested on. In this section we give some explanation for this fact by showing that our primitives make good use of the caches. Our algorithms are also aided by the fact that these graph datasets often come in highly local orders (e.g., see the Natural order in  [37]). Table 6 shows metrics for our experiments measured using Open Performance Counter Monitor (PCM).

Algorithm Cycles Stalled LLC Hit Rate LLC Misses BW Time
-core (histogram) 9 0.223 49 96 62.9
-core (fetch-and-add) 67 0.155 42 24 221
weighted BFS (blocked) 3.7 0.070 19 130 14.4
weighted BFS (unblocked) 5.6 0.047 29 152 25.2
Table 6: Cycles stalled while the memory subsystem has an outstanding load (trillions), LLC hit rate and misses (billions), bandwidth in GB per second (total bytes read and written from the memory controllers, divided by running time), and running time in seconds. All experiments are run on the ClueWeb graph using 72 cores with hyper-threading.

Due to space limitations, we are only able to report numbers for the ClueWeb graph. We observed that using a work-efficient histogram is 3.5x faster than using fetch-and-add in our -core implementation, which suffers from high contention on this graph. Using a histogram reduces the number of cycles stalled due to memory by more than 7x. We also ran our wBFS implementation with and without the edgeMapBlocked optimization, which reduces the number of cache-lines read from and written to when performing a sparse edgeMap. The blocked implementation reads and writes 2.1x fewer bytes than the unoptimized version, which translates to a 1.7x faster runtime. We disabled the dense optimization for this experiment to directly compare the two implementations of a sparse edgeMap.

Processing Massive Web Graphs. In Tables 1 and  5, we show the running times of our implementations on the ClueWeb, Hyperlink2014, and Hyperlink2012 graphs. To put our performance on these problems in context, we compare our 72-core running times to running times reported for these graphs by existing work. Table 7 summarizes the results described in this section.

Paper Problem Graph Memory Cores Nodes Time Our Speedup
Mosaic [34] BFS 2014 0.768 96 1 6.55 1.4x
Connectivity 2014 0.768 96 1 708 41x
SSSP 2014 0.768 96 1 8.6 -2.5x
FlashGraph [34] BFS 2012 0.512 64 1 208 19x
BC 2012 0.512 64 1 595 16x
Connectivity 2012 0.512 64 1 461 12x
TC 2012 0.512 64 1 7818 5.3x
BigSparse [34] BFS 2012 0.064 32 1 2500 233x
BC 2012 0.064 32 1 3100 83x
Slota et al. [113] Largest-CC 2012 16.3 8192 256 63 1.6x
Largest-SCC 2012 16.3 8192 256 108 -1.6x
Approx -core 2012 16.3 8192 256 363 1.9x
Stergiou et al. [114] Connectivity 2012 128 24000 1000 341 8.9x
Table 7: Speedup of our implementations on 72 cores over existing numbers reported in the literature for the Hyperlink2014 and Hyperlink2012 graphs. We report the memory (terabytes), cores and nodes used in each paper, and their running time on this configuration in seconds. The times used to compute our speedups are from Tables 1 and 5.

FlashGraph [34] reports disk-based running times on a 4-socket, 32-core machine with 512GB of memory and 15 SSDs and run experiments on the directed Hyperlink2012 graph. On 64 hyper-threads, their system solves BFS in 208s, BC in 595s, connected components in 461s, and triangle counting in 7818s. Our BFS and BC implementations are 19x faster and 16x faster than their times, and our triangle counting and connectivity implementations are 5.3x faster and 12x faster than their implementation, respectively. Several other external-memory graph processing systems report times for the Hyperlink2014 and Hyperlink2012 graphs. Mosaic [70] report in-memory running times on the Hyperlink2014 graph; we note that their system is optimized for external memory execution. They solve BFS in 6.5s, connected components in 700s, and SSSP (Bellman-Ford) in 8.6s on a machine with 4 Xeon-Phis (96 cores), 768GB of RAM, and 6 NVMes. With the exception of SSSP, all of our implementations are faster. Our BFS and connectivity implementations are 1.4x and 40x faster respectively, but our SSSP implementation is 2.5x slower. Unfortunately, as the authors do not provide details on their SSSP implementation, how they generate edge weights, or whether they run on the symmetric or directed Hyperlink2014 graph, we are not sure why their implementation performs better than ours. BigSparse [57] report disk-based running times for BFS and BC on the directed Hyperlink2012 graph on a 32-core machine; they solve BFS in 2500s and BC in 3100s. On the same number of cores, our times are 98x faster for BFS and 33x faster for BC.

Slota et al. [113] report running times for the Hyperlink2012 graph on 256 nodes on the Blue Waters supercomputer. Each node contains two 16-core processors, for a total of 8192 cores. They report running times to find the largest connected component and SCC from the graph, which are 63s and 108s respectively. Our implementations find all connected components 1.6x faster than their largest connected component implementation, and find all strongly connected components 1.6x slower than their largest-SCC implementation. They solve approximate -cores in 363s, where the approximate -core of a vertex is the coreness of the vertex rounded up to the nearest powers of 2. Our implementation of -core computes the exact coreness of each vertex in 184s, which is 1.9x faster than the approximate implementation while using 113x fewer cores.

Stergiou et al. [114] present a connectivity algorithm for the BSP model that runs in rounds and report running times for the Hyperlink2012 graph. They implement their algorithm using a proprietary in-memory/secondary-storage graph processing system used within Yahoo, and run their experiments on a 1000 node cluster. Each node contains two 12-core processors and 128GB of RAM, for a total of 24000 cores and 128TB of RAM. Their fastest running time on the Hyperlink2012 graph is 341s on their 1000 node system. Our implementation solves connectivity on this graph in 38.3s–8.8x faster on a system with 128x less memory and 333x fewer cores. They also report running times for solving connectivity on a private Yahoo webgraph with 272 billion vertices and 5.9 trillion edges, over 26 times the size of our largest graph. While such a graph seems to currently be out of reach of our machine, we are hopeful that techniques from theoretically-efficient parallel algorithms can help solve problems on graphs at this scale and beyond.

7 Conclusion

In this paper, we showed that we can process the largest publicly-available real-world graph on just a single shared-memory server with 1TB of memory using theoretically-efficient parallel algorithms. We outperform existing implementations on the largest real-world graphs, and use much fewer resources than the distributed-memory solutions. On a per-core basis, our numbers are significantly better. Our results provide evidence that theoretically-efficient shared-memory graph algorithms can be efficient and scalable in practice.

Acknowledgements

We would like to thank the reviewers for their helpful comments. Thanks also to Lin Ma for helpful discussions. This work was supported in part by NSF grants CCF-1408940, CCF-1533858, and CCF-1629444.

References

  • [1] C. R. Aberger, A. Lamb, S. Tu, A. Nötzli, K. Olukotun, and C. Ré. Emptyheaded: A relational engine for graph processing. ACM Trans. Database Syst., 2017.
  • [2] A. Aggarwal, R. J. Anderson, and M.-Y. Kao. Parallel depth-first search in general directed graphs. In STOC, 1989.
  • [3] M. Ahmad, F. Hijaz, Q. Shi, and O. Khan. Crono: A benchmark suite for multithreaded graph algorithms executing on futuristic multicores. In IISWC, 2015.
  • [4] N. Alon, L. Babai, and A. Itai. A fast and simple randomized parallel algorithm for the maximal independent set problem. J. Algorithms, 1986.
  • [5] N. Alon, R. Yuster, and U. Zwick. Finding and counting given length cycles. Algorithmica, 17(3), 1997.
  • [6] R. Anderson and E. W. Mayr. A -complete problem and approximations to it. Technical report, 1984.
  • [7] B. Awerbuch. Complexity of network synchronization. J. ACM, 32(4), 1985.
  • [8] B. Awerbuch and Y. Shiloach. New connectivity and MSF algorithms for Ultracomputer and PRAM. In ICPP, 1983.
  • [9] D. A. Bader and G. Cong. Fast shared-memory algorithms for computing the minimum spanning forest of sparse graphs. JPDC, 2006.
  • [10] D. A. Bader and K. Madduri. Design and implementation of the hpcs graph analysis benchmark on symmetric multiprocessors. In International Conference on High-Performance Computing, 2005.
  • [11] D. A. Bader and K. Madduri. Designing multithreaded algorithms for breadth-first search and st-connectivity on the cray mta-2. In ICPP, 2006.
  • [12] S. Beamer, K. Asanovic, and D. A. Patterson. The GAP benchmark suite. CoRR, abs/1508.03619, 2015.
  • [13] N. Ben-David, G. E. Blelloch, J. T. Fineman, P. B. Gibbons, Y. Gu, C. McGuffey, and J. Shun. Implicit decomposition for write-efficient connectivity algorithms. In IPDPS, 2018.
  • [14] B. Berger, J. Rompel, and P. W. Shor. Efficient algorithms for set cover with applications to learning and geometry. J. Comput. Syst. Sci., 49(3), Dec. 1994.
  • [15] M. Birn, V. Osipov, P. Sanders, C. Schulz, and N. Sitchinava. Efficient parallel and external matching. In Euro-Par, 2013.
  • [16] G. E. Blelloch. Prefix sums and their applications. Synthesis of Parallel Algorithms, 1993.
  • [17] G. E. Blelloch and L. Dhulipala. Introduction to parallel algorithms 15-853 : Algorithms in the real world. 2018.
  • [18] G. E. Blelloch, J. T. Fineman, P. B. Gibbons, and J. Shun. Internally deterministic algorithms can be fast. In PPoPP, 2012.
  • [19] G. E. Blelloch, J. T. Fineman, and J. Shun. Greedy sequential maximal independent set and matching are parallel on average. In SPAA, 2012.
  • [20] G. E. Blelloch, Y. Gu, J. Shun, and Y. Sun. Parallelism in randomized incremental algorithms. In SPAA, 2016.
  • [21] G. E. Blelloch, Y. Gu, and Y. Sun. A new efficient construction on probabilistic tree embeddings. In ICALP, 2017.
  • [22] G. E. Blelloch, R. Peng, and K. Tangwongsan. Linear-work greedy parallel approximate set cover and variants. In SPAA, 2011.
  • [23] G. E. Blelloch, H. V. Simhadri, and K. Tangwongsan. Parallel and I/O efficient set covering algorithms. In SPAA, 2012.
  • [24] R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded computations by work stealing. J. ACM, 46(5), Sept. 1999.
  • [25] P. Boldi and S. Vigna. The WebGraph framework I: Compression techniques. In WWW, 2004.
  • [26] O. Borůvka. O jistém problému minimálním. Práce Mor. Přírodověd. Spol. v Brně III, 3, 1926.
  • [27] U. Brandes. A faster algorithm for betweenness centrality. Journal of mathematical sociology, 25(2), 2001.
  • [28] A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the web. Computer networks, 33(1-6), 2000.
  • [29] R. Cole, P. N. Klein, and R. E. Tarjan. Finding minimum spanning forests in logarithmic time and linear work using random sampling. In SPAA, 1996.
  • [30] G. Cong and D. A. Bader. An experimental study of parallel biconnected components algorithms on symmetric multiprocessors (SMPs). In IPDPS, 2005.
  • [31] G. Cong and I. G. Tanase. Composable locality optimizations for accelerating parallel forest computations. In HPCC, 2016.
  • [32] D. Coppersmith, L. Fleischer, B. Hendrickson, and A. Pinar. A divide-and-conquer algorithm for identifying strongly connected components. 2003.
  • [33] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms (3. ed.). MIT Press, 2009.
  • [34] D. M. Da Zheng, R. Burns, J. Vogelstein, C. E. Priebe, and A. S. Szalay. Flashgraph: Processing billion-node graphs on an array of commodity SSDs. In FAST, 2015.
  • [35] N. S. Dasari, R. Desh, and M. Zubair. Park: An efficient algorithm for k-core decomposition on multicore processors. In Big Data, 2014.
  • [36] L. Dhulipala, G. E. Blelloch, and J. Shun. Julienne: A framework for parallel graph algorithms using work-efficient bucketing. In SPAA, 2017.
  • [37] L. Dhulipala, I. Kabiljo, B. Karrer, G. Ottaviano, S. Pupyrev, and A. Shalita. Compressing graphs and indexes with recursive graph bisection. In KDD, 2016.
  • [38] J. A. Edwards and U. Vishkin. Better speedups using simpler parallel programming for graph connectivity and biconnectivity. In PMAM, 2012.
  • [39] J. T. Fineman. Nearly work-efficient parallel algorithm for digraph reachability. In STOC, 2018.
  • [40] M. Fischer and A. Noever. Tight analysis of parallel randomized greedy MIS. In SODA, 2018.
  • [41] L. K. Fleischer, B. Hendrickson, and A. Pınar. On identifying strongly connected components in parallel. In IPDPS, 2000.
  • [42] H. Gazit. An optimal randomized parallel algorithm for finding connected components in a graph. SIAM J. Comput., 1991.
  • [43] H. Gazit and G. L. Miller. An improved parallel algorithm that computes the bfs numbering of a directed graph. Information Processing Letters, 28(2), 1988.
  • [44] J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. PowerGraph: Distributed graph-parallel computation on natural graphs. In OSDI, 2012.
  • [45] O. Green, L. M. Munguia, and D. A. Bader. Load balanced clustering coefficients. In PPAA, 2014.
  • [46] R. Greenlaw, H. J. Hoover, and W. L. Ruzzo. Limits to Parallel Computation: P-completeness Theory. Oxford University Press, Inc., 1995.
  • [47] Y. Gu, J. Shun, Y. Sun, and G. E. Blelloch. A top-down parallel semisort. In SPAA, 2015.
  • [48] S. Halperin and U. Zwick. An optimal randomized logarithmic time connectivity algorithm for the EREW PRAM (extended abstract). In SPAA, 1994.
  • [49] S. Halperin and U. Zwick. Optimal randomized EREW PRAM algorithms for finding spanning forests. In J. Algorithms, 2000.
  • [50] W. Hasenplaugh, T. Kaler, T. B. Schardl, and C. E. Leiserson. Ordering heuristics for parallel graph coloring. In SPAA, 2014.
  • [51] S. Hong, N. C. Rodia, and K. Olukotun. On fast parallel detection of strongly connected components (SCC) in small-world graphs. In SC, 2013.
  • [52] J. Hopcroft and R. Tarjan. Algorithm 447: efficient algorithms for graph manipulation. Communications of the ACM, 1973.
  • [53] A. Iosup, T. Hegeman, W. L. Ngai, S. Heldens, A. Prat-Pérez, T. Manhardto, H. Chafio, M. Capotă, N. Sundaram, M. Anderson, I. G. Tănase, Y. Xia, L. Nai, and P. Boncz. LDBC graphalytics: A benchmark for large-scale graph analysis on parallel and distributed platforms. Proc. VLDB Endow., 9(13), Sept. 2016.
  • [54] A. Israeli and Y. Shiloach. An improved parallel algorithm for maximal matching. Inf. Process. Lett., 1986.
  • [55] A. Itai and M. Rodeh. Finding a minimum circuit in a graph. In STOC, 1977.
  • [56] J. Jaja. Introduction to Parallel Algorithms. Addison-Wesley Professional, 1992.
  • [57] S. W. Jun, A. Wright, S. Zhang, S. Xu, and Arvind. Bigsparse: High-performance external graph analytics. CoRR, abs/1710.07736, 2017.
  • [58] H. Kabir and K. Madduri. Parallel k-core decomposition on multicore platforms. In IPDPSW, May 2017.
  • [59] D. R. Karger, P. N. Klein, and R. E. Tarjan. A randomized linear-time algorithm to find minimum spanning trees. J. ACM, 42(2), Mar. 1995.
  • [60] R. M. Karp and V. Ramachandran. Handbook of theoretical computer science (vol. a). chapter Parallel Algorithms for Shared-memory Machines. MIT Press, Cambridge, MA, USA, 1990.
  • [61] R. M. Karp and A. Wigderson. A fast parallel algorithm for the maximal independent set problem. In STOC, 1984.
  • [62] J. Kim, W.-S. Han, S. Lee, K. Park, and H. Yu. OPT: A new framework for overlapped and parallel triangulation in large-scale graphs. In SIGMOD, 2014.
  • [63] R. Kumar, B. Moseley, S. Vassilvitskii, and A. Vattani. Fast greedy algorithms in mapreduce and streaming. ACM Trans. Parallel Comput., 2(3), Sept. 2015.
  • [64] H. Kwak, C. Lee, H. Park, and S. Moon. What is twitter, a social network or a news media? In WWW, 2010.
  • [65] M. Latapy. Main-memory triangle computations for very large (sparse (power-law)) graphs. Theor. Comput. Sci., 2008.
  • [66] C. E. Leiserson and T. B. Schardl. A work-efficient parallel breadth-first search algorithm (or how to cope with the nondeterminism of reducers). In SPAA, 2010.
  • [67] Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed GraphLab: A framework for machine learning and data mining in the cloud. Proc. VLDB Endow., 5(8), Apr. 2012.
  • [68] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. GraphLab: A new parallel framework for machine learning. In UAI, July 2010.
  • [69] M. Luby. A simple parallel algorithm for the maximal independent set problem. SIAM J. Comput., 1986.
  • [70] S. Maass, C. Min, S. Kashyap, W. Kang, M. Kumar, and T. Kim. Mosaic: Processing a trillion-edge graph on a single machine. In EuroSys, 2017.
  • [71] S. Maleki, D. Nguyen, A. Lenharth, M. Garzarán, D. Padua, and K. Pingali. DSMR: A parallel algorithm for single-source shortest path problem. In ICS, 2016.
  • [72] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: A system for large-scale graph processing. In SIGMOD, 2010.
  • [73] Y. Maon, B. Schieber, and U. Vishkin. Parallel ear decomposition search (EDS) and st-numbering in graphs. Theoretical Computer Science, 47, 1986.
  • [74] D. W. Matula and L. L. Beck. Smallest-last ordering and clustering and graph coloring algorithms. J. ACM, 30(3), July 1983.
  • [75] R. R. McCune, T. Weninger, and G. Madey. Thinking like a vertex: A survey of vertex-centric frameworks for large-scale distributed graph processing. ACM Comput. Surv., 48(2), Oct. 2015.
  • [76] W. Mclendon Iii, B. Hendrickson, S. J. Plimpton, and L. Rauchwerger. Finding strongly connected components in distributed graphs. Journal of Parallel and Distributed Computing, 65(8), 2005.
  • [77] R. Meusel, S. Vigna, O. Lehmberg, and C. Bizer. The graph structure in the web–analyzed on different aggregation levels. The Journal of Web Science, 1(1), 2015.
  • [78] U. Meyer and P. Sanders. Parallel shortest path for arbitrary graphs. In Euro-Par, 2000.
  • [79] U. Meyer and P. Sanders. -stepping: a parallelizable shortest path algorithm. J. Algorithms, 49(1), 2003.
  • [80] G. L. Miller, R. Peng, and S. C. Xu. Parallel graph decompositions using random shifts. In SPAA, 2013.
  • [81] G. L. Miller and V. Ramachandran. A new graph triconnectivity algorithm and its parallelization. Combinatorica, 12(1), Mar 1992.
  • [82] L. Nai, Y. Xia, I. G. Tanase, H. Kim, and C.-Y. Lin. GraphBIG: Understanding graph computing in the context of industrial solutions. In SC, 2015.
  • [83] D. Nguyen, A. Lenharth, and K. Pingali. A lightweight infrastructure for graph analytics. In SOSP, 2013.
  • [84] S. Nobari, T.-T. Cao, P. Karras, and S. Bressan. Scalable parallel minimum spanning forest computation. In PPoPP, 2012.
  • [85] M. Ortmann and U. Brandes. Triangle listing algorithms: Back from the diversion. In ALENEX, 2014.
  • [86] V. Osipov, P. Sanders, and J. Singler. The filter-kruskal minimum spanning tree algorithm. In ALENEX, 2009.
  • [87] R. Pagh and F. Silvestri. The input/output complexity of triangle enumeration. In PODS, 2014.
  • [88] M. Patwary, P. Refsnes, and F. Manne. Multi-core spanning forest algorithms using the disjoint-set data structure. In IPDPS, 2012.
  • [89] S. Pettie and V. Ramachandran. A randomized time-work optimal parallel algorithm for finding a minimum spanning forest. SIAM J. Comput., 31(6), 2002.
  • [90] C. A. Phillips. Parallel graph contraction. In SPAA, 1989.
  • [91] C. K. Poon and V. Ramachandran. A randomized linear work EREW PRAM algorithm to find a minimum spanning forest. In ISAAC, 1997.
  • [92] S. Rajagopalan and V. V. Vazirani. Primal-dual approximation algorithms for set cover and covering integer programs. SIAM J. Comput., 28(2), Feb. 1999.
  • [93] V. Ramachandran. A framework for parallel graph algorithm design. In Optimal Algorithms, 1989.
  • [94] V. Ramachandran. Parallel open ear decomposition with applications to graph biconnectivity and triconnectivity. In Synthesis of Parallel Algorithms, 1993.
  • [95] J. Reif. Optimal parallel algorithms for integer sorting and graph connectivity. TR-08-85, Harvard University, 1985.
  • [96] A. E. Sariyuce, C. Seshadhri, and A. Pinar. Parallel local algorithms for core, truss, and nucleus decompositions. arXiv preprint arXiv:1704.00386, 2017.
  • [97] T. Schank. Algorithmic aspects of triangle-based network analysis. PhD Thesis, Universitat Karlsruhe, 2007.
  • [98] T. Schank and D. Wagner. Finding, counting and listing all triangles in large graphs, an experimental study. In WEA, 2005.
  • [99] W. Schudy. Finding strongly connected components in parallel using reachability queries. In SPAA, 2008.
  • [100] S. B. Seidman. Network structure and minimum degree. Soc. Networks, 5(3), 1983.
  • [101] M. Sevenich, S. Hong, A. Welc, and H. Chafi. Fast in-memory triangle listing for large real-world graphs. In Workshop on Social Network Mining and Analysis, 2014.
  • [102] X. Shi, Z. Zheng, Y. Zhou, H. Jin, L. He, B. Liu, and Q.-S. Hua. Graph processing on GPUs: A survey. ACM Comput. Surv., 50(6), Jan. 2018.
  • [103] Y. Shiloach and U. Vishkin. An parallel connectivity algorithm. J. Algorithms, 1982.
  • [104] J. Shun and G. E. Blelloch. Ligra: A lightweight graph processing framework for shared memory. In PPoPP, 2013.
  • [105] J. Shun, G. E. Blelloch, J. T. Fineman, and P. B. Gibbons. Reducing contention through priority updates. In SPAA, 2013.
  • [106] J. Shun, G. E. Blelloch, J. T. Fineman, P. B. Gibbons, A. Kyrola, H. V. Simhadri, and K. Tangwongsan. Brief announcement: the problem based benchmark suite. In SPAA, 2012.
  • [107] J. Shun, L. Dhulipala, and G. E. Blelloch. A simple and practical linear-work parallel algorithm for connectivity. In SPAA, 2014.
  • [108] J. Shun, L. Dhulipala, and G. E. Blelloch. Smaller and faster: Parallel processing of compressed graphs with Ligra+. In DCC, 2015.
  • [109] J. Shun and K. Tangwongsan. Multicore triangle computations without tuning. In ICDE, 2015.
  • [110] G. M. Slota and K. Madduri. Simple parallel biconnectivity algorithms for multicore platforms. In HiPC, 2014.
  • [111] G. M. Slota, S. Rajamanickam, and K. Madduri. BFS and coloring-based parallel algorithms for strongly connected components and related problems. In IPDPS, 2014.
  • [112] G. M. Slota, S. Rajamanickam, and K. Madduri. Supercomputing for Web Graph Analytics. Apr 2015.
  • [113] G. M. Slota, S. Rajamanickam, and K. Madduri. A case study of complex graph analysis in distributed memory: Implementation and optimization. In IPDPS, 2016.
  • [114] S. Stergiou, D. Rughwani, and K. Tsioutsiouliklis. Shortcutting label propagation for distributed connected components. In WSDM, 2018.
  • [115] R. E. Tarjan and U. Vishkin. An efficient parallel biconnectivity algorithm. SIAM Journal on Computing, 1985.
  • [116] D. J. Welsh and M. B. Powell. An upper bound for the chromatic number of a graph and its application to timetabling problems. The Computer Journal, 1967.
  • [117] D. Yan, Y. Bu, Y. Tian, and A. Deshpande. Big graph analytics platforms. Foundations and Trends in Databases, 7, 2017.
  • [118] W. Zhou. A practical scalable shared-memory parallel algorithm for computing minimum spanning trees. Master’s thesis, KIT, 2017.

Appendix A Algorithmic Details

In this section we give self-contained descriptions of all of the theoretically efficient algorithms implemented in our benchmark. We also include implementation details for algorithms implemented in prior papers that we did not significantly change. The pseudocode for many of the algorithms make use of the edgeMap and vertexMap primitives, as well as test-and-set, fetch-and-add and priority-write. All of these primitives are defined in Section 3.

Shortest Path Problems. Although work-efficient polylogarithmic-depth algorithms for single-source shortest paths (SSSP) type problems are not known due to the transitive-closure bottleneck [60], work-efficient algorithms that run in depth proportional to the diameter of the graph are known for the special cases considered in our benchmark. Several work-efficient parallel breadth-first search algorithms are known [11, 66, 18]. On weighted graphs with integral edge weights, SSSP can be solved in work and depth [36]. Parallel algorithms also exist for weighted graphs with positive edge weights [79, 78]. SSSP on graphs with negative integer edge weights can be solved using Bellman-Ford [33], where the number of iterations depends on the diameter of the graph. Betweenness centrality from a single source can be computed using two breadth-first searches [27, 104].

In this paper, we present implementations of the four SSSP problems that are based on graph search. Our implementations of BFS, Bellman-Ford, and betweenness centrality are based on the implementations in Ligra [104]. Algorithm 1 shows pseudocode for a frontier-based BFS implementation which synchronizes after each round of the BFS. The algorithm runs in and depth on the TS-MT-RAM, as vertices use test-and-set to non-deterministically acquire unvisited neighbors on the next frontier. Algorithm 2 shows pseudocode for a frontier-based version of Bellman-Ford which uses a priority-write to write the minimum distance to a vertex on each round and runs in work and depth on the PW-MT-RAM. Pseudocode for our betweenness centrality implementation is shown in Algorithm 3. It uses fetch-and-add to compute the total number of shortest-paths through vertices, and runs in work and depth on the FA-MT-RAM. Algorithm 4 shows pseudocode for our weighted BFS implementation from Julienne [36]. The algorithm runs in work in expectation and depth w.h.p on the PW-MT-RAM, as vertices use priority-write to write the minimum distance to a neighboring vertex on each round. The algorithm uses a bucketing structure and generalized version of edgeMap called edgeMapData, which are discussed in the Julienne paper. The main change we made to these algorithms was to improve the cache-efficiency of the edgeMap implementation using the block-based version of edgeMap, described in Section B.

1:
2:procedure Cond() return
3:procedure Update(, )
4:     if  ( && )  then
5:          
6:          return 1      
7:     return 0
8:procedure BFS()
9:     
10:     while  do
11:          
12:                
13:     return
Algorithm 1 Breadth-First Search
1:
2:procedure Cond() return
3:procedure ResetFlags()
4:procedure Update(, , )
5:     if   then
6:          
7:          if   then return                
8:     return 0
9:procedure BellmanFord()
10:     
11:     while  do
12:          
13:                
14:     return
Algorithm 2 Bellman-Ford
1:
2:
3:procedure Cond() return
4:procedure SetVisited() return
5:procedure PathUpdate(, )
6:     
7:     return
8:procedure DependencyUpdate(, )
9:     
10:     
11:procedure BC()
12:     
13:     while  do
14:          
15:          
16:          
17:                
18:     
19:     while  do
20:          
21:          
22:          
23:                
24:     return
Algorithm 3 Betweenness Centrality
1:
2:procedure GetBucketNum() return
3:procedure Cond() return
4:procedure Update(, , )
5:     
6:     if  ()  then
7:          if  ()  then           
8:                
9:     return
10:procedure Reset(, )
11:     
12:     return
13:procedure wBFS()
14:     
15:     
16:     while  ()  do
17:          
18:          
19:          .updateBuckets(, )      
20:     return
Algorithm 4 wBFS

Low-Diameter Decomposition. Low-diameter decompositions (LDD) were first introduced in the context of distributed computing [7], and were later used in metric embedding, linear-system solvers, and parallel algorithms. Awerbuch presents a simple sequential algorithm based on ball growing that computes an decomposition [7]. Miller, Peng, and Xu (MPX) present a work-efficient parallel algorithm that computes a decomposition. For each , the algorithm draws a start time, , from an exponential distribution with parameter . The clustering is done by assigning each vertex to the center which minimizes . This algorithm can be implemented by running a set of parallel breadth-first searches where the initial breadth-first search starts at the vertex with the largest start time, , and starting breadth-first searches from other once steps have elapsed.

In this paper, we present an implementation of the MPX algorithm (Algorithm 5) which computes an decomposition in expected work and depth w.h.p. on the TS-MT-RAM. Our implementation is based on the non-deterministic LDD implementation from Shun et al. [107]. The main changes in our implementation are separating the LDD code from the connectivity implementation. We report more details between our new implementation and the prior code in Section 6.

Algorithm 5 shows the modified version of the Miller-Peng-Xu algorithm from [107], which computes a decomposition in expected work and depth w.h.p. on the TS-MT-RAM. The algorithm allows ties to be broken arbitrarily when two searches visit a vertex in the time-step, and one can show that this only affects the number of cut edges by a constant factor [107]. The algorithm first draws independent samples from (Line 1). Next, it computes the start times which is the difference between the maximum shifted value and (Line 2). Initially, all vertices are unvisited (Line 3). The algorithm performs ball-growing while all of the vertices are not yet visited (Line 10). On Line 11, it updates the current frontier with any vertices that are ready to start and have not yet been visited, and update the number of visited vertices with the size of the current frontier (Line 12). Finally, on Line 13, it uses edgeMap to traverse the out edges of the current frontier and non-deterministically acquire unvisited neighboring vertices.

1:
2:
3:
4:procedure Update(, )
5:     if  ()  then return      
6:     return 0
7:procedure LDD()
8:     
9:     
10:     while  do
11:          
12:          
13:          
14:                
Algorithm 5 Low Diameter Decomposition

Connectivity. Connectivity can be solved sequentially in linear work using breadth-first or depth-first search. Parallel algorithms for connectivity have a long history; we refer readers to [107] for a review of the literature. Early work on parallel connectivity discovered many natural algorithms which perform work [103, 8, 95, 90]. A number of optimal parallel connectivity algorithms were discovered in subsequent years [42, 29, 48, 49, 91, 89, 107], but to the best of our knowledge the recent algorithm by Shun et al. is the only linear-work polylogarithmic-depth parallel algorithm that has been studied experimentally [107].

Algorithm 6 shows the connectivity algorithm from Shun et al. [107], which runs in expected work and depth w.h.p. on the TS-MT-RAM. The implementation uses the work-efficient algorithm for low-diameter decomposition (LDD) [80], which we discuss in Section 4. One change we made to the implementation from  [107] was to separate the LDD and contraction steps from the connectivity algorithm. Refactoring these sub-routines allowed us to express the connectivity algorithm in Ligra in about 50 lines of code.

The connectivity algorithm from Shun et al. [107] takes as input an undirected graph and a parameter . We first run the LDD algorithm (Line 2) which decomposes the graph into clusters each with diameter , and inter-cluster edges in expectation. Next, we build by contracting each cluster to a single vertex and adding inter-cluster edges while removing duplicate edges and isolated vertices (Line 3). We then check if the contracted graph consists of isolated vertices (Line 4); if so, the clusters are the components, and we return the mapping from vertices to clusters (Line 5). Otherwise, we recurse on the contracted graph (Line 7) and return the connectivity labeling produced by assigning each vertex to the label assigned to its cluster in the recursive call (Line 9).

1:procedure Connectivity()
2:      = LDD()
3:      = Contract()
4:     if  then
5:          return
6:     else
7:           = Connectivity()
8:           =
9:          return      
Algorithm 6 Connectivity

Biconnectivity. Algorithm 7 shows the Tarjan-Vishkin biconnectivity algorithm. We describe our implementation of this algorithm in more detail in Section 4. We first compute a spanning forest of using the work-efficient connectivity algorithm, where the trees in the forest can be rooted arbitrarily (Line 2). Next, we compute a preorder numbering, , with respect to the roots (Line 3). We then compute for each and , which are the minimum and maximum preorder numbers respectively of all non-tree edges where is a vertex in ’s subtree. We also compute , the size of each vertex’s subtree. Note that we can determine whether the parent of a vertex is an articulation point by checking and . As in [13], we refer to this set of tree edges , where is an articulation point, as critical edges. The last step of the algorithm is to solve connectivity on the graph with all critical edges removed. Now, the biconnectivity label of an edge is the connectivity label of the vertex that is further from the root of the tree. The query data structure can thus report biconnectivity labels of edges in time using space, which is important for our implementations as storing a biconnectivity label per-edge explicitly would require a prohibitive amount of memory for large graphs. As the most costly step in this algorithm is to run connectivity, the algorithm runs in work in expectation and depth w.h.p. Our implementation described of the Tarjan-Vishkin algorithm runs in the same work but depth w.h.p. as it computes a spanning tree using BFS and performs leaffix and rootfix computations on this tree.

1:procedure Biconnectivity()
2:      = SpanningForest() trees in are rooted arbitrarily
3:      = PreorderNumber()
4:     For each , compute and and
5:      = s.t. is an articulation point
6:      = CC()
7:     return sufficient to answer biconnectivity queries
Algorithm 7 Biconnectivity

Strongly Connected Components. The Blelloch et al. algorithm for SCCs is shown in Algorithm 8. We refer the reader to Section 6.2 of [20] for proofs of correctness and its work and depth bounds. The algorithm is similar in spirit to randomized quicksort. On Line 2 we randomly permute the vertices and assign them to batches. On Line 3 we set the initial label for all vertices as and mark all vertices as not done. Now, we process the batches one at a time. For each batch, we compute , which are the vertices in the batch that are not yet done (Line 5). The next step calls MarkReachable from the centers on both and the transposed graph, (Lines 6–7). MarkReachable takes the set of centers and uses a breadth-first search to compute the sets (), which for a center