Families of Distributed Memory Parallel Graph Algorithms from SelfStabilizing Kernels–An SSSP Case Study
Abstract
Selfstabilizing algorithms are an important because of their robustness and guaranteed convergence. Starting from any arbitrary state, a selfstabilizing algorithm is guaranteed to converge to a legitimate state.Those algorithms are not directly amenable to solving distributed graph processing problems when performance and scalability are important. In this paper, we show the “Abstract Graph Machine” (AGM) model that can be used to convert selfstabilizing algorithms into forms suitable for distributed graph processing. An AGM is a mathematical model of parallel computation on graphs that adds work dependency and ordering to selfstabilizing algorithms. Using the AGM model we show that some of the existing distributed Single Source Shortest Path (SSSP) algorithms are actually specializations of selfstabilizing SSSP. We extend the AGM model to apply more finegrained orderings at different spatial levels to derive additional scalable variants of SSSP algorithms, essentially enabling the algorithm to be generated for a specific target architecture. Experimental results show that this approach can generate new algorithmic variants that outperform standard distributed algorithms for SSSP.
xleftmargin=8pt,numbers=left, numberstyle=,stepnumber=1, numbersep=4pt frame=tb language=C++1y, columns=fullflexible, breaklines=true,
I Introduction
Most of the existing parallel algorithms are developed based on Parallel Random Access Machine (PRAM) [1] model. While PRAM is a simple machine model it does not consider factors that are significant in distributed memory system (e.g., overhead of synchronization, remote message communication etc.). Further, algorithms designed for PRAM may assume global data structures and subgraph computations are efficient, hence their cost is not counted into algorithm performance. While these algorithms perform well in shared memory systems they tend to perform poorly on distributed memory systems.
Selfstabilizing graph algorithms rely on local information to solve graph problems. In selfstabilizing algorithms every vertex in the graph is associated with a state. Whenever, there is a state change, neighboring vertices are notified via edges. Selfstabilizing algorithms does not use global data structures and does not rely on operations such as subgraph computations (these operations are expensive in a distributed environment). The fact that, selfstabilizing algorithms rely only on local information and does not assume global data structures motivate us to investigate the applicability of selfstabilizing graph algorithms for large scale static graph processing.
A selfstabilizing algorithm consists of set of rules. Every rule has a condition. A rule is evaluated only if its condition evaluates to true. A selfstabilizing algorithm reaches a legitimate state irrespective of its initial state.
Selfstabilizing graph algorithms have been introduced for number of graph applications including Single Source Shortest Path, Breadth First Search (BFS), Spanning Tree Construction, Maximal Independent Set and Graph Coloring. A survey of selfstabilizing graph algorithms is presented in [2].
The strategy for updating vertex states is defined by an execution model. In selfstabilization, those execution models are called demons. We find three types of demons in selfstabilizing algorithms: central demon, synchronous demon and distributed demon. In a central demon algorithm, only one vertex can update the state at a time. While synchronous demon updates all the vertex states at the same time, a distributed demon select a subset of vertices to update states at the same time. Since our main focus is to reduce global synchronization and to rely on “local” data for processing, we only consider distributed demon algorithms.
A distributed demon selfstabilizing Single Source Shortest Path algorithm is presented in Algorithm 1 (discussed in [3]). At stabilization this algorithm will have the minimum distance to every vertex, from a given source vertex. Selfstabilizing algorithms describe the algorithm using rules. A rule consists of a condition and an action, when condition evaluated true action is invoked. Rules in the Algorithm 1 have the format; current state new state.
Note that in the algorithm, represents the state of vertex and stands for set of all neighbours of vertex . The pre assigned weight for an edge is denoted by .
The algorithm consists of two rules. The first rule is only applicable to the source. It says, if the source “distance state” is not 0, then the “distance state” should be transferred to 0. The second rule is activated only if the current distance of the vertex is not equal to the minimum of its neighbours distance plus the weight of the edge connecting the neighbour. So, the legitimate state of the system is ( is the given source) and . [3] proves that this algorithm ultimately stabilizes and represents the shortest distance from the source, for each vertex at stabilization.
Another important aspect of selfstabilizing algorithms is the atomicity requirement. Algorithm 1 requires to querying vertex ’s state and its neighbors state and updating vertex ’s state (if the rule evaluates to true for vertex ) in a single atomic step. In selfstabilization, the requirement to query the current vertex state and its neighbours states and update the current vertex state in a single atomic step is called composite atomicity. The ReadWrite atomicity is another form of atomicity discussed in detail in [4]. In this paper we consider selfstabilizing algorithms with composite atomicity.
Using selfstabilizing algorithms for static graph analysis is challenging due to several reasons: 1. Selfstabilizing algorithms are designed for streaming contexts in which those algorithms do not terminate after stabilizing; 2. To implement the composite atomicity requirement we need to lock the current vertex as well as its neighbors. This will generate a lot of synchronization regions and involves a significant number of locking/unlocking operations, hence it reduces performance especially in distributed execution; 3. A naive implementation of the selfstabilizing SSSP would require program to iterate through all the vertices and apply rules R0 and R1 until they reach the legitimate state. When processing a large graph, such an implementation would generate a massive amount of unnecessary work and would perform very poorly.
To overcome the challenges discussed above, we model selfstabilizing SSSP Algorithm 1 using an Abstract Graph Machine (AGM) [5]. An AGM is a model for designing distributed memory parallel graph algorithms. An AGM essentially converts the selfstabilizing algorithm to a stabilizing algorithm by adding work dependency and uses ordering to reduce the amount of work. A stabilizing algorithm starts from a specific initial state where as selfstabilizing algorithm can execute the algorithm from any initial state. The modeled stabilizing algorithm is in a suitable form for distributed graph processing. We also show that some of the existing wellknown distributed SSSP algorithms are specific versions of the modeled selfstabilizing algorithm in 1.
We further enhance the AGM model to incorporate architecture dependent spatial characteristics to generate less synchronized orderings. The extended model is called Extended AGM (EAGM). Using EAGM we generate nine variations of SSSP algorithms and compare their performance to standard distributed SSSP algorithms under three different types of graph inputs. Our results show that some of the generated algorithm variations perform better compared to standard distributed SSSP algorithms.
Ii SelfStabilizing SSSP & AGM
In the following we discuss several important aspects when modeling the selfstabilizing SSSP algorithm using AGM. More specifically, our discussion is focused on termination, composite atomicity and ordering.
Selfstabilizing algorithms are not iterative algorithms neither the AGM algorithms. While selfstabilizing algorithms does not terminate they go to an idle state when states are stabilized but AGM algorithms terminate at stabilization. Algorithm termination depends on the amount of active work available in the system. Our implementations use standard termination detection algorithms to count active work available in the system alive. As long as there are state changes there will be active work available. When active work is zero there are no more state changes and guarantees that the algorithm state is stabilized.
Under composite atomicity Algorithm 1 needs to query states from neighbors, query the current state and update current the vertex state in a single atomic step. This requires synchronization between neighboring vertices. However, for the SSSP algorithm in 1, the synchronization requirement can be alleviated due to the monotonic function “min”. In rule 1, neighbors states are processed in the “min” function. The “min” function selects the minimum distance + weight value for all neighbors, irrespective of the order states pushed in. Therefore we only need to maintain the state of the current vertex.
The AGM model makes state transitions based on work events. We define a unit of work to be a pair, where and is the state associated with . We call a unit of work a and we denote all the possible work items generated by an algorithm as the set . In the AGM model, everytime a state is updated new work items are generated.
Before processing, new work items are ordered. In an AGM ordering is defined as a strict weak ordering. The strict weak ordering divide work items into different equivalence classes.
Iii Abstract Graph Machine (AGM)
An Abstract Graph Machine(AGM) consists of a definition of a , an initial set, a set of states, a processing function and a strict weak ordering relation. The processing function takes a () as an input and may produce 0 or more work items. Further, the processing function may change values associated with states. The strict weak ordering relation order work items into a set of ordered (induced) equivalence classes.
The AGM model denotes a () as a tuple. A tuple’s first element stores a vertex and the rest of the positions store the state/s and ordering attribute values. For example, the selfstabilizing SSSP algorithm stores a vertex and a distance in a . The values are populated by the processing function. The size of the tuple (i.e. the number of additional elements) is determined by the states in the algorithm and the ordering attributes used in the AGM formulation.
To access tuple elements we use the bracket operator; e.g., if and if = <>then = v and = and = , etc. The data (i.e. tuple elements) are read by the processing function. After reading values, the processing function can change the states associated with the vertex in the .
An AGM maintains state values as mappings(functions). The domain of the state mappings is the set . The codomain depends on the possible values that can be held in states. For example, in the selfstabilizing SSSP algorithm the state mapping is . In AGM terminology, accessing a state value associated with a vertex (or edge) “v” is denoted as “mapping_name[v]” (e.g., distance[v]).
In addition to state mappings, algorithms also use readonly properties. These properties usually hold graph data and are interpreted as functions, e.g., edge weights (). In the abstraction we treat those readonly properties as part of the graph definition. In terms of the syntax, to read values, the AGM model does not distinguish between readonly properties and state mappings.
Both state mappings and properties are accessed within a processing function and the processing function only reads from the properties while processing function may modify state mappings when processing a . The processing function updates the state value associated with a vertex in a single atomic step.
The logic inside the processing function is analogous to the code that runs infinitely on every process in a selfstabilizing algorithm. However, in the AGM model a single process may run multiple instances of processing functions (e.g., distributed shared memory model). The logic inside the processing function is based on the rules defined for selfstabilizing algorithms but is modified to work with work items. A processing function () takes a and may produce more work items based on the logic defined inside the . In our formalism we treat states as side effects, in the sense they are not passed as explicit arguments but subject to change when executing . We will denote set of states using and we will use to denote the powerset of set . Then, mathematically the is declared as .
The processing function defines the basic logic of an algorithm. It consists of a set of statements (). A statement specifies a condition based on input and/or states () and an update to states () and how a output work items () should be constructed (). The of a statement is evaluated only if its (condition) and are evaluates to . We distinguish between and , since may have side effects where it updates states but does not create any side effects.
An abstract version of is formally defined in Definition 1.
Definition 1
where;
The output work items of a processing function are ordered according to the strict weak ordering defined on . The ordered work items are again fed into the processing function in the order they appear. The interaction between the processing function and the ordering is graphically depicted in the Figure 1.
The strict weak ordering relation (denoted by ) must satisfy the following properties;

For all , .

For all , if , then .

For all , if and , then .

For all , if not comparable with and not comparable with , then is not comparable with .
Properties 1 and 2 states that the strict weak ordering relation is not reflexive and is antisymmetric. Property 3 denotes the transitivity of the “comparable work items” and Property 4 states that transitivity is preserved among noncomparable elements in the set. These properties give rise to an equivalence (i.e. noncomparable work items belong to the same equivalence class) relation defined on , hence the strict weak ordering relation partitions the complete . Since work items in different equivalence classes are comparable, the strict weak ordering relation defined on the set induces an ordering on generated equivalence classes. In general, there are several ways to define the induced ordering relation (denoted ). For our work we use the definition given in the Definition 2.
Definition 2
is a binary relation defined on , such that if then; .
Having defined all supporting concepts we now give the definition of an AGM in Definition 3.
Definition 3
An Abstract Graph Machine(AGM) is a 6tuple (G, , Q, , , S), where

G = (V, E) is the input graph,

where each represents a state value or an ordering attribute,

Q  Set of states represented as mappings,

is the processing function,

 Strict weak ordering relation defined on work items

S ()  Initial set.
In the following we give the semantics of an AGM. An AGM starts execution with the initial set. The initial set is ordered according to the strict weak ordering relation. Next, the work items within the smallest equivalence class is fed to the processing function. If the processing function generates new work items, then they are separated into equivalence classes based on the strict weak ordering relation. The work items within a single equivalence class can execute the processing function in parallel. However, work items in two different equivalence classes must be ordered according to the induced relation (i.e. ). When executing work items in an equivalence class, it may generate new work items for the same equivalence class or to an equivalence class greater (as per ) than currently processing equivalence class. The AGM executes work items in the next equivalence class, once it finished executing all the work items in the current smallest equivalence class. An AGM terminates when it executes all the work items in all the equivalence classes.
The AGM is used to model graph algorithms related to graph applications such as SSSP, BFS, PageRank and Connected Components. Processing functions and orderings used in those algorithms is discussed detail in [5].
Iiia SSSP Algorithms in AGM
In this subsection, we build AGM model for Algorithm 1. We also show that adding different orderings to the modeled algorithm reveals behaviours of existing distributed SSSP algorithms.
The SSSP algorithms discussed in this paper can be formulated using a single statement. For brevity we use following notation to represent the processing function.
Notation 1
In notation 1, is the new generated from . As discussed previously and represents state update and condition.
To build the AGM for selfstabilizing Algorithm 1, we need to define each tuple element in Definition 3 (i.e. (G, , Q, , , S)). As the input graph we use , where weight is a readonly property map that has weights associated to edges. As explained in Section II the set is defined based on the vertex state. For Algorithm 1, the state we are interested in is the distance from the source vertex. Therefore, we define, , where Distance . The only state AGM needs is the distance from the source vertex and it is represented as distance. Values for distance state is assigned when the processing function is executed with work items.
The AGM processing function is defined based on the rules in Algorithm 1 and given in Definition 4. Since Rule 1 is only applied to the source vertex, we can move it to the initial set in the AGM representation. Rule 2 is encoded into the processing function in the format defined in Definition 1. The definition of the processing function uses a helper function called neighbors (Declared as ), which operates on graph vertices.
Definition 4
The next required parameter definition for the AGM model is the strict weak ordering relation (). If work items are not ordered in any form, then we will have a single large equivalence class of generated work items. We can have a single large equivalence class by defining strict weak ordering relation as in Definition 5.
Definition 5
is a binary relation defined on where,
Basically, the binary relation does not divide work items into any comparable equivalence classes. Using the strict weak ordering defined in Definition 5, we present the AGM model for Algorithm 1, in Proposition 1.
Proposition 1

is the input graph,

= ,

Q = {distance} is the state mappings,

= ,

Strict weak ordering relation = ,

S = {<, 0>} where and is the source vertex.
Since the AGM presented in Proposition 1, does not perform any ordering on work items, we call it Chaotic AGM. However, the ordering on work items can be improved by defining strict weak ordering relations that generate smaller comparable equivalence classes. There are numerous ways to define strict weak orderings so that they generate smaller equivalence classes. Some of those strict weak orderings yield us, existing distributed SSSP algorithms such as Dijkstra’s SSSP [6] algorithm, stepping SSSP [7] algorithm and KLA SSSP algorithm [8]. All those algorithms share almost the same processing function but uses different orderings on work items.
IiiA1 Dijkstra’s Algorithm
Dijkstra’s SSSP algorithm is the work efficient SSSP algorithm. Dijkstra’s algorithm globally orders vertices by their associated distances and the shortest distance vertices are processed first. In the following, we define the ordering relation for Dijkstra’s algorithm and we instantiate Dijkstra’s algorithm using Abstract Graph Machine.
Definition 6
is a binary relation defined on as follows; Let , then; iff
It can be proved that is a strict weak ordering that satisfy constraints listed under Definition 6 (proof is omitted to save space). AGM instantiation for Dijkstra’s algorithm is given in Proposition 2.
Proposition 2
Dijkstra’s Algorithm is an instance of AGM where;

is the input graph,

= ,

Q = {distance} is the state mappings,

= ,

Strict weak ordering relation = ,

S = {<, 0>} where and is the source vertex.
IiiA2 Stepping Algorithm
Stepping [7] SSSP algorithm arranges vertexdistance pairs into distance ranges (buckets) of size and execute buckets in order. Within a bucket, vertexdistance pairs are not ordered, and can be executed in any order. Processing a bucket may produce extra work for the same bucket or for a successive buckets. The strict weak ordering relation for stepping algorithm is given in Definition 7.
Definition 7
is a binary relation defined on as follows;
Let , then;
iff
Instantiation of Stepping algorithm in the AGM is same as in Proposition 2, except the strict weak ordering relation is (= ).
IiiA3 KLA SSSP Algorithm
The level asynchronous (KLA) paradigm [8] bridges levelsynchronous and asynchronous paradigms for processing graphs. In levelsynchronous approach (E.g: In levelsynchronous BFS), vertices are processed level by level. KLA processes vertices upto levels asynchronously and then moves to next levels.
In the following we model KLA SSSP with the AGM. The KLA approach order work items by the level of the resulting tree. Therefore, we need an additional ordering attribute in the definition. The KLA is defined as where .
Further, the processing function also needs to be altered to populate value for the new ordering attribute. The processing function for KLA SSSP is defined in Definition 8. The processing function generates work items with an updated level, while changing distance state appropriately.
Definition 8
KLA SSSP order work items based on the level. The work items within two consecutive levels can be executed in parallel and work items that are not in two consecutive levels must be ordered. The strict weak ordering relation for KLA SSSP is given in Definition 9.
Definition 9
is a binary relation defined on
as follows:
Let , then
iff
The AGM formulation for KLA SSSP algorithm is given in Proposition 3.
Proposition 3
KLA SSSP Algorithm is an instance of AGM where;

is the input graph

=

Q = {distance} are the state mappings,

=

Strict weak ordering relation =

S = {<, 0, 0>} where and is the source vertex and level starts with 0.
Each AGM discussed above divides into equivalence classes differently (Figure 2) The Dijkstra’s AGM (Figure 1(a)) generates an equivalence class per each different distance value. The work items that have the same distance belong to the same equivalence class while work items of different distances go into different equivalence classes. The stepping (Figure 1(b)) AGM also performs ordering based on the distance but equivalence classes are generated based on a value and in general have more elements compared to Dijkstra’s AGM. All the work items within a single equivalence class are guaranteed to have distances between and for some . Similar to stepping algorithms, KLA, too, generates larger equivalence classes but uses level as the ordering attribute. Figure 1(c), shows how KLA algorithm arranges equivalence classes. As shown in Figure 1(d), the Chaotic version has a single large equivalence class containing all the work items.
Each of the above algorithms is different because of the way they generate equivalence classes and how those equivalence classes are ordered. Otherwise, they tend to share the same processing function that implements Rule 1 in Algorithm 1. Further, if two SSSP algorithms share the same ordering attribute, then they share the same processing function in AGM model. For example, both Dijkstra’s AGM and stepping share the same processing function. If two algorithms use different ordering attributes, then processing functions differ only to update values of different ordering attributes.
Iv ExtendedAGM
Ordering in terms of distance reduces the amount of redundant work in SSSP algorithms. Out of the algorithms discussed in the previous section, Dijkstra’s algorithm performs the best ordering, so that it does the minimum amount of redundant work. However, the overhead of ordering in Dijkstra’s algorithm is significant in a parallel distributed runtime due to the frequent synchronization. In other words, when the AGM instance generates more equivalence classes, the global synchronization overhead increases. The other algorithms discussed above reduce overhead of ordering by chunking work items into larger equivalence classes. This reduces the number of equivalence classes generated. The ExtendedAGM adds ordering to chunks but to reduce the overhead of ordering at AGM level, it applies ordering at different lower spatial levels.
A spatial level defines the amount of memory accessible to order work items. The highest spatial level is the accumulation of all the memory of the participating nodes (also called global memory). The next spatial level is the memory available at a node. The memory in a single node can be further subdivided into logical regions such as numa domains. Each numa domain may be shared by several threads. The lowest spatial level is the thread local memory. Such a spatial division is highly architecturedependent and hierarchical.
The EAGM depicts a spatial hierarchy as a tree (Figure 3). Every node in the spatial hierarchy has an ordering attached. The ordering attached to the root represents the ordering defined by the relevant AGM. The example given inFigure 3, is an EAGM hierarchy derived from the stepping AGM. As shown inFigure 3 the ordering ( Definition 7) is attached to the root node. The orderings attached to the lower spatial levels are performed on work items available to the memory in the relevant spatial level. Therefore, the orderings attached to lower spatial levels are more relaxed than the ordering attached to the root. By default the EAGM spatial hierarchy assumes Chaotic (i.e. no ordering) ordering but specific orderings can be enforced. The example in Figure 3 enforces strict weak ordering at numa level and uses Chaotic orderings at other levels.
An Extended AGM is an AGM but instead of consisting a strict weak ordering relation an EAGM has a spatial hierarchy with annotated orderings. An EAGM extends an AGM if and only if the EAGM generates same equivalence classes in the AGM at the root level of the spatial hierarchy. Therefore, an AGM can have multiple EAGMs, where each EAGM has the same ordering as the AGM at the root but different orderings at lower spatial levels. Each EAGM represents a variation of the algorithm modeled in the relevant AGM. If an AGM generates equivalence classes with many work items, then the EAGM has provision to perform finegrain ordering at different spatial levels. However, if the AGM ordering is such that it generates equivalence classes with few work items, then the derived EAGM has less opportunity to perform ordering at spatial levels. For example, the stepping AGM generates equivalence classes with many work items (provided is sufficiently large). Variations of the stepping AGM can be generated by applying ordering to work items at the process level, the numa level or at the thread level. However, for Dijkstra’s AGM the spatial orderings may not improve the overall performance of the algorithm because the equivalence classes generated by Dijkstra’s AGM has fewer work items on average.
Out of the algorithms discussed in the previous section, the finegrained spatial ordering is effective to AGMs defined for stepping, KLA and Chaotic. By considering the spatial hierarchy used in Figure 3, we apply Dijkstra’s strict weak ordering relation ( Definition 6) to spatial hierarchy levels PROCESS, NUMA and THREAD to derive EAGMs (Figure 4). Each EAGM generates a variation of the main algorithm defined by its corresponding AGM.
Figure 3(a), shows EAGM variations derived for stepping algorithm. Figure 3(a)(i), applies ordering to THREAD level and Figure 3(a)(ii) applies ordering to NUMA level and Figure 3(a)(iii) applies ordering to PROCESS level. EAGMs for KLA and Chaotic are derived in the same way.
For convenience, we will refer to the original AGM implementation as buffer, the variation that does THREAD level ordering as threadq, the variation that does PROCESS level ordering as nodeq and the variation that does NUMA level ordering as numaq.
V Implementation
We implemented EAGMs shown in Figure 4. Each implementation generated a variation of main algorithm. For implementation we used a light weight active messaging framework based on MPI. To represent the local graph data, we used compressed sparse row format. 1D distribution is used to distribute the graph. States and read only mappings are maintained as property maps indexed by vertices or edges and property maps are also distributed. We used concurrent priority queues (flat combining synchronous priority queue [9]) for node level and numa level ordering.
Vi Results
We experimented with the performance of EAGM variations on synthetic graphs and on realworld graphs. For synthetic graphs we used:

RMAT2: Graphs generated based on the proposed Graph500 [12] SSSP benchmark specification with RMAT parameters , and and random edge weights from to
For real world graphs, we use four graphs with varying parameter from the SNAP [13] repository with random edge weights from to
All our experiments were carried out on a Cray XE6/XK7 supercomputer, with 32 AMD Opteron Abu Dhabi CPUs 64 GB memory per node (4 NUMA domains), running Cray Linux Environment and the craympich2 ver. 7.2.5 MPI implementation. The runtime architecture we used, aligned with the spatial hierarchy used in our EAGM implementations (Figure 4).
Via Scaling Results
We ran weak scaling experiments on Graph500 graphs from scale 19 ( vertices) to scale 31. We compare each AGM algorithm (i.e. main SSSP algorithms) to their EAGM variations (see section IV). The implementation of the main AGM algorithm is represented in buffer for each algorithm (Stepping, KLA and Chaotic). The threadlevel ordering variation, node level ordering variation and numa level ordering variation implementations are denoted using threadq, nodeq and numaq, for each algorithm.
ViA1 Stepping Variations
In fig. 5, we present weak scaling results for stepping variations on RMAT1 and RMAT2 synthetic graphs with values 3, 5, and 7. The original stepping (buffer) algorithm performs the best innode. Since no communication is involved, the additional ordering provided by the other implementations does not provide a sufficient benefit for its overhead.
In general, the threadq variation is the fastest in the distributed setting. The nodeq and the numaq variations perform better with increasing deltas, and they are not competitive with the buffer implementation.
In summary, while innode performance is dominated by the traditional stepping algorithm (aka the buffer implementation) the distributed execution shows significant improvement with threadq variation. Though, the numaq and nodeq variations should provide better performance than the threadq variation by providing more ordering, the overhead of the concurrent ordering reduces the performance of numaq and nodeq.
ViA2 KLA Variations
KLA variations show different performance characteristics than stepping. Weak scaling results for KLA implementations on RMAT1 and RMAT2 graphs and are shown in fig. 6. For KLA, the nodeq and the numaq variations perform the best at scale, with At greater values, the performance of threadq is comparable to nodeq and numaq, but in absolute terms, the performance at higher values is worse than at As said in the previous section, the numaq and nodeq provide the best potential ordering by ordering the most items. The overheads are kept at bay because at all the writes to the next level’s queue occur before all the reads. The flat combining queue [9] that we use performs the best in this scenario. For higher values, writes and reads get more mixed, and the advantage of numaq and nodeq becomes less pronounced (when is higher more work items go into queues and concurrent ordering overhead become significant). In KLA, for both RMAT1 and RMAT2 inputs, all EAGM variations (threadq, nodeq and numaq) perform better compared to original KLA algorithm (buffer).
Graph  Cores  AGM  buffer  threadq  nodeq  numaq  

SOCLive [14]  4847571  68993773  16  64  22.7  6.47  20.52  5.8  14.93  2.39  29.8  3.87  
90.39  5.78  38.66  4.33  25.08  1.94  35.96  2.55  
39.82  6.98  11.66  0.63  166.26  15.42  207.42  22.21  
WikiTalk [15]  2394385  5021410  9  64  2.27  0.28  2.26  0.57  3.44  0.28  9.71  0.5  
3.73  0.44  2.53  0.34  1.94  0.23  3.9  0.4  
41.46  6.54  1.41  0.05  8.34  1.29  5.97  0.78  
roadNetCA [14]  1965206  2766607  849  1024  24.53  6.49  22.28  6.09  12.23  1.2  19.18  2.05  
54.24  6.14  54.63  5.8  43.86  5.18  51.35  5.54  
44.62  7.4  2.68  0.21  44.17  2.08  23.76  2.1  
Orkut [16]  3072441  117185083  9  1024  3.72  0.43  3.12  0.4  4.29  0.36  6.77  0.3  
20.94  6.33  71.19  23.5  15.87  1.56  18.28  2.12  
64.84  12.19  2.97  1.14  56.81  5.43  50.41  3.99 
ViA3 Chaotic Variations
Figure 7 shows results for chaotic variations with RMAT1 and RMAT2 input graphs. The chaotic AGM has a single large equivalence class and does not perform any form of ordering. Due to work explosion we were unable to run the chaotic algorithm except for small scales. For the same reason both nodeq and numaq end up having larger queue sizes, hence the overhead of ordering became significant as we increase the scale. However, the thread level ordering shows good performance, specially in distributed execution. For RMAT2, threadq achieves almost perfect weak scaling. Furthermore, the threadq chaotic variation is faster than all other variations in terms of absolute performance, demonstrating how the structured (E)AGM approach may result in new, highly performant algorithms.
ViB Real World Graphs
The real world graphs we used in our experiments are listed in Table I along with their characteristics and relevant results. Most of the real world graphs are fairly small. For our experiments, we pick either 64 or 1024 cores, depending on the size of the graph. SOCLive Journal [14] and WikiTalk [15] experiments are run on 64 cores, California Road Network [14], and Orkut [16] experiments are run on 1024 cores.
For Live Journal, nodeq showed better results out of the stepping variations and the variations. Similarly to synthetic graph results, the chaotic variation show best results for threadq. KLA nodeq and chaotic threadq showed the best results on WikiTalk graph. California Road Network has the highest diameter, with the edge weights ranging from 0 to 100. Both stepping and KLA show good results with the node level ordering and Chaotic showed good results for threadq. Orkut graph input shows minimum timing values for stepping in their threadq implementations. In Chaotic variations, threadq implementation continued to perform better for social networking graphs.
Vii Related Work
SSSP is a classic example of an irregular application. Parallel graph algorithms for SSSP being wellstudied. Stepping [7], KLA [8], Bellman Ford [17], Crauser’s SSSP [18] are popular algorithms that address distributed memory parallel SSSP problem.
A distributed SSSP algorithm connecting selfstabilizing was studied in [19]. The same algorithm is discussed for different runtime characteristics in [20]. Algorithm discussed in [19] is an EAGM instantiation of Chaotic AGM. Much of the work related to selfstabilization is already discussed in Section I.
AGM is an abstract model for designing distributed memory parallel graph algorithms. Most of the graph algorithms existing today are designed based on PRAM architecture. In PRAM we have a single shared memory and individual processors reading/writing from/to shared memory.
As discussed in Section I, PRAM algorithms suffer from performance issues in distributed memory settings. Bulk Synchronous Parallel (BSP) [21] is a model used for designing distributed memory parallel algorithms. In BSP we have super steps where in each super step we do computation, communication and barrier synchronization. Also, there are certain variations of BSP where computation and communication are overlapped to improve performance yet uses barriers. BSP is more like an extended version of PRAM and hence the graph algorithms designed for PRAM can be implemented in distributed settings using BSP. In addition, there are models that consider network parameters and considers communications (e.g., LogP [22] and its descendants).
Regarding spatial orderings, Galois scheduler; OBIM [23] considers spatial features when processing an irregular application but it is an implementation for shared memory systems rather than an abstract model.
Viii Conclusion
Using the AGM abstraction, we converted Algorithm 1 to a form suitable for distributed memory, parallel graph processing. We showed that existing distributed graph algorithms; Dijkstra’s SSSP, stepping SSSP and KLA SSSP are variations of the converted algorithm. Those algorithms basically implement the Rule 2 of Algorithm 1 with different ordering based on distance state or based on a different ordering attribute such as level. We also showed, proposed EAGM model can generate more finegrained orderings at less synchronized spatial levels. Results of our experiments showed that some of the generated algorithms perform better compared to standard distributed memory, parallel SSSP algorithms under different graph inputs.
Acknowledgment
This work is supported by the National Science Foundation under Grant No. 1319520. Further, this research was supported in part by Lilly Endowment, Inc., through its support for the Indiana University Pervasive Technology Institute, and in part by the Indiana METACyt Initiative. The Indiana METACyt Initiative at IU was also supported in part by Lilly Endowment, Inc.
References
 [1] S. Fortune and J. Wyllie, “Parallelism in random access machines,” in Proceedings of the tenth annual ACM symposium on Theory of computing. ACM, 1978, pp. 114–118.
 [2] N. Guellati and H. Kheddouci, “A survey on selfstabilizing algorithms for independence, domination, coloring, and matching in graphs,” Journal of Parallel and Distributed Computing, vol. 70, no. 4, pp. 406–415, 2010.
 [3] T. C. Huang and J.C. Lin, “A selfstabilizing algorithm for the shortest path problem in a distributed system,” Computers & Mathematics with Applications, vol. 43, no. 1, pp. 103–109, 2002.
 [4] S. Dolev, A. Israeli, and S. Moran, “Selfstabilization of dynamic systems assuming only read/write atomicity,” Distributed Computing, vol. 7, no. 1, pp. 3–16, 1993.
 [5] T. A. Kanewala, M. Zalewski, and A. Lumsdaine, “Abstract graph machine,” arXiv preprint arXiv:1604.04772, 2016.
 [6] E. W. Dijkstra, “A Note on Two Problems in Connexion With Graphs,” Numerische mathematik, vol. 1, no. 1, pp. 269–271, 1959.
 [7] U. Meyer and P. Sanders, “Stepping: A Parallelizable Shortest Path Algorithm,” Journal of Algorithms, vol. 49, no. 1, pp. 114–152, 2003.
 [8] Harshvardhan, A. Fidel, N. M. Amato, and L. Rauchwerger, “KLA: A New Algorithmic Paradigm for Parallel Graph Computations,” in Proc. 23rd Internat. Conf. on Parallel Architectures and Compilation. ACM, 2014, pp. 27–38.
 [9] D. Hendler, I. Incze, N. Shavit, and M. Tzafrir, “Flat combining and the synchronizationparallelism tradeoff,” in Proceedings of the twentysecond annual ACM symposium on Parallelism in algorithms and architectures. ACM, 2010, pp. 355–364.
 [10] R. C. Murphy, K. B. Wheeler, B. W. Barrett, and J. A. Ang, “Introducing the graph 500 benchmark,” Cray User’s Group (CUG), 2010.
 [11] D. Chakrabarti, Y. Zhan, and C. Faloutsos, “Rmat: A recursive model for graph mining.” in SDM, vol. 4. SIAM, 2004, pp. 442–446.
 [12] Graph500Contributors. (2016) Graph 500 benchmark 1 ("search"). [Online]. Available: http://www.cc.gatech.edu/~jriedy/tmp/graph500/
 [13] J. Leskovec and A. Krevl, “SNAP Datasets: Stanford large network dataset collection,” http://snap.stanford.edu/data, Jun. 2014.
 [14] J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney, “Community structure in large networks: Natural cluster sizes and the absence of large welldefined clusters,” Internet Mathematics, vol. 6, no. 1, pp. 29–123, 2009.
 [15] J. Leskovec, D. Huttenlocher, and J. Kleinberg, “Signed networks in social media,” in Proceedings of the SIGCHI conference on human factors in computing systems. ACM, 2010, pp. 1361–1370.
 [16] J. Yang and J. Leskovec, “Defining and evaluating network communities based on groundtruth,” Knowledge and Information Systems, vol. 42, no. 1, pp. 181–213, 2015.
 [17] R. Bellman, “On a Routing Problem,” DTIC Document, Tech. Rep., 1956.
 [18] A. Crauser, K. Mehlhorn, U. Meyer, and P. Sanders, “A Parallelization of Dijkstra’s Shortest Path Algorithm,” in Mathematical Foundations of Computer Science 1998. Springer, 1998, pp. 722–731.
 [19] M. Zalewski, T. A. Kanewala, J. S. Firoz, and A. Lumsdaine, “Distributed Control: Priority Scheduling for Single Source Shortest Paths Without Synchronization,” in Proc. of the Fourth Workshop on Irregular Applications: Architectures and Algorithms. IEEE, 2014, pp. 17–24.
 [20] J. S. Firoz, M. Barnas, M. Zalewski, and A. Lumsdaine, “Comparison of single source shortest path algorithms on two recent asynchronous manytask runtime systems.”
 [21] A. V. Gerbessiotis and L. G. Valiant, “Direct bulksynchronous parallel algorithms,” Journal of parallel and distributed computing, vol. 22, no. 2, pp. 251–267, 1994.
 [22] D. Culler, R. Karp, D. Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, and T. Von Eicken, “Logp: Towards a realistic model of parallel computation,” in ACM Sigplan Notices, vol. 28, no. 7. ACM, 1993, pp. 1–12.
 [23] A. Lenharth, D. Nguyen, and K. Pingali, “Priority Queues Are Not Good Concurrent Priority Schedulers,” The University of Texas at Austin, Department of Computer Sciences, Tech. Rep. TR1139, 2011.