Families of Distributed Memory Parallel Graph Algorithms from Self-Stabilizing Kernels–An SSSP Case Study

Families of Distributed Memory Parallel Graph Algorithms from Self-Stabilizing Kernels–An SSSP Case Study

Thejaka Kanewala12, Marcin Zalewski1, Martina Barnas2 and Andrew Lumsdaine1 1Pacific Northwest National Laboratory & University of Washington, Seattle, WA, USA.
2School of Informatics & Computing, Indiana University
Bloomington, IN, USA.
Email: 1{thejaka.kanewala, marcin.zalewski, andrew.lumsdaine}@pnnl.gov, 2{thejkane, mbarnas}@indiana.edu
Abstract

Self-stabilizing algorithms are an important because of their robustness and guaranteed convergence. Starting from any arbitrary state, a self-stabilizing algorithm is guaranteed to converge to a legitimate state.Those algorithms are not directly amenable to solving distributed graph processing problems when performance and scalability are important. In this paper, we show the “Abstract Graph Machine” (AGM) model that can be used to convert self-stabilizing algorithms into forms suitable for distributed graph processing. An AGM is a mathematical model of parallel computation on graphs that adds work dependency and ordering to self-stabilizing algorithms. Using the AGM model we show that some of the existing distributed Single Source Shortest Path (SSSP) algorithms are actually specializations of self-stabilizing SSSP. We extend the AGM model to apply more fine-grained orderings at different spatial levels to derive additional scalable variants of SSSP algorithms, essentially enabling the algorithm to be generated for a specific target architecture. Experimental results show that this approach can generate new algorithmic variants that out-perform standard distributed algorithms for SSSP.

xleftmargin=8pt,numbers=left, numberstyle=,stepnumber=1, numbersep=4pt frame=tb language=C++1y, columns=fullflexible, breaklines=true,

I Introduction

Most of the existing parallel algorithms are developed based on Parallel Random Access Machine (PRAM) [1] model. While PRAM is a simple machine model it does not consider factors that are significant in distributed memory system (e.g., overhead of synchronization, remote message communication etc.). Further, algorithms designed for PRAM may assume global data structures and subgraph computations are efficient, hence their cost is not counted into algorithm performance. While these algorithms perform well in shared memory systems they tend to perform poorly on distributed memory systems.

Self-stabilizing graph algorithms rely on local information to solve graph problems. In self-stabilizing algorithms every vertex in the graph is associated with a state. Whenever, there is a state change, neighboring vertices are notified via edges. Self-stabilizing algorithms does not use global data structures and does not rely on operations such as subgraph computations (these operations are expensive in a distributed environment). The fact that, self-stabilizing algorithms rely only on local information and does not assume global data structures motivate us to investigate the applicability of self-stabilizing graph algorithms for large scale static graph processing.

A self-stabilizing algorithm consists of set of rules. Every rule has a condition. A rule is evaluated only if its condition evaluates to true. A self-stabilizing algorithm reaches a legitimate state irrespective of its initial state.

Self-stabilizing graph algorithms have been introduced for number of graph applications including Single Source Shortest Path, Breadth First Search (BFS), Spanning Tree Construction, Maximal Independent Set and Graph Coloring. A survey of self-stabilizing graph algorithms is presented in [2].

The strategy for updating vertex states is defined by an execution model. In self-stabilization, those execution models are called demons. We find three types of demons in self-stabilizing algorithms: central demon, synchronous demon and distributed demon. In a central demon algorithm, only one vertex can update the state at a time. While synchronous demon updates all the vertex states at the same time, a distributed demon select a subset of vertices to update states at the same time. Since our main focus is to reduce global synchronization and to rely on “local” data for processing, we only consider distributed demon algorithms.

A distributed demon self-stabilizing Single Source Shortest Path algorithm is presented in Algorithm 1 (discussed in  [3]). At stabilization this algorithm will have the minimum distance to every vertex, from a given source vertex. Self-stabilizing algorithms describe the algorithm using rules. A rule consists of a condition and an action, when condition evaluated true action is invoked. Rules in the Algorithm 1 have the format; current state new state.

1:  {For the source r }
2:  R0:
3:  {For node }
4:  R1:
Algorithm 1 Self-stabilizing SSSP for distributed demon

Note that in the algorithm, represents the state of vertex and stands for set of all neighbours of vertex . The pre assigned weight for an edge is denoted by .

The algorithm consists of two rules. The first rule is only applicable to the source. It says, if the source “distance state” is not 0, then the “distance state” should be transferred to 0. The second rule is activated only if the current distance of the vertex is not equal to the minimum of its neighbours distance plus the weight of the edge connecting the neighbour. So, the legitimate state of the system is ( is the given source) and [3] proves that this algorithm ultimately stabilizes and represents the shortest distance from the source, for each vertex at stabilization.

Another important aspect of self-stabilizing algorithms is the atomicity requirement. Algorithm 1 requires to querying vertex ’s state and its neighbors state and updating vertex ’s state (if the rule evaluates to true for vertex ) in a single atomic step. In self-stabilization, the requirement to query the current vertex state and its neighbours states and update the current vertex state in a single atomic step is called composite atomicity. The Read-Write atomicity is another form of atomicity discussed in detail in [4]. In this paper we consider self-stabilizing algorithms with composite atomicity.

Using self-stabilizing algorithms for static graph analysis is challenging due to several reasons: 1. Self-stabilizing algorithms are designed for streaming contexts in which those algorithms do not terminate after stabilizing; 2. To implement the composite atomicity requirement we need to lock the current vertex as well as its neighbors. This will generate a lot of synchronization regions and involves a significant number of locking/unlocking operations, hence it reduces performance especially in distributed execution; 3. A naive implementation of the self-stabilizing SSSP would require program to iterate through all the vertices and apply rules R0 and R1 until they reach the legitimate state. When processing a large graph, such an implementation would generate a massive amount of unnecessary work and would perform very poorly.

To overcome the challenges discussed above, we model self-stabilizing SSSP Algorithm 1 using an Abstract Graph Machine (AGM) [5]. An AGM is a model for designing distributed memory parallel graph algorithms. An AGM essentially converts the self-stabilizing algorithm to a stabilizing algorithm by adding work dependency and uses ordering to reduce the amount of work. A stabilizing algorithm starts from a specific initial state where as self-stabilizing algorithm can execute the algorithm from any initial state. The modeled stabilizing algorithm is in a suitable form for distributed graph processing. We also show that some of the existing well-known distributed SSSP algorithms are specific versions of the modeled self-stabilizing algorithm in 1.

We further enhance the AGM model to incorporate architecture dependent spatial characteristics to generate less synchronized orderings. The extended model is called Extended AGM (EAGM). Using EAGM we generate nine variations of SSSP algorithms and compare their performance to standard distributed SSSP algorithms under three different types of graph inputs. Our results show that some of the generated algorithm variations perform better compared to standard distributed SSSP algorithms.

Ii Self-Stabilizing SSSP & AGM

In the following we discuss several important aspects when modeling the self-stabilizing SSSP algorithm using AGM. More specifically, our discussion is focused on termination, composite atomicity and ordering.

Self-stabilizing algorithms are not iterative algorithms neither the AGM algorithms. While self-stabilizing algorithms does not terminate they go to an idle state when states are stabilized but AGM algorithms terminate at stabilization. Algorithm termination depends on the amount of active work available in the system. Our implementations use standard termination detection algorithms to count active work available in the system alive. As long as there are state changes there will be active work available. When active work is zero there are no more state changes and guarantees that the algorithm state is stabilized.

Under composite atomicity Algorithm 1 needs to query states from neighbors, query the current state and update current the vertex state in a single atomic step. This requires synchronization between neighboring vertices. However, for the SSSP algorithm in 1, the synchronization requirement can be alleviated due to the monotonic function “min”. In rule 1, neighbors states are processed in the “min” function. The “min” function selects the minimum distance + weight value for all neighbors, irrespective of the order states pushed in. Therefore we only need to maintain the state of the current vertex.

The AGM model makes state transitions based on work events. We define a unit of work to be a pair, where and is the state associated with . We call a unit of work a and we denote all the possible work items generated by an algorithm as the set . In the AGM model, every-time a state is updated new work items are generated.

Before processing, new work items are ordered. In an AGM ordering is defined as a strict weak ordering. The strict weak ordering divide work items into different equivalence classes.

Iii Abstract Graph Machine (AGM)

An Abstract Graph Machine(AGM) consists of a definition of a , an initial set, a set of states, a processing function and a strict weak ordering relation. The processing function takes a () as an input and may produce 0 or more work items. Further, the processing function may change values associated with states. The strict weak ordering relation order work items into a set of ordered (induced) equivalence classes.

The AGM model denotes a () as a tuple. A tuple’s first element stores a vertex and the rest of the positions store the state/s and ordering attribute values. For example, the self-stabilizing SSSP algorithm stores a vertex and a distance in a . The values are populated by the processing function. The size of the tuple (i.e. the number of additional elements) is determined by the states in the algorithm and the ordering attributes used in the AGM formulation.

To access tuple elements we use the bracket operator; e.g., if and if = <>then = v and = and = , etc. The data (i.e. tuple elements) are read by the processing function. After reading values, the processing function can change the states associated with the vertex in the .

An AGM maintains state values as mappings(functions). The domain of the state mappings is the set . The co-domain depends on the possible values that can be held in states. For example, in the self-stabilizing SSSP algorithm the state mapping is . In AGM terminology, accessing a state value associated with a vertex (or edge) “v” is denoted as “mapping_name[v]” (e.g., distance[v]).

In addition to state mappings, algorithms also use read-only properties. These properties usually hold graph data and are interpreted as functions, e.g., edge weights (). In the abstraction we treat those read-only properties as part of the graph definition. In terms of the syntax, to read values, the AGM model does not distinguish between read-only properties and state mappings.

Both state mappings and properties are accessed within a processing function and the processing function only reads from the properties while processing function may modify state mappings when processing a . The processing function updates the state value associated with a vertex in a single atomic step.

The logic inside the processing function is analogous to the code that runs infinitely on every process in a self-stabilizing algorithm. However, in the AGM model a single process may run multiple instances of processing functions (e.g., distributed shared memory model). The logic inside the processing function is based on the rules defined for self-stabilizing algorithms but is modified to work with work items. A processing function () takes a and may produce more work items based on the logic defined inside the . In our formalism we treat states as side effects, in the sense they are not passed as explicit arguments but subject to change when executing . We will denote set of states using and we will use to denote the powerset of set . Then, mathematically the is declared as .

The processing function defines the basic logic of an algorithm. It consists of a set of statements (). A statement specifies a condition based on input and/or states () and an update to states () and how a output work items () should be constructed (). The of a statement is evaluated only if its (condition) and are evaluates to . We distinguish between and , since may have side effects where it updates states but does not create any side effects.

An abstract version of is formally defined in Definition 1.

Definition 1


where;

The output work items of a processing function are ordered according to the strict weak ordering defined on . The ordered work items are again fed into the processing function in the order they appear. The interaction between the processing function and the ordering is graphically depicted in the Figure 1.

Fig. 1: An overview of the Abstract Graph Machine

The strict weak ordering relation (denoted by ) must satisfy the following properties;

  1. For all , .

  2. For all , if , then .

  3. For all , if and , then .

  4. For all , if not comparable with and not comparable with , then is not comparable with .

Properties 1 and 2 states that the strict weak ordering relation is not reflexive and is anti-symmetric. Property 3 denotes the transitivity of the “comparable work items” and Property 4 states that transitivity is preserved among non-comparable elements in the set. These properties give rise to an equivalence (i.e. non-comparable work items belong to the same equivalence class) relation defined on , hence the strict weak ordering relation partitions the complete . Since work items in different equivalence classes are comparable, the strict weak ordering relation defined on the set induces an ordering on generated equivalence classes. In general, there are several ways to define the induced ordering relation (denoted ). For our work we use the definition given in the Definition 2.

Definition 2

is a binary relation defined on , such that if then; .

Having defined all supporting concepts we now give the definition of an AGM in Definition 3.

Definition 3

An Abstract Graph Machine(AGM) is a 6-tuple (G, , Q, , , S), where

  1. G = (V, E) is the input graph,

  2. where each represents a state value or an ordering attribute,

  3. Q - Set of states represented as mappings,

  4. is the processing function,

  5. - Strict weak ordering relation defined on work items

  6. S () - Initial set.

In the following we give the semantics of an AGM. An AGM starts execution with the initial set. The initial set is ordered according to the strict weak ordering relation. Next, the work items within the smallest equivalence class is fed to the processing function. If the processing function generates new work items, then they are separated into equivalence classes based on the strict weak ordering relation. The work items within a single equivalence class can execute the processing function in parallel. However, work items in two different equivalence classes must be ordered according to the induced relation (i.e. ). When executing work items in an equivalence class, it may generate new work items for the same equivalence class or to an equivalence class greater (as per ) than currently processing equivalence class. The AGM executes work items in the next equivalence class, once it finished executing all the work items in the current smallest equivalence class. An AGM terminates when it executes all the work items in all the equivalence classes.

The AGM is used to model graph algorithms related to graph applications such as SSSP, BFS, PageRank and Connected Components. Processing functions and orderings used in those algorithms is discussed detail in [5].

Iii-a SSSP Algorithms in AGM

In this subsection, we build AGM model for Algorithm 1. We also show that adding different orderings to the modeled algorithm reveals behaviours of existing distributed SSSP algorithms.

The SSSP algorithms discussed in this paper can be formulated using a single statement. For brevity we use following notation to represent the processing function.

Notation 1

In notation 1, is the new generated from . As discussed previously and represents state update and condition.

To build the AGM for self-stabilizing  Algorithm 1, we need to define each tuple element in Definition 3 (i.e. (G, , Q, , , S)). As the input graph we use , where weight is a read-only property map that has weights associated to edges. As explained in Section II the set is defined based on the vertex state. For Algorithm 1, the state we are interested in is the distance from the source vertex. Therefore, we define, , where Distance . The only state AGM needs is the distance from the source vertex and it is represented as distance. Values for distance state is assigned when the processing function is executed with work items.

The AGM processing function is defined based on the rules in Algorithm 1 and given in Definition 4. Since Rule 1 is only applied to the source vertex, we can move it to the initial set in the AGM representation. Rule 2 is encoded into the processing function in the format defined in Definition 1. The definition of the processing function uses a helper function called neighbors (Declared as ), which operates on graph vertices.

Definition 4

The next required parameter definition for the AGM model is the strict weak ordering relation (). If work items are not ordered in any form, then we will have a single large equivalence class of generated work items. We can have a single large equivalence class by defining strict weak ordering relation as in Definition 5.

Definition 5

is a binary relation defined on where,

Basically, the binary relation does not divide work items into any comparable equivalence classes. Using the strict weak ordering defined in Definition 5, we present the AGM model for Algorithm 1, in Proposition 1.

Proposition 1
  1. is the input graph,

  2. = ,

  3. Q = {distance} is the state mappings,

  4. = ,

  5. Strict weak ordering relation = ,

  6. S = {<, 0>} where and is the source vertex.

Since the AGM presented in Proposition 1, does not perform any ordering on work items, we call it Chaotic AGM. However, the ordering on work items can be improved by defining strict weak ordering relations that generate smaller comparable equivalence classes. There are numerous ways to define strict weak orderings so that they generate smaller equivalence classes. Some of those strict weak orderings yield us, existing distributed SSSP algorithms such as Dijkstra’s SSSP [6] algorithm, -stepping SSSP [7] algorithm and KLA SSSP algorithm [8]. All those algorithms share almost the same processing function but uses different orderings on work items.

Iii-A1 Dijkstra’s Algorithm

Dijkstra’s SSSP algorithm is the work efficient SSSP algorithm. Dijkstra’s algorithm globally orders vertices by their associated distances and the shortest distance vertices are processed first. In the following, we define the ordering relation for Dijkstra’s algorithm and we instantiate Dijkstra’s algorithm using Abstract Graph Machine.

Definition 6

is a binary relation defined on as follows; Let , then; iff

It can be proved that is a strict weak ordering that satisfy constraints listed under Definition 6 (proof is omitted to save space). AGM instantiation for Dijkstra’s algorithm is given in  Proposition 2.

Proposition 2

Dijkstra’s Algorithm is an instance of AGM where;

  1. is the input graph,

  2. = ,

  3. Q = {distance} is the state mappings,

  4. = ,

  5. Strict weak ordering relation = ,

  6. S = {<, 0>} where and is the source vertex.

Iii-A2 -Stepping Algorithm

-Stepping [7] SSSP algorithm arranges vertex-distance pairs into distance ranges (buckets) of size and execute buckets in order. Within a bucket, vertex-distance pairs are not ordered, and can be executed in any order. Processing a bucket may produce extra work for the same bucket or for a successive buckets. The strict weak ordering relation for -stepping algorithm is given in Definition 7.

Definition 7

is a binary relation defined on as follows; Let , then;
iff

Instantiation of -Stepping algorithm in the AGM is same as in Proposition 2, except the strict weak ordering relation is (= ).

Iii-A3 KLA SSSP Algorithm

The -level asynchronous (KLA) paradigm [8] bridges level-synchronous and asynchronous paradigms for processing graphs. In level-synchronous approach (E.g:- In level-synchronous BFS), vertices are processed level by level. KLA processes vertices up-to levels asynchronously and then moves to next levels.

In the following we model KLA SSSP with the AGM. The KLA approach order work items by the level of the resulting tree. Therefore, we need an additional ordering attribute in the definition. The KLA is defined as where .

Further, the processing function also needs to be altered to populate value for the new ordering attribute. The processing function for KLA SSSP is defined in Definition 8. The processing function generates work items with an updated level, while changing distance state appropriately.

Definition 8

KLA SSSP order work items based on the level. The work items within two consecutive levels can be executed in parallel and work items that are not in two consecutive levels must be ordered. The strict weak ordering relation for KLA SSSP is given in Definition 9.

Definition 9

is a binary relation defined on
as follows:
Let , then
iff

The AGM formulation for KLA SSSP algorithm is given in Proposition 3.

Proposition 3

KLA SSSP Algorithm is an instance of AGM where;

  1. is the input graph

  2. =

  3. Q = {distance} are the state mappings,

  4. =

  5. Strict weak ordering relation =

  6. S = {<, 0, 0>} where and is the source vertex and level starts with 0.

(a) Dijkstra’s SSSP
(b) -Stepping SSSP
(c) -Level
(d) Chaotic (No ordering)
Fig. 2: Equivalence classes generated by different SSSP algorithms

Each AGM discussed above divides into equivalence classes differently (Figure 2) The Dijkstra’s AGM (Figure 1(a)) generates an equivalence class per each different distance value. The work items that have the same distance belong to the same equivalence class while work items of different distances go into different equivalence classes. The -stepping (Figure 1(b)) AGM also performs ordering based on the distance but equivalence classes are generated based on a value and in general have more elements compared to Dijkstra’s AGM. All the work items within a single equivalence class are guaranteed to have distances between and for some . Similar to -stepping algorithms, KLA, too, generates larger equivalence classes but uses level as the ordering attribute. Figure 1(c), shows how KLA algorithm arranges equivalence classes. As shown in Figure 1(d), the Chaotic version has a single large equivalence class containing all the work items.

Each of the above algorithms is different because of the way they generate equivalence classes and how those equivalence classes are ordered. Otherwise, they tend to share the same processing function that implements Rule 1 in Algorithm 1. Further, if two SSSP algorithms share the same ordering attribute, then they share the same processing function in AGM model. For example, both Dijkstra’s AGM and -stepping share the same processing function. If two algorithms use different ordering attributes, then processing functions differ only to update values of different ordering attributes.

Iv Extended-AGM

Ordering in terms of distance reduces the amount of redundant work in SSSP algorithms. Out of the algorithms discussed in the previous section, Dijkstra’s algorithm performs the best ordering, so that it does the minimum amount of redundant work. However, the overhead of ordering in Dijkstra’s algorithm is significant in a parallel distributed run-time due to the frequent synchronization. In other words, when the AGM instance generates more equivalence classes, the global synchronization overhead increases. The other algorithms discussed above reduce overhead of ordering by chunking work items into larger equivalence classes. This reduces the number of equivalence classes generated. The Extended-AGM adds ordering to chunks but to reduce the overhead of ordering at AGM level, it applies ordering at different lower spatial levels.

A spatial level defines the amount of memory accessible to order work items. The highest spatial level is the accumulation of all the memory of the participating nodes (also called global memory). The next spatial level is the memory available at a node. The memory in a single node can be further subdivided into logical regions such as numa domains. Each numa domain may be shared by several threads. The lowest spatial level is the thread local memory. Such a spatial division is highly architecture-dependent and hierarchical.

Fig. 3: EAGM Spatial Hierarchy

The EAGM depicts a spatial hierarchy as a tree (Figure 3). Every node in the spatial hierarchy has an ordering attached. The ordering attached to the root represents the ordering defined by the relevant AGM. The example given inFigure 3, is an EAGM hierarchy derived from the -stepping AGM. As shown inFigure 3 the ordering ( Definition 7) is attached to the root node. The orderings attached to the lower spatial levels are performed on work items available to the memory in the relevant spatial level. Therefore, the orderings attached to lower spatial levels are more relaxed than the ordering attached to the root. By default the EAGM spatial hierarchy assumes Chaotic (i.e. no ordering) ordering but specific orderings can be enforced. The example in Figure 3 enforces strict weak ordering at numa level and uses Chaotic orderings at other levels.

An Extended AGM is an AGM but instead of consisting a strict weak ordering relation an EAGM has a spatial hierarchy with annotated orderings. An EAGM extends an AGM if and only if the EAGM generates same equivalence classes in the AGM at the root level of the spatial hierarchy. Therefore, an AGM can have multiple EAGMs, where each EAGM has the same ordering as the AGM at the root but different orderings at lower spatial levels. Each EAGM represents a variation of the algorithm modeled in the relevant AGM. If an AGM generates equivalence classes with many work items, then the EAGM has provision to perform fine-grain ordering at different spatial levels. However, if the AGM ordering is such that it generates equivalence classes with few work items, then the derived EAGM has less opportunity to perform ordering at spatial levels. For example, the -stepping AGM generates equivalence classes with many work items (provided is sufficiently large). Variations of the -stepping AGM can be generated by applying ordering to work items at the process level, the numa level or at the thread level. However, for Dijkstra’s AGM the spatial orderings may not improve the overall performance of the algorithm because the equivalence classes generated by Dijkstra’s AGM has fewer work items on average.

Out of the algorithms discussed in the previous section, the fine-grained spatial ordering is effective to AGMs defined for -stepping, KLA and Chaotic. By considering the spatial hierarchy used in Figure 3, we apply Dijkstra’s strict weak ordering relation ( Definition 6) to spatial hierarchy levels PROCESS, NUMA and THREAD to derive EAGMs (Figure 4). Each EAGM generates a variation of the main algorithm defined by its corresponding AGM.

(a) EAGMs of -stepping AGM
(b) EAGMs of KLA AGM
(c) EAGMs of Chaotic AGM
Fig. 4: Thread ordered, NUMA ordered and Process ordered EAGMs for -stepping, KLA and Chaotic AGMs.

Figure 3(a), shows EAGM variations derived for -stepping algorithm. Figure 3(a)-(i), applies ordering to THREAD level and Figure 3(a)-(ii) applies ordering to NUMA level and Figure 3(a)-(iii) applies ordering to PROCESS level. EAGMs for KLA and Chaotic are derived in the same way.

For convenience, we will refer to the original AGM implementation as buffer, the variation that does THREAD level ordering as threadq, the variation that does PROCESS level ordering as nodeq and the variation that does NUMA level ordering as numaq.

V Implementation

We implemented EAGMs shown in Figure 4. Each implementation generated a variation of main algorithm. For implementation we used a light weight active messaging framework based on MPI. To represent the local graph data, we used compressed sparse row format. 1D distribution is used to distribute the graph. States and read only mappings are maintained as property maps indexed by vertices or edges and property maps are also distributed. We used concurrent priority queues (flat combining synchronous priority queue [9]) for node level and numa level ordering.

Fig. 5: Timing results of -stepping variations for RMAT1 and RMAT2 graphs, with , and

Vi Results

Fig. 6: Timing results of KLA variations for RMAT1 and RMAT2 graphs, with , and

We experimented with the performance of EAGM variations on synthetic graphs and on real-world graphs. For synthetic graphs we used:

  • RMAT1: Graphs based on the current Graph500 [10] BFS benchmark specification with R-MAT [11] parameters , and and random edge weights from to .

  • RMAT2: Graphs generated based on the proposed Graph500 [12] SSSP benchmark specification with R-MAT parameters , and and random edge weights from to

For real world graphs, we use four graphs with varying parameter from the SNAP [13] repository with random edge weights from to

All our experiments were carried out on a Cray XE6/XK7 supercomputer, with 32 AMD Opteron Abu Dhabi CPUs 64 GB memory per node (4 NUMA domains), running Cray Linux Environment and the cray-mpich2 ver. 7.2.5 MPI implementation. The run-time architecture we used, aligned with the spatial hierarchy used in our EAGM implementations (Figure 4).

Vi-a Scaling Results

Fig. 7: Timing results of chaotic variations for RMAT1 and RMAT2 graphs.

We ran weak scaling experiments on Graph500 graphs from scale 19 ( vertices) to scale 31. We compare each AGM algorithm (i.e. main SSSP algorithms) to their EAGM variations (see section IV). The implementation of the main AGM algorithm is represented in buffer for each algorithm (-Stepping, KLA and Chaotic). The thread-level ordering variation, node level ordering variation and numa level ordering variation implementations are denoted using threadq, nodeq and numaq, for each algorithm.

Vi-A1 -Stepping Variations

In fig. 5, we present weak scaling results for -stepping variations on RMAT1 and RMAT2 synthetic graphs with values 3, 5, and 7. The original -stepping (buffer) algorithm performs the best in-node. Since no communication is involved, the additional ordering provided by the other implementations does not provide a sufficient benefit for its overhead.

In general, the threadq variation is the fastest in the distributed setting. The nodeq and the numaq variations perform better with increasing deltas, and they are not competitive with the buffer implementation.

In summary, while in-node performance is dominated by the traditional -stepping algorithm (aka the buffer implementation) the distributed execution shows significant improvement with threadq variation. Though, the numaq and nodeq variations should provide better performance than the threadq variation by providing more ordering, the overhead of the concurrent ordering reduces the performance of numaq and nodeq.

Vi-A2 KLA Variations

KLA variations show different performance characteristics than -stepping. Weak scaling results for KLA implementations on RMAT1 and RMAT2 graphs and are shown in fig. 6. For KLA, the nodeq and the numaq variations perform the best at scale, with At greater values, the performance of threadq is comparable to nodeq and numaq, but in absolute terms, the performance at higher values is worse than at As said in the previous section, the numaq and nodeq provide the best potential ordering by ordering the most items. The overheads are kept at bay because at all the writes to the next level’s queue occur before all the reads. The flat combining queue [9] that we use performs the best in this scenario. For higher values, writes and reads get more mixed, and the advantage of numaq and nodeq becomes less pronounced (when is higher more work items go into queues and concurrent ordering overhead become significant). In KLA, for both RMAT1 and RMAT2 inputs, all EAGM variations (threadq, nodeq and numaq) perform better compared to original KLA algorithm (buffer).

Graph Cores AGM buffer threadq nodeq numaq
SOC-Live [14] 4847571 68993773 16 64 22.7 6.47 20.52 5.8 14.93 2.39 29.8 3.87
90.39 5.78 38.66 4.33 25.08 1.94 35.96 2.55
39.82 6.98 11.66 0.63 166.26 15.42 207.42 22.21
Wiki-Talk [15] 2394385 5021410 9 64 2.27 0.28 2.26 0.57 3.44 0.28 9.71 0.5
3.73 0.44 2.53 0.34 1.94 0.23 3.9 0.4
41.46 6.54 1.41 0.05 8.34 1.29 5.97 0.78
roadNet-CA [14] 1965206 2766607 849 1024 24.53 6.49 22.28 6.09 12.23 1.2 19.18 2.05
54.24 6.14 54.63 5.8 43.86 5.18 51.35 5.54
44.62 7.4 2.68 0.21 44.17 2.08 23.76 2.1
Orkut [16] 3072441 117185083 9 1024 3.72 0.43 3.12 0.4 4.29 0.36 6.77 0.3
20.94 6.33 71.19 23.5 15.87 1.56 18.28 2.12
64.84 12.19 2.97 1.14 56.81 5.43 50.41 3.99
TABLE I: Mean timing results (in seconds) for buffer, threadq, nodeq and numaq when ran against listed real graphs.

Vi-A3 Chaotic Variations

Figure 7 shows results for chaotic variations with RMAT1 and RMAT2 input graphs. The chaotic AGM has a single large equivalence class and does not perform any form of ordering. Due to work explosion we were unable to run the chaotic algorithm except for small scales. For the same reason both nodeq and numaq end up having larger queue sizes, hence the overhead of ordering became significant as we increase the scale. However, the thread level ordering shows good performance, specially in distributed execution. For RMAT2, threadq achieves almost perfect weak scaling. Furthermore, the threadq chaotic variation is faster than all other variations in terms of absolute performance, demonstrating how the structured (E)AGM approach may result in new, highly performant algorithms.

Vi-B Real World Graphs

The real world graphs we used in our experiments are listed in Table I along with their characteristics and relevant results. Most of the real world graphs are fairly small. For our experiments, we pick either 64 or 1024 cores, depending on the size of the graph. SOC-Live Journal [14] and Wiki-Talk [15] experiments are run on 64 cores, California Road Network [14], and Orkut [16] experiments are run on 1024 cores.

For Live Journal, nodeq showed better results out of the -stepping variations and the variations. Similarly to synthetic graph results, the chaotic variation show best results for threadq. KLA nodeq and chaotic threadq showed the best results on Wiki-Talk graph. California Road Network has the highest diameter, with the edge weights ranging from 0 to 100. Both -stepping and KLA show good results with the node level ordering and Chaotic showed good results for threadq. Orkut graph input shows minimum timing values for -stepping in their threadq implementations. In Chaotic variations, threadq implementation continued to perform better for social networking graphs.

Vii Related Work

SSSP is a classic example of an irregular application. Parallel graph algorithms for SSSP being well-studied. -Stepping [7], KLA [8], Bellman Ford [17], Crauser’s SSSP [18] are popular algorithms that address distributed memory parallel SSSP problem.

A distributed SSSP algorithm connecting self-stabilizing was studied in [19]. The same algorithm is discussed for different run-time characteristics in [20]. Algorithm discussed in [19] is an EAGM instantiation of Chaotic AGM. Much of the work related to self-stabilization is already discussed in Section I.

AGM is an abstract model for designing distributed memory parallel graph algorithms. Most of the graph algorithms existing today are designed based on PRAM architecture. In PRAM we have a single shared memory and individual processors reading/writing from/to shared memory.

As discussed in Section I, PRAM algorithms suffer from performance issues in distributed memory settings. Bulk Synchronous Parallel (BSP) [21] is a model used for designing distributed memory parallel algorithms. In BSP we have super steps where in each super step we do computation, communication and barrier synchronization. Also, there are certain variations of BSP where computation and communication are overlapped to improve performance yet uses barriers. BSP is more like an extended version of PRAM and hence the graph algorithms designed for PRAM can be implemented in distributed settings using BSP. In addition, there are models that consider network parameters and considers communications (e.g., LogP [22] and its descendants).

Regarding spatial orderings, Galois scheduler; OBIM [23] considers spatial features when processing an irregular application but it is an implementation for shared memory systems rather than an abstract model.

Viii Conclusion

Using the AGM abstraction, we converted Algorithm 1 to a form suitable for distributed memory, parallel graph processing. We showed that existing distributed graph algorithms; Dijkstra’s SSSP, -stepping SSSP and KLA SSSP are variations of the converted algorithm. Those algorithms basically implement the Rule 2 of Algorithm 1 with different ordering based on distance state or based on a different ordering attribute such as level. We also showed, proposed EAGM model can generate more fine-grained orderings at less synchronized spatial levels. Results of our experiments showed that some of the generated algorithms perform better compared to standard distributed memory, parallel SSSP algorithms under different graph inputs.

Acknowledgment

This work is supported by the National Science Foundation under Grant No. 1319520. Further, this research was supported in part by Lilly Endowment, Inc., through its support for the Indiana University Pervasive Technology Institute, and in part by the Indiana METACyt Initiative. The Indiana METACyt Initiative at IU was also supported in part by Lilly Endowment, Inc.

References

  • [1] S. Fortune and J. Wyllie, “Parallelism in random access machines,” in Proceedings of the tenth annual ACM symposium on Theory of computing.   ACM, 1978, pp. 114–118.
  • [2] N. Guellati and H. Kheddouci, “A survey on self-stabilizing algorithms for independence, domination, coloring, and matching in graphs,” Journal of Parallel and Distributed Computing, vol. 70, no. 4, pp. 406–415, 2010.
  • [3] T. C. Huang and J.-C. Lin, “A self-stabilizing algorithm for the shortest path problem in a distributed system,” Computers & Mathematics with Applications, vol. 43, no. 1, pp. 103–109, 2002.
  • [4] S. Dolev, A. Israeli, and S. Moran, “Self-stabilization of dynamic systems assuming only read/write atomicity,” Distributed Computing, vol. 7, no. 1, pp. 3–16, 1993.
  • [5] T. A. Kanewala, M. Zalewski, and A. Lumsdaine, “Abstract graph machine,” arXiv preprint arXiv:1604.04772, 2016.
  • [6] E. W. Dijkstra, “A Note on Two Problems in Connexion With Graphs,” Numerische mathematik, vol. 1, no. 1, pp. 269–271, 1959.
  • [7] U. Meyer and P. Sanders, “-Stepping: A Parallelizable Shortest Path Algorithm,” Journal of Algorithms, vol. 49, no. 1, pp. 114–152, 2003.
  • [8] Harshvardhan, A. Fidel, N. M. Amato, and L. Rauchwerger, “KLA: A New Algorithmic Paradigm for Parallel Graph Computations,” in Proc. 23rd Internat. Conf. on Parallel Architectures and Compilation.   ACM, 2014, pp. 27–38.
  • [9] D. Hendler, I. Incze, N. Shavit, and M. Tzafrir, “Flat combining and the synchronization-parallelism tradeoff,” in Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures.   ACM, 2010, pp. 355–364.
  • [10] R. C. Murphy, K. B. Wheeler, B. W. Barrett, and J. A. Ang, “Introducing the graph 500 benchmark,” Cray User’s Group (CUG), 2010.
  • [11] D. Chakrabarti, Y. Zhan, and C. Faloutsos, “R-mat: A recursive model for graph mining.” in SDM, vol. 4.   SIAM, 2004, pp. 442–446.
  • [12] Graph500Contributors. (2016) Graph 500 benchmark 1 ("search"). [Online]. Available: http://www.cc.gatech.edu/~jriedy/tmp/graph500/
  • [13] J. Leskovec and A. Krevl, “SNAP Datasets: Stanford large network dataset collection,” http://snap.stanford.edu/data, Jun. 2014.
  • [14] J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney, “Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters,” Internet Mathematics, vol. 6, no. 1, pp. 29–123, 2009.
  • [15] J. Leskovec, D. Huttenlocher, and J. Kleinberg, “Signed networks in social media,” in Proceedings of the SIGCHI conference on human factors in computing systems.   ACM, 2010, pp. 1361–1370.
  • [16] J. Yang and J. Leskovec, “Defining and evaluating network communities based on ground-truth,” Knowledge and Information Systems, vol. 42, no. 1, pp. 181–213, 2015.
  • [17] R. Bellman, “On a Routing Problem,” DTIC Document, Tech. Rep., 1956.
  • [18] A. Crauser, K. Mehlhorn, U. Meyer, and P. Sanders, “A Parallelization of Dijkstra’s Shortest Path Algorithm,” in Mathematical Foundations of Computer Science 1998.   Springer, 1998, pp. 722–731.
  • [19] M. Zalewski, T. A. Kanewala, J. S. Firoz, and A. Lumsdaine, “Distributed Control: Priority Scheduling for Single Source Shortest Paths Without Synchronization,” in Proc. of the Fourth Workshop on Irregular Applications: Architectures and Algorithms.   IEEE, 2014, pp. 17–24.
  • [20] J. S. Firoz, M. Barnas, M. Zalewski, and A. Lumsdaine, “Comparison of single source shortest path algorithms on two recent asynchronous many-task runtime systems.”
  • [21] A. V. Gerbessiotis and L. G. Valiant, “Direct bulk-synchronous parallel algorithms,” Journal of parallel and distributed computing, vol. 22, no. 2, pp. 251–267, 1994.
  • [22] D. Culler, R. Karp, D. Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, and T. Von Eicken, “Logp: Towards a realistic model of parallel computation,” in ACM Sigplan Notices, vol. 28, no. 7.   ACM, 1993, pp. 1–12.
  • [23] A. Lenharth, D. Nguyen, and K. Pingali, “Priority Queues Are Not Good Concurrent Priority Schedulers,” The University of Texas at Austin, Department of Computer Sciences, Tech. Rep. TR-11-39, 2011.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
33565
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description