The Reverse CuthillMcKee Algorithm in DistributedMemory ^{†}^{†}thanks: A twopage abstract of this work has appeared at the SIAM Workshop on Combinatorial Scientific Computing. Short abstracts on SIAM conferences are not archived and does not appear in published proceedings.
Abstract
Ordering vertices of a graph is key to minimize fillin and data structure size in sparse direct solvers, maximize locality in iterative solvers, and improve performance in graph algorithms. Except for naturally parallelizable ordering methods such as nested dissection, many important ordering methods have not been efficiently mapped to distributedmemory architectures. In this paper, we present the firstever distributedmemory implementation of the reverse CuthillMcKee (RCM) algorithm for reducing the profile of a sparse matrix. Our parallelization uses a twodimensional sparse matrix decomposition. We achieve high performance by decomposing the problem into a small number of primitives and utilizing optimized implementations of these primitives. Our implementation shows strong scaling up to 1024 cores for smaller matrices and up to 4096 cores for larger matrices.
I Introduction
Reordering a sparse matrix to reduce its bandwidth or profile can speed up many sparse matrix computations [1, 2]. For example, a matrix with a small profile is useful in direct methods for solving sparse linear systems since it allows a simple data structure to be used. It is also useful in iterative methods because the nonzero elements will be clustered close to the diagonal, thereby enhancing data locality. Given a symmetric matrix , a bandwidthreduction ordering aims to find a permutation so that the bandwidth of is small. Since obtaining a reordering to minimize bandwidth is an NPcomplete problem [3], various heuristics are used in practice such as CuthillMcKee, Reverse CuthillMcKee (RCM), and Sloan’s algorithms [4, 5, 6]. This paper solely focuses on the RCM algorithm [5] because, with careful algorithm design, it is amenable to massive distributedmemory parallelism – the primary topic of interest of this paper.
The need for distributed memory RCM algorithms is driven by the move to extremescale computing. In a large scale scientific application, the matrix has often been distributed already by the time one arrives at the numerical phases of sparse matrix computations. Hence, it would be a waste of time to gather a distributed matrix onto a single processor to compute an ordering using sequential or multithreaded algorithms. Furthermore, by clustering the nonzeros closer to the diagonal, RCM ordering not only increases cache performance of iterative solvers, but it can often restrict the communication to resemble more of a nearestneighbor pattern. Figure 1 illustrates the performance effects of RCM ordering on a preconditioned conjugate gradient solver of the popular PETSc package [7]. Notice that the performance benefit of RCM ordering actually increases as the number of cores increases, possible due to reduced communication costs. Our goal in this paper is therefore to design and develop a scalable distributedmemory parallel implementation of the RCM algorithm [5].
Similar to many sparse matrix computations, RCM ordering has been shown to be a difficult problem to parallelize [8]. The RCM algorithm relies on the repeated application of breadthfirst search (BFS) and its computational load is highly dynamic, especially if the graph has high diameter. The problem exacerbates on high concurrency, where load imbalance and communication overhead degrade the performance of the parallel algorithm. RCM is harder to parallelize than BFS primarily for two reasons. First, RCM imposes additional restrictions on how the vertices are numbered in the graph traversal, limiting available parallelism. Second, RCM is often used for matrices with mediumtohigh diameter while most of the existing work on parallel BFS has focused on lowdiameter graphs such as synthetic graphs used by the Graph500 benchmark. A higher diameter increases the critical path of levelsynchronous BFS algorithms, also limiting available parallelism.
In this paper, we aim to overcome these challenges by using the graphmatrix duality and replacing unstructured graph operations by structured matrix/vector operations. We make the following contributions:

We present a scalable distributedmemory algorithm for RCM ordering. Our algorithm relies on a handful of bulksynchronous parallel primitives that are optimized for both sharedmemory and massively parallel distributed memory systems.

The quality (bandwidth and envelope) of ordering from our distributedmemory implementation is comparable to the stateoftheart and remains insensitive to the degree of concurrency.

We provide a hybrid OpenMPMPI implementation of the RCM ordering that attains up to 38 speedup on matrices from various applications on 1024 cores of a Cray XC30 supercomputer. We provide detailed performance evaluation on up to 4096 cores, which sheds light on the performance bottlenecks and opportunities for future research.
Ii Preliminaries
Iia Serial Algorithms
Let be the number of columns in a symmetric matrix . Let be the the column subscript of the first nonzero element in column of : . The th bandwidth is defined as . The overall bandwidth of matrix is denoted and is defined as . Using these notations, the envelope is defined as:
The quantity is called the profile or envelope size of .
Reverse CuthillMcKee ordering, or RCM, introduced by George [9], is a variant of the CuthillMcKee ordering [4], which aims at reducing the bandwidth of a sparse symmetric matrix . This is of particular importance when the matrix is to be stored using a profile based format. Finding a reordering of the rows/columns of corresponds to the process of labeling vertices of the graph associated with .
RCM repeatedly labels vertices adjacent to the current vertex until all have been labeled, as depicted in Algorithm LABEL:algo:seq_rcm. The algorithm essentially processes vertices levels by levels and labels them in reverse order. The resulting reordered matrix often has a smaller profile [9].
algocf[htbp] \end@float
For the graph associated with , is the set of vertices and is the set of edges in . The number of vertices in is denoted and is the number of edges . The eccentricity [10] of a vertex is defined as where is the graph distance between and .
The length of corresponds to the eccentricity while the width of is defined by
The first vertex being labeled strongly impacts the bandwidth of the permuted matrix. Experience shows that it is better to start with a node having a large eccentricity. A peripheral vertex is a vertex with maximum eccentricity. Finding such vertex is prohibitively expensive, and a common heuristic is to use a pseudoperipheral vertex instead [12, 5]. A pseudoperipheral vertex is a vertex displaying a high eccentricity, as close to the graph diameter as possible. Gibbs et al. [13] introduced an algorithm to find such a vertex, which is later refined by George and Liu [12]. The process, given in Algorithm LABEL:algo:seq_pseudo, starts with an arbitrary vertex in and computes its rooted level structure. Then, a vertex in the last level is picked and the corresponding level structure is computed. This process is repeated until the number of levels in the rooted level structure converges. Computing the level structure corresponds to a BFS of the graph .
algocf[htbp] \end@float
Function  Arguments  Returns  Serial  Communication 
Complexity  
Ind  : a sparse vector  local indices of  None  
nonzero entries of  
Select  : a sparse vector  an empty sparse vector  
: a dense vector  for  None  
: logical expr. on  if then  
assume  
Set  : a sparse vector  for  None  
: a dense vector  
SpMSpV  : a sparse matrix  AllGather &  
: a sparse vector  returns  AlltoAll on  
SR: a semiring  subcommunicator [14]  
Reduce  : a sparse vector  mv = maximum value in  
: a dense vector  for  AllReduce  
a reduction operation  mv min{mv, }  
SortPerm  an empty array of tuples  AllToAll  
: a sparse vector  for  
: a dense vector  
sort and return the permutation 
IiB Previous Work on Parallel RCM
There is a strong connection between BFS, finding a pseudoperipheral vertex and computing the RCM ordering. However, RCM is often used for matrices with higher diameter than graphs for which parallel BFS is often optimized [14, 15]. In addition, computing the actual RCM ordering is even harder to parallelize because the vertices within each level of the traversal tree has to be ordered by degree.
Computing sparse matrix orderings in parallel have received intermittent attention over the last few decades. While RCM is often used to accelerate iterative solvers, one of its first reported supercomputerscale implementations was in the context of direct solvers by Ashcraft et al. [16] on a CRAY XMP. It is hard to compare results from 30 years ago, where both the architectures and the sizes of matrices were significantly different. Karantasis et al. [8] recently studied the sharedmemory parallelization of various reordering algorithms including RCM.
Iii Algorithms based on matrix algebra
In this section, we present a distributed memory algorithm for computing the RCM ordering that takes advantage of the equivalence between graph algorithms and sparse matrix algebra to exploit latest hardware platforms. The function computes the number of nonzeros in its input, e.g., returns the number of nonzeros in . We utilize the Matlab colon notation: denotes the th column, and denotes the th row.
algocf[htbp] \end@dblfloat
algocf[htbp] \end@dblfloat
Iiia Primitives for the RCM algorithm
With an aim to design a scalable parallel algorithm, we redesign the sequential algorithms for RCM ordering (Algorithm LABEL:algo:seq_rcm) and finding a pseudoperipheral vertex (Algorithm LABEL:algo:seq_pseudo) in terms of matrixalgebraic operations. For this purpose, we use the primitives summarized in Table I. A sparse vector is used to represent a subset of vertices. Consider that a subset of unique vertices is represented by a sparse vector . Then, for each vertex in , there is a nonzero entry in the th location of (i.e., ), and the number of nonzeros in is equal to . The nonzero entries of the sparse vector can store arbitrary values depending on the context of the algorithm. By contrast, a dense vector stores information about all vertices in the graph, hence the length of is always . In sparse matrix computations, we often are interested in where the nonzero elements are in the sparse matrices. Hence, if is an sparse symmetric matrix, then we use a graph to describe the sparsity structure of : , where corresponds to row/column of , for , and when .
Let be a sparse vector, be a dense vector, be a logical unary operation and be a sparse matrix. Ind() returns the indices of nonzero entries in . Select() returns every nonzero entry of where is true. Set() replaces values of by corresponding nonzero entries of (other entries of remain unchanged). Reduce() considers only the nonzero indices of the sparse vector and performs a reduction on the values of the dense vector using the operation . In Table I, we provide an example where the reduction operation is to find the minimum value. SortPerm() creates a tuple for each nonzero index in . The list of tuples are lexicographically sorted and the function returns the sorted permutation (the indices passed to the sorting routine are used to obtain the permutation). Finally, SpMSpv() performs a sparse matrixsparse vector multiplication over the semiring . For the purposes of this work, a semiring is defined over (potentially separate) sets of ‘scalars’, and has its two operations ‘multiplication’ and ‘addition’ redefined. We refer to a semiring by listing its scaling operations, such as the (multiply, add) semiring. The usual semiring multiply for BFS is select2nd, which returns the second value it is passed. The BFS semiring is defined over two sets: the matrix elements are from the set of binary numbers, whereas the vector elements are from the set of integers.
IiiB The RCM algorithm using matrixalgebraic primitives
Algorithm LABEL:algo:dist_RCM describes the RCM algorithm using the operations from Table I. Here, we assume that the graph is connected. The case for more than connected components can be handled by repeatedly invoking Algorithm LABEL:algo:dist_RCM for each connected component. Algorithm LABEL:algo:dist_RCM takes a sparse adjacency matrix and a pseudoperipheral vertex as inputs, and returns the RCM ordering as a dense vector .
The th element of is initialized to and it retains the initial value until the th vertex is visited and labeled. Let and be the set of vertices in the current and the next BFS level, respectively. is called the BFS frontier (the set of current active vertices). The algorithm starts by labeling the pseudoperipheral vertex , inserting it into .
The while loop (lines LABEL:algo:li_while1LABEL:algo:li_while2 of Algorithm LABEL:algo:dist_RCM) explores the vertices levelbylevel until the frontier becomes empty. Vertices in have been already labeled in the previous iteration or during initialization. Hence each iteration of the while loop traverses the unvisited neighbors of and labels vertices in .
At the beginning of the while loop of Algorithm LABEL:algo:dist_RCM, the values of the sparse vector are set to the labels of vertices. Next, we discover vertices that are adjacent to the current frontier by multiplying by over (select2nd, min) semiring (line 7). Here the overloaded multiplication operation selec2nd passes the labels of parents to the children and the overloaded addition operation min ensures that each vertex in becomes a child of a vertex in with the minimum label. This step is explained in Figure 2 with an example. Notice that the specific operator overloading by a (select2nd, min) semiring makes the vertex exploration deterministic, and parallelizing the specialized SpMSpV becomes more challenging than traditional BFS. Thankfully, the linear algebraic formulation with its operator overloading feature hides the RCMspecific complexity and is able to achieve high performance despite the deterministic nature of the computation.
After the vertices in the next level are discovered via SpMSpV, previously labeled vertices are removed from (line 8). The next step is to label the vertices in . For this purpose, we sort vertices in based on the labels of the parents of and the degrees of vertices . The sorted permutation of vertices in is used to set their labels. Finally, we update the labels of the newly visited vertices (line 12) and proceed to the next iteration of the while loop. When the current frontier becomes empty, Algorithm LABEL:algo:dist_RCM returns with the RCM ordering by reversing the labels of the vertices.
IiiC Finding the pseudoperipheral vertex
Algorithm LABEL:algo:dist_pseudo finds a pseudoperipheral vertex using matrixalgebraic operations from Table I. Similar to Algorithm LABEL:algo:seq_pseudo, Algorithm LABEL:algo:dist_pseudo starts with an arbitrary vertex . Initially, the number of levels in the current and previous BFSs are set to and , respectively, so that the while loop in line 4 iterates at least twice before termination. Each iteration of the while loop in line 4 runs a full BFS starting with . As before, and represent the subsets of vertices in the current and next levels of the BFS. stores the BFS level to which each vertex belongs ( denotes an unvisited vertex). The dowhile loop (lines 816) performs the BFS traversal similar to Algorithm LABEL:algo:dist_RCM. The overloaded addition operation of the SpMSpV at line 12 is set to min only for convenience. It can be replaced by any equivalent operation because the order of vertices within a level of BFS does not matter in the discovery of a pseudoperipheral vertex. At the end of the BFS, we find a vertex with the minimum degree from the last nonempty level and make it the root for the next BFS (line 17). The while loop at line 4 terminates when the number of levels in the latest BFS is not greater than the previous BFS. At this point, Algorithm LABEL:algo:dist_pseudo returns the pseudoperipheral vertex .
Iv Distributed memory algorithm
In order to parallelize the RCM algorithm, we simply need to parallelize the matrixalgebraic functions described in Table I.
Iva Data distribution and storage
We use the Combinatorial BLAS (CombBLAS) framework [17], which distributes its sparse matrices on a 2D processor grid. Processor stores the submatrix of dimensions in its local memory. Vectors are also distributed on the same 2D processor grid. CombBLAS supports different formats to store its local submatrices. In this work, we use the CSC format as we found it to be the fastest for the SpMSpV operation with very sparse vectors, which is often the case for matrices where RCM is commonly used. CombBLAS uses a vector of {index, value} pairs for storing sparse vectors. To balance load across processors, we randomly permute the input matrix before running the RCM algorithm.
IvB Analysis of the distributed algorithm
We measure communication by the number of words moved () and the number of messages sent (). The cost of communicating a length message is where is the latency and is the inverse bandwidth, both defined relative to the cost of a single arithmetic operation. Hence, an algorithm that performs arithmetic operations, sends messages, and moves words takes time.
Since the amount of work is variable across BFS iterations, we analyze the aggregate cost over the whole BFS. For ease of analysis, matrix and vector nonzeros are assumed to be i.i.d. distributed. We also assume a square processor grid . Number of complete breadthfirst searches is denoted by . The value of is exactly one in Algorithm LABEL:algo:dist_RCM, but it can be more than one in Algorithm LABEL:algo:dist_pseudo.
We leverage the 2D SpMSpV algorithms implemented in CombBLAS. The complexity of parallel SpMSpV algorithm has been analyzed in this context recently [18], hence we just state the result here:
As described before, the SortPerm function returns the sorted permutation of the vertices in the next frontier based on the lexicographic order of (parent_label, degree, vertex_id), where parent_label is the label of the parent of a vertex in . We observe that parents of are a subset of the current frontier and vertices in are incrementally labeled in the previous iteration. Hence, the number of unique parent labels is less than or equal to . This observation motivates us to design a distributed bucket sort algorithm where the th processor is responsible for sorting vertices whose parent’s labels fall in the range
Hence, processors exchange tuples using AllToAll and the responsible processors sort tuples locally. After sorting, the vertex indices associated with the tuples work as the inverse permutation. Processors conduct another round of AllToAll (only the indices) to obtain the sorted permutation.
The total number of nonzeros sorted is exactly the sum of the frontier sizes over all iterations, which is . Hence the perprocess cost of SortPerm is upper bounded by
using personalized alltoall [19]. We found our specialized bucket sort to be faster than stateoftheart general sorting libraries, such as HykSort [20].
V Results
Name  Spy Plot  Dimensions  BW (preRCM) 

Description  Nonzeros  BW (postRCM)  
Pseudodiameter  
nd24k  72K72K  68,114  
3D mesh  29M  10,294  
problem  14  


Ldoor  952K952K  686,979  
structural prob.  42.49M  9,259  
178  


Serena  1.39M1.39M  81,578  
gas reservoir  64.1M  81,218  
simulation  58  


audikw_1  943K943K  925,946  
structural prob  78M  35,170  
82  


dielFilterV3real  1.1M1.1M  1,036,475  
higherorder  89.3M  23,813  
finite element  84  


Flan_1565 
1.6M1.6M  20,702  
3D model of  114M  20,600  
a steel flange  199  


Li7Nmax6  664K664K  663,498  
nuclear configuration  212M  490,000  
interaction calculations  7  


Nm7  4M4M  4,073,382  
nuclear configuration  437M  3,692,599  
interaction calculations  5  


nlpkkt240  78M78M  14,169,841  
Sym. indefinite  760M  361,755  
KKT matrix  243  

Va Experimental Platform
We evaluated the performance of parallel RCM algorithm on Edison, a Cray XC30 supercomputer at NERSC. In Edison, nodes are interconnected with the Cray Aries network using a Dragonfly topology. Each compute node is equipped with 64 GB RAM and two 12core 2.4 GHz Intel Ivy Bridge processors, each with 30 MB L3 cache. We used Cray’s MPI implementation, which is based on MPICH2. We used OpenMP for intranode multithreading and compiled the code with gcc 5.2.0 with O2 fopenmp flags. In our experiments, we only used square process grids because rectangular grids are not supported in CombBLAS [17]. When cores were allocated for an experiment, we created a process grid, where was the number of threads per process. In our hybrid OpenMPMPI implementation, all MPI processes performed local computation followed by synchronized communication rounds. Local computation in every matrixalgebraic kernel was fully multithreaded using OpenMP. Only one thread in every process made MPI calls in the communication rounds.
Graphs  SpMP  Distributed RCM  

BW  Runtime (sec)  Runtime (sec)  
1t  6t  24t  1t  6t  24t  
nd24k  10,608  0.26  0.06  0.03  1.45  0.38  0.12 
ldoor  9,099  3.25  0.52  0.28  4.63  1.52  0.74 
Serena  85,229  1.64  0.49  0.66  7.75  2.26  1.08 
audikw_1  34,202  1.31  0.34  0.16  7.31  1.99  0.81 
dielFilterV3real  25,436  1.99  0.73  0.46  8.63  2.37  0.95 
Flan_1565  20,849  1.86  0.44  0.17  12.11  3.88  1.35 
Li7Nmax6  443,991  4.62  1.48  0.87  20.28  4.91  2.85 
Nm7               
nlpkkt240  346,556  57.21  25.17  9.92       
VB Matrix Suite
The sparse matrix test suite used in our experiments are shown in Figure 3. These matrices came from a set of real applications, where either a sparse system is solved or an eigenvalue problem is solved. The matrices were chosen to represent a variety of different structures and nonzero densities. Since RCM is only well defined on symmetric matrices, all matrices are symmetric.
In the last column of Figure 3, the original (preRCM) as well as final (postRCM) bandwidth of the matrix are shown. In the majority of the cases, RCM effectively reduces the bandwidth. Serena and Flan_1565 seem to be the only two matrices where RCM was ineffective in that regard.
VC Sharedmemory performance
Our implementation is fully multithreaded to take advantage of the sharedmemory parallelism available within a node of modern supercomputers. Here, we compare the quality and runtime of our algorithm with the RCM implementation in SpMP (Sparse Matrix Preprocessing) by Park et al. [23], which is based on optimization from [24] and on the algorithm presented in [8]. The results from SpMP is shown in Table II. For four out of eight matrices where SpMP was able to compute RCM, the RCM ordering from our distributedmemory algorithm (shown in Figure 3) yields smaller bandwidths than SpMP. SpMP is faster than our implementation in sharedmemory due to our distributedmemory parallelization overheads. However, SpMP sometimes loses efficiency across NUMA domains. For example, SpMP slows down for Serena on 24 cores compared to 6 cores.
One very important thing to note is that in order to compute an ordering using a serial or multithreaded implementation of the RCM algorithm such as SpMP, the matrix structure has to be gathered on a single node. Indeed, in many reallife applications, the matrix is already distributed and this mandatory communication step has a nonnegligible cost. For example, it takes over 9 seconds to gather the nlpkkt240 matrix from being distributed over 1024 cores into a single node/core. This time is approximately longer than computing RCM using our algorithm on the same number of cores. One of the key benefit of our approach is that it does not require this step. Adding those times to the time required to compute the ordering itself using SpMP makes our approach highly competitive and often faster.
VD Distributedmemory performance
We ran the distributedmemory RCM algorithm on up to 4096 cores of Edison. All performance results shown in this section used six threads per MPI process as it performed the best. Figure 13 shows the strong scaling of the distributedmemory RCM algorithm for nine graphs from Table 3. To better understand the performance, we break down the runtime into five parts at each concurrency where the height of a bar denotes the total runtime of identifying a pseudoperipheral vertex and computing the RCM ordering.
Our distributed algorithm scales up to 1024 cores for all of the nine graphs as shown in Figure 13. The RCM algorithm attains the best speedup of for Li7Nmax6 and on nd24k on 1024 cores. By contrast, it achieves and speedups for ldoor and Flan_1565. The sharp drop in parallel efficiency on these graphs is due to their relatively high diameters. The levelsynchronous nature of our BFS incurs high latency costs and decreases the amount of work per processor on highdiameter graphs. Figure 13 shows that SpMSpV is usually the most expensive operation on lower concurrency. However, SortPerm starts to dominate on high concurrency because it performs an AllToAll among all processes, which has higher latency. The size of the matrix also contributes to the scalability of the distributedmemory RCM algorithm. For example, the largest two matrices in our test set (Nm7 and nlpkkt240) continue to scale on more than 4K cores whereas smaller problems do not scale beyond 1K cores. This demonstrates that our algorithm can scale on even higher core count if larger matrices are given as inputs.
Since the distributedmemory SpMSpV is the most expensive step of our RCM algorithm, we investigate its performance more closely. Figure 23 shows the breakdown of computation and communication time of the distributed SpMSpV primitive. We observe that on lower concurrency, SpMSpV is dominated by its computation, as expected. The communication time of SpMSpV starts to dominate the computation time on higher concurrency, and the crossover point where the communication becomes more expensive than computation depends on the properties of the matrices. Graphs with higher diameters have higher communication overhead than graphs with low diameters. For example, ldoor has one of the highest pseudodiameters among problems we considered. Hence, the RCM algorithm on ldoor becomes communication bound on 1K cores. By contrast, other lower diameter graphs with similar sizes continue to compute bound and scale beyond 1K cores on Edison.
We have also experimented with a flat MPI approach. This nonthreaded RCM implementation had higher communication overhead and ran slower than the multithreaded implementation, especially on higher concurrencies. For example, the flat MPI implementation took longer to compute RCM on the ldoor matrix using 4096 cores of Edison as shown in Figure 24.
Vi Conclusion and Future Work
We have introduced the first distributedmemory implementation of the RCM algorithm. In particular, we have described how the RCM algorithm can be reformulated using the sparse matrix / graph duality, so that the parallel implementation can be accomplished using a small set of parallel primitives that have been highly optimized. Our experiments have shown that the distributedmemory implementation of the RCM algorithm presented in this work scales well up to 1024 processors for smaller matrices, and even to 4096 cores for the largest problems we have evaluated.
We have shown that this new approach is overall faster than the traditional approach that gathers the matrix structure on a single node, followed by the application of a serial or a multithreaded implementation of the RCM algorithm, and then redistributing the permuted matrix. More importantly, our approach removes the memory bottleneck that may be caused by having to store the entire graph with a single node.
A scalable implementation of the RCM algorithm is of prime importance for many iterative solvers that benefit from reordering the matrix. We have assessed this on a sample problem using the conjugate gradient method with block Jacobi preconditioner available in PETSc.
Immediate future work involves finding alternatives to sorting (i.e. global sorting at the end, or not sorting at all and sacrifice some quality). Longer term work would investigate alternative BFS formulations that are not levelsynchronous, as the existing approach has trouble scaling to highdiameter graphs.
Acknowledgments
We thank Hari Sundar for sharing the source code of HykSort. This work is supported by the Applied Mathematics and the SciDAC Programs of the DOE Office of Advanced Scientific Computing Research under contract number DEAC0205CH11231. We used resources of the NERSC supported by the Office of Science of the DOE under Contract No. DEAC0205CH11231.
References
 [1] I. S. Duff, J. K. Reid, and J. A. Scott, “The use of profile reduction algorithms with a frontal code,” International Journal for Numerical Methods in Engineering, vol. 28, no. 11, pp. 2555–2568, 1989.
 [2] I. S. Duff and G. A. Meurant, “The effect of ordering on preconditioned conjugate gradients,” BIT Numerical Mathematics, vol. 29, no. 4, pp. 635–657, 1989.
 [3] C. H. Papadimitriou, “The NPcompleteness of the bandwidth minimization problem,” Computing, vol. 16, no. 3, pp. 263–270, 1976.
 [4] E. Cuthill and J. McKee, “Reducing the bandwidth of sparse symmetric matrices,” in Proc. of 24th national conference. ACM, 1969, pp. 157–172.
 [5] A. George and J. W.H. Liu, Computer Solution of Large Sparse Positive Definite Systems. Englewood Cliffs, New Jersey: PrenticeHall Inc., 1981.
 [6] S. Sloan, “An algorithm for profile and wavefront reduction of sparse matrices,” International Journal for Numerical Methods in Engineering, vol. 23, no. 2, pp. 239–251, 1986.
 [7] S. Balay, S. Abhyankar, M. F. Adams, J. Brown, P. Brune, K. Buschelman, L. Dalcin, V. Eijkhout, W. D. Gropp, D. Kaushik, M. G. Knepley, L. C. McInnes, K. Rupp, B. F. Smith, S. Zampini, H. Zhang, and H. Zhang, “PETSc users manual,” Argonne National Laboratory, Tech. Rep. ANL95/11  Revision 3.7, 2016. [Online]. Available: http://www.mcs.anl.gov/petsc
 [8] K. I. Karantasis, A. Lenharth, D. Nguyen, M. J. Garzaran, and K. Pingali, “Parallelization of reordering algorithms for bandwidth and wavefront reduction,” in International Conference for High Performance Computing, Networking, Storage and Analysis (SC’14). IEEE, 2014, pp. 921–932.
 [9] J. A. George, “Computer implementation of the finite element method,” Ph.D. dissertation, Stanford, CA, USA, 1971.
 [10] B. Claude, The Theory of Graphs and its Applications. John Wiley & Sons, 1962.
 [11] I. Arany, L. Szoda, and W. Smyth, “An improved method for reducing the bandwidth of sparse symmetric matrices.” in IFIP Congress (2), 1971, pp. 1246–1250.
 [12] A. George and J. W. H. Liu, “An implementation of a pseudoperipheral node finder,” ACM Transactions on Mathematical Software (TOMS), vol. 5, no. 3, pp. 284–295, Sep. 1979.
 [13] N. E. Gibbs, W. G. Poole, Jr, and P. K. Stockmeyer, “An algorithm for reducing the bandwidth and profile of a sparse matrix,” SIAM Journal on Numerical Analysis (SINUM), vol. 13, no. 2, pp. 236–250, 1976.
 [14] A. Buluç and K. Madduri, “Parallel breadthfirst search on distributed memory systems,” in International Conference for High Performance Computing, Networking, Storage and Analysis (SC’11). ACM, 2011, pp. 65:1–65:12.
 [15] F. Checconi, F. Petrini, J. Willcock, A. Lumsdaine, A. R. Choudhury, and Y. Sabharwal, “Breaking the speed and scalability barriers for graph exploration on distributedmemory machines,” in International Conference for High Performance Computing, Networking, Storage and Analysis (SC’12). IEEE, 2012, pp. 1–12.
 [16] C. C. Ashcraft, R. G. Grimes, J. G. Lewis, B. W. Peyton, H. D. Simon, and P. E. Bjørstad, “Progress in sparse matrix methods for large linear systems on vector supercomputers,” International Journal of HighPerformance Computing Applications (IJHPCA), vol. 1, no. 4, pp. 10–30, 1987.
 [17] A. Buluç and J. R. Gilbert, “The Combinatorial BLAS: Design, implementation, and applications,” International Journal of HighPerformance Computing Applications (IJHPCA), vol. 25, no. 4, 2011.
 [18] A. Azad and A. Buluç, “Distributedmemory algorithms for maximum cardinality matching in bipartite graphs,” in IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2016.
 [19] J. Bruck, C.T. Ho, S. Kipnis, E. Upfal, and D. Weathersby, “Efficient algorithms for alltoall communications in multiport messagepassing systems,” IEEE Transactions on Parallel and Distributed Systems (TPDS), vol. 8, no. 11, pp. 1143–1156, 1997.
 [20] H. Sundar, D. Malhotra, and G. Biros, “HykSort: a new variant of hypercube quicksort on distributed memory architectures,” in International Conference on Supercomputing (ICS). ACM, 2013, pp. 293–302.
 [21] T. A. Davis and Y. Hu, “The university of florida sparse matrix collection,” ACM Transactions on Mathematical Software (TOMS), vol. 38, no. 1, p. 1, 2011.
 [22] H. M. Aktulga, A. Buluç, S. Williams, and C. Yang, “Optimizing sparse matrixmultiple vectors multiplication for nuclear configuration interaction calculations,” in IEEE International Parallel & Distributed Processing Symposium (IPDPS). IEEE Computer Society, 2014.
 [23] J. Park et al., “SpMP: Sparse Matrix Preprocessing.” [Online]. Available: https://github.com/jspark1105/SpMP
 [24] J. Chhugani, N. Satish, C. Kim, J. Sewall, and P. Dubey, “Fast and efficient graph traversal algorithm for CPUs: Maximizing singlenode efficiency,” in IEEE International Parallel & Distributed Processing Symposium (IPDPS). IEEE, 2012, pp. 378–389.