A workefficient parallel sparse matrixsparse vector multiplication algorithm
Abstract
We design and develop a workefficient multithreaded algorithm for sparse matrixsparse vector multiplication (SpMSpV) where the matrix, the input vector, and the output vector are all sparse. SpMSpV is an important primitive in the emerging GraphBLAS standard and is the workhorse of many graph algorithms including breadthfirst search, bipartite graph matching, and maximal independent set. As thread counts increase, existing multithreaded SpMSpV algorithms can spend more time accessing the sparse matrix data structure than doing arithmetic. Our sharedmemory parallel SpMSpV algorithm is work efficient in the sense its total work is proportional to the number of arithmetic operations required. The key insight is to avoid each thread individually scan the list of matrix columns.
Our algorithm is simple to implement and operates on existing columnbased sparse matrix formats. It performs well on diverse matrices and vectors with heterogeneous sparsity patterns. A highperformance implementation of the algorithm attains up to 15x speedup on a 24core Intel Ivy Bridge processor and up to 49x speedup on a 64core Intel KNL manycore processor. In contrast to implementations of existing algorithms, the performance of our algorithm is sustained on a variety of different input types include matrices representing scalefree and highdiameter graphs.
I Introduction
Sparse matrixsparse vector multiplication (SpMSpV) is an important computational primitive with many applications in graph algorithms and machine learning. The SpMSpV operation can be formalized as where a sparse matrix is multiplied by a sparse vector to produce a (potentially sparse) vector . Due to lack of applications in traditional scientific computing, the research community has not paid much attention to computing SpMSpV efficiently. It is possible to interpret SpMSpV as a special case of sparse matrixmatrix multiplication where the second matrix has dimensions . While this formulation can be relatively efficient for computing SpMSpV sequentially, for example by using Gustavson’s SpGEMM algorithms [1], it is not a good fit for computing SpMSpV in parallel. This is because there is often little work in each SpMSpV operation, necessitating novel approaches in order to scale to increasing thread counts.
The computational pattern in many graph algorithms involves transforming a set of active vertices (often called “the current frontier”) to a new set of active vertices (often called the “next frontier”). Such graph algorithms, which are often called “datadriven’ algorithms” [2], are harder to parallelize because the work changes dynamically as the algorithm proceeds and there is often very little work per transformation. This “frontier expansion” pattern is neatly captured by the SpMSpV primitive: the current frontier is represented with the input vector , the graph is represented by the matrix and the next frontier is represented by . For this reason, SpMSpV is the workhorse of many graph algorithms that are implemented using matrix primitives, such as breadthfirst search [3], maximal independent sets [4], connected components [5], and bipartite graph matching [6]. This makes SpMSpV one of the most important primitives in the upcoming GraphBLAS [7] standard (http://graphblas.org).
Even seemingly more regular graph algorithms, such as PageRank, are better implemented in a datadriven way using the SpMSpV primitive as opposed to using sparse matrixdense vector multiplication. This is because SpMSpV allows marking vertices “inactive” using the sparsity of the input vector, as soon as its value converges (i.e. stops changing). Finally, local graph clustering methods such as those based on the SpielmanTeng algorithm [8] or the more practical AndersenChungLang (ACL) algorithm [9] essentially perform one SpMSpV at each step.
In the area of supervised learning, SpMSpV becomes the workhorse of many supportvector machine (SVM) implementations that use the sequential minimal optimization (SMO) approach [10]. In this formulation, the working set is represented by the sparse matrix and the sample data is represented by the sparse input vector . SpMSpV is also the primitive used for solving logistic regression problems in dual form [11].
For a given problem, the minimum amount of work that needs to be performed by any algorithm is called a lower bound, and the algorithms that match the lower bound within a constant factor are called optimal. Parallel algorithms for which the total work performed by all processors is within a constant factor of the stateoftheart serial algorithm are called workefficient. A parallel algorithm is said to have a data race whenever multiple threads access the same part of the memory and at least one of those accesses is a write operation. Whenever there is a data race among threads, the algorithm needs a synchronization mechanism to avoid inconsistencies.
In this work, we show that existing sharedmemory parallel SpMSpV algorithms are not workefficient because they start spending more time accessing the sparse matrix data structure than doing arithmetic as parallelism increases. We present a new sharedmemory parallel SpMSpV algorithm. When the input and output vectors are not sorted, the algorithm is optimal for matrices with at least one nonzero per column on average workefficient. The key insight is to avoid each thread individually scan the list of matrix columns, which is unscalable even if the columns are stored in a sparse format. We also implement and evaluate a variation of our algorithm where the input and output vectors are sorted, as it shows better performance in practice due to increased cache efficiency. Both variations of the algorithm avoid unnecessary synchronization. We experimentally evaluate our algorithm on a Intel Ivy Bridge multicore processor as well as the new Intel Knight’s Landing processor on a variety of realworld matrices with varying nonzeros structures and topologies.
Ii Background
Iia Notation
Sparse matrixsparse vector multiplication is the operation where a sparse matrix is multiplied by a sparse vector to produce a sparse vector . Intuitively, a matrix (vector) is said to be sparse when it is computationally advantageous to treat it differently from a dense matrix (vector). In this paper, we only consider this “right multiplication” with the column vector case because the “left multiplication” by the row vector is symmetric and the algorithms we present can be trivially adopted to the “left multiplication” case.
The function computes the number of nonzeros in its input, e.g., returns the number of nonzeros in . The function, which is only applicable to matrices, computes the number of nonempty columns of its input. When the object is clear from the context, we sometimes drop the input and simply write and . We follow the Matlab colon notation: denotes the th column, denotes the th row, and denotes the element at the th position of matrix .
Our SpMSpV algorithm works for all inputs with different sparsity structures as evidenced by our experimental results, but we will analyze its computational complexity on ErdősRényi random graphs for simplicity. In the ErdősRényi random graph model , each edge is present with probability independently from each other. For where , in expectation nonzeros are uniformly distributed in each column. We use as shorthand of in our analysis.
IiB Classes of SpMSpV algorithms
SpMSpV algorithms can be broadly classified into two classes: vectordriven and matrixdriven algorithms. In vectordriven algorithms, the nonzeros in the input vector drives the computation and data accesses. By contrast, the nonzeros of the matrix drive the computation in matrixdriven algorithms. In some sense, vectordriven algorithms can be classified as pullbased since the entries of the matrix are selectively pulled depending on the location of the nonzeros in the input vector. Following the same logic, matrixdriven algorithms can be classified as pushbased. In the vectordriven formulation, the SpMSpV problem becomes reminiscent of merging multiple lists (i.e., scaled columns of for which ).
IiC Sparse matrix and vector data structures
There is no shortage of sparse matrix formats, most of which were exclusively invented for the sparse matrixdense vector multiplication (SpMV) operation. A recent paper includes an uptodate list of sparse matrix storage formats [12]. The SpMV operation can be implemented by sequentially iterating over the nonzeros of the sparse matrix, hence does not require fast random access to the columns of a matrix. In SpMSpV, however, only those columns for which needs to be accessed. Consequently, we only consider the storage formats that allow fast random access to columns.
The Compressed Sparse Columns (CSC) format is perhaps the most widely used sparse matrix storage format, together with its row analog; the Compressed Sparse Rows. CSC has three arrays: \idcolptrs is an integer array of length that effectively stores pointers to the start and end positions of the nonzeros for each column, \idrowids is an integer array of length that stores the row ids for nonzeros, and \idvalues is an array of length that stores the numerical values for nonzeros. CSC supports random access to the start of a column in constant time. Some implementations of CSC keep the row ids of nonzeros within each column sorted, e.g. the range is sorted for all , but this is not a universal requirement.
The DoubleCompressed Sparse Column (DCSC) format [13] further compresses CSC by removing repetitions in the \idcolptrs array, which arise from empty columns. In DCSC, only columns that have at least one nonzero are represented, together with their column indices. DCSC requires space compared to CSC’s . DCSC can be augmented to support fast column indexing by building an auxiliary index array that enables random access to the start of a column in expected constant time. This additional array does not increase the asymptotic storage.
There are two commonly utilized methods to store sparse vectors. The list format simply stores the vector compactly as a list of (index,value) pairs. The list can be sorted or unsorted. In contrast to its name, the actual data structure is often an array of pairs for maximizing cache performance. This format is space efficient, requiring only space. It is often the format of choice for vectordriven algorithms but inefficient for matrixdriven algorithms because it does not support constant time random access for a given index. The alternative bitvector format [14] is composed of a length bitmap that signals whether or not a particular index is nonzero, and an list of values.
We require our algorithm to produce the output vector in the same format that it received the input vector . For example, if the input is presented in sorted list format, then the output should also be in sorted list format. This is necessary to ensure the usability of our algorithms as part of a larger computational process where the output of one SpMSpV can then be reused as the input of another SpMSpV.
IiD A Lower Bound for SpMSpV
We present a simple lower bound for multiplying an by matrix that represents the ErdősRényi graph by a sparse by vector with nonzeros. Since , any SpMSpV algorithm has to access the nonzero entries in columns of . Since each column of has nonzeros in expectation, the asymptotic lower bound of SpMSpV is .
This lower bound assumes no restrictions for storing the matrix and the vectors. The algorithm we present in this paper attains this lower bound using unsorted vectors. No known algorithm attains this lower bound if we require the vectors to be sorted.
Class  Algorithms  Data structures  Merging  Sequential  Parallelization  Parallel  

matrix  vector  strategy  complexity  strategy  complexity  
matrixdriven  GraphMat [14]  DCSC  bitvector  SPA  rowsplit matrix and private SPA  
vectordriven  CombBLASSPA [3]  DCSC  list  SPA  rowsplit matrix and private SPA  
vectordriven  CombBLASheap [3]  DCSC  list  heap  rowsplit matrix and private heap  
vectordriven  SpMSpVsort [15]  CSC  list  sorting  concatenate, sort and prune  
vectordriven  SpMSpVbucket  CSC  list  buckets  2step merging and private SPA 
IiE Prior work on parallel SpMSpV algorithms
A summary of existing algorithms are shown in Table I, with references to their first appearances in the literature. Some libraries later extended their algorithms to support each others’ ideas as well, but Table I only refers to the first implementation reported.
Combinatorial BLAS (CombBLAS) [16] includes implementations of a variety of vectordriven algorithms. The algorithms that use the DCSC format has been first used in the context of parallel breadthfirst search (BFS) [3]. For sharedmemory parallelization, the BFS work advocated splitting the matrix rowwise to (number of threads) pieces. Each thread local by submatrix was then stored in DCSC format. The authors experimented with multiple data structures for merging scaled columns of : a sparse accumulator (SPA) and a priority queue (heap). The SPA [17] is an abstract data type that (at minimum) consists of a dense vector of numerical values and a list of indices that refer to nonzero entries in the dense vector. CombBLAS later extended its support to CSC.
GraphMat [14] supports a matrixdriven SpMSpV algorithm. In GraphMat, the matrix is represented in the DCSC format and the vector is stored using the bitvector format. GraphMat also splits matrix rowwise. Nurvitadhi et al. [18] present a hardware accelarator for a vectordriven SpMSpV algorithm. Their algorithm description makes random accesses to vector , without any reference to its sparsity. Yang et al. [15] present a vectordriven SpMSpV implementation on the GPU, using sorted input/output vectors.
IiF Requirements for a parallel workefficient SpMSpV algorithm

To attain the lower bound, an SpMSpV algorithm must be vector driven. In contrast to the matrixdriven algorithms that need to iterate over or columns for DCSC and CSC formats, respectively, the vectordriven algorithms can efficiently access entries of the matrix.

To attain the lower bound, a SPAbased SpMSpV algorithm should not initialize the entire SPA. Since SPA is a dense vector of size , initializing the entire SPA requires time. By contrast, an algorithm that only initializes entries of SPA to be accessed in the multiplication requires initialization time; hence can be work efficient.

A parallel SpMSpV algorithm that splits the matrix rowwise is not work efficient. Consider an algorithm that splits rowwise to pieces and multiplies each of the by submatrices independently with in parallel by threads to produce piece of . Here, each thread needs to access the entire input vector requiring time per thread to access . The total time to access over all threads is , making it work inefficient. However, in the row split case, each thread writes to a separate part of the output vector , so no synchronization is needed.

A parallel SpMSpV algorithm that splits the matrix columnwise needs synchronization among threads. Consider an algorithm that splits columnwise to pieces and multiplies each of the by submatrices with piece of in parallel by threads to produce . This algorithm is work efficient because the nonzero entries of and are accessed only once. However, synchronization is required among threads in the column split case because each thread writes to the same output vector via a shared SPA.

A parallel SpMSpV algorithm that employs 2D partitioning of the matrix is not work efficient. Consider an algorithm that partitions into grids and multiplies each of the by submatrices with piece of to generate partial piece of . Since each submatrix in a column of the grid needs to access the same piece of , the input vector is accessed times across all threads, making the algorithm work inefficient. Futhermore, threads processing submatrices in a row of the grid need to update the same part of the output vector , requiring synchronization among threads. This algorithm mimics the concepts of distributedmemory SpMSpV algorithms in CombBLAS and GraphMat.
Algorithm  Attain  Work  Synch.  

aspects  lower bound?  efficient?  needed?  
Sequential 
matrix driven  
vector driven  ✓  
SPA full init  
SPA partial init  ✓  
Parallel 
rowsplit (private SPA)  
columnsplit (shared SPA)  ✓  ✓  
2D (shared SPA)  ✓ 
We summarize the properties SPAbased sequential and parallel SpMSpV algorithms in Table II. Based on this table, an asymptotically optimal SpMSpV algorithm that attains the lower bound should be vectordriven and initializes only necessary entries of SPA. A desirable parallel algorithm should be workefficient and should perform as little synchronization as possible (synchronizationavoiding). However, none of the parallelization scheme described in Table II is both workefficient and synchronizationfree at the same time. This observation motivates us to develop a new parallel algorithm incorporating the advantages of both row and columnsplit schemes to make the newlydeveloped algorithm both work efficient and synchronizationfree. In contrast to CombBLAS and GraphMat that split the matrix rowwise beforehand, our algorithm, called SpMSpVbucket, splits the necessary columns of the matrix on the fly using a list of buckets. This approach can address the need of each multiplication independently and has been shown to be very effective in sparse matrixdense vector multiplication [19]. We describe the SpMSpVbucket algorithm in the next section.
Iii The SpMSpVbucket algorithm
Algorithm 1 describes the steps of the SpMSpVbucket algorithm that takes a dense vector \idSPA of size and a list of \idnb buckets along with and as inputs. The matrix is stored in CSC format and the vector is stored in list format. The buckets are uninitialized space to be used by threads to temporarily store (row index, scaled value) pairs from the selected columns of the matrix. Each bucket corresponds to a subset of consecutive rows of the matrix. The th location of SPA corresponds to the th row of the matrix and is accessed by a single thread only. The SpMSpVbucket algorithm then performs the following three steps for the multiplication.
Step 1: Accumulate columns of into buckets (lines 27 of Algorithm 1). In this step, the columns for which are extracted, the values of the extracted columns are multiplied by the nonzero values of , and the scaled values paired with their row indices are stored in buckets. The bucket in which a scaled matrix entry is placed is determined by its row index. More specifically, values in the th row are stored in ()th bucket where \idnb is the number of buckets. This step is depicted in Step 1 of Figure 1 where the second, fifth and seventh columns of corresponding to nonzero indices of are extracted and stored in four buckets B1, B2, B3, and B4. This step is similar to the columnsplit algorithm that ensures the workefficiency of our parallel algorithm.
In the parallel algorithm, each thread processes a subset of nonzero entries of and stores the scaled entries of the corresponding columns of in their designated buckets. Writing to buckets requires synchronization among threads because multiple threads could write simultaneously to the same bucket when they extract entries from the same row . To avoid expensive synchronizations, we pass over the columns of for which in a preprocessing step and count how many scaled entries each thread will write to a bucket in Step 1 of Algorithm 1. The preprocessing step is described in Algorithm 2 where \idBoffset[i][j] stores the number of entries that the th thread will insert to th bucket in Step 1 of Algorithm 1. We use \idBoffset to precisely compute where each thread will insert in each bucket. Using this approach, threads can insert to buckets (line 7 of Algorithm 1) without any synchronization.
Step 2: merge entries in each bucket (lines 1018 of Algorithm 1). At this point, the algorithm behaves like a rowsplit algorithm where the buckets store scaled entries split rowwise among the buckets. Since there is no data dependency among buckets after they are populated, a bucket can be merged independently by a thread. At the beginning of this step, each thread initializes only those locations of the SPA to be used in merging entries in the current bucket. Next, entries in a bucket are merged using a part of SPA dedicated only for this bucket. In this process, the algorithm retrieves unique indices from the th bucket and stores them in \iduind_k. This step is depicted in Step 2 of Figure 1 where each of the four buckets independently merges its entries by adding values with the same row indices.
Step 3: Construct by concatenating buckets using SPA (lines 1924 of Algorithm 1). In the final step, unique indices identified in a bucket are coupled with the corresponding values of the SPA and the (index, value) pairs are inserted to the result vector. To make this step synchronization free, unique indices in a bucket are mapped to indices of using a prefix sum operation described in line 20 of Algorithm 1. This step is depicted in Step 3 of Figure 1 where six unique indices are coupled with the computed values of SPA and saved in the output vector . In Figure 1, we also showed the situation when indices in are required to be sorted.
So far, we have not addressed the sortedness of the vectors. The algorithm works asis for unsorted vectors. The sortedness of the input does not affect the correct of the algorithm. However, in order to return a sorted output, the algorithm requires a modification to add a sorting step at the very end.
Iiia Performance optimizations
Load balancing. In order to balance work among threads, we create more buckets than the available number of threads. In our experiments, we use buckets when using threads and employ dynamic scheduling of threads over the buckets. Using more buckets tends to improve the scalability of the SpMSpVbucket algorithm except when the input vector is extremely sparse (see the discussion in Section IVF).
Cache efficiency. To improve the cache locality of Step 1 in Algorithm 1, we allocate a small private buffer for each thread. A thread first fills its private buffer as it accesses the columns of and copies data from the private buffer to buckets when the local buffer is full. The threadprivate buffer is small enough to fit in L1 or L2 cache. Sorting the input vector beforehand (if it is not sorted) improves the cache locality of the bucketing step when is denser. This is due to the fact that when is denser, the probability of accessing consecutive columns of increases significantly.
Memory allocation. The memory allocation time for buckets and SPA can be expensive, especially when we run SpMSpV many times in an iterative algorithm such as the BFS. Hence, we allocate enough memory for all buckets and SPA only once and pass them to the SpMSpVbucket algorithm. The number of entries inserted in all buckets is at most . Hence, preallocating the buckets does not increase the total memory requirement of our algorithm.
IiiB Time and space complexity
Serial complexity. The preprocessing step described in Algorithm 2 and Step 1 in Algorithm 1 both access nonzero entries from columns of . Hence these steps require time. The initialization of SPA and merging entries in all buckets require another time in the second step. The total number of entries in across all buckets is . Since , the overall complexity of the algorithm is . If is needed to be sorted by nonzero indices, another time is required for sorting. However, sorting is very efficient in SpMSpVBucket algorithm because only unique indices in each buckets are needed to be sorted. Hence each thread can run a sequential integer sorting function on its local indices using efficient sorting algorithms such as the radix sort.
Parallel complexity. In the first step, nonzero entries of the input vector are evenly distributed among threads. Hence, each thread accesses nonzero entries of the matrix. Since the nonzero entries of the matrix are evenly distributed among rows in the ErdősRényi model, each bucket will have entries in expectation when buckets are used. Hence the parallel complexity of the SpMSpVBucket algorithm is .
Space complexity. The total space required for all buckets is no more than . Hence total space requirement of our algorithm is .
Cori  Edison  
(Intel KNL)  (Intel Ivy Bridge)  
Core  
Clock (GHz)  1.4  2.4 
L1 Cache (KB)  32  32 
L2 Cache (KB)  1024  256 
DP GFlop/s/core  44  19.2 
Node Arch.  
Sockets/node  1  2 
Cores per socket  64  12 
STREAM BW  102 GB/s  104 GB/s 
Memory per node  96 GB  64 GB 
Prog. Environment  
Compiler  gcc 5.3.0  gcc 5.3.0 
Optimization  O3  O3 
Class  Graph  #vertices  #edges  pseudo  Description 

()  ()  diameter  
amazon0312  0.40  3.20  21  Amazon product copurchasing network  
webGoogle  0.92  5.11  16  Webgraph from the Google prog. contest, 2002  
lowdiameter graphs  wikipedia20070206  3.56  45.03  14  Wikipedia page links 
ljournal2008  5.36  79.02  34  LiveJournal social network  
wbedu  9.85  57.16  38  Web crawl on .edu domain  
dielFilterV3real  1.10  89.31  84  Highorder vector finite element method in EM  
G3_circuit  1.56  7.66  514  circuit simulation problem  
hugetric00020  7.12  21.36  3,662  undirected graph  
highdiameter graphs  hugetrace00020  16.00  48.00  5,633  Frames from 2D Dynamic Simulations 
delaunay_n24  16.77  100.66  1,718  Delaunay triangulations of random points  
rgg_n24_s0  16.77  165.1  3,069  Random geometric graph 
Iv Results
Iva Experimental Setup
We evaluate the performance of SpMSpV algorithms on Edison, a Cray XC30 supercomputer at NERSC and on a KNL manycore porcessor that will be integrated with NERSC/Cori. These two systems are described in Table III. We used OpenMP for multithreaded execution in our code.
IvB Impact of sorted input and output vectors on the performance of the SpMSpVbucket algorithm
We implemented two variants of the SpMSpVbucket algorithm based on the sortedness of the input and output vectors: in one variant both and are kept sorted by their indices, while the second variant works on unsorted vectors. Figure 2 shows the impact of sorted vectors on the performance of the SpMSpVbucket algorithm for with 10K and 2.5M nonzeros. When the vector is relatively dense, keeping the vectors sorted improves the performance of our algorithm as can be seen in the right subfigure in Figure 2. This is due to the fact that when is denser, the probability of accessing consecutive columns of increases, making the bucketing step (Step 1 in Algorithm 1) more cache efficient. However, for relatively sparse vectors (e.g., when is less than of ), sorted vectors do not significantly impact the performance of the SpMSpVbucket algorithm because the access pattern to columns of more or less remains random. Since the unsorted version never seems to outperform the sorted version in practice, we only present results with sorted vectors in the remainder of the results section.
IvC Relative performance of SpMSpV algorithms
We compare the performance SpMSpVbucket with three other SpMSpV algorithms: CombBLASSPA, CombBLASheap, and GraphMat. These algorithms are already discussed in Section IIE. At first, we study the impact of of the input vector on the performance of SpMSpV algorithms. Figure 3 shows the runtime of four algorithms when the adjacency matrix of ljournal2008 is multiplied by sparse vectors with values of using (a) 1 thread and (b) 12 threads on Edison. When is very sparse (i.e., less than 50K), the runtime of GraphMat remains constant for a fixed thread count. This is a property of any matrixdriven algorithm whose runtime is dominated by the term needed to pass through all nonzero columns of the matrix, especially when the vector is very sparse. CombBLASSPA also shows similar behavior for very sparse vectors because it fully initializes the SPA requiring time. By contrast, SpMSpVbucket does not have any extra overhead when the vector is very sparse; hence it outperforms its competitors by several orders of magnitude. For example, when , SpMSpVbucket is , , and faster than CombBLASSPA, CombBLASheap, and GraphMat, respectively on a single thread. When , SpMSpVbucket is , , and faster than CombBLASSPA, CombBLASheap, and GraphMat, respectively on a single thread. This huge gap in performance shrinks as the input vector becomes denser when terms is less significant. For example, when , SpMSpVbucket, CombBLASSPA, and GraphMat all performs similarly and run faster than CombBLASheap because of the logarithm term in the latter algorithm. These story remains more or less similar on higher concurrency as can be seen in Figure 3(b).
IvD Performance of SpMSpV algorithms when used in BFS
BFS is arguably the most common customer of SpMSpV where the product of the adjacency matrix of the graph and the sparse vector representation of the current frontier provides the next frontier of the BFS. This approach has been successfully used in parallel BFS targeting GPU and the shared and distributedmemory platforms [3, 14, 15]. Here we compare the performance for four SpMSpV algorithms when they are used in BFS.
Figure 4 shows the performance of four sharedmemory SpMSpV algorithms on eleven real world matrices from Table IV on a single node of Edison. To ensure the fairness in comparing algorithms, the same source vertex is used to start the BFS by all four algorithms and only the runtime of SpMSpVs in all iterations are considered. For all problems in Figure 4, SpMSpVbucket runs the fastest for all concurrencies. The performance improvement is more dramatic on highdiameter graphs where SpMSpVbucket runs to faster than GraphMat as can be seen in the bottom row of Figure 4. According to the discussion in Section IVC, this performance gap is expected for highdiameter graphs where BFS executes many SpMSpVs with very sparse vectors – a territory where matrixdriven algorithms are inefficient. On scalefree graphs, SpMSpVbucket still performs the best, but the gaps among the algorithms are narrower. This is due to the fact that BFS on a scalefree graph is usually dominated by few iterations with dense frontiers where matrixdriven algorithms usually perform their best.
On average, SpMSpVbucket achieves (max: , min: ), CombBLASSPA achieves (max: , min: ), CombBLASheap achieves (max: , min: ), and GraphMat achieves, (max: , min: ) speedups, when going from 1 thread to 24 threads on Edison. GraphMat attains better scalability, even when is very sparse, because it always has work needed to access nonzero columns of the matrix. By contrast, our workefficient algorithm might not scale well when the vector is very sparse (e.g., less than the number of threads) due to the lack of enough work for all threads. The parallel efficiency of CombBLASSPA decreases with increasing concurrency because the total amount of work performed by all threads increases as each thread scans the entire input vector. Poor serial runtime often contributes to the high speedups of the CombBLASheap algorithm.
IvE Performance of SpMSpV algorithms on the Intel KNL processor
Figure 5 shows the performance of three SpMSpV algorithms on the Intel KNL processor equipped with 64 cores. We were unable to run GraphMat on KNL. On average, SpMSpVbucket achieves (max: , min: ), CombBLASSPA achieves (max: , min: ), and CombBLASheap achieves (max: , min: ) speedup when going from 1 thread to 64 threads on KNL. As before, the serial performance of CombBLASSPA is similar to or slightly better than SpMSpVbucket on scalefree graphs. However, scalability of CombBLASSPA suffers with increasing number of threads because of its work inefficiency. By contrast, SpMSpVbucket scales well up to 64 cores of KNL for diverse classes of matrices. We did not observe any benefit of using multiple threads per core on KNL.
IvF Performance breakdown of the SpMSpVbucket algorithm
The SpMSpVbucket algorithm has four distinct steps (including the preprocessing step) that are described in Section III. Here we show how these steps contribute to the total runtime of SpMSpV and how they scale as we increase thread count. Figure 6 shows the strong scaling of the components of the SpMSpVbucket algorithm when the adjacency matrix of ljournal2008 is multiplied by with different sparsity patterns. SPAbased merging is the most expensive step of the sequential SpMSpVbucket algorithm for all sparsity patterns of the input vector. As becomes denser, bucketing becomes as expensive as merging on a single thread. For example, in Figure 6, SPAbased merging takes 73%, 62%, and 46% of the total sequential runtime when is 200, 10K and 2.5M, respectively. By contrast, the bucketing steps takes 10%, 17%, and 35% of the serial runtime when is 200, 10K and 2.5M, respectively.
SPAbased merging has the best scalability than other steps of the SpMSpVbucket algorithm for all sparsity levels of because each thread independently performs the merging on its private bucket. For example, when we go from 1 core to 24 cores in Figure 6, SPAbased merging achieves , , and speedups when is 200, 10K and 2.5M, respectively. By contrast, the bucketing step achieves , and speedups when is 10K and 2.5M, respectively, when we go from 1 core to 24 cores on Edison. This step slows down by a factor of 2 when is 200 because the overhead of managing buckets (24 threads multiplied by 4) becomes more expensive than performing the perbucket merging operations. Consequently, bucketing step starts to dominate the runtime of the SpMSpVbucket algorithm on high concurrency. The scalability of all components improves as the input vector becomes denser, as expected.
V Conclusions and Future Work
We presented a workefficient parallel algorithm for the sparse matrixsparse vector multiplication (SpMSpV) problem. We carefully characterized different potential ways to organize the SpMSpV computation and identified the requirements for a highperformance algorithm that is workefficient and one that also avoids unnecessary synchronization.
Our algorithm avoids synchronization by performing a rowwise partitioning of the input matrix on the fly, and attains work efficiency by employing the common computational pattern of the columnwise algorithms. Our algorithm achieves highperformance for a wide range of vector sparsity levels thanks to its vectordriven nature. The implementation of our algorithm on the Intel Ivy Bridge and the Intel KNL processors significantly outperforms existing approaches when the input vector is very sparse, and performs competitively when the input vector gets denser Matrixdriven algorithms are only competitive when the input vector gets relatively dense. As future work, we will investigate when and if it is beneficial to switch to a matrixdriven algorithm.
Further refinements of the SpMSpV problem arise in different contexts. Some SVM implementations shrink the working set periodically, hence requiring a data structure that is more friendly for row deletions. This could effect the tradeoffs involved in choosing the right SpMSpV algorithm, depending on the frequency of the shrinking. In addition, GraphBLAS effort is in the process of defining masked operations, including SpMSpV. This could also effect the algorithmic tradeoffs involved. Studying those effects are subject to future work.
Acknowledgments
This work is supported by the Applied Mathematics Program of the DOE Office of Advanced Scientific Computing Research under contract number DEAC0205CH11231. We used resources of the NERSC supported by the Office of Science of the DOE under Contract No. DEAC0205CH11231.
References
 F. G. Gustavson, “Two fast algorithms for sparse matrices: Multiplication and permuted transposition,” ACM Transactions on Mathematical Software (TOMS), vol. 4, no. 3, pp. 250–269, 1978.
 A. Lenharth, D. Nguyen, and K. Pingali, “Parallel graph analytics,” Communications of the ACM, vol. 59, no. 5, pp. 78–87, 2016.
 A. Buluç and K. Madduri, “Parallel breadthfirst search on distributed memory systems,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’11. New York, NY, USA: ACM, 2011.
 A. Buluç, E. Duriakova, A. Fox, J. Gilbert, S. Kamil, A. Lugowski, L. Oliker, and S. Williams, “Highproductivity and highperformance analysis of filtered semantic graphs,” in Proceedings of the IPDPS. IEEE Computer Society, 2013.
 K. Ekanadham, W. Horn, M. Kumar, J. Jann, J. Moreira, P. Pattnaik, M. Serrano, G. Tanase, and H. Yu, “Graph programming interface (GPI): a linear algebra programming model for large scale graph computations,” in Proceedings of the ACM International Conference on Computing Frontiers. ACM, 2016, pp. 72–81.
 A. Azad and A. Buluç, “Distributedmemory algorithms for maximum cardinality matching in bipartite graphs,” in Proceedings of the IPDPS. IEEE, 2016.
 J. Kepner, P. Aaltonen, D. Bader, A. Buluç, F. Franchetti, J. Gilbert, D. Hutchison, M. Kumar, A. Lumsdaine, H. Meyerhenke, S. McMillan, J. Moreira, J. Owens, C. Yang, M. Zalewski, and T. Mattson, “Mathematical foundations of the GraphBLAS,” in IEEE High Performance Extreme Computing (HPEC), 2016.
 D. A. Spielman and S.H. Teng, “A local clustering algorithm for massive graphs and its application to nearly linear time graph partitioning,” SIAM Journal on Computing, vol. 42, no. 1, pp. 1–26, 2013.
 R. Andersen, F. Chung, and K. Lang, “Local graph partitioning using pagerank vectors,” in Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science. IEEE Computer Society, 2006, pp. 475–486.
 C.C. Chang and C.J. Lin, “LIBSVM: a library for support vector machines,” ACM Transactions on Intelligent Systems and Technology (TIST), vol. 2, no. 3, p. 27, 2011.
 R.E. Fan, K.W. Chang, C.J. Hsieh, X.R. Wang, and C.J. Lin, “LIBLINEAR: A library for large linear classification,” The Journal of Machine Learning Research, vol. 9, pp. 1871–1874, 2008.
 D. Langr and P. Tvrdik, “Evaluation criteria for sparse matrix storage formats,” IEEE Transactions on parallel and distributed systems, vol. 27, no. 2, pp. 428–440, 2016.
 A. Buluç and J. R. Gilbert, “On the Representation and Multiplication of Hypersparse Matrices,” in Proceedings of the IPDPS, April 2008.
 N. Sundaram, N. Satish, M. M. A. Patwary, S. R. Dulloor, M. J. Anderson, S. G. Vadlamudi, D. Das, and P. Dubey, “Graphmat: High performance graph analytics made productive,” Proceedings of the VLDB Endowment, vol. 8, no. 11, pp. 1214–1225, 2015.
 C. Yang, Y. Wang, and J. D. Owens, “Fast sparse matrix and sparse vector multiplication algorithm on the gpu,” in IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 2015, pp. 841–847.
 A. Buluç and J. R. Gilbert, “The Combinatorial BLAS: Design, implementation, and applications,” IJHPCA, vol. 25, no. 4, 2011.
 J. R. Gilbert, C. Moler, and R. Schreiber, “Sparse matrices in MATLAB: Design and implementation,” SIAM Journal on Matrix Analysis and Applications, vol. 13, no. 1, pp. 333–356, 1992.
 E. Nurvitadhi, A. Mishra, Y. Wang, G. Venkatesh, and D. Marr, “Hardware accelerator for analytics of sparse data,” in Europe Conference in Design, Automation, Test & Exhibition (DATE). IEEE, 2016, pp. 1616–1621.
 D. Buono, F. Petrini, F. Checconi, X. Liu, X. Que, C. Long, and T.C. Tuan, “Optimizing sparse matrixvector multiplication for largescale data analytics,” in Proceedings of the 2016 International Conference on Supercomputing. ACM, 2016, p. 37.
 T. A. Davis and Y. Hu, “The University of Florida sparse matrix collection,” ACM Trans. Math. Softw., vol. 38, no. 1, p. 1, 2011.