Basker: A Threaded Sparse LU Factorization Utilizing Hierarchical Parallelism and Data Layouts
Abstract
Scalable sparse factorization is critical for efficient numerical simulation of circuits and electrical power grids. In this work, we present a new scalable sparse direct solver called Basker. Basker introduces a new algorithm to parallelize the GilbertPeierls algorithm for sparse factorization. As architectures evolve, there exists a need for algorithms that are hierarchical in nature to match the hierarchy in thread teams, individual threads, and vector level parallelism. Basker is designed to map well to this hierarchy in architectures. There is also a need for data layouts to match multiple levels of hierarchy in memory. Basker uses a twodimensional hierarchical structure of sparse matrices that maps to the hierarchy in the memory architectures and to the hierarchy in parallelism. We present performance evaluations of Basker on the Intel SandyBridge and Xeon Phi platforms using circuit and power grid matrices taken from the University of Florida sparse matrix collection and from Xyce circuit simulations. Basker achieves a geometric mean speedup of on CPU (16 cores) and on Xeon Phi (32 cores) relative to KLU. Basker outperforms Intel MKL Pardiso (PMKL) by as much as on CPU (16 cores) and on Xeon Phi (32 cores) for low fillin circuit matrices. Furthermore, Basker provides speedup on a challenging matrix sequence taken from an actual Xyce simulation.
I Introduction
Scalable sparse direct linear solvers play a pivotal role in the efficiency of simulation codes on manycore systems. Current approaches process multiple columns with similar nonzero structure (supernodes) with threaded Basic Linear Algebra Subprograms (BLAS) [1]. The approach of using BLAS with onedimensional data layouts of these matrices may not be able to extract enough parallelism when the matrix has low fillin or an irregular nonzero pattern, such as matrices generated by Simulation Program with Integrated Circuit Emphasis (SPICE). Therefore, a new type of solver is needed that uses a hierarchical structures to leverage finegrain parallelism within the irregular nonzero pattern. In this work, we present a new sharedmemory sparse direct solver, Basker, designed to use hierarchical data layouts that exposes finegrain parallelism and naturally fit the hierarchical memory structure of most manycore systems. Basker is targeted towards parallelizing the stateoftheart GilbertPeierls algorithm [2] for low fillin problems and thereby becoming the first parallel sharedmemory solver to do so.
Sparse factorization of unsymmetric indefinite systems is difficult due to the need for numerical pivoting for stability and dynamic nonzero structure generated by such pivoting. Scaling sparse therefore depends on efficiently finding concurrent work inside this dynamic nonzero structure while providing enough numerical stability. As a result, speedups achievable for sparse factorization is far from ideal [3, 4]. Coefficient matrices with low fillin are particularly difficult, since the existence of supernodes is limited. However, a hierarchical structure can often be found in these matrices that can expose multiple levels of parallelism.
Basker uses a hierarchy of twodimensional sparse blocks designed to exploit the nonzero structure that can be found in a matrix from circuit/powergrid problems. These blocks can be found using traditional ordering techniques, such as block triangular form [5] and nesteddissection ordering [6]. This hierarchy of twodimensional sparse blocks design allows Basker to accomplish two goals: (1) exploiting any finegrained parallelism found within or between blocks and (2) designing a hierarchical data structure that fits the multiple levels of memory hierarchy and divide data among threads appropriately. As a result, Basker enables parallelization of GilbertPeierls algorithm by allowing multiple threads work simultaneously on a single matrix column.
In this work, we present the algorithm and data layouts used by Basker to achieve hierarchical parallelism. Basker is a implemented in templated C++11 with Kokkos [7]. Kokkos is a package providing portability across multiple manycore processors and device backends. The main contributions of this work are:

Parallelization of the GilbertPeierls algorithm;

A method to expose hierarchical parallelism in sparse matrices using two dimensional datalayouts;

Empirical evaluation of Basker, KLU, and Pardiso on the Intel Sandy Bridge and Xeon Phi architectures.

Performance evaluation with matrices from a transient simulation performed by the Xyce circuit simulator
The remainder of this paper is organized as follows. Section II presents an overview of previous solver work. We then introduce the hierarchically structured algorithm to extract parallelism from sparse matrices in Section III. Implementation choices are outlined in Section IV. Section V provides performance results and comparisons with other solvers. Finally, possible future improvements and a summary of our findings are described in Section VI.
Ii Background and Related Work
This section provides a brief overview of background and related work to the solution of the sparse linear system , where is a large sparse coefficient matrix, is the solution vector, is the given righthand side vector.
Orderings. All sparse direct solvers use structural information to improve performance and scalability. Coefficient matrices are often reordered to limit fillin, i.e., zeros becoming nonzero during factorization, or cluster nonzeros into patterns that reveal dependencies in computation. Minimum degree orderings, such as approximate minimum degree ordering (AMD), are a type of ordering that is very efficient in reducing fillin [8]. Nesteddissection (ND) [6] is another ordering based on the graph() corresponding to a matrix, using when is symmetric and + when is unsymmetric. It is commonly used to provide a treestructure that can be used in parallel factorizations while reducing fillin.
If an unsymmetric matrix does not have the strong Hall property, i.e., if every set of columns has nonzeros in at least + rows, then the matrix can be permuted into a block triangular form (BTF) where block submatrices in the lower triangular part are all zeros. A coefficient matrix permuted by matrices and into BTF has the form:
This form is common in irregular unsymmetric systems, such as those from circuit simulation [5]. In this form, only submatrices on the diagonal () need to be factored resulting in far less work, reduced memory usage, and a great deal of parallelism. In addition to fill reduction, permuting the matrix to limit pivoting by placing nonzeros on the diagonal is common before computation [9]. Finding such a permutation is done through finding a maximum cardinality matching of a bipartite graph representation of the coefficient matrix [10]. However, nonzeros on the diagonal is only one half of the issue; a variant that also tries to maximize the values on the diagonal is often used. We will call this variant maximum weightcardinality matching ordering (MWCM) [10].
Sparse LU. We consider three popular solver packages, namely SuperLUDist [9], Pardiso [4], and KLU [5], to compare their design choices to Basker.
SuperLUDist is a distributed memory unsymmetric direct solver [9] that uses a twodimensional data layout and avoids pivoting by using MWCM that maximizes the sum of the diagonal element (MC64) [10]. In each block matrix, SuperLUDist performs a supernodal based factorization. However, supernodal methods have limitations such as a pivot can only be chosen from inside a single supernode, fillin must be known before hand, and scaling is limited by the size of supernodes [11]. A sharedmemory version SuperLUMT [11] that uses a onedimensional data layout exists.
Pardiso [4] is a sharedmemory, supernodal, sparse solver that uses a number of techniques to achieve high performance. These techniques include using a leftright looking strategy to reduce synchronization and provide three levels of parallelism, namely from the etree, hybrid (leftright) at top levels, and pipelining parallelism. We compare against Intel MKL version of Pardiso and SuperLUMT in Section V.
KLU [5] is a serial direct solver, based on the GilbertPeierls algorithm, and the closest to our effort in algorithmic terms. It achieves good performance by permuting the matrices first into BTF. It then uses the GilbertPeierls algorithm to discover the nonzero pattern due to fillin during numeric factorization in time proportional to arithmetic operations (Algorithm 1 [2]). However, KLU has no method to factor any part in parallel. Basker was designed to replace KLU for circuit simulation problems by adding parallel execution both between blocks and within blocks of the BTF. It is part of Trilinos library through both Amesos2 [12] and ShyLU [13] packages.
The primary features of Basker are: (1) It is a nonsupernodal factorization; (2) It uses a hierarchical data layouts; (3) It uses both MWCM and pivoting; (4) It is a templated C++ solver using a manycore portable package supporting multiple backends such as OpenMP and PThreads.
Iii Basker Algorithm
This section introduces the parallel symbolic (pattern only) and numeric (pattern and values) factorization algorithms in Basker. The nonzero pattern of the coefficient matrix and the datalayout,i.e., how matrix entries are stored, determines not only the work but also the available parallelism to a sparse factorization. Serial/multithreaded factorization codes traditionally utilize a flat onedimensional (1D) layout of blocks where blocks contain nonzeros in rows/columns stored contiguously. These blocks are derived from some ordering of the matrix (e.g., See Figure 1). However, using only 1D layouts limit the algorithms from exploiting sparsity patterns within and between block structures. For instance, a 1D multithreaded supernodal factorization’s speedup will be limited by the threaded BLAS on a set of columns (rows) called separators (e.g, the block column in Figure 1). When these columns are not dense, like in circuit/powergrid problems the use of BLAS is limited leading to a serial bottleneck in the separators. Due to this observation, Basker uses a variety of reordering methods, such as BTF and ND, to derive a hierarchy of twodimensional sparse blocks. This reordering allows Basker to fit the irregular nonzero pattern into a hierarchy of blocks that fit the memory structure of modern nodes and allow an algorithm that can utilize the 2D layouts (called 2D algorithm). 2D algorithms break columns into multiple submatrices (e.g., See Figures 2,3) allowing for multiple threads to work on a column that would have been serial in a nonsupernodal method or efficiently use multiple calls of serial BLAS.
In this work, we will focus on two levels of structures, i.e., BTF and ND. We leave the third level (supernodes) within the 2D blocks for future extensions. BTF provides both the coarse structure for the whole matrix, and the fine structure for a collection of submatrices. ND provides the fine structure for very large submatrices from BTF. The fine structure of ND is used to arrive at a parallel 2D GilbertPeierls algorithm.
The notation used in this section is as follows. A submatrix is given as , where and are the indices in the row and column in the twodimensional block structure. The nonzero pattern of a column () in a submatrix is given as . We use C++ notation for comments in the algorithms.
Iiia Coarse Block Triangular Structure
Basker uses block triangular form (BTF) on the input matrix to compute a coarse structure. It permutes the matrix based on an ordering found from MWCM () to ensure a nonzero diagonal with large entries. A strongly connected components algorithm is used next to reorder the matrix () such that each component corresponds to a block diagonal. The reordered matrix, i.e., , produces a structure similar to that in Figure 2. This form is common to matrices from several domains, and is well studied [14]. Any of the large diagonal blocks may or may not exist for a particular matrix.
In Figure 2, a twodimensional structure with three diagonal blocks is shown. As the multiple tiny subblocks in and provide enough natural parallelism (for factoring each block), Basker uses this ordering derived from BTF as their second level structure as well. The submatrices from this second level structure are handled using a Fine Block Triangular Structure based method. In contrast, is very large without an opportunity to expose parallelism. We will use ND to reorder further and use Fine NestedDissection Structure based method.
IiiB Fine Block Triangular Structure
A typical representation of fine BTF structure, such as and , is given in Figure 2. The substructure is easily dealt with as the subblocks are independent of each other. Therefore, the sparsity pattern and factorization of each subblock () can be computed concurrently. A twodimensional sparse block structure is used here. The offdiagonal blocks are “partitioned” in a manner to help the sparse matrixvector multiplication when solving for a given righthand side vector. They could further be split, however they tend to be very sparse as they retain the original nonzero pattern.
Parallel Symbolic Factorization. The symbolic factorization algorithm for the fine BTF block is shown in Algorithm 2. It is embarrassingly parallel over the blocks. We reorder each diagonal submatrix using AMD (Line 2) for fillreduction. Next, we find the number of nonzeros of each column and estimate the number of floatingpoint operations required to factor (Line 3). Using the number of floatingpoint operations, Basker assigns the submatrices among the threads and memory for factors can be allocated. The colors in Figure 2 provides one such assignment for four threads.
Parallel Numeric Factorization. After symbolic factorization, the numeric factorization uses the same thread mapping to submatrices to call sparse factorization using GilbertPeierls algorithm. The algorithm is not shown as it is a simple parallelfor loop over the diagonal submatrices.
IiiC Fine NestedDissection Structure
A subblock, such as in Figure 2 could be too large to be factored in serial as in the above BTF fine structure method. This block could easily dominate the factorization time, but there is no simple way to factor this block with multiple threads with natural ordering. This block constitutes an average of of the total matrix size in our problem test suite (see Section V). As observed before, using a 1D layout (Figure 1) does not provide enough parallelism. Instead we reorder this block even further into finer 2D blocks. Using this structure,we design the first parallel GilbertPeierls algorithm so multiple threads can work on a single column.
The nesteddissection ordering is used in order to discover smaller independent subblocks to factor in parallel. Basker first permutes using a MWCM () to find the locally best matching and reduce the need to pivot. Next, Basker computes the ND ordering on the graph of + with a ND tree. Basker currently limits the number of leafs in the ND tree to the number of threads available (). We note that increasing the number of leafs in the ND tree may provide smaller cache friendly submatrices, but would limit the amount of pivoting allowed. This tradeoff is not explored in this paper. Additionally, current implementations of ND provide only a binary tree, and therefore, Basker is limited to using a power of two threads. The ND ordering () results in , and the reordered matrix is given in Figure 3 for four threads. This twodimensional structure of sparse matrices is used to store both the reordered matrix and factorization (). The colors suggest one possible layout where blocks of a particular color are shared by a thread.
Dependency Tree. Basker requires a method to map the ND structure to threads. One option is to use a taskdependency graph, and use a tasking runtime. However, Basker is currently limited to using dataparallel methods (parallelfor) due to dependence on Kokkos and integration requirements with Trilinos and Xyce. Basker does this by transforming a taskdependency graph into a dependency tree that represents level sets that can be executed in parallel.
Figure 3 provides a general dependency tree used by both symbolic and numeric factorization for the twodimensional matrix in Figure 3, and is read from the bottomup. This tree represents two levels of dependency. The first level dependencies are between matrices within a node. Within each node, matrices listed in a particular row depend on matrices listed in rows below in the same node. For example, depends on having . The second level dependencies are between nodes and are represented with arrows. The levels in the dependency tree is denoted as , and will always be used for only the dependency tree (not the or ND tree). Nodes are colored to match the thread mapping in Figure 3. Note that this tree is different from a ND tree, and expresses the concurrency in the hierarchical layout so Basker can use level scheduling. One can easily see the difference with Figure 1 where the root node represents the entire block column, whereas in the new dependence tree are distributed to multiple threads and the bottleneck in the root node is much smaller.
Parallel Symbolic Factorization. Basker now needs an accurate estimate of the nonzero count for the twodimensional factors found in parallel (Algorithm 3). A parallel symbolic factorization is crucial in a multithreaded environment as repeated reallocation for factors would require a system call, which is a performance bottleneck when done in a parallel region. We do not form the of the whole matrix and instead build the appropriate portions of the in different threads.
Basker first processes the bottom two levels in the dependency tree (Line 29) to obtain an accurate nonzero count. The bottom most level of the dependency tree, i.e., 1, has submatrices corresponding to . First, we find both the nonzero count per column and the [15] of either + or ) (depending on symmetry and pivoting options) in parallel (Line 5). Second, the nonzero counts for remaining in the node at 1 is found (Line 6). We note that
Also, pivoting while factorizing will not affect as by the fillpath theorem [17]. Therefore, Basker can use the above expression to find the nonzeros counts of the lowerdiagonal submatrices. Moreover, we find a data structure with the maximum and minimum row index for each column that will be used for estimating nonzero counts in higher . At 0, nonzero counts for the upperdiagonal submatrices, i.e., , can be found (Line 8). As may depend on the pivoting on the must be used. For each column (), the method counts the nodes encountered starting from each nonzero in the column of to the least common ancestor of any nonzero already explored, where the least common ancestor of two nodes is the least numbered node that is the ancestor of both. A data structure is returned with the maximum and minimum row index for each row.
The estimated nonzero counts for submatrices in the higher levels of the dependency tree are found using the estimates and by looping over the remaining (Line 11). At each , all the nodes on the level are handled by finding the nonzero count of the diagonal subblock, e.g., (Line 14). Now,
for these blocks, where is the pattern after the multiplication of . Basker estimates an upper bound of using the and by assuming the column is dense between the minimum and maximum if and overlap for the column. We find that this is a reasonable upper bound and cheaper than storing the whole nonzero pattern. Finally, the column count of any offdiagonal submatrices, such as and , can be computed (Line 15 and 16). The column count for these submatrices use the upper bound as well (i.e., fillin estimated with and ).
Parallel Numeric Factorization. This subsection describes the parallel leftlooking GilbertPeierls algorithm (Algorithm 4). To facilitate understanding, we explain the algorithm using a series of block diagrams of the execution in Figure 4. Blocks that are not colored gray represent submatrices that are active/used at a stage, and the colors correspond to the thread mapping as in Figure 3.
Submatrices are factored based on the dependency tree in Figure 3 in a columnbycolumn manner. Figure 4 starts with the submatrices in 1. Basker factors the submatrices on the diagonal that have no dependencies, i.e., computing (Line 4). This factorization uses the GilbertPeierls algorithm similar to Algorithm 1 in parallel on each submatrix. Next, the just computed column is used to compute column in the lower offdiagonal submatrices in the node at 1, e.g., and (Line 5). This is done by discovering the nonzero pattern as a result of parallel sparse matrixvector multiplication. At 1, a level synchronization between all threads is needed before moving to next . Note that Basker need not necessarily sync all threads if done in a task parallel manner.
The nodes in the dependency tree starting at has a subtle but important distinction. All submatrices in a tree node are not computed before moving to next node as in the symbolic factorization. In contrast, only those submatrices in a tree node corresponding to a particular column are computed (Line 9). The indicates multiple passes over the dependency tree (bottom up until ). Figures 4(b) 4 show the block diagram of with , where the red line indicates the column being factored. Submatrices at this (Figure 4(b)), e.g., , are factorized in parallel using a method similar to Algorithm 1 except that is used for the backsolve (Line 14).
Basker continues up the dependency tree with a loop over (Line 15). At each new level, Basker must synchronize specific threads in order to combine their results (Line 18). Figure 4 shows the blocks used in the reduction. The reduction has two phases. The first phase is multiple parallel sparse matrixvector multiplication of the matrices colored in and the column of just found (the red line in the colored blocks). The second phase is subtracting each threads’ matrixvector product from the corresponding blocks in (where gray blocks in the column are , and ). For example, one thread computes the reduction results in . and are computed in parallel as well. Once the reduction is complete, the newly updated submatrix at can be factored similar to other upper offdiagonal matrices (Line 20). Figure 4 provides a visual representation of this step. At the last step, when , at the root, there is one reduction needed to the already computed (Line 24, Figure 4) and then a simple factorization in the diagonal block can be computed (Line 26, Figure 4). This last factorization is the only serial bottleneck in the algorithm.
In the more general case, when = (Line 22) and we are not at the root node (not shown in the figures), there is no farther bottomup traversal of the dependency tree. This would have been true for the for block column three in our example. In matrix terms, this means that for a column has been computed and only the block diagonal and remain to be computed (e.g, , , and ). This requires a reduction (Line 24) and factoring the diagonal submatrix (Line 26) as before, but any lower offdiagonal submatrices of that remain, such as , need to be factored as well (Line 28).
Iv Basker Implementation
Data Layout. Basker uses a hierarchy of twodimensional sparse matrix blocks to store both the original matrix and factors. The 2D structure is composed of multiple compressed sparse column (CSC) format matrices. Parallelism must be extracted from between blocks in the BTF structure and within large blocks in order to achieve speedup on low fillin matrices. In particular, a hierarchical structure needs to be exploited to reveal more parallelism. Additionally, this also breaks the problem into finegrain data structures that better fit the structure of memory in modern manycore nodes. Basker implements this by building this structure of C++ classes during the symbolic factorization after applying the aforementioned orderings.
Synchronization. Light weight synchronizations are needed to allow multiple threads to work on a single column in Basker. There are multiple places where these synchronizations need to happen in Basker, and they are marked in Algorithm 4. The number of threads that need to synchronize depends on location and iteration in the algorithm. For instance, all threads need to synchronize moving from factoring leaf nodes and parent nodes, but only two threads need to sync in separator columns.
A traditional dataparallel approach launches parallelfor over a set of threads, and these threads rejoin the master only after the end of the loop. However, if synchronization takes place between all threads at every level, the overhead would be too high. In particular, the total time spent for synchronization in this manner for matrix G2 Circuit with 8 cores is 11% of total time. Therefore, Basker uses a different mechanism to synchronize between threads. This mechanism is a pointtopoint synchronization that utilizes writing to a volatile variable where synchronization only happens between two threads that have a dependency. Pointtopoint synchronization’s importance in the speedup of sparse triangular solve has been shown before [18]. Using this method, Basker is able to reduce synchronization overhead to 2.3% ( improvement) of total runtime for G2 Circuit.
V Empirical Evaluation
We evaluate Basker against Pardiso MKL 11.2.2 (PMKL), SuperLUMT 3.0 (SLUMT), and KLU 1.3.2 on a set of sparse matrices from circuit and powergrid simulations in terms of memory and runtime. Our MWCM implementation is similar to MC64 bottleneck ordering [10], unlike SuperLUDist’s product/sum based MC64 ordering. Scotch [6] 6.0 is used to obtain the ND ordering. Furthermore, we compare Basker’s performance on a sequence of matrices from circuit simulation of interest.
Va Experimental Setup
KLU  Pardiso  Basker  BTF  BTF  KLU  
Matrix  +  +  +  %  blocks  
RS_b39c30+  6.0E4  1.1E6  6.9E5  6.3E6  6.9E5  100  3E3  0.6 
RS_b678c2+  3.6E4  8.8E6  5.8E6  5.9E7  5.8E6  100  271  0.7 
Power0*+  9.8E4  4.8E5  6.4E5  9.1E5  6.4E5  100  7.7E3  1.3 
Circuit5M  5.6E6  6.0E7  6.8E7  3.1E8  7.4E7  0  1  1.3 
memplus  1.2E4  9.9E4  1.4E5  1.3E5  1.4E5  0.1  23  1.4 
rajat21  4.1E5  1.9E6  2.8E6  4.9E6  2.8E6  2  5.9E3  1.5 
trans5  1.2E5  7.5E5  1.2E6  1.3E6  1.2E6  0  1  1.6 
circuit_4  8.0E4  3.1E5  5.0E5  5.8E5  5.1E5  34.8  2.8E4  1.6 
Xyce0*  6.8E5  3.9E6  4.7E6  3.8E7  4.8E6  85  5.8E5  1.8 
Xyce4*  6.2E6  7.3E7  4.5E7  5.0E7  4.5E7  12  7.5E5  2.0 
Xyce1*  4.3E5  2.4E6  5.1E6  5.6E6  5.1E6  21  9.9E4  2.4 
asic_680ks  6.8E5  1.7E6  4.5E6  2.9E7  4.5E6  86  5.8E5  2.6 
bcircuit  6.9E4  3.8E5  1.1E6  1.1E6  1.1E6  0  1  2.8 
scircuit  1.7E5  9.6E5  2.7E6  2.7E6  2.7E6  0.3  48  2.8 
hvdc2+  1.9E5  1.3E6  3.8E6  3.0E6  3.8E6  100  67  2.8 
Freescale1  3.4E6  1.7E7  7.1E7  5.6E7  6.8E7  0  1  4.1 
hcircuit  1.1E5  5.1E5  7.3E5  6.7E5  7.1E5  13  1.4E3  6.9 
Xyce3*  1.9E6  9.5E6  7.6E7  4.3E7  7.7E7  20  4.0E5  9.2 
memchip  2.7E6  1.3E7  1.3E8  6.5E7  9.4E7  0  1  9.9 
G2_Circuit  1.5E5  7.3E5  2.0E7  1.3E7  2.0E7  0  1  27.7 
twotone  1.2E5  1.2E6  4.8E7  2.7E7  4.7E7  0  5  39.9 
onetone1  3.6E4  3.4E5  1.4E7  4.3E6  1.2E7  1.1  203  40.8 
System Setup. We use two test beds for our experiments. The first system has two eightcore Xeon E52670 running at 2.6GHz (SandyBridge). The two processors are interconnected using Intel’s QuickPath Interconnect (QPI), and share 24GB of DRAM. The second system has an Intel Xeon Phi coprocessor with 61 cores running at 1.238GHz and 16GB of memory. Since Basker requires a power of two threads, we only test up to 32 cores as 64 threads would oversubscribe the device. All codes are compiled using Intel 15.2 with 03 optimization.
Test Suite. Basker is evaluated over a test suite of circuit and powergrid matrices taken from Xyce and the University of Florida Sparse Matrix Collection [19]. These matrices vary in size, sparsity pattern, and number of BTF blocks. Additionally, these matrices vary in fillin density, i.e., where is the number of nonzeros in . We note that fillin can be when using BTF, since only the diagonal subblocks of are factored to . In Davis and Natarajan [5], coefficient matrices coming from circuit simulation generally have lower fillin density than those coming from PDE simulations, i.e., . Matrices with lower fillin tend to perform better using a GilbertPeierls algorithm than a supernodal approach. For fairness, we include seven matrices with fillin density larger than . Table I lists all matrices sorted by increasing fillin density measured using KLU. The percent of matrix rows in small independent diagonal submatrices (Fine BTF Structure) is shown as BTF%. The total number of BTF blocks is also shown. A double line divides matrices with fillin density higher than . The test suite is a mix of matrices with very different properties to exercise different portions of Basker.
VB Memory Usage
We now compare memory requirements in terms of +. Table I lists the number of nonzeros in + for KLU, PMKL, and Basker. We do not report results for SLUMT due performance considerations (shown below). The nonzeros reported for PMKL and Basker are from a run using 8 cores on SandyBridge. We note that this number varies slightly for Basker depending on number of cores. The best result between PMKL and Basker is in bold. We observe that Basker provides factors with less nonzero entries for most matrices with fillin density 4. This reduction can be as high as an order of magnitude for the matrix RS_b678c2+. This is the result of using the BTF structure and using fill reducing ordering on the subblocks. However, PMKL uses slightly less memory on matrix with fillin density 4. The additional memory used by Basker on these matrices is far less than the additional memory used by PMKL on the first group of matrices.
VC Performance
We first compare the general performance of the chosen sparse solver packages. Only the numeric time is compared, since the symbolic factorization of both Basker and PMKL is limited by finding ND ordering. Figure 5 gives the raw time on Intel SandyBridge for a selection of six matrices. These six matrices are selected due to their varying fillin density, and ordered increasing from a density of to . We first observe that PMKL is as good or better than SuperLUMT. Similar results have been reported in the past [20] in comparing against SuperLUDist for circuit problems. Additionally, Basker performs better than other solvers in 5/6 matrices. For this reason, we only perform additional comparison to PMKL.
VD Scalability
In this section, we focus on the scalability of the numeric factorization phase of Basker and PMKL on the two architectures. We use the relative speedup to KLU as that is the stateoftheart sequential solver, i.e., , where is the time of the numeric factorization phase, is the input matrix, is either Basker or PMKL, and is the number of cores.
Figure 6 shows the speedup achieved for these six matrices on SandyBridge platform. We provide in the title of each figure. We observe that Basker can achieve up to speedup (hvdc2) and outperform PMKL in all but one case (Xyce3) that has a high filldensity of 9.2. Moreover, we observe that PMKL has a speedup less than 1 in serial for four problems demonstrating the inefficiency of a supernodal algorithm to a GilbertPeierls algorithm for matrices with low fillin density. By adding more cores, PMKL is not able to recover from this inefficiency and reaches a max speedup of on the first four problems. The reason for this is due to semidense columns that Basker is able to avoid factoring. PMKL does factor Xyce3 faster with its high fillin density, but Basker scales in a similar way.
The relative speedup of the same six matrices on the Intel Xeon Phi are shown in Figure 6. Again, KLU time is provided in each figure’s title. On Intel Xeon Phi, Basker is able to out perform PMKL on four out of the six matrices. Basker achieves a maximum speedup (Power0) on these six matrices and PMKL achieves maximum speedup (Xyce3). We observe that any overhead from using a GilbertPeierls algorithm on a matrix with high fillin density is magnified by the Intel Phi. This is exposed and seen in both Freescale1 and Xyce3. One possible reason for this is that the submatrices in the lowest level of the hierarchical structure are too large to fit into a core’s L2 cache (). Basker currently makes the submatrices as large as possible to allow for better pivoting. However, Basker still achieves speedups higher than PMKL on the four matrices with low fillin density.
As a next step, we compare the performance on the whole test suite. On SandyBridge, the geometric mean of speedup for all the matrices with Basker is and with PMKL is it using 16 cores. On 16 cores, Basker is faster than PMKL on 17/22 matrices. The five matrices PMKL is faster on have a high fillin density. On the Xeon Phi, the geometric mean speedup with Basker is and with PMKL it is using 32 cores. On 32 cores, Basker is faster than PMKL on 16/22 matrices. This includes the same matrices as on the SandyBridge except Freescale1. The reason for such a high speedup for PMKL on Xeon Phi is again its higher performance on high fillin density matrices.
While the geometric mean gives some idea on relative performance, we use a performance profile to gain an understanding of the overall performance over the test suite. The performance profile measures the relative time of a solver on a given matrix to the best solver. The values are plotted for all matrices in a graph with an xaxis of time relative to best time and a yaxis as fraction of matrices. The result is a figure where a point(x,y) is plotted if a solver takes no more than x times the runtime of of the fastest solver for y problems.
Figure 7 shows the performance profile of Basker, PMKL, and KLU in serial on SandyBridge. This shows a baseline of how well each method does in serial. We observe that Basker is better on of the problems, while the supernodal method of PMKL is within of the the best solver for of the problems. However, PMKL is the better solver for of the problems. Despite having very similar algorithms, Basker is able to slightly beat KLU. This slight difference is because of the difference in orderings and the use of Kokkos memory padding.
The performance profile of the parallel solvers on SandyBridge (16 cores) is shown in Figure 7. Serial KLU is not included in this figure. Basker is the best solver for of the matrices, and PMKL is within of Basker on of the matrices. PMKL is the best solver for of the matrices, which correspond to matrices with high fillin density. This demonstrates Basker scales well on SandyBridge for low fillin density matrices. On Intel Xeon Phi with 32 cores, the performance profile is slightly different (Figure 7). Basker now is the best solver for matrices, and PMKL is within of Basker for of matrices. PMKL is the best (or very close to the best) for of the matrices. One can observe Basker now does poorly on high fillin density matrices. A reason for that is the missing large shared to share data needed during the Basker’s reductions.
VE Comparison on Ideal Matrices
Next, we analyze how well Basker scales on low fillin density matrices, compared to how well the supernodal solver PMKL scales on 2/3D mesh problems. This comparison allows us to better understand if Basker achieves speedup for its ideal input similar to PMKL on its ideal input. The other reason is to see how well we can parallelize GilbertPeierls algorithm for its ideal problems. We use a second test suite of matrices for PMKL that come from 2/3D mesh problems in Table II. Performance of PMKL on these matrices will be compared to the performance of Basker on the six matrices of our primary test suite with the lowest fillin density.
Matrix  +  Description  

pwtk  2.2E5  1.2E7  9.7E7  Wind tunnel stiffness matrix 
ecology  1.0E6  5.0E6  7.1E7  5 pt stencil model movement 
apache2  7.2E5  4.8E6  2.8E8  Finite difference 3D 
bmwcra1  1.5E5  1.1E7  1.4E8  Stiffness matrix 
parabolic_fem  5.3E5  3.7E6  5.2E7  Parabolic finite element 
helm2d03  3.9E5  2.7E6  3.7E7  Helmholtz on square 
Figure 8 provides a scatter plot of the speedup for each solver relative to itself over its ideal six matrices. A linear trend line is shown for each set of solver speedups. Both solvers achieve similar speedup trend on SandyBridge for their ideal inputs. This demonstrates that on systems with a large cache hierarchy Basker is able to achieve so called stateoftheart performance on low fillin density matrices. In Figure 8, a similar plot is given for our Xeon Phi platform. This time Basker has a slightly lower trend line starting at 16 cores. We suspect this is due to both the size of the submatrices not fitting into cache and time for the reduction. We plan to address both these issues in future versions of Basker.
VF Xyce
Next, we consider the use of Basker on a sequence of matrices generated during the transient analysis of a circuit. Xyce is a transistorlevel simulator that performs a SPICEstyle simulation of circuits, where devices and their interconnectivity are transformed via modified nodal analysis into a set of nonlinear differential algebraic equations (DAEs). During transient analysis, these nonlinear DAEs are solved implicitly through numerical integration methods. Any numerical integration method requires the solution to a sequence of nonlinear equations, which inturn generates a sequence of linear systems. A transient analysis can generate millions of coefficient matrices with the same structure and significantly different values. Each factorization may require a different permutation due to pivoting for this reason. For very large circuits, this results in the numeric factorization being the limiting factor of the simulation overall time and scalability. Furthermore, a solver package must reuse the symbolic factorization for all matrices in the sequence as repeating symbolic factorization would dramatically affect performance.
For this experiment, we chose a sequence from the circuit that generated Xyce1. This circuit is of particular interest because it has been used in prior studies [21] to illustrate the ineffectiveness of preconditioned iterative methods and direct solvers other than KLU. Attempts to use the PMKL solver had either been met with solver failure or simulation failure until recently. Therefore, we wish to see how well Basker performs on a sequence of these matrices (1000 matrices) which represent of the desired transient length.
Over the sequence of 1000 matrices, Basker took 175.21 seconds, KLU took 914.77 seconds, and PMKL took 951.34 seconds. This is a speedup of when using Basker instead of PMKL and when using Basker instead of KLU. The scalable simulation of this circuit was previously limited by the serial bottleneck produced by using KLU as the direct solver, which is justified due to its performance compared to PMKL. Basker provides significant speedup compared to either KLU or PMKL, and will finally provide a scalable direct solver to Xyce for performing the transient analysis of this circuit.
Vi Conclusions and Future Work
We introduced a new multithreaded sparse factorization, Basker, that uses hierarchical parallelism and data layouts. Basker provides a nice alternative to traditional solvers that use onedimensional layout with BLAS. In particular, it is useful for coefficient matrices with hierarchical structure such as circuit problems. We also introduced the first parallel implementation of GilbertPeierls algorithm. Performance results show that Basker scales well for matrices with low fillin density resulting in a speedup of (geometric mean) over the test suite on 16 SandyBridge cores and over the test suite on 32 Intel Xeon Phi cores relative to KLU. Particularly, Basker can have speedups on these matrices similar to PMKL on 2/3D mesh problems and reduce the time for a sequence of circuit problems from Xyce by . Basker shows that in order to speedup sparse factorization on manycore node, solvers must leverage all available parallelism and may do so by using a hierarchical structure.
We plan to continue support of Basker in the ShyLU package of Trilinos for Xyce. Future scheduled improvements include adding supernodes to the hierarchy structure to improve performance on high fillin matrices, and using asynchronous tasking to reduce synchronization costs.
Acknowledgment
We would like to thank Erik Boman, Andrew Bradley, Kyungjoo Kim, H.C. Edwards, Christian Trott, and Simon Hammond for insights and discussions. Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the U.S. Department of Energy under contract DEAC0494AL85000.
References
 [1] J. W. Demmel, S. C. Eisenstat, J. R. Gilbert, X. S. Li, and J. W. H. Liu, “A supernodal approach to sparse partial pivoting,” SIAM J. Matrix Anal. Appl., vol. 20, no. 3, pp. 720–755, May 1999.
 [2] J. R. Gilbert and T. Peierls, “Sparse partial pivoting in time proportional to arithmetic operations,” SIAM Journal on Scientific and Statistical Computing, vol. 9, no. 5, pp. 862–874, 1988.
 [3] P. R. Amestoy, I. S. Duff, J.Y. LâExcellent, and J. Koster, “Mumps: a general purpose distributed memory sparse solver,” in Applied Parallel Computing. New Paradigms for HPC in Industry and Academia. Springer, 2001, pp. 121–130.
 [4] O. Schenk, K. Gärtner, W. Fichtner, and A. Stricker, “Pardiso: A highperformance serial and parallel sparse linear solver in semiconductor device simulation,” Future Gener. Comput. Syst., vol. 18, no. 1, pp. 69–78, Sep. 2001.
 [5] T. A. Davis and E. Palamadai Natarajan, “Algorithm 907: Klu, a direct sparse solver for circuit simulation problems,” ACM Trans. Math. Softw., vol. 37, no. 3, pp. 36:1–36:17, Sep. 2010.
 [6] F. Pellegrini and J. Roman, “Scotch: A software package for static mapping by dual recursive bipartitioning of process and architecture graphs,” in Proceedings of the International Conference and Exhibition on HighPerformance Computing and Networking, ser. HPCN Europe 1996. London, UK, UK: SpringerVerlag, 1996, pp. 493–498.
 [7] H. C. Edwards, C. R. Trott, and D. Sunderland, “Kokkos: Enabling manycore performance portability through polymorphic memory access patterns,” Journal of Parallel and Distributed Computing, vol. 74, no. 12, pp. 3202 – 3216, 2014, domainSpecific Languages and HighLevel Frameworks for HighPerformance Computing.
 [8] P. R. Amestoy, T. A. Davis, and I. S. Duff, “An approximate minimum degree ordering algorithm,” SIAM J. Matrix Anal. Appl., vol. 17, no. 4, pp. 886–905, Oct. 1996.
 [9] X. S. Li and J. W. Demmel, “Superlu dist: A scalable distributedmemory sparse direct solver for unsymmetric linear systems,” ACM Trans. Math. Softw., vol. 29, no. 2, pp. 110–140, Jun. 2003.
 [10] I. S. Duff and J. Koster, “On algorithms for permuting large entries to the diagonal of a sparse matrix,” SIAM J. Matrix Anal. Appl., vol. 22, no. 4, pp. 973–996, Jul. 2000.
 [11] J. W. Demmel, J. R. Gilbert, and X. S. Li, “An asynchronous parallel supernodal algorithm for sparse gaussian elimination,” SIAM J. Matrix Anal. Appl., vol. 20, no. 4, pp. 915–952, Jul. 1999.
 [12] E. Bavier, M. Hoemmen, S. Rajamanickam, and H. Thornquist, “Amesos2 and belos: Direct and iterative solvers for large sparse linear systems,” Sci. Program., vol. 20, no. 3, pp. 241–255, Jul. 2012.
 [13] S. Rajamanickam, E. Boman, and M. Heroux, “Shylu: A hybridhybrid solver for multicore platforms,” in Parallel Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International, 2012, pp. 631–643.
 [14] A. Pothen and C.J. Fan, “Computing the block triangular form of a sparse matrix,” ACM Trans. Math. Softw., vol. 16, no. 4, pp. 303–324, Dec. 1990.
 [15] T. A. Davis, Direct Methods for Sparse Linear Systems (Fundamentals of Algorithms 2). Philadelphia, PA, USA: Society for Industrial and Applied Mathematics, 2006.
 [16] D. J. Rose and R. E. Tarjan, “Algorithmic aspects of vertex elimination on directed graphs,” SIAM Journal on Applied Mathematics, vol. 34, no. 1, pp. 176–197, 1978.
 [17] D. J. Rose, R. E. Tarjan, and G. S. Lueker, “Algorithmic aspects of vertex elimination on graphs,” SIAM Journal on Computing, vol. 5, no. 2, pp. 266–283, 1976.
 [18] J. Park, M. Smelyanskiy, N. Sundaram, and P. Dubey, “Sparsifying synchronization for highperformance sharedmemory sparse triangular solver,” in Proceedings of the 29th International Conference on Supercomputing  Volume 8488, ser. ISC 2014. New York, NY, USA: SpringerVerlag New York, Inc., 2014, pp. 124–140.
 [19] T. A. Davis and Y. Hu, “The university of florida sparse matrix collection.” [Online]. Available: http://www.cise.ufl.edu/research/sparse/matrices
 [20] H. Thornquist and S. Rajamanickam, “A hybrid approach for parallel transistorlevel fullchip circuit simulation,” in High Performance Computing for Computational Science – VECPAR 2014, ser. Lecture Notes in Computer Science, M. DaydÃ©, O. Marques, and K. Nakajima, Eds. Springer International Publishing, 2015, vol. 8969, pp. 102–111.
 [21] H. K. Thornquist, E. R. Keiter, R. J. Hoekstra, D. M. Day, and E. G. Boman, “A parallel preconditioning strategy for efficient transistorlevel circuit simulation,” in ICCAD ’09: Proceedings of the 2009 International Conference on ComputerAided Design. New York, NY, USA: ACM, 2009, pp. 410–417.