Basker: A Threaded Sparse LU Factorization Utilizing Hierarchical Parallelism and Data Layouts
Scalable sparse factorization is critical for efficient numerical simulation of circuits and electrical power grids. In this work, we present a new scalable sparse direct solver called Basker. Basker introduces a new algorithm to parallelize the Gilbert-Peierls algorithm for sparse factorization. As architectures evolve, there exists a need for algorithms that are hierarchical in nature to match the hierarchy in thread teams, individual threads, and vector level parallelism. Basker is designed to map well to this hierarchy in architectures. There is also a need for data layouts to match multiple levels of hierarchy in memory. Basker uses a two-dimensional hierarchical structure of sparse matrices that maps to the hierarchy in the memory architectures and to the hierarchy in parallelism. We present performance evaluations of Basker on the Intel SandyBridge and Xeon Phi platforms using circuit and power grid matrices taken from the University of Florida sparse matrix collection and from Xyce circuit simulations. Basker achieves a geometric mean speedup of on CPU (16 cores) and on Xeon Phi (32 cores) relative to KLU. Basker outperforms Intel MKL Pardiso (PMKL) by as much as on CPU (16 cores) and on Xeon Phi (32 cores) for low fill-in circuit matrices. Furthermore, Basker provides speedup on a challenging matrix sequence taken from an actual Xyce simulation.
Scalable sparse direct linear solvers play a pivotal role in the efficiency of simulation codes on many-core systems. Current approaches process multiple columns with similar nonzero structure (supernodes) with threaded Basic Linear Algebra Subprograms (BLAS) . The approach of using BLAS with one-dimensional data layouts of these matrices may not be able to extract enough parallelism when the matrix has low fill-in or an irregular nonzero pattern, such as matrices generated by Simulation Program with Integrated Circuit Emphasis (SPICE). Therefore, a new type of solver is needed that uses a hierarchical structures to leverage fine-grain parallelism within the irregular nonzero pattern. In this work, we present a new shared-memory sparse direct solver, Basker, designed to use hierarchical data layouts that exposes fine-grain parallelism and naturally fit the hierarchical memory structure of most many-core systems. Basker is targeted towards parallelizing the state-of-the-art Gilbert-Peierls algorithm  for low fill-in problems and thereby becoming the first parallel shared-memory solver to do so.
Sparse factorization of unsymmetric indefinite systems is difficult due to the need for numerical pivoting for stability and dynamic nonzero structure generated by such pivoting. Scaling sparse therefore depends on efficiently finding concurrent work inside this dynamic nonzero structure while providing enough numerical stability. As a result, speedups achievable for sparse factorization is far from ideal [3, 4]. Coefficient matrices with low fill-in are particularly difficult, since the existence of supernodes is limited. However, a hierarchical structure can often be found in these matrices that can expose multiple levels of parallelism.
Basker uses a hierarchy of two-dimensional sparse blocks designed to exploit the nonzero structure that can be found in a matrix from circuit/powergrid problems. These blocks can be found using traditional ordering techniques, such as block triangular form  and nested-dissection ordering . This hierarchy of two-dimensional sparse blocks design allows Basker to accomplish two goals: (1) exploiting any fine-grained parallelism found within or between blocks and (2) designing a hierarchical data structure that fits the multiple levels of memory hierarchy and divide data among threads appropriately. As a result, Basker enables parallelization of Gilbert-Peierls algorithm by allowing multiple threads work simultaneously on a single matrix column.
In this work, we present the algorithm and data layouts used by Basker to achieve hierarchical parallelism. Basker is a implemented in templated C++11 with Kokkos . Kokkos is a package providing portability across multiple manycore processors and device backends. The main contributions of this work are:
Parallelization of the Gilbert-Peierls algorithm;
A method to expose hierarchical parallelism in sparse matrices using two dimensional data-layouts;
Empirical evaluation of Basker, KLU, and Pardiso on the Intel Sandy Bridge and Xeon Phi architectures.
Performance evaluation with matrices from a transient simulation performed by the Xyce circuit simulator
The remainder of this paper is organized as follows. Section II presents an overview of previous solver work. We then introduce the hierarchically structured algorithm to extract parallelism from sparse matrices in Section III. Implementation choices are outlined in Section IV. Section V provides performance results and comparisons with other solvers. Finally, possible future improvements and a summary of our findings are described in Section VI.
Ii Background and Related Work
This section provides a brief overview of background and related work to the solution of the sparse linear system , where is a large sparse coefficient matrix, is the solution vector, is the given right-hand side vector.
Orderings. All sparse direct solvers use structural information to improve performance and scalability. Coefficient matrices are often reordered to limit fill-in, i.e., zeros becoming nonzero during factorization, or cluster nonzeros into patterns that reveal dependencies in computation. Minimum degree orderings, such as approximate minimum degree ordering (AMD), are a type of ordering that is very efficient in reducing fill-in . Nested-dissection (ND)  is another ordering based on the graph() corresponding to a matrix, using when is symmetric and + when is unsymmetric. It is commonly used to provide a tree-structure that can be used in parallel factorizations while reducing fill-in.
If an unsymmetric matrix does not have the strong Hall property, i.e., if every set of columns has nonzeros in at least + rows, then the matrix can be permuted into a block triangular form (BTF) where block submatrices in the lower triangular part are all zeros. A coefficient matrix permuted by matrices and into BTF has the form:
This form is common in irregular unsymmetric systems, such as those from circuit simulation . In this form, only submatrices on the diagonal () need to be factored resulting in far less work, reduced memory usage, and a great deal of parallelism. In addition to fill reduction, permuting the matrix to limit pivoting by placing nonzeros on the diagonal is common before computation . Finding such a permutation is done through finding a maximum cardinality matching of a bipartite graph representation of the coefficient matrix . However, nonzeros on the diagonal is only one half of the issue; a variant that also tries to maximize the values on the diagonal is often used. We will call this variant maximum weight-cardinality matching ordering (MWCM) .
SuperLU-Dist is a distributed memory unsymmetric direct solver  that uses a two-dimensional data layout and avoids pivoting by using MWCM that maximizes the sum of the diagonal element (MC64) . In each block matrix, SuperLU-Dist performs a supernodal based factorization. However, supernodal methods have limitations such as a pivot can only be chosen from inside a single supernode, fill-in must be known before hand, and scaling is limited by the size of supernodes . A shared-memory version SuperLU-MT  that uses a one-dimensional data layout exists.
Pardiso  is a shared-memory, supernodal, sparse solver that uses a number of techniques to achieve high performance. These techniques include using a left-right looking strategy to reduce synchronization and provide three levels of parallelism, namely from the etree, hybrid (left-right) at top levels, and pipelining parallelism. We compare against Intel MKL version of Pardiso and SuperLU-MT in Section V.
KLU  is a serial direct solver, based on the Gilbert-Peierls algorithm, and the closest to our effort in algorithmic terms. It achieves good performance by permuting the matrices first into BTF. It then uses the Gilbert-Peierls algorithm to discover the nonzero pattern due to fill-in during numeric factorization in time proportional to arithmetic operations (Algorithm 1 ). However, KLU has no method to factor any part in parallel. Basker was designed to replace KLU for circuit simulation problems by adding parallel execution both between blocks and within blocks of the BTF. It is part of Trilinos library through both Amesos2  and ShyLU  packages.
The primary features of Basker are: (1) It is a nonsupernodal factorization; (2) It uses a hierarchical data layouts; (3) It uses both MWCM and pivoting; (4) It is a templated C++ solver using a manycore portable package supporting multiple backends such as OpenMP and PThreads.
Iii Basker Algorithm
This section introduces the parallel symbolic (pattern only) and numeric (pattern and values) factorization algorithms in Basker. The nonzero pattern of the coefficient matrix and the data-layout,i.e., how matrix entries are stored, determines not only the work but also the available parallelism to a sparse factorization. Serial/multithreaded factorization codes traditionally utilize a flat one-dimensional (1D) layout of blocks where blocks contain nonzeros in rows/columns stored contiguously. These blocks are derived from some ordering of the matrix (e.g., See Figure 1). However, using only 1D layouts limit the algorithms from exploiting sparsity patterns within and between block structures. For instance, a 1D multithreaded supernodal factorization’s speedup will be limited by the threaded BLAS on a set of columns (rows) called separators (e.g, the block column in Figure 1). When these columns are not dense, like in circuit/powergrid problems the use of BLAS is limited leading to a serial bottleneck in the separators. Due to this observation, Basker uses a variety of reordering methods, such as BTF and ND, to derive a hierarchy of two-dimensional sparse blocks. This reordering allows Basker to fit the irregular nonzero pattern into a hierarchy of blocks that fit the memory structure of modern nodes and allow an algorithm that can utilize the 2D layouts (called 2D algorithm). 2D algorithms break columns into multiple submatrices (e.g., See Figures 2,3) allowing for multiple threads to work on a column that would have been serial in a nonsupernodal method or efficiently use multiple calls of serial BLAS.
In this work, we will focus on two levels of structures, i.e., BTF and ND. We leave the third level (supernodes) within the 2D blocks for future extensions. BTF provides both the coarse structure for the whole matrix, and the fine structure for a collection of submatrices. ND provides the fine structure for very large submatrices from BTF. The fine structure of ND is used to arrive at a parallel 2D Gilbert-Peierls algorithm.
The notation used in this section is as follows. A submatrix is given as , where and are the indices in the row and column in the two-dimensional block structure. The nonzero pattern of a column () in a submatrix is given as . We use C++ notation for comments in the algorithms.
Iii-a Coarse Block Triangular Structure
Basker uses block triangular form (BTF) on the input matrix to compute a coarse structure. It permutes the matrix based on an ordering found from MWCM () to ensure a non-zero diagonal with large entries. A strongly connected components algorithm is used next to reorder the matrix () such that each component corresponds to a block diagonal. The reordered matrix, i.e., , produces a structure similar to that in Figure 2. This form is common to matrices from several domains, and is well studied . Any of the large diagonal blocks may or may not exist for a particular matrix.
In Figure 2, a two-dimensional structure with three diagonal blocks is shown. As the multiple tiny subblocks in and provide enough natural parallelism (for factoring each block), Basker uses this ordering derived from BTF as their second level structure as well. The submatrices from this second level structure are handled using a Fine Block Triangular Structure based method. In contrast, is very large without an opportunity to expose parallelism. We will use ND to reorder further and use Fine Nested-Dissection Structure based method.
Iii-B Fine Block Triangular Structure
A typical representation of fine BTF structure, such as and , is given in Figure 2. The substructure is easily dealt with as the subblocks are independent of each other. Therefore, the sparsity pattern and factorization of each subblock () can be computed concurrently. A two-dimensional sparse block structure is used here. The off-diagonal blocks are “partitioned” in a manner to help the sparse matrix-vector multiplication when solving for a given right-hand side vector. They could further be split, however they tend to be very sparse as they retain the original nonzero pattern.
Parallel Symbolic Factorization. The symbolic factorization algorithm for the fine BTF block is shown in Algorithm 2. It is embarrassingly parallel over the blocks. We reorder each diagonal submatrix using AMD (Line 2) for fill-reduction. Next, we find the number of nonzeros of each column and estimate the number of floating-point operations required to factor (Line 3). Using the number of floating-point operations, Basker assigns the submatrices among the threads and memory for factors can be allocated. The colors in Figure 2 provides one such assignment for four threads.
Parallel Numeric Factorization. After symbolic factorization, the numeric factorization uses the same thread mapping to submatrices to call sparse factorization using Gilbert-Peierls algorithm. The algorithm is not shown as it is a simple parallel-for loop over the diagonal submatrices.
Iii-C Fine Nested-Dissection Structure
A subblock, such as in Figure 2 could be too large to be factored in serial as in the above BTF fine structure method. This block could easily dominate the factorization time, but there is no simple way to factor this block with multiple threads with natural ordering. This block constitutes an average of of the total matrix size in our problem test suite (see Section V). As observed before, using a 1D layout (Figure 1) does not provide enough parallelism. Instead we reorder this block even further into finer 2D blocks. Using this structure,we design the first parallel Gilbert-Peierls algorithm so multiple threads can work on a single column.
The nested-dissection ordering is used in order to discover smaller independent subblocks to factor in parallel. Basker first permutes using a MWCM () to find the locally best matching and reduce the need to pivot. Next, Basker computes the ND ordering on the graph of + with a ND tree. Basker currently limits the number of leafs in the ND tree to the number of threads available (). We note that increasing the number of leafs in the ND tree may provide smaller cache friendly submatrices, but would limit the amount of pivoting allowed. This trade-off is not explored in this paper. Additionally, current implementations of ND provide only a binary tree, and therefore, Basker is limited to using a power of two threads. The ND ordering () results in , and the reordered matrix is given in Figure 3 for four threads. This two-dimensional structure of sparse matrices is used to store both the reordered matrix and factorization (). The colors suggest one possible layout where blocks of a particular color are shared by a thread.
Dependency Tree. Basker requires a method to map the ND structure to threads. One option is to use a task-dependency graph, and use a tasking runtime. However, Basker is currently limited to using data-parallel methods (parallel-for) due to dependence on Kokkos and integration requirements with Trilinos and Xyce. Basker does this by transforming a task-dependency graph into a dependency tree that represents level sets that can be executed in parallel.
Figure 3 provides a general dependency tree used by both symbolic and numeric factorization for the two-dimensional matrix in Figure 3, and is read from the bottom-up. This tree represents two levels of dependency. The first level dependencies are between matrices within a node. Within each node, matrices listed in a particular row depend on matrices listed in rows below in the same node. For example, depends on having . The second level dependencies are between nodes and are represented with arrows. The levels in the dependency tree is denoted as , and will always be used for only the dependency tree (not the or ND tree). Nodes are colored to match the thread mapping in Figure 3. Note that this tree is different from a ND tree, and expresses the concurrency in the hierarchical layout so Basker can use level scheduling. One can easily see the difference with Figure 1 where the root node represents the entire block column, whereas in the new dependence tree are distributed to multiple threads and the bottleneck in the root node is much smaller.
Parallel Symbolic Factorization. Basker now needs an accurate estimate of the nonzero count for the two-dimensional factors found in parallel (Algorithm 3). A parallel symbolic factorization is crucial in a multithreaded environment as repeated reallocation for factors would require a system call, which is a performance bottleneck when done in a parallel region. We do not form the of the whole matrix and instead build the appropriate portions of the in different threads.
Basker first processes the bottom two levels in the dependency tree (Line 2-9) to obtain an accurate nonzero count. The bottom most level of the dependency tree, i.e., -1, has submatrices corresponding to . First, we find both the nonzero count per column and the  of either + or ) (depending on symmetry and pivoting options) in parallel (Line 5). Second, the nonzero counts for remaining in the node at -1 is found (Line 6). We note that
Also, pivoting while factorizing will not affect as by the fill-path theorem . Therefore, Basker can use the above expression to find the nonzeros counts of the lower-diagonal submatrices. Moreover, we find a data structure with the maximum and minimum row index for each column that will be used for estimating nonzero counts in higher . At 0, nonzero counts for the upper-diagonal submatrices, i.e., , can be found (Line 8). As may depend on the pivoting on the must be used. For each column (), the method counts the nodes encountered starting from each nonzero in the column of to the least common ancestor of any nonzero already explored, where the least common ancestor of two nodes is the least numbered node that is the ancestor of both. A data structure is returned with the maximum and minimum row index for each row.
The estimated nonzero counts for submatrices in the higher levels of the dependency tree are found using the estimates and by looping over the remaining (Line 11). At each , all the nodes on the level are handled by finding the nonzero count of the diagonal subblock, e.g., (Line 14). Now,
for these blocks, where is the pattern after the multiplication of . Basker estimates an upper bound of using the and by assuming the column is dense between the minimum and maximum if and overlap for the column. We find that this is a reasonable upper bound and cheaper than storing the whole nonzero pattern. Finally, the column count of any off-diagonal submatrices, such as and , can be computed (Line 15 and 16). The column count for these submatrices use the upper bound as well (i.e., fill-in estimated with and ).
Parallel Numeric Factorization. This subsection describes the parallel left-looking Gilbert-Peierls algorithm (Algorithm 4). To facilitate understanding, we explain the algorithm using a series of block diagrams of the execution in Figure 4. Blocks that are not colored gray represent submatrices that are active/used at a stage, and the colors correspond to the thread mapping as in Figure 3.
Submatrices are factored based on the dependency tree in Figure 3 in a column-by-column manner. Figure 4 starts with the submatrices in -1. Basker factors the submatrices on the diagonal that have no dependencies, i.e., computing (Line 4). This factorization uses the Gilbert-Peierls algorithm similar to Algorithm 1 in parallel on each submatrix. Next, the just computed column is used to compute column in the lower off-diagonal submatrices in the node at -1, e.g., and (Line 5). This is done by discovering the nonzero pattern as a result of parallel sparse matrix-vector multiplication. At -1, a level synchronization between all threads is needed before moving to next . Note that Basker need not necessarily sync all threads if done in a task parallel manner.
The nodes in the dependency tree starting at has a subtle but important distinction. All submatrices in a tree node are not computed before moving to next node as in the symbolic factorization. In contrast, only those submatrices in a tree node corresponding to a particular column are computed (Line 9). The indicates multiple passes over the dependency tree (bottom up until ). Figures 4(b)- 4 show the block diagram of with , where the red line indicates the column being factored. Submatrices at this (Figure 4(b)), e.g., , are factorized in parallel using a method similar to Algorithm 1 except that is used for the backsolve (Line 14).
Basker continues up the dependency tree with a loop over (Line 15). At each new level, Basker must synchronize specific threads in order to combine their results (Line 18). Figure 4 shows the blocks used in the reduction. The reduction has two phases. The first phase is multiple parallel sparse matrix-vector multiplication of the matrices colored in and the column of just found (the red line in the colored blocks). The second phase is subtracting each threads’ matrix-vector product from the corresponding blocks in (where gray blocks in the column are , and ). For example, one thread computes the reduction results in . and are computed in parallel as well. Once the reduction is complete, the newly updated submatrix at can be factored similar to other upper off-diagonal matrices (Line 20). Figure 4 provides a visual representation of this step. At the last step, when , at the root, there is one reduction needed to the already computed (Line 24, Figure 4) and then a simple factorization in the diagonal block can be computed (Line 26, Figure 4). This last factorization is the only serial bottleneck in the algorithm.
In the more general case, when = (Line 22) and we are not at the root node (not shown in the figures), there is no farther bottom-up traversal of the dependency tree. This would have been true for the for block column three in our example. In matrix terms, this means that for a column has been computed and only the block diagonal and remain to be computed (e.g, , , and ). This requires a reduction (Line 24) and factoring the diagonal submatrix (Line 26) as before, but any lower off-diagonal submatrices of that remain, such as , need to be factored as well (Line 28).
Iv Basker Implementation
Data Layout. Basker uses a hierarchy of two-dimensional sparse matrix blocks to store both the original matrix and factors. The 2D structure is composed of multiple compressed sparse column (CSC) format matrices. Parallelism must be extracted from between blocks in the BTF structure and within large blocks in order to achieve speedup on low fill-in matrices. In particular, a hierarchical structure needs to be exploited to reveal more parallelism. Additionally, this also breaks the problem into fine-grain data structures that better fit the structure of memory in modern many-core nodes. Basker implements this by building this structure of C++ classes during the symbolic factorization after applying the aforementioned orderings.
Synchronization. Light weight synchronizations are needed to allow multiple threads to work on a single column in Basker. There are multiple places where these synchronizations need to happen in Basker, and they are marked in Algorithm 4. The number of threads that need to synchronize depends on location and iteration in the algorithm. For instance, all threads need to synchronize moving from factoring leaf nodes and parent nodes, but only two threads need to sync in separator columns.
A traditional data-parallel approach launches parallel-for over a set of threads, and these threads rejoin the master only after the end of the loop. However, if synchronization takes place between all threads at every level, the overhead would be too high. In particular, the total time spent for synchronization in this manner for matrix G2 Circuit with 8 cores is 11% of total time. Therefore, Basker uses a different mechanism to synchronize between threads. This mechanism is a point-to-point synchronization that utilizes writing to a volatile variable where synchronization only happens between two threads that have a dependency. Point-to-point synchronization’s importance in the speedup of sparse triangular solve has been shown before . Using this method, Basker is able to reduce synchronization overhead to 2.3% ( improvement) of total runtime for G2 Circuit.
V Empirical Evaluation
We evaluate Basker against Pardiso MKL 11.2.2 (PMKL), SuperLU-MT 3.0 (SLU-MT), and KLU 1.3.2 on a set of sparse matrices from circuit and powergrid simulations in terms of memory and runtime. Our MWCM implementation is similar to MC64 bottle-neck ordering , unlike SuperLU-Dist’s product/sum based MC64 ordering. Scotch  6.0 is used to obtain the ND ordering. Furthermore, we compare Basker’s performance on a sequence of matrices from circuit simulation of interest.
V-a Experimental Setup
System Setup. We use two test beds for our experiments. The first system has two eight-core Xeon E5-2670 running at 2.6GHz (SandyBridge). The two processors are interconnected using Intel’s QuickPath Interconnect (QPI), and share 24GB of DRAM. The second system has an Intel Xeon Phi coprocessor with 61 cores running at 1.238GHz and 16GB of memory. Since Basker requires a power of two threads, we only test up to 32 cores as 64 threads would oversubscribe the device. All codes are compiled using Intel 15.2 with -03 optimization.
Test Suite. Basker is evaluated over a test suite of circuit and powergrid matrices taken from Xyce and the University of Florida Sparse Matrix Collection . These matrices vary in size, sparsity pattern, and number of BTF blocks. Additionally, these matrices vary in fill-in density, i.e., where is the number of nonzeros in . We note that fill-in can be when using BTF, since only the diagonal subblocks of are factored to . In Davis and Natarajan , coefficient matrices coming from circuit simulation generally have lower fill-in density than those coming from PDE simulations, i.e., . Matrices with lower fill-in tend to perform better using a Gilbert-Peierls algorithm than a supernodal approach. For fairness, we include seven matrices with fill-in density larger than . Table I lists all matrices sorted by increasing fill-in density measured using KLU. The percent of matrix rows in small independent diagonal submatrices (Fine BTF Structure) is shown as BTF%. The total number of BTF blocks is also shown. A double line divides matrices with fill-in density higher than . The test suite is a mix of matrices with very different properties to exercise different portions of Basker.
V-B Memory Usage
We now compare memory requirements in terms of +. Table I lists the number of nonzeros in + for KLU, PMKL, and Basker. We do not report results for SLU-MT due performance considerations (shown below). The nonzeros reported for PMKL and Basker are from a run using 8 cores on SandyBridge. We note that this number varies slightly for Basker depending on number of cores. The best result between PMKL and Basker is in bold. We observe that Basker provides factors with less nonzero entries for most matrices with fill-in density 4. This reduction can be as high as an order of magnitude for the matrix RS_b678c2+. This is the result of using the BTF structure and using fill reducing ordering on the subblocks. However, PMKL uses slightly less memory on matrix with fill-in density 4. The additional memory used by Basker on these matrices is far less than the additional memory used by PMKL on the first group of matrices.
We first compare the general performance of the chosen sparse solver packages. Only the numeric time is compared, since the symbolic factorization of both Basker and PMKL is limited by finding ND ordering. Figure 5 gives the raw time on Intel SandyBridge for a selection of six matrices. These six matrices are selected due to their varying fill-in density, and ordered increasing from a density of to . We first observe that PMKL is as good or better than SuperLU-MT. Similar results have been reported in the past  in comparing against SuperLU-Dist for circuit problems. Additionally, Basker performs better than other solvers in 5/6 matrices. For this reason, we only perform additional comparison to PMKL.
In this section, we focus on the scalability of the numeric factorization phase of Basker and PMKL on the two architectures. We use the relative speedup to KLU as that is the state-of-the-art sequential solver, i.e., , where is the time of the numeric factorization phase, is the input matrix, is either Basker or PMKL, and is the number of cores.
Figure 6 shows the speedup achieved for these six matrices on SandyBridge platform. We provide in the title of each figure. We observe that Basker can achieve up to speedup (hvdc2) and outperform PMKL in all but one case (Xyce3) that has a high fill-density of 9.2. Moreover, we observe that PMKL has a speedup less than 1 in serial for four problems demonstrating the inefficiency of a supernodal algorithm to a Gilbert-Peierls algorithm for matrices with low fill-in density. By adding more cores, PMKL is not able to recover from this inefficiency and reaches a max speedup of on the first four problems. The reason for this is due to semi-dense columns that Basker is able to avoid factoring. PMKL does factor Xyce3 faster with its high fill-in density, but Basker scales in a similar way.
The relative speedup of the same six matrices on the Intel Xeon Phi are shown in Figure 6. Again, KLU time is provided in each figure’s title. On Intel Xeon Phi, Basker is able to out perform PMKL on four out of the six matrices. Basker achieves a maximum speedup (Power0) on these six matrices and PMKL achieves maximum speedup (Xyce3). We observe that any overhead from using a Gilbert-Peierls algorithm on a matrix with high fill-in density is magnified by the Intel Phi. This is exposed and seen in both Freescale1 and Xyce3. One possible reason for this is that the submatrices in the lowest level of the hierarchical structure are too large to fit into a core’s L2 cache (). Basker currently makes the submatrices as large as possible to allow for better pivoting. However, Basker still achieves speedups higher than PMKL on the four matrices with low fill-in density.
As a next step, we compare the performance on the whole test suite. On SandyBridge, the geometric mean of speedup for all the matrices with Basker is and with PMKL is it using 16 cores. On 16 cores, Basker is faster than PMKL on 17/22 matrices. The five matrices PMKL is faster on have a high fill-in density. On the Xeon Phi, the geometric mean speedup with Basker is and with PMKL it is using 32 cores. On 32 cores, Basker is faster than PMKL on 16/22 matrices. This includes the same matrices as on the SandyBridge except Freescale1. The reason for such a high speedup for PMKL on Xeon Phi is again its higher performance on high fill-in density matrices.
While the geometric mean gives some idea on relative performance, we use a performance profile to gain an understanding of the overall performance over the test suite. The performance profile measures the relative time of a solver on a given matrix to the best solver. The values are plotted for all matrices in a graph with an x-axis of time relative to best time and a y-axis as fraction of matrices. The result is a figure where a point(x,y) is plotted if a solver takes no more than x times the runtime of of the fastest solver for y problems.
Figure 7 shows the performance profile of Basker, PMKL, and KLU in serial on SandyBridge. This shows a baseline of how well each method does in serial. We observe that Basker is better on of the problems, while the supernodal method of PMKL is within of the the best solver for of the problems. However, PMKL is the better solver for of the problems. Despite having very similar algorithms, Basker is able to slightly beat KLU. This slight difference is because of the difference in orderings and the use of Kokkos memory padding.
The performance profile of the parallel solvers on SandyBridge (16 cores) is shown in Figure 7. Serial KLU is not included in this figure. Basker is the best solver for of the matrices, and PMKL is within of Basker on of the matrices. PMKL is the best solver for of the matrices, which correspond to matrices with high fill-in density. This demonstrates Basker scales well on SandyBridge for low fill-in density matrices. On Intel Xeon Phi with 32 cores, the performance profile is slightly different (Figure 7). Basker now is the best solver for matrices, and PMKL is within of Basker for of matrices. PMKL is the best (or very close to the best) for of the matrices. One can observe Basker now does poorly on high fill-in density matrices. A reason for that is the missing large shared to share data needed during the Basker’s reductions.
V-E Comparison on Ideal Matrices
Next, we analyze how well Basker scales on low fill-in density matrices, compared to how well the supernodal solver PMKL scales on 2/3D mesh problems. This comparison allows us to better understand if Basker achieves speedup for its ideal input similar to PMKL on its ideal input. The other reason is to see how well we can parallelize Gilbert-Peierls algorithm for its ideal problems. We use a second test suite of matrices for PMKL that come from 2/3D mesh problems in Table II. Performance of PMKL on these matrices will be compared to the performance of Basker on the six matrices of our primary test suite with the lowest fill-in density.
|pwtk||2.2E5||1.2E7||9.7E7||Wind tunnel stiffness matrix|
|ecology||1.0E6||5.0E6||7.1E7||5 pt stencil model movement|
|apache2||7.2E5||4.8E6||2.8E8||Finite difference 3D|
|parabolic_fem||5.3E5||3.7E6||5.2E7||Parabolic finite element|
|helm2d03||3.9E5||2.7E6||3.7E7||Helmholtz on square|
Figure 8 provides a scatter plot of the speedup for each solver relative to itself over its ideal six matrices. A linear trend line is shown for each set of solver speedups. Both solvers achieve similar speedup trend on SandyBridge for their ideal inputs. This demonstrates that on systems with a large cache hierarchy Basker is able to achieve so called state-of-the-art performance on low fill-in density matrices. In Figure 8, a similar plot is given for our Xeon Phi platform. This time Basker has a slightly lower trend line starting at 16 cores. We suspect this is due to both the size of the submatrices not fitting into cache and time for the reduction. We plan to address both these issues in future versions of Basker.
Next, we consider the use of Basker on a sequence of matrices generated during the transient analysis of a circuit. Xyce is a transistor-level simulator that performs a SPICE-style simulation of circuits, where devices and their interconnectivity are transformed via modified nodal analysis into a set of nonlinear differential algebraic equations (DAEs). During transient analysis, these nonlinear DAEs are solved implicitly through numerical integration methods. Any numerical integration method requires the solution to a sequence of nonlinear equations, which in-turn generates a sequence of linear systems. A transient analysis can generate millions of coefficient matrices with the same structure and significantly different values. Each factorization may require a different permutation due to pivoting for this reason. For very large circuits, this results in the numeric factorization being the limiting factor of the simulation overall time and scalability. Furthermore, a solver package must reuse the symbolic factorization for all matrices in the sequence as repeating symbolic factorization would dramatically affect performance.
For this experiment, we chose a sequence from the circuit that generated Xyce1. This circuit is of particular interest because it has been used in prior studies  to illustrate the ineffectiveness of preconditioned iterative methods and direct solvers other than KLU. Attempts to use the PMKL solver had either been met with solver failure or simulation failure until recently. Therefore, we wish to see how well Basker performs on a sequence of these matrices (1000 matrices) which represent of the desired transient length.
Over the sequence of 1000 matrices, Basker took 175.21 seconds, KLU took 914.77 seconds, and PMKL took 951.34 seconds. This is a speedup of when using Basker instead of PMKL and when using Basker instead of KLU. The scalable simulation of this circuit was previously limited by the serial bottleneck produced by using KLU as the direct solver, which is justified due to its performance compared to PMKL. Basker provides significant speedup compared to either KLU or PMKL, and will finally provide a scalable direct solver to Xyce for performing the transient analysis of this circuit.
Vi Conclusions and Future Work
We introduced a new multithreaded sparse factorization, Basker, that uses hierarchical parallelism and data layouts. Basker provides a nice alternative to traditional solvers that use one-dimensional layout with BLAS. In particular, it is useful for coefficient matrices with hierarchical structure such as circuit problems. We also introduced the first parallel implementation of Gilbert-Peierls algorithm. Performance results show that Basker scales well for matrices with low fill-in density resulting in a speedup of (geometric mean) over the test suite on 16 SandyBridge cores and over the test suite on 32 Intel Xeon Phi cores relative to KLU. Particularly, Basker can have speedups on these matrices similar to PMKL on 2/3D mesh problems and reduce the time for a sequence of circuit problems from Xyce by . Basker shows that in order to speedup sparse factorization on many-core node, solvers must leverage all available parallelism and may do so by using a hierarchical structure.
We plan to continue support of Basker in the ShyLU package of Trilinos for Xyce. Future scheduled improvements include adding supernodes to the hierarchy structure to improve performance on high fill-in matrices, and using asynchronous tasking to reduce synchronization costs.
We would like to thank Erik Boman, Andrew Bradley, Kyungjoo Kim, H.C. Edwards, Christian Trott, and Simon Hammond for insights and discussions. Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the U.S. Department of Energy under contract DE-AC04-94-AL85000.
-  J. W. Demmel, S. C. Eisenstat, J. R. Gilbert, X. S. Li, and J. W. H. Liu, “A supernodal approach to sparse partial pivoting,” SIAM J. Matrix Anal. Appl., vol. 20, no. 3, pp. 720–755, May 1999.
-  J. R. Gilbert and T. Peierls, “Sparse partial pivoting in time proportional to arithmetic operations,” SIAM Journal on Scientific and Statistical Computing, vol. 9, no. 5, pp. 862–874, 1988.
-  P. R. Amestoy, I. S. Duff, J.-Y. LâExcellent, and J. Koster, “Mumps: a general purpose distributed memory sparse solver,” in Applied Parallel Computing. New Paradigms for HPC in Industry and Academia. Springer, 2001, pp. 121–130.
-  O. Schenk, K. Gärtner, W. Fichtner, and A. Stricker, “Pardiso: A high-performance serial and parallel sparse linear solver in semiconductor device simulation,” Future Gener. Comput. Syst., vol. 18, no. 1, pp. 69–78, Sep. 2001.
-  T. A. Davis and E. Palamadai Natarajan, “Algorithm 907: Klu, a direct sparse solver for circuit simulation problems,” ACM Trans. Math. Softw., vol. 37, no. 3, pp. 36:1–36:17, Sep. 2010.
-  F. Pellegrini and J. Roman, “Scotch: A software package for static mapping by dual recursive bipartitioning of process and architecture graphs,” in Proceedings of the International Conference and Exhibition on High-Performance Computing and Networking, ser. HPCN Europe 1996. London, UK, UK: Springer-Verlag, 1996, pp. 493–498.
-  H. C. Edwards, C. R. Trott, and D. Sunderland, “Kokkos: Enabling manycore performance portability through polymorphic memory access patterns,” Journal of Parallel and Distributed Computing, vol. 74, no. 12, pp. 3202 – 3216, 2014, domain-Specific Languages and High-Level Frameworks for High-Performance Computing.
-  P. R. Amestoy, T. A. Davis, and I. S. Duff, “An approximate minimum degree ordering algorithm,” SIAM J. Matrix Anal. Appl., vol. 17, no. 4, pp. 886–905, Oct. 1996.
-  X. S. Li and J. W. Demmel, “Superlu dist: A scalable distributed-memory sparse direct solver for unsymmetric linear systems,” ACM Trans. Math. Softw., vol. 29, no. 2, pp. 110–140, Jun. 2003.
-  I. S. Duff and J. Koster, “On algorithms for permuting large entries to the diagonal of a sparse matrix,” SIAM J. Matrix Anal. Appl., vol. 22, no. 4, pp. 973–996, Jul. 2000.
-  J. W. Demmel, J. R. Gilbert, and X. S. Li, “An asynchronous parallel supernodal algorithm for sparse gaussian elimination,” SIAM J. Matrix Anal. Appl., vol. 20, no. 4, pp. 915–952, Jul. 1999.
-  E. Bavier, M. Hoemmen, S. Rajamanickam, and H. Thornquist, “Amesos2 and belos: Direct and iterative solvers for large sparse linear systems,” Sci. Program., vol. 20, no. 3, pp. 241–255, Jul. 2012.
-  S. Rajamanickam, E. Boman, and M. Heroux, “Shylu: A hybrid-hybrid solver for multicore platforms,” in Parallel Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International, 2012, pp. 631–643.
-  A. Pothen and C.-J. Fan, “Computing the block triangular form of a sparse matrix,” ACM Trans. Math. Softw., vol. 16, no. 4, pp. 303–324, Dec. 1990.
-  T. A. Davis, Direct Methods for Sparse Linear Systems (Fundamentals of Algorithms 2). Philadelphia, PA, USA: Society for Industrial and Applied Mathematics, 2006.
-  D. J. Rose and R. E. Tarjan, “Algorithmic aspects of vertex elimination on directed graphs,” SIAM Journal on Applied Mathematics, vol. 34, no. 1, pp. 176–197, 1978.
-  D. J. Rose, R. E. Tarjan, and G. S. Lueker, “Algorithmic aspects of vertex elimination on graphs,” SIAM Journal on Computing, vol. 5, no. 2, pp. 266–283, 1976.
-  J. Park, M. Smelyanskiy, N. Sundaram, and P. Dubey, “Sparsifying synchronization for high-performance shared-memory sparse triangular solver,” in Proceedings of the 29th International Conference on Supercomputing - Volume 8488, ser. ISC 2014. New York, NY, USA: Springer-Verlag New York, Inc., 2014, pp. 124–140.
-  T. A. Davis and Y. Hu, “The university of florida sparse matrix collection.” [Online]. Available: http://www.cise.ufl.edu/research/sparse/matrices
-  H. Thornquist and S. Rajamanickam, “A hybrid approach for parallel transistor-level full-chip circuit simulation,” in High Performance Computing for Computational Science – VECPAR 2014, ser. Lecture Notes in Computer Science, M. DaydÃ©, O. Marques, and K. Nakajima, Eds. Springer International Publishing, 2015, vol. 8969, pp. 102–111.
-  H. K. Thornquist, E. R. Keiter, R. J. Hoekstra, D. M. Day, and E. G. Boman, “A parallel preconditioning strategy for efficient transistor-level circuit simulation,” in ICCAD ’09: Proceedings of the 2009 International Conference on Computer-Aided Design. New York, NY, USA: ACM, 2009, pp. 410–417.