Sympiler: Transforming Sparse Matrix Codes by Decoupling Symbolic Analysis
Abstract.
Sympiler is a domainspecific code generator that optimizes sparse matrix computations by decoupling the symbolic analysis phase from the numerical manipulation stage in sparse codes. The computation patterns in sparse numerical methods are guided by the input sparsity structure and the sparse algorithm itself. In many realworld simulations, the sparsity pattern changes little or not at all. Sympiler takes advantage of these properties to symbolically analyze sparse codes at compiletime and to apply inspectorguided transformations that enable applying lowlevel transformations to sparse codes. As a result, the Sympilergenerated code outperforms highlyoptimized matrix factorization codes from commonlyused specialized libraries, obtaining average speedups over Eigen and CHOLMOD of 3.8 and 1.5 respectively.
1. Introduction


Sparse matrix computations are at the heart of many scientific applications and data analytics codes. The performance and efficient memory usage of these codes depends heavily on their use of specialized sparse matrix data structures that only store the nonzero entries. However, such compaction is done using index arrays that result in indirect array accesses. Due to these indirect array accesses, it is difficult to apply conventional compiler optimizations such as tiling and vectorization even for static index array operations like sparse matrix vector multiply. A static index array does not change during the algorithm; for more complex operations with dynamic index arrays such as matrix factorization and decomposition, the nonzero structure is modified during the computation, making conventional compiler optimization approaches even more difficult to apply.
The most common approach to accelerating sparse matrix computations is to identify a specialized library that provides a manuallytuned implementation of the specific sparse matrix routine. A large number of sparse libraries are available (e.g., SuperLU (Demmel et al., 1999a), MUMPS (Amestoy et al., 2000), CHOLMOD (Chen et al., 2008), KLU (Davis and Palamadai Natarajan, 2010), UMFPACK (Davis, 2011)) for different numerical kernels, supported architectures, and specific kinds of matrices. While handwritten specialized libraries can provide high performance, they must be manually ported to new architectures and may stagnate as architectural advances continue. Alternatively, compilers can be used to optimize code while providing architecture portability. However, indirect accesses and the resulting complex dependence structure run into compiletime loop transformation framework limitations.
Compiler loop transformation frameworks such as those based on the polyhedral model use algebraic representations of loop nests to transform code and successfully generate highlyefficient dense matrix kernels (Ancourt and Irigoin, 1991; Kelly, 1998; Quilleré et al., 2000; Vasilache et al., 2006; Tiwari et al., 2009; Chen, 2012). However, such frameworks are limited when dealing with nonaffine loop bounds and/or array subscripts, both of which arise in sparse codes. Recent work has extended polyhedral methods to effectively operate on kernels with static index arrays by building runtime inspectors that examine the nonzero structure and executors that use this knowledge to transform code execution (Venkat et al., 2015; Strout et al., 2016; Van Der Spek and Wijshoff, 2010; Venkat et al., 2014; Venkat et al., 2016). However, these techniques are limited to transforming sparse kernels with static index arrays. Sympiler addresses these limitations by performing symbolic analysis at compiletime to compute fillin structure and to remove dynamic index arrays from sparse matrix computations. Symbolic analysis is a term from the numerical computing community. It refers to phases that determine the computational patterns that only depend on the nonzero pattern and not on numerical values. Information from symbolic analysis can be used to make subsequent numeric manipulation faster, and the information can be reused as long as the matrix nonzero structure remains constant.
For a number of sparse matrix methods such as LU and Cholesky, it is well known that viewing their computations as a graph (e.g., elimination tree, dependence graph, and quotient graph) and applying a methoddependent graph algorithm yields information about dependences that can then be used to more efficiently compute the numerical method (Davis, 2006). Most highperformance sparse matrix computation libraries utilize symbolic information, but couple this symbolic analysis with numeric computation, further making it difficult for compilers to optimize such codes.
This work presents Sympiler, which generates highperformance sparse matrix code by fully decoupling the symbolic analysis from numeric computation and transforming code to utilize the symbolic information. After obtaining symbolic information by running a symbolic inspector, Sympiler applies inspectorguided transformations, such as variablesized blocking, resulting in performance equivalent to handtuned libraries. But Sympiler goes further than existing numerical libraries by generating code for a specific matrix nonzero structure. Because matrix structure often arises from properties of the underlying physical system that the matrix represents, in many cases the same structure reoccurs multiple times, with different values of nonzeros. Thus, Sympilergenerated code can combine inspectorguided and lowlevel transformations to produce even more efficient code. The transformations applied by Sympiler improves the performance of sparse matrix codes through applying optimizations for a singlecore such as vectorization and increased data locality which should extend to improve performance on shared and distributed memory systems.
1.1. Motivating Scenario
Sparse triangular solve takes a lower triangular matrix and a righthand side (RHS) vector and solves the linear equation for . It is a fundamental building block in many numerical algorithms such as factorization (Davis, 2006; Li, 2005), direct system solvers (Davis, 2005), and rank update methods (Davis and Hager, 2009), where the RHS vector is often sparse. A naïve implementation visits every column of matrix to propagate the contributions of its corresponding value to the rest of (see Figure (b)b). However, with a sparse , the solution vector is also sparse, reducing the required iteration space of sparse triangular solve to be proportional to the number of nonzero values in . Taking advantage of this property requires first determining the nonzero pattern of . Based on a theorem from Gilbert and Peierls (Gilbert and Peierls, 1988), the dependence graph for matrix with nodes and edges can be used to compute the nonzero pattern of , where is the matrix rank and numerical cancellation is neglected. The nonzero indices in are given by which is the set of all nodes reachable from any node in , and can be computed with a depthfirst search of the directed graph staring with . An example dependence graph is illustrated in Figure (a)a. The blue colored nodes correspond to set and the final reachset contains all the colored nodes in the graph.
Figure 6 shows four different implementations of sparse triangular solve. Most solvers assume the input matrix L is stored in a compressed sparse column (CSC) storage format. While the naïve implementation in Figure (b)b traverses all columns, the typical library implementation shown in Figure (c)c skips iterations when the corresponding value in x is zero.
The implementation in Figure (d)d shows a decoupled code that uses the symbolic information provided by the precomputed reachset. This decoupling simplifies numerical manipulation and reduces the runtime complexity from in Figure (c)c to in Figure (d)d, where is the number of floating point operations and is the number of nonzeros in . Sympiler goes further by building the reachset at compiletime and leveraging it to generate code specialized for the specific matrix structure and RHS. The Sympilergenerated code is shown in Figure (e)e, where the code only iterates over reached columns and peels iterations where the number of nonzeros in a column is greater than some threshold (in the case of the figure, this threshold is 2). These peeled loops can be further transformed with vectorization to speed up execution. This shows the power of fully decoupling the symbolic analysis phase from the code that manipulates numeric values: the compiler can aggressively apply conventional optimizations, using the reachset to guide the transformation. On matrices from the SuiteSparse Matrix Collection, the Sympilergenerated code shows speedups between 8.4 to 19 with an average of 13.6 compared to the forward solve code (Figure (b)b) and from 1.2 to 1.7 with an average of 1.3 compared to the libraryequivalent code (Figure (c)c).
1.2. Static Sparsity Patterns
A fundamental concept that Sympiler is built on is that the structure of sparse matrices in scientific codes is dictated by the physical domain and as such does not change in many applications. For example, in power system modeling and circuit simulation problems the sparse matrix used in the matrix computations is often a Jacobian matrix, where the structure is derived from interconnections among the power system and circuit components such as generation, transmission, and distribution resources. While the numerical values in the sparse input matrix change often, a change in the sparsity structure occurs on rare occasions with a change in circuit breakers, transmission lines, or one of the physical components. The sparse systems in simulations in domains such as electromagentics (Luna et al., 2013; Dorf, 2006; ExpositoDominguez et al., 2012; ElKurdi et al., 2015), computer graphics (Gibson and Mirtich, 1997), and fluid mechanics (Anderson et al., 1984) are assembled by discretizing a physical domain and approximating a partial differential equation on the mesh elements. A sparse matrix method is then used to solve the assembled systems. The sparse structure originates from the physical discretization and therefore the sparsity pattern remains the same except where there are deformations or if adaptive mesh refinement is used. Sparse matrices in many other physical domains exhibit the same behavior and benefit from Sympiler.
1.3. Contributions
This work describes Sympiler, a sparsityaware code generator for sparse matrix algorithms that leverages symbolic information to generate fast code for a specific matrix structure. The major contributions of this paper are:

A novel approach for building compiletime symbolic inspectors that obtain information about a sparse matrix, to be used during compilation.

Inspectorguided transformations that leverage compiletime information to transform sparse matrix code for specific algorithms.

Implementations of symbolic inspectors and inspectorguided transformations for two algorithms: sparse triangular solve and sparse Cholesky factorization.

A demonstration of the performance impact of our code generator, showing that Sympilergenerated code can outperform stateoftheart libraries for triangular solve and Cholesky factorization by up to 1.7 and 6.3 respectively.
2. Sympiler: A SymbolicEnabled Code Generator
Sympiler generates efficient sparse kernels by tailoring sparse code to specific matrix sparsity structures. By decoupling the symbolic analysis phase, Sympiler uses information from symbolic analysis to guide code generation for the numerical manipulation phase of the kernel. In this section, we describe the overall structure of the Sympiler code generator, as well as the domainspecific transformations enabled by leveraging information from the symbolic inspector.
2.1. Sympiler Overview
Sparse triangular solve and Cholesky factorization are currently implemented in Sympiler. Given one of these numerical methods and an input matrix stored using compressed sparse column (CSC) format, Sympiler utilizes a methodspecific symbolic inspector to obtain information about the matrix. This information is used to apply domainspecific optimizations while lowering the code for the numerical method. In addition, the lowered code is annotated with additional lowlevel transformations (such as unrolling) when applicable based on domain and matrixspecific information. Finally, the annotated code is further lowered to apply lowlevel optimizations and output to C source code.
Code implementing the numerical solver is represented in a domainspecific abstract syntax tree (AST). Sympiler produces the final code by applying a series of phases to this AST, transforming the code in each phase. An overview of the process is shown in Figure 7. The initial AST for triangular solve is shown in Figure 7a prior to any transformations.
(a) Before  (b) After 
Variable Iteration Space Pruning, loop[k].VIPrune(pruneSet,pruneSetSize)  
(c) Before  (d) After 
2D VariableSized Blocking, loop[I].VSBlock(blockSet,blockSetSize) 
2.2. Symbolic Inspector
Different numerical algorithms can make use of symbolic information in different ways, and prior work has described runtime graph traversal strategies for various numerical methods (Pothen and Toledo, 2004; Li and Demmel, 2003; Davis, 2004, 2006). The compiletime inspectors in Sympiler are based on these strategies. For each class of numerical algorithms with the same symbolic analysis approach, Sympiler uses a specific symbolic inspector to obtain information about the sparsity structure of the input matrix and stores it in an algorithmspecific way for use during later transformation stages.
We classify the used symbolic inspectors based on the numerical method as well as the transformations enabled by the obtained information. For each combination of algorithm and transformation, the symbolic inspector creates an inspection graph from the given sparsity pattern and traverses it during inspection using a specific inspection strategy. The result of the inspection is the inspection set, which contains the result of running the inspector on the inspection graph. Inspection sets are used to guide the transformations in Sympiler. Additional numerical algorithms and transformations can be added to Sympiler, as long as the required inspectors can be described in this manner as well.
For our motivating example, triangular solve, the reachset can be used to prune loop iterations that perform work that is unnecessary due to the sparseness of matrix or the right hand side. In this case, the inspection set is the reachset, and the inspection strategy is to perform a depthfirst search over the inspection graph, which is the directed dependency graph of the triangular matrix. For the example linear system shown in Figure 6, the symbolic inspector generates the reachset {6, 1, 7, 8, 9, 10}.
2.3. Inspectorguided Transformations
The initial lowered code along with the inspection sets obtained by the symbolic inspector are passed to a series of passes that further transform the code. Sympiler currently supports two transformations guided by the inspection sets: Variable Iteration Space Pruning and 2D VariableSized Blocking, which can be applied independently or jointly depending on the input sparsity. As shown in Figure 7a, the code is annotated with information showing where inspectorguided transformations may be applied. The symbolic inspector provides the required information to the transformation phases, which decide whether to transform the code based on the inspection sets. Given the inspection set and annotated code, transformations occur as illustrated in Figure 12.
2.3.1. Variable Iteration Space Pruning
Variable Iteration Space Pruning (VIPrune) prunes the iteration space of a loop using information about the sparse computation. The iteration space for sparse codes can be considerably smaller than that for dense codes, since the computation needs to only consider iterations with nonzeros. The inspection stage of Sympiler generates an inspection set that enables transforming the unoptimized sparse code to a code with a reduced iteration space.
Given this inspection set, the VIPrune transformation can be applied at a particular looplevel to the sparse code to transform it from Figure (a)a to Figure (b)b. In the figure, the transformation is applied to the loop nest in line 4. In the transformed code the iteration space is pruned to pruneSetSize, which is the inspection set size. In addition to the new loop, all references to (the loop index before transformation) are replaced by its corresponding value from the inspection set, pruneSet[]. Furthermore, the transformation phase utilizes inspection set information to annotate specific loops with further lowlevel optimizations to be applied by later stages of code generation. These annotations are guided by thresholds that decide when specific lowlevel optimizations result in faster code.
In our running example of triangular solve, the generated inspection set from the symbolic inspector enables reducing the iteration space of the code. The VIPrune transformation elides unnecessary iterations due to zeros in the right hand side. In addition, depending on the number of iterations the loops will run (which is known thanks to the symbolic inspector), loops are annotated with directives to unroll and/or vectorize during code generation.
2.3.2. 2D VariableSized Blocking
2D VariableSized Blocking (VSBlock) converts a sparse code to a set of nonuniform dense subkernels. In contrast to the conventional approach of blocking/tiling dense codes, where the input and computations are blocked into smaller uniform subkernels, the unstructured computations and inputs in sparse kernels make blocking optimizations challenging. The symbolic inspector identifies subkernels with similar structure in the sparse matrix methods and the sparse inputs to provide the VSBlock stage with “blockable” sets that are not necessarily of the same size or consecutively located. These blocks are similar to the concept of supernodes (Li, 2005) in sparse libraries. VSBlock must deal with a number of challenges:

The block sizes are variable in a sparse kernel.

Due to using compressed storage formats, the block elements may not be in consecutive memory locations.

The type of numerical method used may need to change after applying this transformation. For example, applying VSBlock to Cholesky factorization requires dense Cholesky factorization on the diagonal segment of the blocks, and the offdiagonal segments of the blocks must be updated using a set of dense triangular solves.
To address the first challenge, the symbolic inspector uses an inspection strategy that provides an inspection set specifying the size of each block. For the second challenge, the transformed code allocates temporary block storage and copies data as needed prior to operating on the block. Finally, to deal with the last challenge, the synthesized loops/instructions in the lowering phase contain information about the block location in the matrix, and when applying this transformation, the correct operation is chosen for each loop/instruction. As with the VIPrune transformation, VSBlock also annotates loops with further lowlevel transformations such as tiling to be applied during code generation. By leveraging specific information about the matrix when applying the transformation, Sympiler is able to mitigate all of the difficulties of applying VSBlock to sparse numerical methods.
An offdiagonal version of the VSBlock transformation is shown in Figures (c)c and (d)d. As shown, a new outer loop is made that provides the block information to the inner loops using the given . The inner loop in Figure (c)c transforms to two nested loops (lines 2–6) that iterate over the block specified by the outer loop. The diagonal version VSBlock heavily depends on domain information. More detailed examples of applying this transformation to triangular solve and Cholesky factorization is described in Section 3.
2.4. Enabled Conventional Lowlevel Transformations
While applying inspectorguided transformations, the original loop nests are transformed into new loops with potentially different iteration spaces, enabling the application of conventional lowlevel transformations. Based on the applied inspectorguided transformations as well as the properties of the input matrix and righthand side vectors, the code is annotated with some transformation directives. An example of these annotations are shown in Figure 7b where loop peeling is annotated within the VIPruned code. To decide when to add these annotations, the inspectorguided transformations use sparsityrelated parameters such as the average block size. The main sources of enabling lowlevel transformations are:

Symbolic information provides dependency information at compiletime, allowing Sympiler to apply more transformations such as peeling based on the reachset in Figure 6;

Inspectorguided transformations remove some of the indirect memory accesses and annotate the code with potential conventional transformations;

Sparsityspecific code generation enables Sympiler to know details such as loop boundaries at compiletime. Thus, several customized transformations are applied such vectorization of loops with iteration counts greater than a threshold;
Figure (e)e shows how some of the iterations in the triangular solve code after VIPrune can be peeled. In this example, the inspection set used for VIPrune is the reachset . Because the reachset is created in topological order, iteration ordering dependencies are met and thus code correctness is guaranteed after loop peeling. As shown in Figure 7b, the transformed code after VIPrune is annotated with the enabled peeling transformation based on the number of nonzeros in the columns (the column count). The two selected iterations with column count greater than 2 are peeled to replace them with a specialized kernel or to apply another transformation such as vectorization.
3. Case studies
Transformations  Triangular Solve  Cholesky  

Inspection Graph  Inspection Strategy  Inspection Set  Inspection Graph  Inspection Strategy  Inspection Set  Enabled Lowlevel  
VIPrune  DG +  DFS  Pruneset  etree +  Singlenode  Pruneset  dist, unroll, peel, 
SP(RHS)  (reachset)  SP()  uptraversal  (SP())  vectorization  
VSBlock  DG  Node equivalence  Blockset (supernodes)  etree + ColCount()  Uptraversal  Blockset (supernodes)  tile, unroll, peel, vectorization 
Sympiler currently supports two important sparse matrix computations: triangular solve and Cholesky factorization. This section discusses some of the graph theory and algorithms used in Sympiler’s symbolic inspector to extract inspections sets for these two matrix methods. The runtime complexity of the Symbolic inspector is also presented to evaluate inspection overheads. Finally, we demonstrate how the VIPrune and VSBlock transformations are applied using the inspection sets. Table 1 shows a classification of the inspection graphs, strategies, and resulting inspection sets for the two studied numerical algorithms in Sympiler. As shown in Table 1, the symbolic inspector performs a set of known inspection methods and generates some sets which includes symbolic information. The last column of Table 1 shows the list of transformations enabled by each inspectorguided transformation. We also discuss extending Sympiler to other matrix methods.
3.1. Sparse Triangular Solve
Theory: In the symbolic inspector, the dependency graph is traversed using depth first search (DFS) to determine the inspection set for the VIPrune transformation, which in this case is the reachset from and the righthand side vector. The graph can also be used to detect blocks with similar sparsity patterns, also known as supernodes, in sparse triangular solve. The blockset, which contains columns of grouped into supernodes, are identified by inspecting using a node equivalence method. The node equivalence algorithm first assumes nodes and are equivalent and then compares their outgoing edges. If the outgoing edges go to the same destination nodes then the two nodes are equal and are merged.
Inspectorguided Transformations: Using the reachset,VIPrune limits the iteration spaces of the loops in triangular solve to only those that operate on the necessary nonzeros. The VSBlock transformation changes the loops to apply blocking as shown in Figure 7a in triangular solve. The diagonal block of each columnblock, which is a small triangular solve, is solved first. The solution of the diagonal components is then substituted in the offdiagonal segment of the matrix.
Symbolic Inspection: The time complexity of DFS on graph is proportional to the number of edges traversed and the number of nonzeros in the RHS of the system. The time complexity for the node equivalence algorithm is proportional to the number of nonzeros in . We provide overheads for these methods for the tested matrices in Section 4.3.
3.2. Cholesky Factorization
Cholesky factorization is commonly used in direct solvers and is used to precondition iterative solvers. The algorithm factors a Hermitian positive definite matrix into , where matrix is a lower triangular matrix. Figure 14 shows an example matrix and the corresponding matrix after factorization.
Theory: The elimination tree (etree) (Davis and Hager, 2005) is one of the most important graph structures used in the symbolic analysis of sparse factorization algorithms. Figure 14 shows the corresponding elimination tree for factorizing the example matrix . The etree of is a spanning tree of satisfying where is the graph of . The filled graph or results at the end of the elimination process and includes all edges of the original matrix as well as fillin edges. Indepth discussions of the theory behind the elimination tree, the elimination process, and the filled graph can be found in (Davis, 2006; Pothen and Toledo, 2004).
Figure 13 shows the pseudocode of the leftlooking sparse Cholesky, which is performed in two phases of update (lines 3–6) and column factorization (lines 7–10). The update phase gathers the contributions from the already factorized columns on the left. The column factorization phase calculates the square root of the diagonal element and applies it to the offdiagonal elements.
To find the pruneset that enables the VIPrune transformation, the row sparsity pattern of has to be computed; Figure 13 shows how this information is used to prune the iteration space of the update phase in the Cholesky algorithm. Since is stored in column compressed format, the etree and the sparsity pattern of are used to determine the row sparsity pattern. A nonoptimal method for finding the row sparsity pattern of row in is that for each nonzero the etree of is traversed upwards from node until node is reached or a marked node is found. The rowcount of is the visited nodes in this subtree. Sympiler uses a similar but more optimized approach from (Davis, 2006) to find row sparsity patterns.
Supernodes used in VSBlock for Cholesky are found with the sparsity pattern and the etree. The sparsity pattern of is different from because of fillins created during factorization. However, the elimination tree along with the sparsity pattern of are used to find the sparsity pattern of prior to factorization. As a result, memory for can be allocated ahead of time to eliminate the need for dynamic memory allocation. To create the supernodes, the fillin pattern should be first determined. Equation (1) is based on a theorem from (George and Liu, 1981a) and computes the sparsity pattern of column in , , where is the parent of node in and “” means exclusion. The theorem states that the nonzero pattern of is the union of the nonzero patterns of the children of in the etree and the nonzero pattern of column in .
(1) 
When the sparsity pattern of is obtained, the following rule is used to merge columns to create basic supernodes: when the number of nonzeros in two adjacent columns and , regardless of the diagonal entry in , is equal, and is the only child of in , the two columns can be merged.
Inspectorguided transformations: The VIPrune transformation applies to the update phase of Cholesky. With the row sparsity pattern information, when factorizing column Sympiler only iterates over dependent columns instead of all columns smaller than . The VSBlock transformation applies to both update and column factorization phases. Therefore, the outer loop in the Cholesky algorithm in Figure 13 is converted to a new loop that iterates over the provided blockset. All references to the columns in the inner loops will be changed to the . For the diagonal part of the column factorization, a dense Cholesky needs to be computed instead of the square root in the nonsupernodal version. The resulting factor from the diagonal elements applies to the offdiagonal rows through a sequence of dense triangular solves. VSBlock also converts the update phase from vector operations to matrix operations.
Symbolic Inspection: The computational complexity for building the etree in sympiler is nearly . The runtime complexity for finding the sparsity pattern of row is proportional to the number of nonzeros in row of . The method is executed for all columns which results in a runtime of nearly .The inspection overhead for finding the blockset for VSBlock includes the sparsity detection which is done in nearly and the supernode detection which has a runtime complexity of (Davis, 2006).
3.3. Other Matrix Methods
The inspection graphs and inspection strategies supported in the current version of Sympiler can support a large class of commonlyused sparse matrix computations. The applications of the elimination tree go beyond the Cholesky factorization method and extend to some of the most commonly used sparse matrix routines in scientific applications such as LU, QR, orthogonal factorization methods (Liu, 1990), and incomplete and factorized sparse approximate inverse preconditioner computations (Janna et al., 2015). Inspection of the dependency graph and proposed inspection strategies that extract reachsets and supernodes from the dependency graph are the fundamental symbolic analyses required to optimize algorithms such as rank update and rank increase methods (Davis and Hager, 2009), incomplete LU(0) (Naumov, 2012), incomplete Cholesky preconditioners, and uplooking implementations of factorization algorithms. Thus, Sympiler with the current set of symbolic inspectors can be made to support many of these matrix methods. We plan to extend to an even larger class of matrix methods and to support more optimization methods.
4. Experimental results
Problem ID  Name  n ()  nnz (A) () 
1  cbuckle  13.7  0.677 
2  Pres_Poisson  14.8  0.716 
3  gyro  17.4  1.02 
4  gyro_k  17.4  1.02 
5  Dubcova2  65.0  1.03 
6  msc23052  23.1  1.14 
7  thermomech_dM  204  1.42 
8  Dubcova3  147  3.64 
9  parabolic_fem  526  3.67 
10  ecology2  1000  5.00 
11  tmt_sym  727  5.08 
We evaluate Sympiler by comparing the performance to two stateoftheart libraries, namely Eigen (Guennebaud and Jacob, 2010) and CHOLMOD (Chen et al., 2008), for the Cholesky factorization method and the sparse triangular solve algorithm. Section 4.1 discusses the experimental setup and experimental methodology. In Section 4.2 we demonstrate that the transformations enabled by Sympiler generate highlyoptimized codes for sparse matrix algorithms compared to stateoftheart libraries. Although symbolic analysis is performed only once at compiletime for a fixed sparsity pattern in Sympiler, we analyze the cost of the symbolic inspector in Section 4.3 and compare it with symbolic costs in Eigen and CHOLMOD.
4.1. Methodology
We selected a set of symmetric positive definite matrices from (Davis and Hu, 2011), which are listed in Table 2. The matrices originate from different domains and vary in size. All matrices have real numbers and are in double precision. The testbed architecture is a 3.30GHz Intel®Core™i75820K processor with L1, L2, and L3 cache sizes of 32KB, 256KB, and 15MB respectively and turboboost disabled. We use OpenBLAS.0.2.19 (Xianyi, 2016) for dense BLAS (Basic Linear Algebra Subprogram) routines when needed. All Sympilergenerated codes are compiled with GCC v.5.4.0 using the O3 option. Each experiment is executed 5 times and the median is reported.
We compare the performance of the Sympilergenerated code with CHOLMOD (Chen et al., 2008) as a specialized library for Cholesky factorization and with Eigen (Guennebaud and Jacob, 2010) as a general numerical library. CHOLMOD provides one of the fastest implementations of Cholesky factorization on singlecore architectures (Gould et al., 2007). Eigen supports a wide range of sparse and dense operations including sparse triangular solve and Cholesky. Thus, for Cholesky factorization we compare with both Eigen and CHOLMOD while results for triangular solve are compared to Eigen. Both libraries are installed and executed using the recommended default configuration. Since Sympiler’s current version does not support node amalgamation (Duff and Reid, 1983), this setting is not enabled in CHOLMOD. For the Cholesky factorization both libraries support the more commonly used leftlooking (supernodal) algorithm which is also the algorithm used by Sympiler. Sympiler applies either both or one of the inspectorguided transformations as well as some of the enabled lowlevel transformations; currently, Sympiler implements unrolling, scalar replacement, and loop distibution from among the possible lowlevel transformations.
4.2. Performance of Generated Code
This section shows how the combination of the introduced transformations and the decoupling strategy enable Sympiler to outperform two stateoftheart libraries for sparse Cholesky and sparse triangular solve.
Triangular solve: Figure 15 shows the performance of Sympilergenerated code compared to the Eigen library for a sparse triangular solve with a sparse RHS. The nonzero fillin of the RHS in our experiments is selected to be less than 5%. The sparse triangular system solver is often used as a subkernel in algorithms such as leftlooking LU (Davis, 2006) and Cholesky rank update methods (Davis and Hager, 2009) or as a solver after matrix factorizations. Thus, typically the sparsity of the RHS in sparse triangular systems is close to the sparsity of the columns of a sparse matrix. For the tested problems, the number of nonzeros for all columns of was less than 5%.
The average improvement of Sympilergenerated code, which we refer to as Sympiler (numeric), over the Eigen library is 1.49. Eigen implements the approach demonstrated in Figure (c)c, where symbolic analysis is not decoupled from the numerical code. However, the Sympilergenerated code only manipulates numerical values which leads to higher performance. Figure 15 also shows the effect of each transformation on the overall performance of the Sympilergenerated code. In the current version of Sympiler the symbolic inspector is designed to generate sets so that VSBlock can be applied before VIPrune. Our experiments show that this ordering often leads to better performance mainly because Sympiler supports supernodes with a full diagonal block. As support for more transformations are added to Sympiler, we will enable it to automatically decide the best transformation ordering. Whenever applicable, the vectorization and peeling lowlevel transformations are also applied after VSBlock and VIPrune. Peeling leads to higher performance if applied after VSBlock where iterations related to singlecolumn supernodes are peeled. Vectorization is always applied after VSBlock and does not lead to performance if only VIPrune is applied.
Matrices 3, 4, 5, and 7 do not benefit from the VSBlock transformation so their Sympiler runtimes in Figure 15 are only for VIPrune. Since small supernodes often do not lead to better performance, Sympiler does not apply the VSBlock transformation if the average size of the participating supernodes is smaller than a threshold.This parameter is currently handtuned and is set to 160. VSBlock is not applied to matrices 3, 4, 5, and 7 since the average supernode size is too small and thus does not improve performance. Also, since these matrices have a small column count vectorization does not payoff.
Cholesky: We compare the numerical manipulation code of Eigen and CHOLMOD for Cholesky factorization with the Sympilergenerated code. The results for CHOLMOD and Eigen in Figure 16 refer to the numerical code performance in floating point operations per second (FLOP/s). Eigen and CHOLMOD both execute parts of the symbolic analysis only once if the user explicitly indicates that the same sparse matrix is used for subsequent executions. However, even with such an input from the user, none of the libraries fully decouple the symbolic information from the numerical code. This is because they can not afford to have a separate implementation for each sparsity pattern and also do not implement sparsityspecific optimizations. For fairness, when using Eigen and CHOLMOD we explicitly tell the library that the sparsity is fixed and thus report only the time related to the the library’s numerical code (which still contains some symbolic analysis).
As shown in Figure 16, for Cholesky factorization Sympiler performs up to 2.4 and 6.3 better than CHOLMOD and Eigen respectively. Eigen uses the leftlooking nonsupernodal approach therefore, its performance does not scale well for large matrices. CHOLMOD benefits from supernodes and thus performs well for large matrices with large supernodes. However, CHOLMOD does not perform well for some small matrices and large matrices with small supernodes. Sympiler provides the highest performance for almost all tested matrix types which demonstrates the power of sparsityspecific code generation.
The application of kernelspecific and aggressive optimizations when generating code for dense subkernels enables Sympiler to generate fast code for any sparsity pattern. Since BLAS routines are not welloptimized for small dense kernels they often do not perform well for the small blocks produced when applying VSBlock to sparse codes (Shin et al., 2010). Therefore, libraries such as CHOLMOD do not perform well for matrices with small supernodes. Sympiler has the luxury to generate code for its dense subkernels; instead of being handicapped by the performance of BLAS routines, it generates specialized and highlyefficient codes for small dense subkernels. If the average columncount for a matrix is below a tuned threshold, Sympiler will call BLAS routines (Xianyi, 2016) instead. Since the columncount directly specifies the number of dense triangular solves, which is the most important dense subkernel in Cholesky, the average columncount is used to decide when to switch to BLAS routines (Xianyi, 2016). For example, the average columncount of matrices 3, 4, 6, and 8 are less than the columncount threshold.
Decoupling the pruneset calculation from the numerical manipulation phase also improves the performance of the Sympilergenerated code. As discussed in subsection 3.2, the sparse Cholesky implementation needs to obtain the row sparsity pattern of . The elimination tree of and the upper triangular part of are both used in CHOLMOD and Eigen to find the row sparsity pattern. Since is symmetric and only its lower part is stored, both libraries compute the transpose of in the numerical code to access its upper triangular elements. Through fully decoupling symbolic analysis from the numerical code, Sympiler has the row sparsity information in the pruneset ahead of time and therefore, both the reach function and the matrix transpose operations are removed from the numeric code.
4.3. Symbolic Analysis Time
All symbolic analysis is performed at compiletime in Sympiler and its generated code only manipulates numerical values. Since symbolic analysis is performed once for a specific sparsity pattern, its overheads amortize with repeat executions of the numerical code. However, as demonstrated in Figures 17 and 18 even if the numerical code is executed only once, which is not common in scientific applications, the accumulated symbolic+numeric time of Sympiler is close to Eigen for the triangular solve and faster than both Eigen and CHOLMOD for Cholesky.
Triangular solve: Figure 17 shows the time Sympiler spends to do symbolic analysis at compiletime, Sympiler (symbolic), for the sparse triangular solve. No symbolic time is available for Eigen since as discussed, Eigen uses the code in Figure (c)c for its triangular solve implementation. Figure 17 shows the symbolic analysis and numerical manipulation time of Sympiler normalized over Eigen’s runtime. Sympiler’s numeric plus symbolic time is on average 1.27 slower than the Eigen code. In addition, code generation and compilation in Sympiler costs between 6–197 the cost of the numeric solve, depending on the matrix. It is important to note that since the sparsity structure of the matrix in triangular solve does not change in many applications, the overhead of the symbolic inspector and compilation is only paid once. For example, in preconditioned iterative solvers a triangular system must be solved per iteration, and often the iterative solver must execute thousands of iterations (Benzi et al., 2000; Papadrakakis and Bitoulas, 1993; Kershaw, 1978) until convergence since the systems in scientific applications are not necessarily wellconditioned.
Cholesky: Sparse libraries perform symbolic analysis ahead of time which can be reused for same sparsity patterns and improves the performance of their numerical executions. We compare the analysis time of the libraries with Sympiler’s symbolic inspection time. Figure 18 provides the symbolic analysis and numeric manipulation times for both libraries normalized to Eigen time. The time spent by Sympiler to perform symbolic analysis is referred to as Sympiler symbolic. CHOLMOD (symbolic) and Eigen (symbolic) refer to the partially decoupled symbolic code that is only run once if the user indicates that sparsity remains static. In nearly all cases Sympiler’s accumulated time is better than the other two libraries. Code generation and compilation, which are not shown in the chart, add a very small amount of time, costing at most 0.3 the cost of numeric factorization. Also, like the triangular solve example, the matrix with a fixed sparsity pattern must be factorized many times in scientific applications. For example, in NewtonRaphson (NR) solvers for nonlinear systems of equations, a Jacobian matrix is factorized in each iteration and the NR solvers require tens or hundreds of iterations to converge (Pawlowski et al., 2006; de Moura and de Moura, 2013).
5. Related Work
Compilers for general languages are hampered by optimization methods that either give up on optimizing sparse codes or only apply conservative transformations that do not lead to high performance. This is due to the indirection required to index and loop over the nonzero elements of sparse data structures. Polyhedral methods are limited when dealing with nonaffine loop nests or subscripts (Ancourt and Irigoin, 1991; Chen, 2012; Kelly, 1998; Quilleré et al., 2000; Vasilache et al., 2006; Tiwari et al., 2009) which are common in sparse computations.
To make it possible for compilers to apply more aggressive loop and data transformations to sparse codes, recent work (Venkat et al., 2015; Strout et al., 2016; Van Der Spek and Wijshoff, 2010; Venkat et al., 2014; Venkat et al., 2016) has developed compiletime techniques for automatically creating inspectors and executors for use at runtime. These techniques use an inspector to analyze index arrays in sparse codes at runtime and an executor that uses this runtime information to execute code with specific optimizations. These inspectorexecutor techniques are limited in that they only apply to sparse codes with static index arrays; such codes require the matrix structure to not change during the computation. The aforementioned approach performs well for methods such as sparse incomplete LU(0) and GaussSeidel methods where additional nonzeros/fillins are not introduced during computation. However, in a large class of sparse matrix methods, such as direct solvers including Cholesky, LU, and QR decompositions, index arrays dynamically change during computation since the algorithm itself introduces fillins. In addition, the indirections and dependencies in sparse direct solvers are tightly coupled with the algorithm, making it difficult to apply inspectorexecutor techniques.
Domainspecific compilers integrate domain knowledge into the compilation process, improving the compiler’s ability to transform and optimize specific kinds of computations. Such an approach has been used successfully for stencil computations (RaganKelley et al., 2013; Tang et al., 2011; Holewinski et al., 2012), signal processing (Püschel et al., 2005), dense linear algebra (Gunnels et al., 2001; Spampinato and Püschel, 2014), matrix assembly and mesh analysis (Alnæs et al., 2014; Luporini et al., 2016), simulation (Kjolstad et al., 2016; Bernstein et al., 2016), and sparse operations (Davis, 2013; Rong et al., 2016). Though the simulations and sparse compilers use some knowledge of matrix structure to optimize operations, they do not build specialized matrix solvers.
Specialized Libraries are the typical approach for sparse direct solvers. These libraries differ in (1) which numerical methods are implemented, (2) the implementation strategy or variant of the solver, (3) the type of the platform supported, and (4) whether the algorithm is specialized for specific applications.
Each numerical method is suitable for different classes of matrices; for example, Cholesky factorization requires the matrix be symmetric (or Hermitian) positive definite. Libraries such as SuperLU (Demmel et al., 1999a), KLU (Davis and Palamadai Natarajan, 2010), UMFPACK (Davis, 2004), and Eigen (Guennebaud and Jacob, 2010) provide optimized implementations for LU decomposition methods. The Cholesky factorization is available through libraries such as Eigen (Guennebaud and Jacob, 2010), CSparse (Davis, 2006), CHOLMOD (Chen et al., 2008), MUMPS (Amestoy et al., 2000; Amestoy et al., 2001, 2006), and PARDISO (Schenk and Gärtner, 2004; Schenk et al., 2000). QR factorization is implemented in SPARSPAK(George and Liu, 1979, 1981b), SPLOOES (Ashcraft and Grimes, 1999), Eigen (Guennebaud and Jacob, 2010), and CSparse (Davis, 2006). The optimizations and algorithm variants used to implement sparse matrix methods differ between libraries. For example LU decomposition can be implemented using multifrontal methods (Davis, 2004; Gupta et al., 1997; Davis, 2011), leftlooking (Demmel et al., 1999a; Davis and Palamadai Natarajan, 2010; Duff and Reid, 1996; George and Liu, 1979), rightlooking (Li and Demmel, 2003; Shen et al., 2000; Duff et al., 1991), and uplooking (Davis, 2005; Sherman, 1978) methods. Libraries are developed to support different platforms such as sequential implementations (Davis, 2006; Chen et al., 2008; Davis and Palamadai Natarajan, 2010), shared memory (Davis, 2011; Demmel et al., 1999b; Schenk and Gärtner, 2004), and distributed memory (Demmel et al., 1999b; Amestoy et al., 2001). Finally, some libraries are designed to perform well on matrices arising from a specific domain. For example, KLU (Davis and Palamadai Natarajan, 2010) works best for circuit simulation problems. In contrast, SuperLUMT applies optimizations with the assumption that the input matrix structure leads to large supernodes; such a strategy is a poor fit for circuit simulation problems.
6. Conclusion
In this paper we demonstrated how decoupling symbolic analysis from numerical manipulation can enable the generation of domainspecific highlyoptimized sparse codes with static sparsity patterns. Sympiler, the proposed domainspecific code generator, takes the sparse matrix pattern and the sparse matrix algorithm as inputs to perform symbolic analysis at compiletime. It then uses the information from symbolic analysis to apply a number of inspectorguided and lowlevel transformations to the sparse code. The Sympilergenerated code outperforms two stateoftheart sparse libraries, Eigen and CHOLMOD, for the sparse Cholesky and the sparse triangular solve algorithms.
References
 (1)
 Alnæs et al. (2014) Martin S Alnæs, Anders Logg, Kristian B Ølgaard, Marie E Rognes, and Garth N Wells. 2014. Unified form language: A domainspecific language for weak formulations of partial differential equations. ACM Transactions on Mathematical Software (TOMS) 40, 2 (2014), 9.
 Amestoy et al. (2000) Patrick R Amestoy, Iain S Duff, and JY L’Excellent. 2000. Multifrontal parallel distributed symmetric and unsymmetric solvers. Computer methods in applied mechanics and engineering 184, 2 (2000), 501–520.
 Amestoy et al. (2001) Patrick R Amestoy, Iain S Duff, JeanYves L’Excellent, and Jacko Koster. 2001. A fully asynchronous multifrontal solver using distributed dynamic scheduling. SIAM J. Matrix Anal. Appl. 23, 1 (2001), 15–41.
 Amestoy et al. (2006) Patrick R Amestoy, Abdou Guermouche, JeanYves LâExcellent, and Stéphane Pralet. 2006. Hybrid scheduling for the parallel solution of linear systems. Parallel computing 32, 2 (2006), 136–156.
 Ancourt and Irigoin (1991) Corinne Ancourt and François Irigoin. 1991. Scanning polyhedra with DO loops. In ACM Sigplan Notices, Vol. 26. ACM, 39–50.
 Anderson et al. (1984) Dale Arden Anderson, John C Tannehill, and Richard H Pletcher. 1984. Computational fluid mechanics and heat transfer. (1984).
 Ashcraft and Grimes (1999) Cleve Ashcraft and Roger G Grimes. 1999. SPOOLES: An ObjectOriented Sparse Matrix Library.. In PPSC.
 Benzi et al. (2000) Michele Benzi, Jane K Cullum, and Miroslav Tuma. 2000. Robust approximate inverse preconditioning for the conjugate gradient method. SIAM Journal on Scientific Computing 22, 4 (2000), 1318–1332.
 Bernstein et al. (2016) Gilbert Louis Bernstein, Chinmayee Shah, Crystal Lemire, Zachary Devito, Matthew Fisher, Philip Levis, and Pat Hanrahan. 2016. Ebb: A DSL for Physical Simulation on CPUs and GPUs. ACM Trans. Graph. 35, 2, Article 21 (May 2016), 12 pages. DOI:https://doi.org/10.1145/2892632
 Chen (2012) Chun Chen. 2012. Polyhedra scanning revisited. ACM SIGPLAN Notices 47, 6 (2012), 499–508.
 Chen et al. (2008) Yanqing Chen, Timothy A Davis, William W Hager, and Sivasankaran Rajamanickam. 2008. Algorithm 887: CHOLMOD, supernodal sparse Cholesky factorization and update/downdate. ACM Transactions on Mathematical Software (TOMS) 35, 3 (2008), 22.
 Davis (2004) Timothy A Davis. 2004. Algorithm 832: UMFPACK V4. 3—an unsymmetricpattern multifrontal method. ACM Transactions on Mathematical Software (TOMS) 30, 2 (2004), 196–199.
 Davis (2005) Timothy A Davis. 2005. Algorithm 849: A concise sparse Cholesky factorization package. ACM Transactions on Mathematical Software (TOMS) 31, 4 (2005), 587–591.
 Davis (2006) Timothy A Davis. 2006. Direct methods for sparse linear systems. Vol. 2. Siam.
 Davis (2011) Timothy A Davis. 2011. Algorithm 915, SuiteSparseQR: Multifrontal multithreaded rankrevealing sparse QR factorization. ACM Transactions on Mathematical Software (TOMS) 38, 1 (2011), 8.
 Davis (2013) Timothy A Davis. 2013. Algorithm 930: FACTORIZE: An objectoriented linear system solver for MATLAB. ACM Transactions on Mathematical Software (TOMS) 39, 4 (2013), 28.
 Davis and Hager (2005) Timothy A Davis and William W Hager. 2005. Row modifications of a sparse Cholesky factorization. SIAM J. Matrix Anal. Appl. 26, 3 (2005), 621–639.
 Davis and Hager (2009) Timothy A Davis and William W Hager. 2009. Dynamic supernodes in sparse Cholesky update/downdate and triangular solves. ACM Transactions on Mathematical Software (TOMS) 35, 4 (2009), 27.
 Davis and Hu (2011) Timothy A Davis and Yifan Hu. 2011. The University of Florida sparse matrix collection. ACM Transactions on Mathematical Software (TOMS) 38, 1 (2011), 1.
 Davis and Palamadai Natarajan (2010) Timothy A Davis and Ekanathan Palamadai Natarajan. 2010. Algorithm 907: KLU, a direct sparse solver for circuit simulation problems. ACM Transactions on Mathematical Software (TOMS) 37, 3 (2010), 36.
 de Moura and de Moura (2013) Ailson P de Moura and Adriano Aron F de Moura. 2013. Newton–Raphson power flow with constant matrices: a comparison with decoupled power flow methods. International Journal of Electrical Power & Energy Systems 46 (2013), 108–114.
 Demmel et al. (1999a) James W Demmel, Stanley C Eisenstat, John R Gilbert, Xiaoye S Li, and Joseph WH Liu. 1999a. A supernodal approach to sparse partial pivoting. SIAM J. Matrix Anal. Appl. 20, 3 (1999), 720–755.
 Demmel et al. (1999b) James W Demmel, John R Gilbert, and Xiaoye S Li. 1999b. An asynchronous parallel supernodal algorithm for sparse gaussian elimination. SIAM J. Matrix Anal. Appl. 20, 4 (1999), 915–952.
 Dorf (2006) Richard C Dorf. 2006. Electronics, power electronics, optoelectronics, microwaves, electromagnetics, and radar. CRC press.
 Duff et al. (1991) Iain S Duff, Nick IM Gould, John K Reid, Jennifer A Scott, and Kathryn Turner. 1991. The factorization of sparse symmetric indefinite matrices. IMA J. Numer. Anal. 11, 2 (1991), 181–204.
 Duff and Reid (1983) Iain S Duff and John K Reid. 1983. The multifrontal solution of indefinite sparse symmetric linear. ACM Transactions on Mathematical Software (TOMS) 9, 3 (1983), 302–325.
 Duff and Reid (1996) Iain S Duff and John Ker Reid. 1996. The design of MA48: a code for the direct solution of sparse unsymmetric linear systems of equations. ACM Transactions on Mathematical Software (TOMS) 22, 2 (1996), 187–226.
 ElKurdi et al. (2015) Yousef ElKurdi, Maryam Mehri Dehnavi, Warren J. Gross, and Dennis Giannacopoulos. 2015. Parallel finite element technique using Gaussian belief propagation. Computer Physics Communications 193, Complete (2015), 38–48. DOI:https://doi.org/10.1016/j.cpc.2015.03.019
 ExpositoDominguez et al. (2012) Gonzalo ExpositoDominguez, JoseManuel Fernandez Gonzalez, Pablo Padilla de La Torre, and Manuel SierraCastaner. 2012. Dual circular polarized steering antenna for satellite communications in X band. Progress In Electromagnetics Research 122 (2012), 61–76.
 George and Liu (1979) Alan George and Joseph WH Liu. 1979. The design of a user interface for a sparse matrix package. ACM Transactions on Mathematical Software (TOMS) 5, 2 (1979), 139–162.
 George and Liu (1981a) Alan George and Joseph W. Liu. 1981a. Computer Solution of Large Sparse Positive Definite. Prentice Hall Professional Technical Reference.
 George and Liu (1981b) Alan George and Joseph W Liu. 1981b. Computer solution of large sparse positive definite. (1981).
 Gibson and Mirtich (1997) Sarah FF Gibson and Brian Mirtich. 1997. A survey of deformable modeling in computer graphics. Technical Report. Citeseer.
 Gilbert and Peierls (1988) John R Gilbert and Tim Peierls. 1988. Sparse partial pivoting in time proportional to arithmetic operations. SIAM J. Sci. Statist. Comput. 9, 5 (1988), 862–874.
 Gould et al. (2007) Nicholas IM Gould, Jennifer A Scott, and Yifan Hu. 2007. A numerical evaluation of sparse direct solvers for the solution of large sparse symmetric linear systems of equations. ACM Transactions on Mathematical Software (TOMS) 33, 2 (2007), 10.
 Guennebaud and Jacob (2010) Gaël Guennebaud and Benoit Jacob. 2010. Eigen. URl: http://eigen. tuxfamily. org (2010).
 Gunnels et al. (2001) John A Gunnels, Fred G Gustavson, Greg M Henry, and Robert A Van De Geijn. 2001. FLAME: Formal linear algebra methods environment. ACM Transactions on Mathematical Software (TOMS) 27, 4 (2001), 422–455.
 Gupta et al. (1997) Anshul Gupta, George Karypis, and Vipin Kumar. 1997. Highly scalable parallel algorithms for sparse matrix factorization. IEEE Transactions on Parallel and Distributed Systems 8, 5 (1997), 502–520.
 Holewinski et al. (2012) Justin Holewinski, LouisNoël Pouchet, and P. Sadayappan. 2012. Highperformance Code Generation for Stencil Computations on GPU Architectures. In Proceedings of the 26th ACM International Conference on Supercomputing (ICS ’12). ACM, New York, NY, USA, 311–320. DOI:https://doi.org/10.1145/2304576.2304619
 Janna et al. (2015) Carlo Janna, Massimiliano Ferronato, and Giuseppe Gambolati. 2015. The use of supernodes in factored sparse approximate inverse preconditioning. SIAM Journal on Scientific Computing 37, 1 (2015), C72–C94.
 Kelly (1998) Wayne Kelly. 1998. Optimization within a unified transformation framework. (1998).
 Kershaw (1978) David S Kershaw. 1978. The incomplete Choleskyâconjugate gradient method for the iterative solution of systems of linear equations. J. Comput. Phys. 26, 1 (1978), 43–65.
 Kjolstad et al. (2016) Fredrik Kjolstad, Shoaib Kamil, Jonathan RaganKelley, David IW Levin, Shinjiro Sueda, Desai Chen, Etienne Vouga, Danny M Kaufman, Gurtej Kanwar, Wojciech Matusik, and Saman Amarasinghe. 2016. Simit: A language for physical simulation. ACM Transactions on Graphics (TOG) 35, 2 (2016), 20.
 Li (2005) Xiaoye S Li. 2005. An overview of SuperLU: Algorithms, implementation, and user interface. ACM Transactions on Mathematical Software (TOMS) 31, 3 (2005), 302–325.
 Li and Demmel (2003) Xiaoye S Li and James W Demmel. 2003. SuperLU_DIST: A scalable distributedmemory sparse direct solver for unsymmetric linear systems. ACM Transactions on Mathematical Software (TOMS) 29, 2 (2003), 110–140.
 Liu (1990) Joseph W. H. Liu. 1990. The Role of Elimination Trees in Sparse Factorization. SIAM J. Matrix Anal. Appl. 11, 1 (Jan. 1990), 134–172. DOI:https://doi.org/10.1137/0611010
 Luna et al. (2013) José A Zevallos Luna, Alexandre Siligaris, Cédric Pujol, and Laurent Dussopt. 2013. A packaged 60 GHz lowpower transceiver with integrated antennas for shortrange communications.. In Radio and Wireless Symposium. 355–357.
 Luporini et al. (2016) Fabio Luporini, David A Ham, and Paul HJ Kelly. 2016. An algorithm for the optimization of finite element integration loops. arXiv preprint arXiv:1604.05872 (2016).
 Naumov (2012) Maxim Naumov. 2012. Parallel incompleteLU and Cholesky factorization in the preconditioned iterative methods on the GPU. NVIDIA Technical Report NVR2012003 (2012).
 Papadrakakis and Bitoulas (1993) M Papadrakakis and N Bitoulas. 1993. Accuracy and effectiveness of preconditioned conjugate gradient algorithms for large and illconditioned problems. Computer methods in applied mechanics and engineering 109, 34 (1993), 219–232.
 Pawlowski et al. (2006) Roger P Pawlowski, John N Shadid, Joseph P Simonis, and Homer F Walker. 2006. Globalization techniques for Newton–Krylov methods and applications to the fully coupled solution of the Navier–Stokes equations. SIAM review 48, 4 (2006), 700–721.
 Pothen and Toledo (2004) Alex Pothen and Sivan Toledo. 2004. Elimination Structures in Scientific Computing. (2004).
 Püschel et al. (2005) Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso, Bryan Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen, Robert W. Johnson, and Nicholas Rizzolo. 2005. SPIRAL: Code Generation for DSP Transforms. Proceedings of the IEEE, special issue on “Program Generation, Optimization, and Adaptation” 93, 2 (2005), 232– 275.
 Quilleré et al. (2000) Fabien Quilleré, Sanjay Rajopadhye, and Doran Wilde. 2000. Generation of efficient nested loops from polyhedra. International Journal of Parallel Programming 28, 5 (2000), 469–498.
 RaganKelley et al. (2013) Jonathan RaganKelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM SIGPLAN Notices 48, 6 (2013), 519–530.
 Rong et al. (2016) Hongbo Rong, Jongsoo Park, Lingxiang Xiang, Todd A Anderson, and Mikhail Smelyanskiy. 2016. Sparso: Contextdriven optimizations of sparse linear algebra. In Proceedings of the 2016 International Conference on Parallel Architectures and Compilation. ACM, 247–259.
 Schenk and Gärtner (2004) Olaf Schenk and Klaus Gärtner. 2004. Solving unsymmetric sparse systems of linear equations with PARDISO. Future Generation Computer Systems 20, 3 (2004), 475–487.
 Schenk et al. (2000) Olaf Schenk, Klaus Gärtner, and Wolfgang Fichtner. 2000. Efficient sparse LU factorization with leftright looking strategy on shared memory multiprocessors. BIT Numerical Mathematics 40, 1 (2000), 158–176.
 Shen et al. (2000) Kai Shen, Tao Yang, and Xiangmin Jiao. 2000. S+: Efficient 2D sparse LU factorization on parallel machines. SIAM J. Matrix Anal. Appl. 22, 1 (2000), 282–305.
 Sherman (1978) Andrew H Sherman. 1978. Algorithms for sparse Gaussian elimination with partial pivoting. ACM Transactions on Mathematical Software (TOMS) 4, 4 (1978), 330–338.
 Shin et al. (2010) Jaewook Shin, Mary W Hall, Jacqueline Chame, Chun Chen, Paul F Fischer, and Paul D Hovland. 2010. Speeding up Nek5000 with autotuning and specialization. In Proceedings of the 24th ACM International Conference on Supercomputing. ACM, 253–262.
 Spampinato and Püschel (2014) Daniele G Spampinato and Markus Püschel. 2014. A basic linear algebra compiler. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization. ACM, 23.
 Strout et al. (2016) Michelle Mills Strout, Alan LaMielle, Larry Carter, Jeanne Ferrante, Barbara Kreaseck, and Catherine Olschanowsky. 2016. An approach for code generation in the sparse polyhedral framework. Parallel Comput. 53 (2016), 32–57.
 Tang et al. (2011) Yuan Tang, Rezaul Alam Chowdhury, Bradley C Kuszmaul, ChiKeung Luk, and Charles E Leiserson. 2011. The pochoir stencil compiler. In Proceedings of the twentythird annual ACM symposium on Parallelism in algorithms and architectures. ACM, 117–128.
 Tiwari et al. (2009) Ananta Tiwari, Chun Chen, Jacqueline Chame, Mary Hall, and Jeffrey K Hollingsworth. 2009. A scalable autotuning framework for compiler optimization. In Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on. IEEE, 1–12.
 Van Der Spek and Wijshoff (2010) Harmen LA Van Der Spek and Harry AG Wijshoff. 2010. Sublimation: expanding data structures to enable data instance specific optimizations. In International Workshop on Languages and Compilers for Parallel Computing. Springer, 106–120.
 Vasilache et al. (2006) Nicolas Vasilache, Cédric Bastoul, and Albert Cohen. 2006. Polyhedral code generation in the real world. In International Conference on Compiler Construction. Springer, 185–201.
 Venkat et al. (2015) Anand Venkat, Mary Hall, and Michelle Strout. 2015. Loop and data transformations for sparse matrix code. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, 521–532.
 Venkat et al. (2016) Anand Venkat, Mahdi Soltan Mohammadi, Jongsoo Park, Hongbo Rong, Rajkishore Barik, Michelle Mills Strout, and Mary Hall. 2016. Automating wavefront parallelization for sparse matrix computations. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Press, 41.
 Venkat et al. (2014) Anand Venkat, Manu Shantharam, Mary Hall, and Michelle Mills Strout. 2014. Nonaffine extensions to polyhedral code generation. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization. ACM, 185.
 Xianyi (2016) Z Xianyi. 2016. OpenBLAS: an optimized BLAS library. (2016).