Fast Hierarchical Solvers for Sparse Matrices Using Extended Sparsification and LowRank Approximation^{†}^{†}thanks: Funding from the “Army High Performance Computing Research Center” (AHPCRC), sponsored by the U.S. Army Research Laboratory under contract No. W911NF0720027, at Stanford, supported in part this research. The second author is a postdoctoral fellow of the Research Foundation Flanders (FWO) and a Francqui Foundation fellow of the Belgian American Educational Foundation (BAEF). The financial support is gratefully acknowledged.
Abstract
Inversion of sparse matrices with standard direct solve schemes is robust, but computationally expensive. Iterative solvers, on the other hand, demonstrate better scalability, but need to be used with an appropriate preconditioner (e.g., ILU, AMG, GaussSeidel, etc.) for proper convergence. The choice of an effective preconditioner is highly problem dependent. We propose a novel fully algebraic sparse matrix solve algorithm, which has linear complexity with the problem size. Our scheme is based on the Gauss elimination. For a given matrix, we approximate the LU factorization with a tunable accuracy determined a priori. This method can be used as a standalone direct solver with linear complexity and tunable accuracy, or it can be used as a blackbox preconditioner in conjunction with iterative methods such as GMRES. The proposed solver is based on the lowrank approximation of fillins generated during the elimination. Similar to matrices, fillins corresponding to blocks that are wellseparated in the adjacency graph are represented via a hierarchical structure. The linear complexity of the algorithm is guaranteed if the blocks corresponding to wellseparated clusters of variables are numerically lowrank.
Key words. sparse, hierarchical, lowrank, elimination, compression
AMS subject classifications. 65F05, 65F08, 65F50, 65N55, 68Q25
1 Introduction
In the realm of scientific computing, solving a sparse linear system,
(1) 
is known to be one of the challenging parts of many calculations, and is often the main bottleneck. Such a system of equations may be the result of the discretization of some partial differential equation (PDE), or more generally, can represent the local interactions of units in a network.
Solving a system of equations of size using a naive implementation of Gauss elimination has time complexity. The best proved time complexity to solve a general linear system is , where [12, 17, 40]. In the case of sparse matrices, the time and memory complexity can be reduced when a proper elimination order is employed. Finding the optimal ordering (that results in the minimum number of new nonzeros in the LU factorization) is known to be an NPcomplete problem [57]. For matrices resulting from the discretization of some PDE in physical space, nested dissection [23, 43] is known as an efficient elimination strategy. [1] discusses the complexity of nested dissection based on the sparsity pattern of the matrix. For a threedimensional problem, the time and memory complexities are expected to be and , respectively, when nested dissection is employed. As the size of the problem grows, such complexities make direct solvers prohibitive.
Iterative methods, such as conjugate gradient [36], minimum residual [47], and general minimum residual [53], are generally more time and memory efficient. In addition, iterative solvers such as those based on Krylov subspace can be accelerated using fast linear algebra techniques. The fast multipole method (FMM) [20, 22, 28, 45, 49, 58], for example, can accelerate matrixvector multiplication —from quadratic complexity to linear—, which is the bulk calculation in iterative solvers based on Krylov subspace. However, in practice, iterative methods need to be used in conjunction with preconditioners to limit the number of iterations. The choice of an efficient preconditioner is highly problem dependent. There are many ongoing efforts to develop preconditioners that are optimized for particular applications. Hence, there is a need for general purpose preconditioners. Hierarchical matrices enable us to develop such preconditioners.
FMM matrices are a subclass of a larger category of matrices called hierarchical matrices (matrices) [7, 9, 32]. matrices have a hierarchical lowrank structure. For instance, in a hierarchically offdiagonal lowrank (HODLR) matrix [5], offdiagonal blocks can be represented through a hierarchy of lowrank interactions. If the bases used in the hierarchy are nested (i.e., the lowrank basis at each level is constructed using the lowrank basis of the child level) the method is called hierarchically semiseparable (HSS) [3, 15, 56]. In a more general case of hierarchical matrices, more complex lowrank structures can be considered. A fullrank dense matrix with many lowrank structures is in fact datasparse [31, 33]. A datasparse matrix can be represented via an extended sparse matrix, which has extra rows/columns, but with only few nonzero entries [2]. The hierarchical structure of such matrices can be used for efficient calculation and storage.
Recently, hierarchical interpolative factorization [38, 39] was proposed, which can be used to directly solve systems obtained from differential and integral equations based on elliptic operators. The fast factorization is obtained by skeletonization of fronts in the multifrontal scheme. Using lowrank structure of the offdiagonal blocks to develop fast direct solvers for linear systems arising from integral equations has been widely studied [18, 26, 27, 41]. [25] proposed a direct solver for elliptic PDEs with variable coefficients on twodimensional domains by exploiting internal lowrank structures in the matrices. [44] used hierarchical lowrank structures of the offdiagonal blocks to introduce a preconditioner for sparse matrices based on a multifrontal variant of sparse LU factorization. [46] introduced a blackbox linear solver using tensortrain format. [42] used a recursive lowrank approximation algorithm based on the ShermanMorrison formula to obtain a preconditioner for symmetric sparse matrices.
Sparse matrices can be considered as a very special case of hierarchical matrices, where instead of lowrank blocks they initially have zero blocks. However, during the elimination process in a direct solve scheme, many of the zero blocks get filled. For a large category of matrices, including those obtained from the discretization of PDEs, most of the new fillins are numerically lowrank. This is justified when the Green’s function associated to the PDE is smooth (nonoscillatory). In this paper, we will use the matrix structure to compress the fillins. A similar process can be applied in the elimination of an extended sparse matrix resulting from an originally dense matrix [4, 19]. This reduces the complexity of the direct solver to linear. The linear complexity of the method is guaranteed if the blocks corresponding to the interaction of wellseparated nodes are numerically lowrank. We define the wellseparated condition in LABEL:sec:HTree.
The proposed algorithm can be considered as an extension to the block incomplete LU (ILU) [52] preconditioners. In a block ILU factorization, most of the new fillins (i.e., blocks that are created during the elimination process which are originally zero) are ignored, and therefore, the block sparsity of the matrix is preserved, while the accuracy is not. In the proposed algorithm, instead, we use lowrank approximations to compress new fillins. Using a tree structure, new fillins at the fine level are compressed and pushed to the parent (coarse) level. The elimination and compression processes are done in a bottomtotop traversal.
In addition, the proposed algorithm has formal similarities with algebraic multigrid (AMG) methods [10, 11, 51, 54]. However, the two methods differ in the way they build the coarse system, and use restriction and prolongation operators. In AMG, the original system is solved at different levels (from fine to coarse). Here, the compressed fillins —corresponding to the Schur complements— of each level are solved at the coarser level above. Note that the proposed algorithm is purely algebraic, similar to AMG. If the matrix comes from discretization of a PDE on a physical grid, the grid information can be exploited to improve the performance of the solver, similar to geometric multigrid.
The algorithm presented in this paper computes a hierarchical representation of the LU factorization of a sparse matrix using lowrank approximations. We introduce intermediate operations to compress new fillins. The compressed fillins are represented using a set of extra variables. This technique is known as extended sparsification [14]. The accuracy of the factorization phase (i.e., Gauss elimination and compression), , can be determined a priori. The time and memory complexity of the factorization are and , respectively, as will be clarified in LABEL:sec:standalone.
The method presented in this paper is similar to the fast hierarchical methods developed by Hackbusch et al. [9, 31, 32, 33] in a sense that both methods use a tree decomposition to identify and represent lowrank blocks. The key difference, however, is that in the Hackbusch’s algorithm the LU factorization is computed using a depthfirst tree traversal order, whereas here we use a breadthfirst (level by level from leaf to root) traversal. The connection and differences of the proposed algorithm and Hackbusch’s fast algebra is discussed in our companion paper [19], in which a similar method is used for dense matrix factorization.
Our solver can be used as a standalone direct solver with tunable accuracy. The factorization part is completely separate from the solve part and is generally more expensive. This makes the algorithm appealing when multiple right hand sides are available (e.g., using the proposed solver as a blackbox preconditioner in an iterative method). We have implemented the algorithm in C++ (the code can be downloaded from bitbucket.org/hadip/lorasp), and benchmarked it as both a standalone solver (see LABEL:sec:standalone), and a preconditioner in conjunction with the generalized minimum residual (GMRES) iterative solver [53] (see LABEL:sec:precond).
Furthermore, the proposed algorithm has interesting parallelization properties. On one hand, all calculations are block matrix computations which can be highly accelerated using BLAS3 operations [6]. On the other hand, since the sparsity pattern at every level is preserved, the data dependency is very local, which is an interesting property to reduce the amount of communications. In addition, the amount of calculation scales with the third power of the size of blocks, while the communications scales with the second power of block sizes. This helps with the concurrency of the parallel implementation. Moreover, the order of elimination does not change the complexity of the presented algorithm. This is in particular an appealing property for parallel implementation. The parallel implementation of the proposed method is not further discussed in this paper.
The remainder of this paper is organized as follows. In LABEL:sec:SLS we briefly introduce a graph representation of sparse matrices, and an interpretation of the Gauss elimination using the adjacency graph. In LABEL:sec:HTree some concepts related to the hierarchical representation of matrices are defined. The algorithm is explained in LABEL:sec:alg in detail, and the linear complexity analysis is provided in LABEL:sec:lin. We present numerical results obtained from various benchmarks in LABEL:sec:numerical. There are many avenues for optimization and extension of the algorithm. We discuss some of these opportunities in LABEL:sec:conclusion.
2 Sparse linear systems
In this section we briefly introduce the graphical framework that is required in the rest of the paper. We assume a sparse linear system of size as in LABEL:eqn:linsys is given.
2.1 Adjacency graph
In many algorithms, including the method proposed in this paper, it is necessary (or more efficient) to operate on subblocks of the matrix rather than single elements. The blocks of the matrix can be identified using a partitioning as defined below.
Definition 1.
(partitioning) A partitioning is defined as a surjective map . groups rows/columns of into clusters, for .
We denote an entry of matrix located in row and column by . For , we use to represent a submatrix formed by concatenating all entries such that and . Additionally, for a vector of size , we use to represent a subvector formed by concatenating all entries such that .
It is often fruitful to represent sparse matrices using graphs. An adjacency graph, as defined below, represents a sparse matrix with partitioning.
Definition 2.
(adjacency graph) A sparse matrix with a partitioning can be represented by its adjacency graph , where . Each for represents a cluster of rows and columns of . A vertex is connected to a vertex by a directed edge if and only if the block in is nonzero.^{1}^{1}1The adjacency graph of a matrix with a partitioning is essentially the quotient graph of the adjacency graph of matrix with identity partitioning, where the equivalence relation is induced by partitioning .
In LABEL:fig:domain, an example of the adjacency graph is illustrated. In the rest of the paper we use vertex and node interchangeably for the elements of in the adjacency graph.
The linear system in LABEL:eqn:linsys can also be represented using the adjacency graph of , . For a node , denotes the vector of variables corresponding to cluster . Similarly, denotes the vector of right hand sides corresponding to cluster . Also for an edge , denotes the submatrix corresponding to cluster of columns and cluster of rows. For the example shown in LABEL:fig:domain, the following two notations represent the same set of equations corresponding to the node and its incoming edges. \cref@addtoresetequationparentequation
(2a)  
(2b) 
2.2 Elimination
The block Gauss elimination process, or block LU factorization, can also be explained using the graph representation of the matrix. At step of the elimination process, a set of unknowns, , is eliminated from the system of equations. This corresponds to eliminating vertex from the adjacency graph . The selfedge from to itself corresponds to the pivot diagonal subblock in the matrix. After eliminating , for every pair of outgoing edge to a vertex and incoming edge from a vertex , a new edge from to is created, corresponding to the Schur complement of the eliminated edges, that is
(3) 
Note that if the edge between and exists before elimination, the Schur complement adds to the existing subblock.
The process described above reveals the fact that during the elimination process many new edges are introduced in the graph. This corresponds to generating new nonzero blocks in the matrix during the LU factorization. The generation of many dense blocks is what makes the direct factorization of sparse matrices a prohibitive process. Essentially, a matrix can be sparse, while and in the LU factorization of are dense. In the next section, we explain how we can preserve the sparsity of the matrix during the elimination process by compressing the wellseparated interactions. This process is known as extended sparsification.
2.3 Key idea
An important observation in the elimination process is the fact that fillins (i.e., new edges created during the elimination process) that correspond to wellseparated vertices are often numerically lowrank. For a linear system obtained from a discretized PDE, wellseparated vertices refers to points that are physically far enough from each other. For a general sparse matrix two vertices are wellseparated if their distance in the adjacency graph is large enough. It is formally defined in LABEL:def:wellSep. We replace such fillins with a sequence of lowrank matrices.
For example, consider the following symmetric linear system that is partitioned into 3 blocks
(4) 
In LABEL:fig:keyIdea:sub1 the adjacency graph of the system of equations in LABEL:eqn:example is shown. Now, consider eliminating . The resulting system is as follows
(5) 
where , , , , and . The adjacency graph of the system of equations in LABEL:eqn:example1 is depicted in LABEL:fig:keyIdea:sub2. Nodes 2 and 3 can be considered wellseparated (see LABEL:def:wellSep). They get connected due to the elimination of node 1. We assume their interaction is lowrank, and can be written as
(6) 
where and are tall matrices. We can use any lowrank approximation in LABEL:eqn:lowrank, for example the singular value decomposition (SVD). We combine LABEL:eqn:lowrank and LABEL:eqn:example1 to define a new set of equations, in which direct interaction of the nodes 2 and 3 is replaced by a sequence of lowrank interactions
(7) 
In the above equation, we introduced extra variables (, and ) to represent the farfield interactions. This technique is known as extended sparsification [14]. Note that the system of equations LABEL:eqn:extend is equivalent to the system of equations LABEL:eqn:example1 up to the accuracy of LABEL:eqn:lowrank. In fact, if we do a Gaussian elimination on the matrix in LABEL:eqn:extend using blocks starting from the right, we get back to the original matrix in LABEL:eqn:example1.
In an analogy with the fast multipole method, and can be considered as the multipole coefficients, and and as the local coefficients. In LABEL:fig:keyIdea:sub3 the adjacency graph of the extended linear system is depicted. The red and black nodes correspond to the extended part of the matrix in LABEL:eqn:extend (extra variables introduced above). As a result of extended sparsification, nodes 2 and 3 are disconnected in LABEL:fig:keyIdea:sub3. Therefore, in the general case could be eliminated without introducing new fillins between and other neighbors of .^{2}^{2}2Note that in this example, we intentionally chose order of elimination , to introduce a new fillin, and demonstrate the lowrank compression process. Using , as the order of elimination would result in no fillins for this example. Finding such an order of elimination, however, is a hard problem and may not be possible for practical matrices. As shown In LABEL:lem:sparse, removing wellseparated interactions of nodes before elimination preserves the sparsity.
In the general case, we start by a standard block Gauss elimination. When we create fillins, similar to the above example, we apply the extended sparsification (compression). This results in the creation of new nodes in the graph belonging to a coarser level.
As we proceed with the elimination process at level , edges between the auxiliary nodes at level are created. After the elimination process at level is completed, we proceed with the elimination at level , and continue up to the root (level 0). Essentially, in this process we form a hierarchical approximation of and matrices (which are generally dense) in the LU factorization of , without directly computing them.
In addition, similar to the agglomeration process in multigrid methods, we consider one rednode and one blacknode for every pair of clusters (i.e., the wellseparated interactions of each pair of clusters are compressed together). Therefore, the number of rednodes at the parent level is half of the number of rednodes at the current level. We formally define the hierarchy of nodes in LABEL:sec:HTree, and the details of the algorithm in LABEL:sec:alg.
3 Hierarchical representation
To form a hierarchical tree, we recursively partition the rows/columns indices of the matrix. This is formally defined as a sequence of nested partitionings.
Definition 3.
(nested partitionings) A sequence of partitionings with is called nested if every cluster is the union of two clusters and for . Cluster is then called the parent of two child clusters and . We call and sibling clusters. The level of a cluster is defined as the index of its defining partitioning . The clusters associated to the finest partitioning, , have no children and are called leaf clusters, while the cluster associated to has no parent and is called the root cluster. All other clusters have exactly two children and one parent.^{3}^{3}3The binary subdivision is not necessary. We can generalize this to quadtree, octree, etc.
A visual example of nested partitionings with six levels (i.e., ) is illustrated in LABEL:sec:appB. Now, we define the hierarchical tree^{4}^{4}4In fact, it is not a tree. As we see later, there are edges between nodes at each level. (denoted by tree) of a sparse matrix given a sequence of nested partitionings.
Definition 4.
(hierarchical tree) Given a sparse matrix , and a nested sequence of partitionings , the hierarchical tree (tree) is defined as a directed graph with red and black vertices and two types of edges (parentchild and interaction edges).
The vertices of tree are corresponding to the clusters associated with the nested partitionings. For every cluster with and there is a corresponding blacknode in the vertex set of tree. Also, there is a pair of rednodes and corresponding to the sibling clusters and (children of the cluster ) in the vertex set of tree. We call such a pair of rednodes a supernode, and denote it by , that is corresponding to the cluster . is connected to and by parentchild edges. Hence, is also the parent of supernode . Additionally, the rednode corresponding to is connected to by a parentchild edge. We denote the parent of a node (which can be a red, black, or super node) by .
We also consider one special rednode associated to the root cluster. The level of each vertex of tree is denoted by a superscript index. Additionally, the depth of tree is defined as .
There is an interaction edge between two rednodes with level (leaf rednodes), if the corresponding vertices in the adjacency graph of with partitioning are connected. Therefore, the subgraph of tree induced by and for is the adjacency graph of with partitioning .
There is no interaction edge between nonleaf vertices (shown transparent in LABEL:fig:HTree) of an tree before applying the elimination. Nonleaf nodes are reserved to be used for the extended sparsification similar to the red and black nodes in LABEL:fig:keyIdea:sub3.
Similar to the vertices of an adjacency graph, each node of an tree also corresponds to a set of variables and equations. Leafnodes of the tree correspond to the variables and equations in LABEL:eqn:linsys partitioned using . Nonleaf nodes of the tree, however, correspond to the auxiliary variables and equations. In LABEL:sec:comp, when we explain the extended sparsification in the general case, we will introduce the variables and equations corresponding to the nonleaf nodes of the tree.
An example of an tree is depicted in LABEL:fig:HTree. A parentchild edge between two nodes is shown by a dashed line, whereas interaction edges are shown by solid lines. In the rest of the paper, we use edge and interaction edge interchangeably, while parentchild edges are explicitly mentioned.
Definition 5.
(adjacent clusters) Consider a sparse matrix and a nested sequence of partitionings . Two leaf clusters and are called adjacent (or neighbors) iff or are nonzero blocks (i.e., nodes and in the adjacency graph of with partitioning are connected). Two clusters and (not necessarily with the same level) are adjacent iff a leaf descendant^{5}^{5}5Cluster is a descendant of cluster iff . cluster of is adjacent to a leaf descendant cluster of .
Definition 6.
(wellseparated nodes in the tree) Nodes and in the vertex set of an tree (which can be red, black, or supernodes at different levels) are wellseparated if their corresponding clusters are not adjacent. An interaction edge that connects two wellseparated nodes is called a wellseparated edge. Wellseparated edges do not exist initially in the tree, and are created as a result of elimination.
Lowrank interaction is typically observed in physical systems when two clusters are sufficiently far apart from each other (e.g., wellseparated clusters in fast multipole method). We can replace this condition by an equivalent distance requirement in the graph. However, for simplicity we weakened the requirement and declare that two clusters are wellseparated only if they are not adjacent (LABEL:def:wellSep). If the partitioning of the domain is adequate, this simple definition is sufficient to approximate the more accurate separation requirement based on distance. In many cases, wellseparated interactions are lowrank, but there is no guarantee. It depends on various details not studied in this work, such as the shape of the clusters.
4 Algorithm
In the previous section we defined the tree, which is used to represent graphically the extended system, similar to the example provided in LABEL:fig:keyIdea:sub3. In this section, we explain the details of the algorithm to compute a hierarchical representation of an approximate LU factorization of matrix . Note that while is sparse, the and matrices are typically dense. However, we assume that they can be represented using a hierarchical tree (through the extended sparsification). We could find the matrices and through an elimination process, and compute their hierarchical representation using an extended sparsification. However, in order to achieve linear complexity, we perform elimination and the extended sparsification (compression) together.
The algorithm presented in this paper takes advantage of similar technique as in the inverse fast multipole method (IFMM) [19]. The IFMM can be used to compute the hierarchical representation of LU factorization of a dense matrix which is given in hierarchical form (i.e., FMM matrix). Essentially, in the IFMM, we are given an tree that has interaction edges at all levels. This tree represents a dense matrix.
In the sparse case, however, the tree initially has interaction edges only at the leaf level. For both of the sparse and dense cases, we start with an tree that represents matrix , and end up with an tree representing an approximate LU factorization of . In LABEL:alg:fact the overall factorization scheme is introduced. Various subalgorithms are explained afterwards. In addition, a step by step example of the elimination process on the tree and the corresponding extended matrix is presented in LABEL:sec:appA. Similar to the standard LU factorization, after the elimination process we are able to efficiently compute through forward and backward substitutions.
4.1 Initializing the tree
The Initialize(
)
function in LABEL:alg:fact consists of computing nested partitionings and form the tree with depth , as defined in LABEL:def:Htree. An example of an tree and the corresponding matrix is depicted in LABEL:fig:pictorial_alg0.
Leaf nodes and interaction edges of the tree initially represent the given linear system of equations LABEL:eqn:linsys with partitioning . Through the elimination process, we extend the system of equations, and use nonleaf nodes of the tree to represent the new variables and equations.
4.2 Forming supernodes
The outerloop in LABEL:alg:fact is over different levels from the bottom to the top of the tree. At each level, we start by merging redsiblings into supernodes. In LABEL:alg:fact this process is denoted by the function MergeRedNodes()
. This process results in a coarser representation of the linear system.
We substitute interactions (i.e., edges) between any two pairs of rednodes ( and ) with an interaction between supernodes ( and ) as follows.
(8) 
Similarly, the variable and right hand side vectors corresponding to a supernode is formed by concatenating the variables and right hand sides of its constituting rednodes.
(9) 
(10) 
The process of merging rednodes to form the supernodes is illustrated in LABEL:fig:merge. The merging process is also depicted in LABEL:fig:pictorial_alg1 in the example provided in LABEL:sec:appA.
4.3 Compressing wellseparated edges
The next subalgorithm to consider is the compression. In LABEL:alg:fact this process is denoted by the function Compress()
. During the compression, wellseparated interactions of a supernode are pushed to the parent level nodes which represent a set of auxiliary variables. This is essentially the extended sparsification method which we discussed in the example in LABEL:sec:keyIdea (i.e., replacing the wellseparated edges in LABEL:fig:keyIdea:sub2 by the sequence of edges between red and black nodes at the parent level as shown in LABEL:fig:keyIdea:sub3).
Assume we are at level , and about to apply Compress(
)
. Also, assume is of size (i.e., corresponds to variables and equations), and interacts with (i.e., has an edge to) wellseparated nodes with sizes , respectively. As it will be clear later, the nodes for can either be a rednode (at the parent level) or a supernode (at the same level).
Assume blocks are associated to the outgoing wellseparated edges from to , respectively, i.e., , where is an by matrix. Similarly, are associated to the incoming wellseparated edges to , i.e., , where is an by matrix. This is depicted schematically in LABEL:fig:comp (left).
Similar to the example in LABEL:sec:keyIdea, we assume wellseparated edges can be approximated using a lowrank factorization. We compress all wellseparated interactions of a supernode together. We then introduce auxiliary variables, and replace the wellseparated edges by new edges between the auxiliary variables (i.e., going from the left configuration to the right configuration in LABEL:fig:comp).
In order to compress all wellseparated edges together, we form a temporary matrix by vertically concatenating as well as . Use a lowrank approximation method (e.g., SVD) and write:
(11) 
where and are by matrices for , and is an by matrix. is the rank in the above lowrank approximation. From LABEL:eqn:lra we can write
(12) 
contributes to the equation corresponding to a node with a term like for . We can rewrite this term as
(13) 
Similarly, the contribution of in the equation corresponding to the node can be written as
(14) 
Now, we apply the extended sparsification, and introduce new variables and equations as follows
(15) 
(16) 
In the above, we defined two new vector of variables, and , each of size (similar to the vectors and in the example of LABEL:sec:keyIdea). We assign equations LABEL:eqn:interpol and LABEL:eqn:anterpol to the blacknode and the rednode , respectively. Therefore,
(17) 
Now, we apply the following edge updates:

Remove edges from to and vice versa for .

Add edges from to with blocks for .

Add edges from to with blocks for .

Add an edge from to and vice versa with blocks and , respectively.

Add an edge from to and vice versa with blocks (minus identity matrix of size ).
Therefore, in the compression process all wellseparated edges connected to the supernode are substituted with edges to/from , as shown in LABEL:fig:comp. Note that each step of the compression process described here is in some sense “half” of the extended sparsification step as described in LABEL:sec:keyIdea. For example, in LABEL:fig:pictorial_alg2 there is a wellseparated interaction between and . After the compression step, this interaction is substituted with a wellseparated edge between and (LABEL:fig:pictorial_alg3). Ultimately, when we are about to eliminate , we further compress this edge and make connection between and (LABEL:fig:pictorial_alg6), which is now the same as the example in LABEL:fig:keyIdea.
As a result of the compression process on a supernode , we defined new auxiliary variables, corresponding to two nodes and , each of size . Note that if the matrix is symmetric, for . Therefore, we would not need to concatenate ’s in LABEL:eqn:lra, and only half of the above calculations are required.
4.4 Elimination
After compressing all wellseparated edges, we apply the standard elimination. As a result of the compression process, the supernode is only connected to its original neighbor nodes. This is a key property of the algorithm that preserves the sparsity of the matrix, which results in a slightly larger system of equations (a constant number times the original size of the matrix). The elimination process for a node is explained in LABEL:sec:elimination. We first eliminate the supernode , and then eliminate its blackparent . In the example provided in LABEL:sec:appA, LABEL:fig:pictorial_alg7, LABEL:fig:pictorial_alg5, LABEL:fig:pictorial_alg4 and LABEL:fig:pictorial_alg2 illustrate the graph and matrix after the elimination process.
4.5 Solve
After the factorization part is completed, we can solve for multiple right hand sides. The solve process consists of two steps: a forward and a backward traversal of all nodes. This is identical to the standard forward and backward substitutions in the LU factorization. In the forward traversal we visit all nodes in the order they have been eliminated, and in the backward traversal we visit nodes in the exact reverse order. The solve process is introduced in LABEL:alg:solve.
Note that in the factorization part we introduced auxiliary variables and equations (i.e., all variables and equations associated to nonleaf nodes). We denote this extended system of equations by , which is equivalent to the system LABEL:eqn:linsys up to the accuracy of LABEL:eqn:lra (i.e., eliminating the auxiliary variables from the extended system recovers the original system of equations). In the solve part, we have to solve for all variables (original and auxiliary variables, i.e., ) even though we are just interested in the original variables, . The number of extra variables is limited (see LABEL:lem:lincomp), and is of the same order as the number of original variables.
The solve algorithm begins with SetRHS(), that is to set the right hand side of all nodes in the tree. As explained in LABEL:def:Htree, leaf rednodes of the tree correspond to the original equation LABEL:eqn:linsys; therefore, the right hand side of each leaf rednode is a subvector of determined by the leafpartitioning of the tree, . Based on LABEL:eqn:rhszero, the right hand side of every nonleaf rednode and blacknode is . The right hand sides of supernodes are computed by LABEL:eqn:concatRHS.
After setting the right hand side vectors, we apply functions SolveL()
and SolveU()
to all supernodes and blacknodes. The function SolveL()
(see LABEL:alg:solveL) is applied through a forward traversal, and updates the right hand side vectors. The function SolveU()
(see LABEL:alg:solveU) is applied through a backward traversal, and solve for the variables of each node. After solving for variables of a supernode, we split the solution between its two constituting rednodes (denoted by function SplitVar()
) according to LABEL:eqn:concatX. When LABEL:alg:solve is completed, the solution vector is formed by concatenating variable vectors of all leaf rednodes.
In LABEL:alg:solveU and LABEL:alg:solveL OutGoingEdges(
)
and InComingEdges(
)
, respectively, denote the set of all outgoing and incoming edges of a node in the tree. The function Order(
)
returns the order of elimination of a node , i.e., if in LABEL:alg:fact a node is eliminated after a node , then Order(
)
Order(
)
.
5 Linear complexity
In this section we show that the block sparsity of the extended matrix is preserved through the elimination process. Therefore, the factorization algorithm has provable linear complexity provided that the block sizes (and thus the rank of the lowrank approximations) are bounded.
In LABEL:def:wellSep we defined wellseparated nodes in an tree. In this section, we generalize this concept, and define distance of nodes in a given tree.
Definition 7.
(distance of nodes in tree) Consider nodes and in the vertex set of an tree of a sparse matrix with sequence of nested partitionings . Using LABEL:def:Htree, and correspond to clusters and for some , , and . Assume , and is a descendant of cluster . The distance between nodes and is defined as the length of (i.e., number of edges) the minimum path between nodes and in the adjacency graph of with partitioning .
Corollary 8.
Nodes and in the vertex set of an tree are wellseparated iff their distance is greater than 1.
Note that other criteria are possible to define wellseparated nodes (see LABEL:sec:conclusion).
Theorem 9.
(preservation of sparsity) In LABEL:alg:fact, we never create an edge between two nodes with distance greater than 2.
Proof.
Elimination can results in connecting nodes at large distances. However, in LABEL:alg:fact, before applying elimination we remove all edges to nodes with distance larger than 1 (i.e., the wellseparated edges). Therefore, after eliminating a node we create edges between nodes with distance at most 2.
LABEL:lem:sparse shows that for each node we need to process at most edges, where and are, respectively, the maximum number of supernodes at distance 1 and 2 from a supernode. Note that and depend on the original matrix sparsity pattern, and are independent of the size of the matrix. To establish linear complexity of the factorization, we need to bound the size of the nodes (i.e., number of variables associated to them) in the tree.
For matrices arising from the discretization of a PDE, well separated edges correspond to the interaction of points that are physically far from each other. Therefore, if the Green’s function of the associated PDE is smooth enough, one can expect a wellseparated interaction to be numerically lowrank. We provide numerical evidence in LABEL:sec:numerical to support this argument. Note that the lowrank property of wellseparated nodes depends on the quality of partitioning and the definition of wellseparation. These are topics for followup studies.
For general sparse matrices we can guarantee the linear complexity through bounding the rank growth. This is explained in the next theorem.
Theorem 10.
(linear complexity condition) Consider to be the maximum size of supernodes at level of an tree with depth resulted from LABEL:alg:fact such that
(18) 
where is a constant number. Also assume and , the maximum number of supernodes at distance 1 and 2 from a supernode, are quantities. Under these conditions the cost of the algorithm is linear with respect to the problem size.
Proof.
For a given supernode at level the compression cost is , and the elimination cost is . Note that the required memory scales with for each node. Ignoring the constant factors and , the order of the total cost of factorization is as follows:
(19) 
Plug LABEL:eqn:geometric in LABEL:eqn:cost:
(20) 
Note that for we have . Furthermore, is the number of variables in supernodes at the leaf level which is . Therefore:
(21) 
6 Numerical results
We have implemented the algorithm described in LABEL:sec:alg in C++. The code (we call it LoRaSp^{6}^{6}6Low Rank Sparse solver.) can be downloaded from bitbucket.org/hadip/lorasp. We use Eigen [29] as the backend for linear algebra calculations, and SCOTCH [48] for graph partitioning. We present results for various benchmarks, where LoRaSp is used as a direct solver, or as a preconditioner in conjunction with an iterative solver.
6.1 LoRaSp as a standalone solver
In this section we employ LoRaSp as a standalone solver. The accuracy of the solver depends on the accuracy of the lowrank approximations during the compression step as explained in LABEL:sec:comp. Any lowrank approximation method can be used for the compression. Here, we use SVD. For every wellseparated interaction, we first compute the SVD, and then truncate the singular values at some point. There are many possible criteria to truncate singular values. We discuss some possible criteria. LABEL:fig:decay shows the decay of singular values for blocks corresponding to the interaction between randomly chosen wellseparated nodes at different levels of an tree. The tree corresponds to a matrix obtained from the secondorder uniform discretization of the Poisson equation:
(22) 
The domain is a threedimensional unit cube. The matrix size is 32,768, and the depth of the corresponding tree is 11. Evidently, singular values have exponential decay at different levels of the tree. The zero (up to machine precision) singular values are not shown in the plot.
To demonstrate the linear complexity of the method, we considered a sequence of problems with a growing number of variables. Consider the following sequence of uniform discretization of the domain in LABEL:eqn:Pois: , , , , , , and . The matrix size is increased by a factor of 2 in the consecutive problems. Hence, to keep the size of the leaf supernodes constant among all problems, we consider trees with depth for this sequence of problems, respectively. In general, the depth of tree should scale linearly with , where is the size of matrix.
Wellseparated edges corresponding to a block , as shown in LABEL:eqn:lra, with singularvalues , are compressed by keeping only the singularvalues that satisfy:
(23) 
Smaller values of lead a to more accurate approximation of each block, and consequently a more accurate approximation of the final solution. For a given linear system , the precision of any solution is quantified by the relative error and relative residual defined as follows
(24) 
LABEL:fig:directSolve_varyN:sub2 shows that smaller values of (i.e., more accurate lowrank approximations) result in a more accurate estimation of the solution to the linear system in the cost of larger factorization and solve times, as shown in LABEL:fig:directSolve_varyN:sub1. For a constant , the time spent for the factorization and solve parts are asymptotically linear with the problem size. Note that for smaller values of the linear scaling is achieved for larger values of . In addition, the error and residual of the estimated solution for a fixed barely change with the problem size (see LABEL:fig:directSolve_varyN:sub2).
As it is clear from LABEL:fig:directSolve_varyN, we can obtain more accurate solutions by decreasing the parameter in LABEL:eqn:cutOff1. To show the convergence of the solver, we picked a fixed problem size, and measured the accuracy of the estimated solution as decreases. In addition, for comparison purposes, we consider a 2D variation of LABEL:eqn:Pois which is discretized using a finite volume approach with Voronoi tessellation. The points are drawn from a random uniform distribution in the interval. The discretization results in a matrix , where is a diagonal matrix with inverse of the Voronoi cells on the diagonal, and is a symmetric matrix. We apply the factorization directly to . Note that the average number of nonzeros per row for a matrix corresponding to a 2D Voronoi discretization is 7, which is the same as for a uniform second order 3D discretization.^{7}^{7}7We can show this by double counting the angles in a 2D Voronoi tessellation, once through points, and once through triangles of the corresponding Delaunay triangulations.
In LABEL:fig:convergence the convergence of the solution for a 3D Poisson problem with (corresponding to a