FAST HIERARCHICAL SOLVERS FOR SPARSE MATRICES USING EXTENDED SPARSIFICATION AND LOW-RANK APPROXIMATIONFunding from the “Army High Performance Computing Research Center” (AHPCRC), sponsored by the U.S. Army Research Laboratory under contract No. W911NF-07-2-0027, at Stanford, supported in part this research. The second author is a post-doctoral fellow of the Research Foundation Flanders (FWO) and a Francqui Foundation fellow of the Belgian American Educational Foundation (BAEF). The financial support is gratefully acknowledged.

# Fast Hierarchical Solvers for Sparse Matrices Using Extended Sparsification and Low-Rank Approximation††thanks: Funding from the “Army High Performance Computing Research Center” (AHPCRC), sponsored by the U.S. Army Research Laboratory under contract No. W911NF-07-2-0027, at Stanford, supported in part this research. The second author is a post-doctoral fellow of the Research Foundation Flanders (FWO) and a Francqui Foundation fellow of the Belgian American Educational Foundation (BAEF). The financial support is gratefully acknowledged.

HADI POURANSARI222Stanford University, Department of Mechanical Engineering, Stanford, CA 94305, USA ([hadip,pcoulier,darve]@stanford.edu).    PIETER COULIER222Stanford University, Department of Mechanical Engineering, Stanford, CA 94305, USA ([hadip,pcoulier,darve]@stanford.edu). 333KU Leuven, Department of Civil Engineering, Kasteelpark Arenberg 40, 3001 Leuven, Belgium.    ERIC DARVE222Stanford University, Department of Mechanical Engineering, Stanford, CA 94305, USA ([hadip,pcoulier,darve]@stanford.edu). 444Stanford University, Institute for Computational and Mathematical Engineering, Stanford, CA 94305, USA.
###### Abstract

Inversion of sparse matrices with standard direct solve schemes is robust, but computationally expensive. Iterative solvers, on the other hand, demonstrate better scalability, but need to be used with an appropriate preconditioner (e.g., ILU, AMG, Gauss-Seidel, etc.) for proper convergence. The choice of an effective preconditioner is highly problem dependent. We propose a novel fully algebraic sparse matrix solve algorithm, which has linear complexity with the problem size. Our scheme is based on the Gauss elimination. For a given matrix, we approximate the LU factorization with a tunable accuracy determined a priori. This method can be used as a stand-alone direct solver with linear complexity and tunable accuracy, or it can be used as a black-box preconditioner in conjunction with iterative methods such as GMRES. The proposed solver is based on the low-rank approximation of fill-ins generated during the elimination. Similar to -matrices, fill-ins corresponding to blocks that are well-separated in the adjacency graph are represented via a hierarchical structure. The linear complexity of the algorithm is guaranteed if the blocks corresponding to well-separated clusters of variables are numerically low-rank.

Key words. sparse, hierarchical, low-rank, elimination, compression

AMS subject classifications. 65F05, 65F08, 65F50, 65N55, 68Q25

## 1 Introduction

In the realm of scientific computing, solving a sparse linear system,

 Ax=b, (1)

is known to be one of the challenging parts of many calculations, and is often the main bottleneck. Such a system of equations may be the result of the discretization of some partial differential equation (PDE), or more generally, can represent the local interactions of units in a network.

Solving a system of equations of size using a naive implementation of Gauss elimination has time complexity. The best proved time complexity to solve a general linear system is , where [12, 17, 40]. In the case of sparse matrices, the time and memory complexity can be reduced when a proper elimination order is employed. Finding the optimal ordering (that results in the minimum number of new non-zeros in the LU factorization) is known to be an NP-complete problem [57]. For matrices resulting from the discretization of some PDE in physical space, nested dissection [23, 43] is known as an efficient elimination strategy. [1] discusses the complexity of nested dissection based on the sparsity pattern of the matrix. For a three-dimensional problem, the time and memory complexities are expected to be and , respectively, when nested dissection is employed. As the size of the problem grows, such complexities make direct solvers prohibitive.

Iterative methods, such as conjugate gradient [36], minimum residual [47], and general minimum residual [53], are generally more time and memory efficient. In addition, iterative solvers such as those based on Krylov subspace can be accelerated using fast linear algebra techniques. The fast multipole method (FMM) [20, 22, 28, 45, 49, 58], for example, can accelerate matrix-vector multiplication —from quadratic complexity to linear—, which is the bulk calculation in iterative solvers based on Krylov subspace. However, in practice, iterative methods need to be used in conjunction with preconditioners to limit the number of iterations. The choice of an efficient preconditioner is highly problem dependent. There are many ongoing efforts to develop preconditioners that are optimized for particular applications. Hence, there is a need for general purpose preconditioners. Hierarchical matrices enable us to develop such preconditioners.

FMM matrices are a subclass of a larger category of matrices called hierarchical matrices (-matrices) [7, 9, 32]. -matrices have a hierarchical low-rank structure. For instance, in a hierarchically off-diagonal low-rank (HODLR) matrix [5], off-diagonal blocks can be represented through a hierarchy of low-rank interactions. If the bases used in the hierarchy are nested (i.e., the low-rank basis at each level is constructed using the low-rank basis of the child level) the method is called hierarchically semi-separable (HSS) [3, 15, 56]. In a more general case of hierarchical matrices, more complex low-rank structures can be considered. A full-rank dense matrix with many low-rank structures is in fact data-sparse [31, 33]. A data-sparse matrix can be represented via an extended sparse matrix, which has extra rows/columns, but with only few non-zero entries [2]. The hierarchical structure of such matrices can be used for efficient calculation and storage.

Recently, hierarchical interpolative factorization [38, 39] was proposed, which can be used to directly solve systems obtained from differential and integral equations based on elliptic operators. The fast factorization is obtained by skeletonization of fronts in the multifrontal scheme. Using low-rank structure of the off-diagonal blocks to develop fast direct solvers for linear systems arising from integral equations has been widely studied [18, 26, 27, 41]. [25] proposed a direct solver for elliptic PDEs with variable coefficients on two-dimensional domains by exploiting internal low-rank structures in the matrices. [44] used hierarchical low-rank structures of the off-diagonal blocks to introduce a preconditioner for sparse matrices based on a multifrontal variant of sparse LU factorization. [46] introduced a black-box linear solver using tensor-train format. [42] used a recursive low-rank approximation algorithm based on the Sherman-Morrison formula to obtain a preconditioner for symmetric sparse matrices.

Sparse matrices can be considered as a very special case of hierarchical matrices, where instead of low-rank blocks they initially have zero blocks. However, during the elimination process in a direct solve scheme, many of the zero blocks get filled. For a large category of matrices, including those obtained from the discretization of PDEs, most of the new fill-ins are numerically low-rank. This is justified when the Green’s function associated to the PDE is smooth (non-oscillatory). In this paper, we will use the -matrix structure to compress the fill-ins. A similar process can be applied in the elimination of an extended sparse matrix resulting from an originally dense matrix [4, 19]. This reduces the complexity of the direct solver to linear. The linear complexity of the method is guaranteed if the blocks corresponding to the interaction of well-separated nodes are numerically low-rank. We define the well-separated condition in LABEL:sec:HTree.

The proposed algorithm can be considered as an extension to the block incomplete LU (ILU) [52] preconditioners. In a block ILU factorization, most of the new fill-ins (i.e., blocks that are created during the elimination process which are originally zero) are ignored, and therefore, the block sparsity of the matrix is preserved, while the accuracy is not. In the proposed algorithm, instead, we use low-rank approximations to compress new fill-ins. Using a tree structure, new fill-ins at the fine level are compressed and pushed to the parent (coarse) level. The elimination and compression processes are done in a bottom-to-top traversal.

In addition, the proposed algorithm has formal similarities with algebraic multi-grid (AMG) methods [10, 11, 51, 54]. However, the two methods differ in the way they build the coarse system, and use restriction and prolongation operators. In AMG, the original system is solved at different levels (from fine to coarse). Here, the compressed fill-ins —corresponding to the Schur complements— of each level are solved at the coarser level above. Note that the proposed algorithm is purely algebraic, similar to AMG. If the matrix comes from discretization of a PDE on a physical grid, the grid information can be exploited to improve the performance of the solver, similar to geometric multi-grid.

The algorithm presented in this paper computes a hierarchical representation of the LU factorization of a sparse matrix using low-rank approximations. We introduce intermediate operations to compress new fill-ins. The compressed fill-ins are represented using a set of extra variables. This technique is known as extended sparsification [14]. The accuracy of the factorization phase (i.e., Gauss elimination and compression), , can be determined a priori. The time and memory complexity of the factorization are and , respectively, as will be clarified in LABEL:sec:standalone.

The method presented in this paper is similar to the fast hierarchical methods developed by Hackbusch et al. [9, 31, 32, 33] in a sense that both methods use a tree decomposition to identify and represent low-rank blocks. The key difference, however, is that in the Hackbusch’s algorithm the LU factorization is computed using a depth-first tree traversal order, whereas here we use a breadth-first (level by level from leaf to root) traversal. The connection and differences of the proposed algorithm and Hackbusch’s fast -algebra is discussed in our companion paper [19], in which a similar method is used for dense matrix factorization.

Our solver can be used as a stand-alone direct solver with tunable accuracy. The factorization part is completely separate from the solve part and is generally more expensive. This makes the algorithm appealing when multiple right hand sides are available (e.g., using the proposed solver as a black-box preconditioner in an iterative method). We have implemented the algorithm in C++ (the code can be downloaded from bitbucket.org/hadip/lorasp), and benchmarked it as both a stand-alone solver (see LABEL:sec:standalone), and a preconditioner in conjunction with the generalized minimum residual (GMRES) iterative solver [53] (see LABEL:sec:precond).

Furthermore, the proposed algorithm has interesting parallelization properties. On one hand, all calculations are block matrix computations which can be highly accelerated using BLAS3 operations [6]. On the other hand, since the sparsity pattern at every level is preserved, the data dependency is very local, which is an interesting property to reduce the amount of communications. In addition, the amount of calculation scales with the third power of the size of blocks, while the communications scales with the second power of block sizes. This helps with the concurrency of the parallel implementation. Moreover, the order of elimination does not change the complexity of the presented algorithm. This is in particular an appealing property for parallel implementation. The parallel implementation of the proposed method is not further discussed in this paper.

The remainder of this paper is organized as follows. In LABEL:sec:SLS we briefly introduce a graph representation of sparse matrices, and an interpretation of the Gauss elimination using the adjacency graph. In LABEL:sec:HTree some concepts related to the hierarchical representation of matrices are defined. The algorithm is explained in LABEL:sec:alg in detail, and the linear complexity analysis is provided in LABEL:sec:lin. We present numerical results obtained from various benchmarks in LABEL:sec:numerical. There are many avenues for optimization and extension of the algorithm. We discuss some of these opportunities in LABEL:sec:conclusion.

## 2 Sparse linear systems

In this section we briefly introduce the graphical framework that is required in the rest of the paper. We assume a sparse linear system of size as in LABEL:eqn:linsys is given.

In many algorithms, including the method proposed in this paper, it is necessary (or more efficient) to operate on sub-blocks of the matrix rather than single elements. The blocks of the matrix can be identified using a partitioning as defined below.

###### Definition 1.

(partitioning) A partitioning is defined as a surjective map . groups rows/columns of into clusters, for .

We denote an entry of matrix located in row and column by . For , we use to represent a sub-matrix formed by concatenating all entries such that and . Additionally, for a vector of size , we use to represent a sub-vector formed by concatenating all entries such that .

It is often fruitful to represent sparse matrices using graphs. An adjacency graph, as defined below, represents a sparse matrix with partitioning.

###### Definition 2.

(adjacency graph) A sparse matrix with a partitioning can be represented by its adjacency graph , where . Each for represents a cluster of rows and columns of . A vertex is connected to a vertex by a directed edge if and only if the block in is non-zero.111The adjacency graph of a matrix with a partitioning is essentially the quotient graph of the adjacency graph of matrix with identity partitioning, where the equivalence relation is induced by partitioning .

In LABEL:fig:domain, an example of the adjacency graph is illustrated. In the rest of the paper we use vertex and node interchangeably for the elements of in the adjacency graph.

The linear system in LABEL:eqn:linsys can also be represented using the adjacency graph of , . For a node , denotes the vector of variables corresponding to cluster . Similarly, denotes the vector of right hand sides corresponding to cluster . Also for an edge , denotes the sub-matrix corresponding to cluster of columns and cluster of rows. For the example shown in LABEL:fig:domain, the following two notations represent the same set of equations corresponding to the node and its incoming edges. \cref@addtoresetequationparentequation

 A2,1x1+A2,2x2+A2,3x3+A2,4x4=b2 (2a) Mat(ev1→v2)⋅Var(v1)+Mat(ev2→v2)⋅Var(v2)+Mat(ev3→v2)⋅Var(v3)+% Mat(ev4→v2)⋅Var(v4)=RHS(v2) (2b)

### 2.2 Elimination

The block Gauss elimination process, or block LU factorization, can also be explained using the graph representation of the matrix. At step of the elimination process, a set of unknowns, , is eliminated from the system of equations. This corresponds to eliminating vertex from the adjacency graph . The self-edge from to itself corresponds to the pivot diagonal sub-block in the matrix. After eliminating , for every pair of outgoing edge to a vertex and incoming edge from a vertex , a new edge from to is created, corresponding to the Schur complement of the eliminated edges, that is

 −Mat(evi→vj)⋅Mat(evi→vi)−1⋅Mat(evk→vi)=−Aj,iA−1i,iAi,k (3)

Note that if the edge between and exists before elimination, the Schur complement adds to the existing sub-block.

The process described above reveals the fact that during the elimination process many new edges are introduced in the graph. This corresponds to generating new non-zero blocks in the matrix during the LU factorization. The generation of many dense blocks is what makes the direct factorization of sparse matrices a prohibitive process. Essentially, a matrix can be sparse, while and in the LU factorization of are dense. In the next section, we explain how we can preserve the sparsity of the matrix during the elimination process by compressing the well-separated interactions. This process is known as extended sparsification.

### 2.3 Key idea

An important observation in the elimination process is the fact that fill-ins (i.e., new edges created during the elimination process) that correspond to well-separated vertices are often numerically low-rank. For a linear system obtained from a discretized PDE, well-separated vertices refers to points that are physically far enough from each other. For a general sparse matrix two vertices are well-separated if their distance in the adjacency graph is large enough. It is formally defined in LABEL:def:wellSep. We replace such fill-ins with a sequence of low-rank matrices.

For example, consider the following symmetric linear system that is partitioned into 3 blocks

 ⎛⎜⎝SBCB⊺PC⊺Q⎞⎟⎠⎛⎜⎝x1x2x3⎞⎟⎠=⎛⎜⎝b1b2b3⎞⎟⎠ (4)

In LABEL:fig:keyIdea:sub1 the adjacency graph of the system of equations in LABEL:eqn:example is shown. Now, consider eliminating . The resulting system is as follows

 (P′DD⊺Q′)(x2x3)=(b′2b′3), (5)

where , , , , and . The adjacency graph of the system of equations in LABEL:eqn:example1 is depicted in LABEL:fig:keyIdea:sub2. Nodes 2 and 3 can be considered well-separated (see LABEL:def:wellSep). They get connected due to the elimination of node 1. We assume their interaction is low-rank, and can be written as

 D≃UKV⊺, (6)

where and are tall matrices. We can use any low-rank approximation in LABEL:eqn:lowrank, for example the singular value decomposition (SVD). We combine LABEL:eqn:lowrank and LABEL:eqn:example1 to define a new set of equations, in which direct interaction of the nodes 2 and 3 is replaced by a sequence of low-rank interactions

 ⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝P′UQ′VU⊺−IV⊺−I−IK−IK⊺⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝x2x3z2z3y2y3⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝b′2b′30000⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠ (7)

In the above equation, we introduced extra variables (, and ) to represent the far-field interactions. This technique is known as extended sparsification [14]. Note that the system of equations LABEL:eqn:extend is equivalent to the system of equations LABEL:eqn:example1 up to the accuracy of LABEL:eqn:lowrank. In fact, if we do a Gaussian elimination on the matrix in LABEL:eqn:extend using blocks starting from the right, we get back to the original matrix in LABEL:eqn:example1.

In an analogy with the fast multipole method, and can be considered as the multipole coefficients, and and as the local coefficients. In LABEL:fig:keyIdea:sub3 the adjacency graph of the extended linear system is depicted. The red and black nodes correspond to the extended part of the matrix in LABEL:eqn:extend (extra variables introduced above). As a result of extended sparsification, nodes 2 and 3 are disconnected in LABEL:fig:keyIdea:sub3. Therefore, in the general case could be eliminated without introducing new fill-ins between and other neighbors of .222Note that in this example, we intentionally chose order of elimination , to introduce a new fill-in, and demonstrate the low-rank compression process. Using , as the order of elimination would result in no fill-ins for this example. Finding such an order of elimination, however, is a hard problem and may not be possible for practical matrices. As shown In LABEL:lem:sparse, removing well-separated interactions of nodes before elimination preserves the sparsity.

In the general case, we start by a standard block Gauss elimination. When we create fill-ins, similar to the above example, we apply the extended sparsification (compression). This results in the creation of new nodes in the graph belonging to a coarser level.

As we proceed with the elimination process at level , edges between the auxiliary nodes at level are created. After the elimination process at level is completed, we proceed with the elimination at level , and continue up to the root (level 0). Essentially, in this process we form a hierarchical approximation of and matrices (which are generally dense) in the LU factorization of , without directly computing them.

In addition, similar to the agglomeration process in multi-grid methods, we consider one red-node and one black-node for every pair of clusters (i.e., the well-separated interactions of each pair of clusters are compressed together). Therefore, the number of red-nodes at the parent level is half of the number of red-nodes at the current level. We formally define the hierarchy of nodes in LABEL:sec:HTree, and the details of the algorithm in LABEL:sec:alg.

## 3 Hierarchical representation

To form a hierarchical tree, we recursively partition the rows/columns indices of the matrix. This is formally defined as a sequence of nested partitionings.

###### Definition 3.

(nested partitionings) A sequence of partitionings with is called nested if every cluster is the union of two clusters and for . Cluster is then called the parent of two child clusters and . We call and sibling clusters. The level of a cluster is defined as the index of its defining partitioning . The clusters associated to the finest partitioning, , have no children and are called leaf clusters, while the cluster associated to has no parent and is called the root cluster. All other clusters have exactly two children and one parent.333The binary subdivision is not necessary. We can generalize this to quad-tree, octree, etc.

A visual example of nested partitionings with six levels (i.e., ) is illustrated in LABEL:sec:appB. Now, we define the hierarchical tree444In fact, it is not a tree. As we see later, there are edges between nodes at each level. (denoted by -tree) of a sparse matrix given a sequence of nested partitionings.

###### Definition 4.

(hierarchical tree) Given a sparse matrix , and a nested sequence of partitionings , the hierarchical tree (-tree) is defined as a directed graph with red and black vertices and two types of edges (parent-child and interaction edges).

The vertices of -tree are corresponding to the clusters associated with the nested partitionings. For every cluster with and there is a corresponding black-node in the vertex set of -tree. Also, there is a pair of red-nodes and corresponding to the sibling clusters and (children of the cluster ) in the vertex set of -tree. We call such a pair of red-nodes a super-node, and denote it by , that is corresponding to the cluster . is connected to and by parent-child edges. Hence, is also the parent of super-node . Additionally, the red-node corresponding to is connected to by a parent-child edge. We denote the parent of a node (which can be a red, black, or super node) by .

We also consider one special red-node associated to the root cluster. The level of each vertex of -tree is denoted by a superscript index. Additionally, the depth of -tree is defined as .

There is an interaction edge between two red-nodes with level (leaf red-nodes), if the corresponding vertices in the adjacency graph of with partitioning are connected. Therefore, the subgraph of -tree induced by and for is the adjacency graph of with partitioning .

There is no interaction edge between non-leaf vertices (shown transparent in LABEL:fig:HTree) of an -tree before applying the elimination. Non-leaf nodes are reserved to be used for the extended sparsification similar to the red and black nodes in LABEL:fig:keyIdea:sub3.

Similar to the vertices of an adjacency graph, each node of an -tree also corresponds to a set of variables and equations. Leaf-nodes of the -tree correspond to the variables and equations in LABEL:eqn:linsys partitioned using . Non-leaf nodes of the -tree, however, correspond to the auxiliary variables and equations. In LABEL:sec:comp, when we explain the extended sparsification in the general case, we will introduce the variables and equations corresponding to the non-leaf nodes of the -tree.

An example of an -tree is depicted in LABEL:fig:HTree. A parent-child edge between two nodes is shown by a dashed line, whereas interaction edges are shown by solid lines. In the rest of the paper, we use edge and interaction edge interchangeably, while parent-child edges are explicitly mentioned.

###### Definition 5.

(adjacent clusters) Consider a sparse matrix and a nested sequence of partitionings . Two leaf clusters and are called adjacent (or neighbors) iff or are non-zero blocks (i.e., nodes and in the adjacency graph of with partitioning are connected). Two clusters and (not necessarily with the same level) are adjacent iff a leaf descendant555Cluster is a descendant of cluster iff . cluster of is adjacent to a leaf descendant cluster of .

###### Definition 6.

(well-separated nodes in the -tree) Nodes and in the vertex set of an -tree (which can be red, black, or super-nodes at different levels) are well-separated if their corresponding clusters are not adjacent. An interaction edge that connects two well-separated nodes is called a well-separated edge. Well-separated edges do not exist initially in the -tree, and are created as a result of elimination.

Low-rank interaction is typically observed in physical systems when two clusters are sufficiently far apart from each other (e.g., well-separated clusters in fast multipole method). We can replace this condition by an equivalent distance requirement in the graph. However, for simplicity we weakened the requirement and declare that two clusters are well-separated only if they are not adjacent (LABEL:def:wellSep). If the partitioning of the domain is adequate, this simple definition is sufficient to approximate the more accurate separation requirement based on distance. In many cases, well-separated interactions are low-rank, but there is no guarantee. It depends on various details not studied in this work, such as the shape of the clusters.

## 4 Algorithm

In the previous section we defined the -tree, which is used to represent graphically the extended system, similar to the example provided in LABEL:fig:keyIdea:sub3. In this section, we explain the details of the algorithm to compute a hierarchical representation of an approximate LU factorization of matrix . Note that while is sparse, the and matrices are typically dense. However, we assume that they can be represented using a hierarchical tree (through the extended sparsification). We could find the matrices and through an elimination process, and compute their hierarchical representation using an extended sparsification. However, in order to achieve linear complexity, we perform elimination and the extended sparsification (compression) together.

The algorithm presented in this paper takes advantage of similar technique as in the inverse fast multipole method (IFMM) [19]. The IFMM can be used to compute the hierarchical representation of LU factorization of a dense matrix which is given in hierarchical form (i.e., FMM matrix). Essentially, in the IFMM, we are given an -tree that has interaction edges at all levels. This -tree represents a dense matrix.

In the sparse case, however, the -tree initially has interaction edges only at the leaf level. For both of the sparse and dense cases, we start with an -tree that represents matrix , and end up with an -tree representing an approximate LU factorization of . In LABEL:alg:fact the overall factorization scheme is introduced. Various sub-algorithms are explained afterwards. In addition, a step by step example of the elimination process on the -tree and the corresponding extended matrix is presented in LABEL:sec:appA. Similar to the standard LU factorization, after the elimination process we are able to efficiently compute through forward and backward substitutions.

### 4.1 Initializing the H-tree

The Initialize() function in LABEL:alg:fact consists of computing nested partitionings and form the -tree with depth , as defined in LABEL:def:Htree. An example of an -tree and the corresponding matrix is depicted in LABEL:fig:pictorial_alg0. Leaf nodes and interaction edges of the -tree initially represent the given linear system of equations LABEL:eqn:linsys with partitioning . Through the elimination process, we extend the system of equations, and use non-leaf nodes of the -tree to represent the new variables and equations.

### 4.2 Forming super-nodes

The outer-loop in LABEL:alg:fact is over different levels from the bottom to the top of the tree. At each level, we start by merging red-siblings into super-nodes. In LABEL:alg:fact this process is denoted by the function MergeRedNodes(). This process results in a coarser representation of the linear system. We substitute interactions (i.e., edges) between any two pairs of red-nodes ( and ) with an interaction between super-nodes ( and ) as follows.

 Mat(es[i]j→s[i]j′)=⎛⎜⎝% Mat(er0[i]j→r0[i]j′)Mat(er1[i]j→r0[i]j′)Mat(er0[i]j→r1[i]j′)Mat(er1[i]j→r1[i]j′)⎞⎟⎠ (8)

Similarly, the variable and right hand side vectors corresponding to a super-node is formed by concatenating the variables and right hand sides of its constituting red-nodes.

 Var(s[i]j)=concatenate(Var(r0[i]j), Var(r1[i]j)) (9)
 RHS(s[i]j)=concatenate(RHS(r0[i]j), RHS(r1[i]j)) (10)

The process of merging red-nodes to form the super-nodes is illustrated in LABEL:fig:merge. The merging process is also depicted in LABEL:fig:pictorial_alg1 in the example provided in LABEL:sec:appA.

### 4.3 Compressing well-separated edges

The next sub-algorithm to consider is the compression. In LABEL:alg:fact this process is denoted by the function Compress(). During the compression, well-separated interactions of a super-node are pushed to the parent level nodes which represent a set of auxiliary variables. This is essentially the extended sparsification method which we discussed in the example in LABEL:sec:keyIdea (i.e., replacing the well-separated edges in LABEL:fig:keyIdea:sub2 by the sequence of edges between red and black nodes at the parent level as shown in LABEL:fig:keyIdea:sub3).

Assume we are at level , and about to apply Compress(). Also, assume is of size (i.e., corresponds to variables and equations), and interacts with (i.e., has an edge to) well-separated nodes with sizes , respectively. As it will be clear later, the nodes for can either be a red-node (at the parent level) or a super-node (at the same level). Assume blocks are associated to the outgoing well-separated edges from to , respectively, i.e., , where is an by matrix. Similarly, are associated to the incoming well-separated edges to , i.e., , where is an by matrix. This is depicted schematically in LABEL:fig:comp (left).

Similar to the example in LABEL:sec:keyIdea, we assume well-separated edges can be approximated using a low-rank factorization. We compress all well-separated interactions of a super-node together. We then introduce auxiliary variables, and replace the well-separated edges by new edges between the auxiliary variables (i.e., going from the left configuration to the right configuration in LABEL:fig:comp).

In order to compress all well-separated edges together, we form a temporary matrix by vertically concatenating as well as . Use a low-rank approximation method (e.g., SVD) and write:

 ⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣A1⋮AtB1⋮Bt⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦≃⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣R1⋮RtQ1⋮Qt⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦V⊺, (11)

where and are by matrices for , and is an by matrix. is the rank in the above low-rank approximation. From LABEL:eqn:lra we can write

 Ak≃RkV⊺ and B⊺k=VQ⊺kfork=1,2,…,t (12)

contributes to the equation corresponding to a node with a term like for . We can rewrite this term as

 Ak⋅Var(s[i]j)≃(RkV⊺)⋅Var(s[i]j)=Rk(V⊺⋅Var(s[i]j)) (13)

Similarly, the contribution of in the equation corresponding to the node can be written as

 t∑k=1B⊺k⋅Var(pk)≃t∑k=1(VQ⊺k)⋅Var(pk)=Vt∑k=1Q⊺k⋅Var(pk) (14)

Now, we apply the extended sparsification, and introduce new variables and equations as follows

 \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@gray@stroke0\pgfsys@color@gray@fill0Var(P(b[i]j)):=V⊺⋅Var(s[i]j)⇒V⊺⋅Var(s[i]j)−Var(P(b[i]j))=0 (15)
 \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@gray@stroke0\pgfsys@color@gray@fill0Var(b[i]j):=t∑k=1Q⊺k⋅Var(pk)⇒t∑k=1Q⊺k⋅Var(pk)−Var(b[i]j)=0 (16)

In the above, we defined two new vector of variables, and , each of size (similar to the vectors and in the example of LABEL:sec:keyIdea). We assign equations LABEL:eqn:interpol and LABEL:eqn:anterpol to the black-node and the red-node , respectively. Therefore,

 RHS(b[i]j)=RHS(P(b[i]j))=0 (17)

Now, we apply the following edge updates:

• Remove edges from to and vice versa for .

• Add edges from to with blocks for .

• Add edges from to with blocks for .

• Add an edge from to and vice versa with blocks and , respectively.

• Add an edge from to and vice versa with blocks (minus identity matrix of size ).

Therefore, in the compression process all well-separated edges connected to the super-node are substituted with edges to/from , as shown in LABEL:fig:comp. Note that each step of the compression process described here is in some sense “half” of the extended sparsification step as described in LABEL:sec:keyIdea. For example, in LABEL:fig:pictorial_alg2 there is a well-separated interaction between and . After the compression step, this interaction is substituted with a well-separated edge between and (LABEL:fig:pictorial_alg3). Ultimately, when we are about to eliminate , we further compress this edge and make connection between and (LABEL:fig:pictorial_alg6), which is now the same as the example in LABEL:fig:keyIdea.

As a result of the compression process on a super-node , we defined new auxiliary variables, corresponding to two nodes and , each of size . Note that if the matrix is symmetric, for . Therefore, we would not need to concatenate ’s in LABEL:eqn:lra, and only half of the above calculations are required.

### 4.4 Elimination

After compressing all well-separated edges, we apply the standard elimination. As a result of the compression process, the super-node is only connected to its original neighbor nodes. This is a key property of the algorithm that preserves the sparsity of the matrix, which results in a slightly larger system of equations (a constant number times the original size of the matrix). The elimination process for a node is explained in LABEL:sec:elimination. We first eliminate the super-node , and then eliminate its black-parent . In the example provided in LABEL:sec:appA, LABEL:fig:pictorial_alg7, LABEL:fig:pictorial_alg5, LABEL:fig:pictorial_alg4 and LABEL:fig:pictorial_alg2 illustrate the graph and matrix after the elimination process.

### 4.5 Solve

After the factorization part is completed, we can solve for multiple right hand sides. The solve process consists of two steps: a forward and a backward traversal of all nodes. This is identical to the standard forward and backward substitutions in the LU factorization. In the forward traversal we visit all nodes in the order they have been eliminated, and in the backward traversal we visit nodes in the exact reverse order. The solve process is introduced in LABEL:alg:solve.

Note that in the factorization part we introduced auxiliary variables and equations (i.e., all variables and equations associated to non-leaf nodes). We denote this extended system of equations by , which is equivalent to the system LABEL:eqn:linsys up to the accuracy of LABEL:eqn:lra (i.e., eliminating the auxiliary variables from the extended system recovers the original system of equations). In the solve part, we have to solve for all variables (original and auxiliary variables, i.e., ) even though we are just interested in the original variables, . The number of extra variables is limited (see LABEL:lem:lincomp), and is of the same order as the number of original variables.

The solve algorithm begins with SetRHS(), that is to set the right hand side of all nodes in the -tree. As explained in LABEL:def:Htree, leaf red-nodes of the -tree correspond to the original equation LABEL:eqn:linsys; therefore, the right hand side of each leaf red-node is a sub-vector of determined by the leaf-partitioning of the -tree, . Based on LABEL:eqn:rhszero, the right hand side of every non-leaf red-node and black-node is . The right hand sides of super-nodes are computed by LABEL:eqn:concatRHS.

After setting the right hand side vectors, we apply functions SolveL() and SolveU() to all super-nodes and black-nodes. The function SolveL() (see LABEL:alg:solveL) is applied through a forward traversal, and updates the right hand side vectors. The function SolveU() (see LABEL:alg:solveU) is applied through a backward traversal, and solve for the variables of each node. After solving for variables of a super-node, we split the solution between its two constituting red-nodes (denoted by function SplitVar()) according to LABEL:eqn:concatX. When LABEL:alg:solve is completed, the solution vector is formed by concatenating variable vectors of all leaf red-nodes.

In LABEL:alg:solveU and LABEL:alg:solveL OutGoingEdges() and InComingEdges(), respectively, denote the set of all outgoing and incoming edges of a node in the -tree. The function Order() returns the order of elimination of a node , i.e., if in LABEL:alg:fact a node is eliminated after a node , then Order()Order().

## 5 Linear complexity

In this section we show that the block sparsity of the extended matrix is preserved through the elimination process. Therefore, the factorization algorithm has provable linear complexity provided that the block sizes (and thus the rank of the low-rank approximations) are bounded.

In LABEL:def:wellSep we defined well-separated nodes in an -tree. In this section, we generalize this concept, and define distance of nodes in a given -tree.

###### Definition 7.

(distance of nodes in -tree) Consider nodes and in the vertex set of an -tree of a sparse matrix with sequence of nested partitionings . Using LABEL:def:Htree, and correspond to clusters and for some , , and . Assume , and is a descendant of cluster . The distance between nodes and is defined as the length of (i.e., number of edges) the minimum path between nodes and in the adjacency graph of with partitioning .

###### Corollary 8.

Nodes and in the vertex set of an -tree are well-separated iff their distance is greater than 1.

Note that other criteria are possible to define well-separated nodes (see LABEL:sec:conclusion).

###### Theorem 9.

(preservation of sparsity) In LABEL:alg:fact, we never create an edge between two nodes with distance greater than 2.

###### Proof.

Elimination can results in connecting nodes at large distances. However, in LABEL:alg:fact, before applying elimination we remove all edges to nodes with distance larger than 1 (i.e., the well-separated edges). Therefore, after eliminating a node we create edges between nodes with distance at most 2.

LABEL:lem:sparse shows that for each node we need to process at most edges, where and are, respectively, the maximum number of super-nodes at distance 1 and 2 from a super-node. Note that and depend on the original matrix sparsity pattern, and are independent of the size of the matrix. To establish linear complexity of the factorization, we need to bound the size of the nodes (i.e., number of variables associated to them) in the -tree.

For matrices arising from the discretization of a PDE, well separated edges correspond to the interaction of points that are physically far from each other. Therefore, if the Green’s function of the associated PDE is smooth enough, one can expect a well-separated interaction to be numerically low-rank. We provide numerical evidence in LABEL:sec:numerical to support this argument. Note that the low-rank property of well-separated nodes depends on the quality of partitioning and the definition of well-separation. These are topics for followup studies.

For general sparse matrices we can guarantee the linear complexity through bounding the rank growth. This is explained in the next theorem.

###### Theorem 10.

(linear complexity condition) Consider to be the maximum size of super-nodes at level of an -tree with depth resulted from LABEL:alg:fact such that

 \definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@gray@stroke0\pgfsys@color@gray@fill0di≤αl−i\definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@gray@stroke0\pgfsys@color@gray@fill0dland\definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@gray@stroke0\pgfsys@color@gray@fill0dl=O(n/2l)=O(1), (18)

where is a constant number. Also assume and , the maximum number of super-nodes at distance 1 and 2 from a super-node, are quantities. Under these conditions the cost of the algorithm is linear with respect to the problem size.

###### Proof.

For a given super-node at level the compression cost is , and the elimination cost is . Note that the required memory scales with for each node. Ignoring the constant factors and , the order of the total cost of factorization is as follows:

 (19)

Plug LABEL:eqn:geometric in LABEL:eqn:cost:

 factorization cost=O(l∑i=12i−1α3(l−i)\definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@gray@stroke0\pgfsys@color@gray@fill0d3l)=O(2l−1\definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@gray@stroke0\pgfsys@color@gray@fill0d3ll−1∑i′=0(α32)i′)change of variable: i′=l−i=O(2l\definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@gray@stroke0\pgfsys@color@gray@fill0d3l) (20)

Note that for we have . Furthermore, is the number of variables in super-nodes at the leaf level which is . Therefore:

 factorization cost=O(n\definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@gray@stroke0\pgfsys@color@gray@fill0d2l),factorization memory=O(n\definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@gray@stroke0\pgfsys@color@gray@fill0dl) (21)

## 6 Numerical results

We have implemented the algorithm described in LABEL:sec:alg in C++. The code (we call it LoRaSp666Low Rank Sparse solver.) can be downloaded from bitbucket.org/hadip/lorasp. We use Eigen [29] as the backend for linear algebra calculations, and SCOTCH [48] for graph partitioning. We present results for various benchmarks, where LoRaSp is used as a direct solver, or as a preconditioner in conjunction with an iterative solver.

### 6.1 LoRaSp as a stand-alone solver

In this section we employ LoRaSp as a stand-alone solver. The accuracy of the solver depends on the accuracy of the low-rank approximations during the compression step as explained in LABEL:sec:comp. Any low-rank approximation method can be used for the compression. Here, we use SVD. For every well-separated interaction, we first compute the SVD, and then truncate the singular values at some point. There are many possible criteria to truncate singular values. We discuss some possible criteria. LABEL:fig:decay shows the decay of singular values for blocks corresponding to the interaction between randomly chosen well-separated nodes at different levels of an -tree. The tree corresponds to a matrix obtained from the second-order uniform discretization of the Poisson equation:

 (22)

The domain is a three-dimensional unit cube. The matrix size is 32,768, and the depth of the corresponding -tree is 11. Evidently, singular values have exponential decay at different levels of the tree. The zero (up to machine precision) singular values are not shown in the plot.

To demonstrate the linear complexity of the method, we considered a sequence of problems with a growing number of variables. Consider the following sequence of uniform discretization of the domain in LABEL:eqn:Pois: , , , , , , and . The matrix size is increased by a factor of 2 in the consecutive problems. Hence, to keep the size of the leaf super-nodes constant among all problems, we consider -trees with depth for this sequence of problems, respectively. In general, the depth of -tree should scale linearly with , where is the size of matrix.

Well-separated edges corresponding to a block , as shown in LABEL:eqn:lra, with singular-values , are compressed by keeping only the singular-values that satisfy:

 σkσ0≥ϵ (23)

Smaller values of lead a to more accurate approximation of each block, and consequently a more accurate approximation of the final solution. For a given linear system , the precision of any solution is quantified by the relative error and relative residual defined as follows

 error=∥x−~x∥2∥x∥2,residual=∥A~x−b∥2∥b∥2 (24)

LABEL:fig:directSolve_varyN:sub2 shows that smaller values of (i.e., more accurate low-rank approximations) result in a more accurate estimation of the solution to the linear system in the cost of larger factorization and solve times, as shown in LABEL:fig:directSolve_varyN:sub1. For a constant , the time spent for the factorization and solve parts are asymptotically linear with the problem size. Note that for smaller values of the linear scaling is achieved for larger values of . In addition, the error and residual of the estimated solution for a fixed barely change with the problem size (see LABEL:fig:directSolve_varyN:sub2).

As it is clear from LABEL:fig:directSolve_varyN, we can obtain more accurate solutions by decreasing the parameter in LABEL:eqn:cutOff1. To show the convergence of the solver, we picked a fixed problem size, and measured the accuracy of the estimated solution as decreases. In addition, for comparison purposes, we consider a 2D variation of LABEL:eqn:Pois which is discretized using a finite volume approach with Voronoi tessellation. The points are drawn from a random uniform distribution in the interval. The discretization results in a matrix , where is a diagonal matrix with inverse of the Voronoi cells on the diagonal, and is a symmetric matrix. We apply the factorization directly to . Note that the average number of non-zeros per row for a matrix corresponding to a 2D Voronoi discretization is 7, which is the same as for a uniform second order 3D discretization.777We can show this by double counting the angles in a 2D Voronoi tessellation, once through points, and once through triangles of the corresponding Delaunay triangulations.

In LABEL:fig:convergence the convergence of the solution for a 3D Poisson problem with (corresponding to a