A Robust Hierarchical Solver for Illconditioned Systems with Applications to Ice Sheet Modeling
Abstract
A hierarchical solver is proposed for solving sparse illconditioned linear systems in parallel. The solver is based on a modification of the LoRaSp method, but employs a deferredcompression technique, which provably reduces the approximation error and significantly improves efficiency. Moreover, the deferredcompression technique introduces minimal overhead and does not affect parallelism. As a result, the new solver achieves linear computational complexity under mild assumptions and excellent parallel scalability. To demonstrate the performance of the new solver, we focus on applying it to solve sparse linear systems arising from ice sheet modeling. The strong anisotropic phenomena associated with the thin structure of ice sheets creates serious challenges for existing solvers. To address the anisotropy, we additionally developed a customized partitioning scheme for the solver, which captures the strongcoupling direction accurately. In general, the partitioning can be computed algebraically with existing software packages, and thus the new solver is generalizable for solving other sparse linear systems. Our results show that ice sheet problems of about 300 million degrees of freedom have been solved in just a few minutes using a thousand processors.
keywords:
Hierarchical matrix, Sparse matrix, Ice sheet modeling, Parallel computing1 Introduction
This paper considers the problem of solving large sparse linear systems, which is a fundamental building block but also often a computational bottleneck in many science and engineering applications. In particular, we target linear systems that result from the numerical discretization of elliptic partial differentials equations (PDE) including Laplace, Stokes, Helmholtz equations (in low and middle frequency regime), etc., using local schemes such as finite differences or finite elements. One challenge arises when the condition number of the problem is large, and existing solvers become inefficient. Existing solvers fall into three classes. The first class is sparse direct solvers Davis et al. (2016), which leverage efficient ordering schemes to perform Gaussian elimination. However, they generally require computation and storage for solving a threedimensional problem of size . These large costs seriously limit the application of sparse direct solvers to truly largescale problems. The second class is iterative solvers such as the conjugate gradient method and the multigrid method. These methods may need only work and storage per iteration. However, the number of iterations required to achieve convergence can be quite large when solving illconditioned linear systems. Preconditioning is essential to improve the conditioning and convergence.
The third class of methods, which is the focus here, is hierarchical solvers. These methods compute an approximate factorization of a discretized elliptic PDE by taking advantage of the fact that the discretization matrix (and its inverse) has certainty hierarchical lowrank structures, including  Hackbusch (1999); Hackbusch and Khoromskij (2000),  Hackbusch and Börm (2002); Hackbusch (2015) matrices and hierarchically semiseparable (HSS) Xia et al. (2010); Chandrasekaran et al. (2006) matrices, among others Amestoy et al. (2015); Aminfar et al. (2016). By exploiting this data sparsity of the underlying physical problem, hierarchical solvers have been shown to achieve linear or quasilinear complexity. However, their efficiency deteriorates when highly illconditioned problems are encountered because the rank/costs of approximations must increase dramatically in order to maintain the same accuracy in the final solution. Note that hierarchical solvers can be used either as direct solver (high accuracy) or as a preconditioner (low accuracy) for iterative methods. Our focus is on the latter.
In this paper, we introduce the deferredcompression technique to improve the efficiency of hierarchical solvers for solving sparse illconditioned linear systems and demonstrate this improvement by implementing it in a particular hierarchical solver named LoRaSp Pouransari et al. (2017). In hierarchical solvers such as LoRaSp, offdiagonal matrix blocks are compressed with lowrank approximations if they satisfy the strong admissibility condition Hackbusch and Börm (2002). In our new solver, these compressible matrix blocks are first scaled by Cholesky factors of the corresponding diagonal blocks before lowrank approximations are applied. This extra scaling step provably reduces errors in the subsequent step of forming the Schur complement. In addition, it increases the likelihood that the Schur complement remains symmetric positive definite with crude lowrank approximation when the original input matrix is SPD. For many practical applications, using the deferredcompression technique to preserve the SPD property of the underlying physical problem is crucial.
Previous deferredcompression work Xia and Gu (2010); Xia and Xin (2017); Xing and Chow (2018) focused on HSS matrices, which is a type of weakly admissible (as opposed to strongly admissible) hierarchical matrices. These approaches are not directly applicable to strongly admissible hierarchical matrices (e.g., matrices) such as the LoRaSp solver. Furthermore, prior deferredcompression efforts concentrated on solving dense linear systems, where incorporating the deferredcompression step leads to an extra amount of computation, and no corresponding parallel solver was developed. Compared to the previously published papers, our deferredcompression technique is novel in three ways:

we target hierarchical solvers specialized for strongly admissible hierarchical matrices, and develop an associated general error analysis; the previous analysis can be recovered as a special case of our new analysis.

we propose a new solver for sparse linear systems for which it is proved that the computational complexity is under some mild assumptions. This nearly optimal complexity implies that we can solve large problems with minimum asymptotic computational cost (up to some constants).

we show that incorporating the deferredcompression scheme into the LoRaSp solver does not change the data and task dependencies in the parallel LoRaSp solver Chen et al. (2016). Therefore, we can take advantage of the existing parallel algorithm to solve largescale problems efficiently (on distributedmemory machines).
In order to demonstrate the performance of our new solver, this paper addresses the challenges of solving linear systems from a realworld problem—ice sheet modeling where the solution of discretized linear systems remains the computational bottleneck. Ice sheet modeling is an essential component needed to estimate future sealevel rise due to climate modeling. As noted in Solomon (2007); Stocker (2014) from the Intergovernmental Panel on Climate Change (IPCC), modern ice sheet models must continue to introduce advanced features such as adaptive mesh refinement at subkilometer resolutions, optimization, data assimilation and uncertainty quantification for the treatment of numerous unknown model inputs. These advances will likely introduce further computational burdens requiring improvements in the linear solver, which must be repeatedly invoked over the course of the simulation. Given that current ice sheet simulations already consume resources on thousands of processing units on modern supercomputers and can involve up to billions of unknown variables, there is a pressing need for efficient linear solvers to reduce simulation costs and prepare for potentially larger more sophisticated simulations in the future.
However, many existing solvers turn out to deliver rather disappointing performance for solving problems from ice sheet modeling. The most prominent challenge comes from the anisotropic nature of ice sheet models, where the thin vertical scale of the domain is tiny relative to the horizontal scale. This extraordinary contrast is also reflected by the dramatically different magnitudes of entries in the discretization matrix, where large values correspond to strong vertical coupling and tiny ones to weak horizontal coupling. This weak coupling gives rise to oscillatory eigenvectors associated with small eigenvalues and a poorlyconditioned linear system. This can be seen from a simplified model , where , where the standard five point finite difference discretization on a regular grid produces a matrix with many small eigenvalues
for all values of and small values of . Further, the Neumann boundary condition imposed on the top surface and some bottom parts of the domain gives rise to problematic linear systems with nearly singular matrix blocks. Physically, the bottom Neumann boundary condition models large ice shelves, which are vast areas of floating ice connected to landbased ice sheets and are common to Antarctica. The resulting Green’s function decays much slower along the vertical direction than that for nonsliding ice at a frozen ice interface Tuminaro et al. (2016), again contributing to the poor performance of many existing solvers.
The two solvers (preconditioners) commonly used in ice sheet modeling are the incomplete LU factorization (ILU) and the algebraic multigrid method (AMG). Although the customized ILU with a specific ordering scheme performs reasonably well for the Greenland ice sheet problem, its performance deteriorates significantly for the Antarctic ice sheet problem. The reason is that ice sheets on the Antarctic problem contain a substantial fraction of floating ice shelves, modeled by imposing Neumann boundary conditions, which leads to aforementioned illconditioned linear systems. Another possible approach to solve the ice sheet linear systems is some form of algebraic multigrid (AMG). However, standard AMG methods (e.g., the smoothed aggregation AMG solver Vaněk et al. (1996)) do not generally converge on realistic ice sheet simulations. While some specialized AMG techniques have been successfully developed (e.g., a customized matrixdependent AMG solver Tuminaro et al. (2016)) using tailored semicoarsening schemes, these approaches required significant nontrivial multigrid adaptions to address ice sheet simulations. These types of adaptations are not generally provided with available AMG packages.
To solve the particular linear systems from ice sheet modeling efficiently, our new solver introduces one customization to efficiently address the ice sheet linear systems. Specifically, the typical meshes employed for ice sheet models are generated by first creating a twodimensional unstructured horizontal mesh and then extruding this mesh into the vertical dimension to create a threedimensional grid. This mesh structure is leveraged when building clusters for the hierarchical solver. In particular, the (twodimensional unstructured) nonextruded mesh is first partitioned with a general graph partitioner and then the horizontal partition results are extended along the third/extruded direction such that mesh vertices lying on the same vertical line belong to the same cluster. Since extruded meshes appear frequently in geophysical problems such as atmospheric and oceanic circulation, oil and gas modeling, etc., our new solver along with the “extruded partitioning” algorithm can be generally applied to other engineering simulations involving thin structures. Compared with the ILU and the AMG methods used for ice sheet modeling, our new solver is robust in the sense that the iteration number stays nearly constant if it is used as a preconditioner for solving linear systems associated with ice sheet modeling, and our new solver is generalpurpose in that it can be applied as a “blackbox” method with a general partitioning scheme available in several existing software packages, such as METIS/ParMETIS Karypis and Kumar (1998), Scotch Chevalier and Pellegrini (2008) and Zoltan Boman et al. (2012), though a special partitioner can also be easily incorporated. Moreover, it is challenging to parallelize the ILU and the AMG methods on modern manycore architectures such as the GPU. Our new solver, similar to other hierarchical solvers, is mainly based on dense linear algebra subroutines and thus can potentially be accelerated using manycore processors.
To summarize, the paper presents a parallel hierarchical solver for sparse illconditioned linear systems using the deferredcompression technique, and in particular, our work makes the following three major contributions:

Error analysis of the deferredcompression scheme for hierarchical solvers based on strongly admissible hierarchical matrices (e.g., matrices).

A parallel/distributedmemory hierarchical solver for sparse illconditioned linear systems.

Application and analysis of the preconditioner for an ice sheet modeling problem, including numerical comparisons with ILU.^{1}^{1}1A highperformance implementation in the Trilinos IFPACK package.
The remainder of this paper is organized as follows. Section 2 introduces the deferredcompression technique and provides an error analysis. Following that is the algorithm of our new solver presented in Section 3. Next Section 4 briefly summarizes the firstorderaccurate Stokes approximation model of ice sheets and introduces the “extruded partitioning” algorithm. Finally, in Section 5 numerical results are given demonstrating the performance and scalability of our new solver for ice sheet modeling and also general problems from the SuiteSparse Matrix Collection.^{2}^{2}2https://sparse.tamu.edu/
2 Deferredcompression Scheme
This section presents the algorithm for deferredcompression and the corresponding error analysis, targeted at hierarchical solvers that are based on strongly admissible hierarchical matrices. These solvers employ lowrank approximations to compress offdiagonal matrix blocks that satisfy the strongadmissibility condition. A rigorous definition of the strongadmissibility condition can be found in Hackbusch and Börm (2002). From a highlevel perspective, the strongadmissibility condition states that one blockrow in a (appropriately partitioned) strongly admissible hierarchical matrix includes a diagonal block corresponding to “selfinteraction,” a fullrank offdiagonal block corresponding to “neighbor or nearfield interaction,” and a (numerically) lowrank offdiagonal block corresponding to “wellseparated or farfield interaction.” Therefore, a strongly admissible hierarchical matrix can be partitioned as the following block matrix
where “s” is a set of row/column indexes that we seek to eliminate via a Cholesky factorization. “n” stands for the set of indexes for which and are full rank, and “w” is used to denote the lowrank blocks and . We further assume that is SPD in this paper, so , , and .
Below we first review the classical Cholesky factorization and introduce some notations; then we analyze the errors in forming (approximate) Schur complements when the block is eliminated with and without using the deferredcompression scheme. In order to measure error, we use the matrixnorm (a.k.a., 2norm or operator norm) denoted by .
Cholesky factorization
To carry out one step of (block) Cholesky factorization on the block in , we define the following three matrices
where is the Cholesky factorization of . The (exact) Schur complement is found in the lower block matrix of as
(1) 
To actually compute the Cholesky factorization of the whole matrix , the Schur complement needs to be further factorized, which is skipped here since this is not relevant for our current discussion.
Without deferredcompression scheme
Suppose the lowrank matrix block can be decomposed as
(2) 
where is an orthogonal matrix and , a small prescribed tolerance. This kind of decomposition can be computed using, e.g., a rankrevealing QR factorization (RRQR). Dropping the term in leads to the compressed matrix as follows
where the lowrank approximation can be exploited to compute an approximate factorization of at a lower cost. Apply one step of the Cholesky factorization on the block in with
where
As a result, contains , an approximation of with error :
(3) 
(4) 
Proposition 1.
Proof.
The first part of the proposition is already shown above, so we derive the three error bounds as follows.
(5) 
where Eq. (2), and are used, and denotes the smallest singular value of a matrix.
where we used the equality for any matrix . ∎
For illconditioned problems such as linear systems from ice sheet modeling the diagonal matrix block can be nearly singular, and so is very small. As a result, the error can be large. Worse still, due to this large error the approximate Schur complement may become indefinite and the Cholesky factorization of diagonal blocks can break down. This leads to a poor approximation of the exact Schur Complement , an SPD matrix.
The above error analysis extends to all hierarchical solvers based on strongly admissible hierarchical matrices (with potentially minor modifications) and shows that the lowrank truncation error needs to decrease at least as fast as to maintain the same error tolerance on .
With deferredcompression scheme
Before compressing the offdiagonal matrix block directly, we first scale by the inverse of the Cholesky factor of . Specifically, the Cholesky factorization of the diagonal block is used to scale the first block row and column of as the following
where . Then the block is compressed with a lowrank approximation. Similar to Eq. (2), assume a lowrank decomposition of as
(6) 
where is orthogonal and , a small prescribed tolerance. One way to relate Eq. (6) to Eq. (2) is the following. Define , then Eq. (6) is equivalent to , where is orthogonal in terms of the inner product defined by the SPD matrix .
Replacing by in leads to the compressed matrix as follows
where
Carrying out one step of Cholesky factorization on the block in with
produces the following Schur complement, another approximation of as follows
(7) 
Proposition 2.
Assume Eq.(6) holds, the error between the two Schur complements, namely in and in is the following
(8) 
Moreover, the following error estimates hold

,

,

and ,
where and stand for the (1,2) block, (2,1) block and (2,2) block in .
Proof.
As the above proposition shows, the approximate Schur complement computed with the deferredcompression scheme is much more accurate than that without the scheme, especially when the problem is highly illconditioned. In other words, if the error tolerance is fixed, our new solver can deploy a (much) larger truncation error reducing the setup/factorization cost of a hierarchical solver significantly. For example, in our numerical experiments we will show that our new hierarchical solver () performs better than the original LoRaSp solver (). In particular, is an order of magnitude smaller than and does not depend on . Furthermore, is now symmetric positive semidefinite, which implies the following.
Corollary 1.
The block, i.e., /(2,2) block in is SPD.
Proof.
The following equality holds according to Eq. (8).
Since the original matrix is SPD, the exact Schur complement and the block are both SPD. It is also obvious that is a symmetric positive semidefinite matrix. Therefore, is SPD. ∎
In general, the matrix itself is not necessarily an SPD matrix for any . However, we observe that the matrix remains SPD for much larger (lower cost) with the deferredcompression scheme than that in the original algorithm.
Overall, the differences between computing an approximate Schur complement of with and without the deferredcompression scheme are summarized in the following table.
Without DC  With DC  
Matrix  
Low rank  



Eq. (3)  Eq. (7)  
block  may be indefinite  always SPD  



3 Improved LoRaSp Solver
In this section, we complete the algorithm description of our new hierarchical solver obtained by implementing the deferredcompression technique in the original LoRaSp solver. Our goal is to solve an (illconditioned) SPD linear system
(9) 
and our solver is based on a clustering of the unknown variables in Eq. (9).
Matrix Partitioning
Define as the (undirected) graph corresponding to the symmetric matrix : vertices in correspond to row/column indexes in , and an edge exists between vertices and if . A clustering of unknown variables in Eq. (9) is equivalent to a partitioning of the graph . Graph partitioning is a wellstudied problem and can be computed algebraically using techniques such as spectral partitioning and multilevel methods in existing highperformance packages, such as METIS/ParMETIS Karypis and Kumar (1998), Scotch Chevalier and Pellegrini (2008) and Zoltan Boman et al. (2012).
Our hierarchical solver computes an approximate factorization of by compressing fillin blocks generated during Gaussian elimination. The key observation is that the fillin blocks have lowrank structures, i.e., their singular values decay fast. Intuitively, the inverse of a diagonal block in the discretization matrix corresponds to the discrete Green’s function of a local elliptic PDE, which have numerically offdiagonal matrix blocks. The same lowrank property also carries over to the Schur complement Amestoy et al. (2015); Aminfar et al. (2016); Pouransari et al. (2017); Xia et al. (2010); Ho and Ying (2016).
3.1 Scaled Lowrank Elimination
Let denote a clustering of all unknown variables in Eq. (9), and without loss of generality, assume that matrix is partitioned and ordered accordingly, e.g., the first block row/column corresponds to . Two clusters and are defined as “neighbors” if the matrix block . In other words, the neighbors of a cluster is the set of adjacent clusters in .
To use the “scaled lowrank elimination” step, we partition matrix in the familiar way
where the “s” block corresponds to , “n” block corresponds to neighbors of and “w” block corresponds to the rest. Based on our definition of neighbors above, . In this case and generally if , the “scaled lowrank elimination” step is reduced to normal block Cholesky factorization.
As in Section 2, denote as the matrix corresponding to one step of block Cholesky factorization and denote as the Schur complement, i.e.,
Again, we can partition into the following block matrix
where the “s” block corresponds to , the “n” block corresponds to neighbors of and the “w” block includes all remaining vertices. Assume , which contains fillin generated from previous elimination of . To simplify notations, we will drop the superscription of matrix blocks in .
The “scaled lowrank elimination” step involves three operators: scaling operator , sparsification operator and Gaussian elimination operator . The scaling operator is defined as follows
(10) 
where is the Cholesky factorization.
After the scaling operator is applied, the offdiagonal block in is compressed with lowrank approximation, as in Eq. (6). This compression step is exactly the same as in the deferredcompression scheme. Instead of eliminating the “s” block directly, the next step applies the sparsification operator
(11) 
and introduces a zero block as below
Notice where the identity has the same size as the number of columns in , i.e., rank of the lowrank approximation in Eq. (6).
After the sparsification step, a cluster of unknown variables can be split into “coarse” unknown variables and “fine” unknown variables , where involves no fillin. Then is eliminated, which does not propagate any existing fillin (no level2 fillin introduced).
The Gaussian elimination operator
(12) 
eliminates the “fine” unknown variables as follows
where .
Last, we introduce an auxiliary permutation operator, , to permute rows and columns corresponding to to the end. is defined as
(13) 
Finally, define the “scaled lowrank approximation” operator and selects and eliminates the fine DOFs in . To summarize, we have derived
3.2 Entire Algorithm
We have introduced the “scaled lowrank elimination” step for one cluster. The algorithm repeatedly applies this step on all clusters in . This process is equivalent to computing an approximate factorization of the input SPD matrix , subject to the error of lowrank approximations. After all clusters are processed, one is left with a linear system consisting of the “coarse” unknown variable , and we can apply the same idea on this coarse system. The entire algorithm is shown in Algorithm 1.
Similar to sparse direct solvers, Algorithm 1 outputs an approximate factorization of the original matrix , which is used to solve the linear system . Since and are block diagonal matrices, is a triangular matrix and is a permutation matrix, the solve phase follows the standard forward and backward substitution, as shown in Algorithm 2.