A Robust Hierarchical Solver for Ill-conditioned Systems with Applications to Ice Sheet Modeling

A Robust Hierarchical Solver for Ill-conditioned Systems with Applications to Ice Sheet Modeling

Chao Chen chenchao.nk@gmail.edu Leopold Cambier lcambier@stanford.edu Erik G. Boman egboman@sandia.gov Sivasankaran Rajamanickam srajama@sandia.gov Raymond S. Tuminaro rstumin@sandia.gov Eric Darve darve@stanford.edu Stanford University Sandia National Laboratories
Abstract

A hierarchical solver is proposed for solving sparse ill-conditioned linear systems in parallel. The solver is based on a modification of the LoRaSp method, but employs a deferred-compression technique, which provably reduces the approximation error and significantly improves efficiency. Moreover, the deferred-compression technique introduces minimal overhead and does not affect parallelism. As a result, the new solver achieves linear computational complexity under mild assumptions and excellent parallel scalability. To demonstrate the performance of the new solver, we focus on applying it to solve sparse linear systems arising from ice sheet modeling. The strong anisotropic phenomena associated with the thin structure of ice sheets creates serious challenges for existing solvers. To address the anisotropy, we additionally developed a customized partitioning scheme for the solver, which captures the strong-coupling direction accurately. In general, the partitioning can be computed algebraically with existing software packages, and thus the new solver is generalizable for solving other sparse linear systems. Our results show that ice sheet problems of about 300 million degrees of freedom have been solved in just a few minutes using a thousand processors.

keywords:
Hierarchical matrix, Sparse matrix, Ice sheet modeling, Parallel computing
journal: Journal of Computational Physics

1 Introduction

This paper considers the problem of solving large sparse linear systems, which is a fundamental building block but also often a computational bottleneck in many science and engineering applications. In particular, we target linear systems that result from the numerical discretization of elliptic partial differentials equations (PDE) including Laplace, Stokes, Helmholtz equations (in low and middle frequency regime), etc., using local schemes such as finite differences or finite elements. One challenge arises when the condition number of the problem is large, and existing solvers become inefficient. Existing solvers fall into three classes. The first class is sparse direct solvers Davis et al. (2016), which leverage efficient ordering schemes to perform Gaussian elimination. However, they generally require computation and storage for solving a three-dimensional problem of size . These large costs seriously limit the application of sparse direct solvers to truly large-scale problems. The second class is iterative solvers such as the conjugate gradient method and the multigrid method. These methods may need only work and storage per iteration. However, the number of iterations required to achieve convergence can be quite large when solving ill-conditioned linear systems. Preconditioning is essential to improve the conditioning and convergence.

The third class of methods, which is the focus here, is hierarchical solvers. These methods compute an approximate factorization of a discretized elliptic PDE by taking advantage of the fact that the discretization matrix (and its inverse) has certainty hierarchical low-rank structures, including Hackbusch (1999); Hackbusch and Khoromskij (2000), Hackbusch and Börm (2002); Hackbusch (2015) matrices and hierarchically semiseparable (HSS) Xia et al. (2010); Chandrasekaran et al. (2006) matrices, among others Amestoy et al. (2015); Aminfar et al. (2016). By exploiting this data sparsity of the underlying physical problem, hierarchical solvers have been shown to achieve linear or quasi-linear complexity. However, their efficiency deteriorates when highly ill-conditioned problems are encountered because the rank/costs of approximations must increase dramatically in order to maintain the same accuracy in the final solution. Note that hierarchical solvers can be used either as direct solver (high accuracy) or as a preconditioner (low accuracy) for iterative methods. Our focus is on the latter.

In this paper, we introduce the deferred-compression technique to improve the efficiency of hierarchical solvers for solving sparse ill-conditioned linear systems and demonstrate this improvement by implementing it in a particular hierarchical solver named LoRaSp Pouransari et al. (2017). In hierarchical solvers such as LoRaSp, off-diagonal matrix blocks are compressed with low-rank approximations if they satisfy the strong admissibility condition Hackbusch and Börm (2002). In our new solver, these compressible matrix blocks are first scaled by Cholesky factors of the corresponding diagonal blocks before low-rank approximations are applied. This extra scaling step provably reduces errors in the subsequent step of forming the Schur complement. In addition, it increases the likelihood that the Schur complement remains symmetric positive definite with crude low-rank approximation when the original input matrix is SPD. For many practical applications, using the deferred-compression technique to preserve the SPD property of the underlying physical problem is crucial.

Previous deferred-compression work Xia and Gu (2010); Xia and Xin (2017); Xing and Chow (2018) focused on HSS matrices, which is a type of weakly admissible (as opposed to strongly admissible) hierarchical matrices. These approaches are not directly applicable to strongly admissible hierarchical matrices (e.g., matrices) such as the LoRaSp solver. Furthermore, prior deferred-compression efforts concentrated on solving dense linear systems, where incorporating the deferred-compression step leads to an extra amount of computation, and no corresponding parallel solver was developed. Compared to the previously published papers, our deferred-compression technique is novel in three ways:

  1. we target hierarchical solvers specialized for strongly admissible hierarchical matrices, and develop an associated general error analysis; the previous analysis can be recovered as a special case of our new analysis.

  2. we propose a new solver for sparse linear systems for which it is proved that the computational complexity is under some mild assumptions. This nearly optimal complexity implies that we can solve large problems with minimum asymptotic computational cost (up to some constants).

  3. we show that incorporating the deferred-compression scheme into the LoRaSp solver does not change the data and task dependencies in the parallel LoRaSp solver Chen et al. (2016). Therefore, we can take advantage of the existing parallel algorithm to solve large-scale problems efficiently (on distributed-memory machines).

In order to demonstrate the performance of our new solver, this paper addresses the challenges of solving linear systems from a real-world problem—ice sheet modeling where the solution of discretized linear systems remains the computational bottleneck. Ice sheet modeling is an essential component needed to estimate future sea-level rise due to climate modeling. As noted in Solomon (2007); Stocker (2014) from the Intergovernmental Panel on Climate Change (IPCC), modern ice sheet models must continue to introduce advanced features such as adaptive mesh refinement at sub-kilometer resolutions, optimization, data assimilation and uncertainty quantification for the treatment of numerous unknown model inputs. These advances will likely introduce further computational burdens requiring improvements in the linear solver, which must be repeatedly invoked over the course of the simulation. Given that current ice sheet simulations already consume resources on thousands of processing units on modern supercomputers and can involve up to billions of unknown variables, there is a pressing need for efficient linear solvers to reduce simulation costs and prepare for potentially larger more sophisticated simulations in the future.

However, many existing solvers turn out to deliver rather disappointing performance for solving problems from ice sheet modeling. The most prominent challenge comes from the anisotropic nature of ice sheet models, where the thin vertical scale of the domain is tiny relative to the horizontal scale. This extraordinary contrast is also reflected by the dramatically different magnitudes of entries in the discretization matrix, where large values correspond to strong vertical coupling and tiny ones to weak horizontal coupling. This weak coupling gives rise to oscillatory eigenvectors associated with small eigenvalues and a poorly-conditioned linear system. This can be seen from a simplified model , where , where the standard five point finite difference discretization on a regular grid produces a matrix with many small eigenvalues

for all values of and small values of . Further, the Neumann boundary condition imposed on the top surface and some bottom parts of the domain gives rise to problematic linear systems with nearly singular matrix blocks. Physically, the bottom Neumann boundary condition models large ice shelves, which are vast areas of floating ice connected to land-based ice sheets and are common to Antarctica. The resulting Green’s function decays much slower along the vertical direction than that for non-sliding ice at a frozen ice interface Tuminaro et al. (2016), again contributing to the poor performance of many existing solvers.

The two solvers (preconditioners) commonly used in ice sheet modeling are the incomplete LU factorization (ILU) and the algebraic multigrid method (AMG). Although the customized ILU with a specific ordering scheme performs reasonably well for the Greenland ice sheet problem, its performance deteriorates significantly for the Antarctic ice sheet problem. The reason is that ice sheets on the Antarctic problem contain a substantial fraction of floating ice shelves, modeled by imposing Neumann boundary conditions, which leads to aforementioned ill-conditioned linear systems. Another possible approach to solve the ice sheet linear systems is some form of algebraic multigrid (AMG). However, standard AMG methods (e.g., the smoothed aggregation AMG solver Vaněk et al. (1996)) do not generally converge on realistic ice sheet simulations. While some specialized AMG techniques have been successfully developed (e.g., a customized matrix-dependent AMG solver Tuminaro et al. (2016)) using tailored semi-coarsening schemes, these approaches required significant non-trivial multigrid adaptions to address ice sheet simulations. These types of adaptations are not generally provided with available AMG packages.

To solve the particular linear systems from ice sheet modeling efficiently, our new solver introduces one customization to efficiently address the ice sheet linear systems. Specifically, the typical meshes employed for ice sheet models are generated by first creating a two-dimensional unstructured horizontal mesh and then extruding this mesh into the vertical dimension to create a three-dimensional grid. This mesh structure is leveraged when building clusters for the hierarchical solver. In particular, the (two-dimensional unstructured) non-extruded mesh is first partitioned with a general graph partitioner and then the horizontal partition results are extended along the third/extruded direction such that mesh vertices lying on the same vertical line belong to the same cluster. Since extruded meshes appear frequently in geophysical problems such as atmospheric and oceanic circulation, oil and gas modeling, etc., our new solver along with the “extruded partitioning” algorithm can be generally applied to other engineering simulations involving thin structures. Compared with the ILU and the AMG methods used for ice sheet modeling, our new solver is robust in the sense that the iteration number stays nearly constant if it is used as a preconditioner for solving linear systems associated with ice sheet modeling, and our new solver is general-purpose in that it can be applied as a “black-box” method with a general partitioning scheme available in several existing software packages, such as METIS/ParMETIS Karypis and Kumar (1998), Scotch Chevalier and Pellegrini (2008) and Zoltan Boman et al. (2012), though a special partitioner can also be easily incorporated. Moreover, it is challenging to parallelize the ILU and the AMG methods on modern many-core architectures such as the GPU. Our new solver, similar to other hierarchical solvers, is mainly based on dense linear algebra subroutines and thus can potentially be accelerated using many-core processors.

To summarize, the paper presents a parallel hierarchical solver for sparse ill-conditioned linear systems using the deferred-compression technique, and in particular, our work makes the following three major contributions:

  1. Error analysis of the deferred-compression scheme for hierarchical solvers based on strongly admissible hierarchical matrices (e.g., -matrices).

  2. A parallel/distributed-memory hierarchical solver for sparse ill-conditioned linear systems.

  3. Application and analysis of the preconditioner for an ice sheet modeling problem, including numerical comparisons with ILU.111A high-performance implementation in the Trilinos IFPACK package.

The remainder of this paper is organized as follows. Section 2 introduces the deferred-compression technique and provides an error analysis. Following that is the algorithm of our new solver presented in Section 3. Next Section 4 briefly summarizes the first-order-accurate Stokes approximation model of ice sheets and introduces the “extruded partitioning” algorithm. Finally, in Section 5 numerical results are given demonstrating the performance and scalability of our new solver for ice sheet modeling and also general problems from the SuiteSparse Matrix Collection.222https://sparse.tamu.edu/

2 Deferred-compression Scheme

This section presents the algorithm for deferred-compression and the corresponding error analysis, targeted at hierarchical solvers that are based on strongly admissible hierarchical matrices. These solvers employ low-rank approximations to compress off-diagonal matrix blocks that satisfy the strong-admissibility condition. A rigorous definition of the strong-admissibility condition can be found in Hackbusch and Börm (2002). From a high-level perspective, the strong-admissibility condition states that one block-row in a (appropriately partitioned) strongly admissible hierarchical matrix includes a diagonal block corresponding to “self-interaction,” a full-rank off-diagonal block corresponding to “neighbor or near-field interaction,” and a (numerically) low-rank off-diagonal block corresponding to “well-separated or far-field interaction.” Therefore, a strongly admissible hierarchical matrix can be partitioned as the following block matrix

where “s” is a set of row/column indexes that we seek to eliminate via a Cholesky factorization. “n” stands for the set of indexes for which and are full rank, and “w” is used to denote the low-rank blocks and . We further assume that is SPD in this paper, so , , and .

Below we first review the classical Cholesky factorization and introduce some notations; then we analyze the errors in forming (approximate) Schur complements when the block is eliminated with and without using the deferred-compression scheme. In order to measure error, we use the matrix-norm (a.k.a., 2-norm or operator norm) denoted by .

Cholesky factorization

To carry out one step of (block) Cholesky factorization on the block in , we define the following three matrices

where is the Cholesky factorization of . The (exact) Schur complement is found in the lower block matrix of as

(1)

To actually compute the Cholesky factorization of the whole matrix , the Schur complement needs to be further factorized, which is skipped here since this is not relevant for our current discussion.

Without deferred-compression scheme

Suppose the low-rank matrix block can be decomposed as

(2)

where is an orthogonal matrix and , a small prescribed tolerance. This kind of decomposition can be computed using, e.g., a rank-revealing QR factorization (RRQR). Dropping the term in leads to the compressed matrix as follows

where the low-rank approximation can be exploited to compute an approximate factorization of at a lower cost. Apply one step of the Cholesky factorization on the block in with

where

As a result, contains , an approximation of with error :

(3)
(4)
Proposition 1.

Assume Eq.(2) holds, the error between the two Schur complements, namely in and in takes the form in Eq.(4). Moreover, the following error estimates hold

  1. ,

  2. ,

  3. and ,

where and stand for the (1,2) block, (2,1) block and (2,2) block in .

Proof.

The first part of the proposition is already shown above, so we derive the three error bounds as follows.

(5)

where Eq. (2), and are used, and denotes the smallest singular value of a matrix.

where we used the equality for any matrix . ∎

For ill-conditioned problems such as linear systems from ice sheet modeling the diagonal matrix block can be nearly singular, and so is very small. As a result, the error can be large. Worse still, due to this large error the approximate Schur complement may become indefinite and the Cholesky factorization of diagonal blocks can break down. This leads to a poor approximation of the exact Schur Complement , an SPD matrix.

The above error analysis extends to all hierarchical solvers based on strongly admissible hierarchical matrices (with potentially minor modifications) and shows that the low-rank truncation error needs to decrease at least as fast as to maintain the same error tolerance on .

With deferred-compression scheme

Before compressing the off-diagonal matrix block directly, we first scale by the inverse of the Cholesky factor of . Specifically, the Cholesky factorization of the diagonal block is used to scale the first block row and column of as the following

where . Then the block is compressed with a low-rank approximation. Similar to Eq. (2), assume a low-rank decomposition of as

(6)

where is orthogonal and , a small prescribed tolerance. One way to relate Eq. (6) to Eq. (2) is the following. Define , then Eq. (6) is equivalent to , where is orthogonal in terms of the inner product defined by the SPD matrix .

Replacing by in leads to the compressed matrix as follows

where

Carrying out one step of Cholesky factorization on the block in with

produces the following Schur complement, another approximation of as follows

(7)
Proposition 2.

Assume Eq.(6) holds, the error between the two Schur complements, namely in and in is the following

(8)

Moreover, the following error estimates hold

  1. ,

  2. ,

  3. and ,

where and stand for the (1,2) block, (2,1) block and (2,2) block in .

Proof.

We first show Eq. (8) as follows:

where Eq. (6) and the orthogonality of ( and ) are used. Next, we can prove the following three error bounds easily.

which finishes the proof. ∎

As the above proposition shows, the approximate Schur complement computed with the deferred-compression scheme is much more accurate than that without the scheme, especially when the problem is highly ill-conditioned. In other words, if the error tolerance is fixed, our new solver can deploy a (much) larger truncation error reducing the setup/factorization cost of a hierarchical solver significantly. For example, in our numerical experiments we will show that our new hierarchical solver () performs better than the original LoRaSp solver (). In particular, is an order of magnitude smaller than and does not depend on . Furthermore, is now symmetric positive semi-definite, which implies the following.

Corollary 1.

The block, i.e., /(2,2) block in is SPD.

Proof.

The following equality holds according to Eq. (8).

Since the original matrix is SPD, the exact Schur complement and the block are both SPD. It is also obvious that is a symmetric positive semi-definite matrix. Therefore, is SPD. ∎

In general, the matrix itself is not necessarily an SPD matrix for any . However, we observe that the matrix remains SPD for much larger (lower cost) with the deferred-compression scheme than that in the original algorithm.

Overall, the differences between computing an approximate Schur complement of with and without the deferred-compression scheme are summarized in the following table.

Without DC With DC
Matrix
Low rank
Approxi-
mation
Schur
complement
Eq. (3) Eq. (7)
block may be indefinite always SPD
Error for
blocks
Error for
blocks
Table 1: Differences between computing an approximate Schur complement of with and without the deferred-compression (DC) scheme. means corresponding blocks in the computed (approximate) Schur complements.

3 Improved LoRaSp Solver

In this section, we complete the algorithm description of our new hierarchical solver obtained by implementing the deferred-compression technique in the original LoRaSp solver. Our goal is to solve an (ill-conditioned) SPD linear system

(9)

and our solver is based on a clustering of the unknown variables in Eq. (9).

Matrix Partitioning

Define as the (undirected) graph corresponding to the symmetric matrix : vertices in correspond to row/column indexes in , and an edge exists between vertices and if . A clustering of unknown variables in Eq. (9) is equivalent to a partitioning of the graph . Graph partitioning is a well-studied problem and can be computed algebraically using techniques such as spectral partitioning and multilevel methods in existing high-performance packages, such as METIS/ParMETIS Karypis and Kumar (1998), Scotch Chevalier and Pellegrini (2008) and Zoltan Boman et al. (2012).

Our hierarchical solver computes an approximate factorization of by compressing fill-in blocks generated during Gaussian elimination. The key observation is that the fill-in blocks have low-rank structures, i.e., their singular values decay fast. Intuitively, the inverse of a diagonal block in the discretization matrix corresponds to the discrete Green’s function of a local elliptic PDE, which have numerically off-diagonal matrix blocks. The same low-rank property also carries over to the Schur complement Amestoy et al. (2015); Aminfar et al. (2016); Pouransari et al. (2017); Xia et al. (2010); Ho and Ying (2016).

Below, we first illustrate applying the deferred-compression technique and the “low-rank elimination” step (“scaled low-rank elimination” in the following) to one cluster of unknown variables in Section 3.1. Then we present the whole algorithm in Section 3.2 and complexity analysis in Section 3.3.

3.1 Scaled Low-rank Elimination

Let denote a clustering of all unknown variables in Eq. (9), and without loss of generality, assume that matrix is partitioned and ordered accordingly, e.g., the first block row/column corresponds to . Two clusters and are defined as “neighbors” if the matrix block . In other words, the neighbors of a cluster is the set of adjacent clusters in .

To use the “scaled low-rank elimination” step, we partition matrix in the familiar way

where the “s” block corresponds to , “n” block corresponds to neighbors of and “w” block corresponds to the rest. Based on our definition of neighbors above, . In this case and generally if , the “scaled low-rank elimination” step is reduced to normal block Cholesky factorization.

As in Section 2, denote as the matrix corresponding to one step of block Cholesky factorization and denote as the Schur complement, i.e.,

Again, we can partition into the following block matrix

where the “s” block corresponds to , the “n” block corresponds to neighbors of and the “w” block includes all remaining vertices. Assume , which contains fill-in generated from previous elimination of . To simplify notations, we will drop the superscription of matrix blocks in .

The “scaled low-rank elimination” step involves three operators: scaling operator , sparsification operator and Gaussian elimination operator . The scaling operator is defined as follows

(10)

where is the Cholesky factorization.

After the scaling operator is applied, the off-diagonal block in is compressed with low-rank approximation, as in Eq. (6). This compression step is exactly the same as in the deferred-compression scheme. Instead of eliminating the “s” block directly, the next step applies the sparsification operator

(11)

and introduces a zero block as below

Notice where the identity has the same size as the number of columns in , i.e., rank of the low-rank approximation in Eq. (6).

After the sparsification step, a cluster of unknown variables can be split into “coarse” unknown variables and “fine” unknown variables , where involves no fill-in. Then is eliminated, which does not propagate any existing fill-in (no level-2 fill-in introduced).

The Gaussian elimination operator

(12)

eliminates the “fine” unknown variables as follows

where .

Last, we introduce an auxiliary permutation operator, , to permute rows and columns corresponding to to the end. is defined as

(13)

Finally, define the “scaled low-rank approximation” operator and selects and eliminates the fine DOFs in . To summarize, we have derived

3.2 Entire Algorithm

We have introduced the “scaled low-rank elimination” step for one cluster. The algorithm repeatedly applies this step on all clusters in . This process is equivalent to computing an approximate factorization of the input SPD matrix , subject to the error of low-rank approximations. After all clusters are processed, one is left with a linear system consisting of the “coarse” unknown variable , and we can apply the same idea on this coarse system. The entire algorithm is shown in Algorithm 1.

1:procedure Hierarchical_Factor()
2:     if the size of is small enough then
3:         Factorize with the conventional Cholesky factorization
4:         return
5:     end if
6:     Partition the graph of and obtain vertex clusters
7: is chosen to get roughly constant cluster sizes
8:     
9:     for  to  do
10:          Scaled_LowRank_Elimination(, , )
11:     end for
12:     Extract from the block diagonal matrix
13: is the Schur complement for the coarse DOFs
14:      Hierarchical_Factor()
15: Recursive call with a smaller matrix
16:     return
17: is not written out explicitly
18:end procedure
19:
20:procedure Scaled_LowRank_Elimination(, , )
21:     Extract from
22:     Compute the low-rank elimination operator based on
23: are defined in Eq. 10, Eq. 12 and Eq. 13
24:     
25: has the same size as
26:     return
27:end procedure
28: Notation: means assign the value to , whereas means they are equivalent
Algorithm 1 Hierarchical solver: factorization phase

Similar to sparse direct solvers, Algorithm 1 outputs an approximate factorization of the original matrix , which is used to solve the linear system . Since and are block diagonal matrices, is a triangular matrix and is a permutation matrix, the solve phase follows the standard forward and backward substitution, as shown in Algorithm 2.

1:procedure Hierarchical_Solve(, )
2:      Forward_Substitution(, )
3:      Backward_Substitution(, )
4:     return
5:end procedure
6: