# Parallel memory-efficient all-at-once algorithms for the sparse matrix triple products in multigrid methods

###### Abstract

Multilevel/multigrid methods is one of the most popular approaches for solving a large sparse linear system of equations, typically, arising from the discretization of partial differential equations. One critical step in the multilevel/multigrid methods is to form coarse matrices through a sequence of sparse matrix triple products. A commonly used approach for the triple products explicitly involves two steps, and during each step a sparse matrix-matrix multiplication is employed. This approach works well for many applications with a good computational efficiency, but it has a high memory overhead since some auxiliary matrices need to be temporarily stored for accomplishing the calculations. In this work, we propose two new algorithms that construct a coarse matrix with taking one pass through the input matrices without involving any auxiliary matrices for saving memory. The new approaches are referred to as “all-at-once” and “merged all-at-once” (a modified version of “all-at-once”) since the new algorithms calculate the two sparse matrix-matrix multiplications simultaneously, and the traditional method is denoted as “two-step”. The all-at-once and the merged all-at-once algorithms are implemented based on hash tables in PETSc as part of this work with a careful consideration on the performance in terms of the compute time and the memory usage. In the new methods, the first sparse matrix-matrix multiplication is implemented using a row-wise algorithm, and the second one is based on an outer product. We numerically show that the proposed algorithms and their implementations are perfectly scalable in both the compute time and the memory usage with up to 32,768 processor cores for a model problem with 27 billions of unknowns. The scalability is also demonstrated for a realistic neutron transport problem with over 2 billion unknowns on a supercomputer with 10,000 processor cores. Compared with the traditional two-step method, the all-at-once and the merged all-at-once algorithms consume much less memory for both the model problem and the realistic neutron transport problem meanwhile they are able to maintain the computational efficiency.

Parallel memory-efficient all-at-once algorithms for the sparse matrix triple products in multigrid methods

Fande Kong{}^{\text{{1}}}

Keywords

Sparse matrix triple product, sparse matrix-matrix multiplication, parallel processing, multilevel/multigrid methods, neutron transport equations

^{0}

^{0}footnotetext: {}^{\text{{1}}} Computational Framework, Idaho National Laboratory

^{0}

^{0}footnotetext: Corresponding author:

Fande Kong, 2525 Fremont Ave, Idaho Falls, ID 83402

^{0}

^{0}footnotetext: Email: fande.kong@inl.gov

## Introduction

Multigrid/multilevel methods including geometric multigrid (GMG) and algebraic multigrid (AMG) is one of the most popular approaches for solving a large sparse linear system of equations, Ax=b, arising from the discretization of partial different equations (smith2004domain; stuben2000algebraic). The methods involves generating a sequence of linear systems of decreasing size during the setup phase, and iteratively improving the solution to the original system using the sequence of coarse systems in the solve phase. One critical component of the multigrid/multilevel methods is to form a coarse operator C through a sparse matrix triple product of restriction P^{T}, fine operator A, and interpolation P. Here P^{T} is the transpose of P when using the Galerkin method, and P can be generated either geometrically (kong2016highly; kong2017scalable; kong2018scalability; kong2018simulation) or algebraically (kong2018fully; kong2019highly). It is challenging to design an efficient parallel algorithm and develop its corresponding software for the sparse matrix triple product since we have to consider both the memory efficiency and the compute time efficiency. Beside the multigrid/multilevel methods, the sparse matrix triple product is also used in other areas such as the Schur complement algorithms so that developing an efficient triple product algorithm becomes an active research topic.

Let us briefly review a few of these developments, and interested readers are referred to (ballard2016reducing; bulucc2012parallel; mccourt2013efficient) for more literature reviews. Many previous works have considered parallel algorithms for sparse matrix-matrix multiplications for the general-purpose use (akbudak2014simultaneous) and the particular application (challacombe2000general). In (ballard2016reducing), the authors propose and analyze a sparse SUMMA algorithm that was originally used for dense matrices. (borvstnik2014sparse) uses a similar idea to convert a dense matrix algorithm (Cannon) to its parallel sparse version for quantum-chemical applications with special optimizations and tuning. In the context of the multigrid/multilevel methods, (tuminaro2000parallel) concerns on the smoothed aggregation algebraic multigrid on distributed-memory supercomputers, where a row-wise algorithm is used for the sparse matrix-matrix multiplications of the two-step matrix triple product. (mccourt2013efficient) proposes a matrix coloring technique to cast a sparse matrix-matrix multiplication to a sparse-matrix dense-matrix operation, and the method is used in the multigrid/multilevel methods. (ballard2016hypergraph) studies a hyper graph to represent the communication costs of the parallel algorithms for general sparse matrix-matrix multiplications, where a Gakerkin triple product is employed as a case study, and the authors conclude that the row-wise algorithm is communication efficient for the first matrix-matrix multiplication, but inefficient for the second one.

In most of the literatures, typically, the sparse matrix triple product, C=P^{T}AP, is formed using two separate sparse matrix-matrix multiplications, A\cdot P and P^{T}\cdot(AP) (ballard2016hypergraph; ballard2016reducing; petsc-user-ref). This two-step method works well for many applications, but it is difficult to use the two-step approach for certain memory-intensive applications such as neutron transport problems (kong2018fully; kong2019highly) since the two-step method needs to create some auxiliary matrices causing a high memory overhead in order to efficiently cary out the calculations. To overcome this difficulty, we develop and study two all-at-once algorithms (including a merged version) that form the coarse operator C with one pass through P^{T}, A and P without creating any auxiliary matrices. Here a row-wise algorithm is employed for the first matrix-matrix multiplication, and an outer product is adopted for the second matrix-matrix multiplication. Note that in the new all-at-once approaches, the two matrix-matrix multiplication are carried out simultaneously, which will be discussed for more details in Section 3. Compared with the traditional two-step method, the new all-at-once algorithms and their implementations result in a reduction of the memory usage by a factor of 9\times for a model problem and a factor of 2.5\times for a realistic neutron transport application, meanwhile the new all-at-once approaches are able to maintain a good computational efficiency. We numerically show that the proposed algorithms work efficiently for both the model problem and the realistic problem with up to 27 billions of unknowns on supercomputers using up to 32,768 processor cores.

The rest of this paper is organized as follows. In Section 2, a row-wise matrix-matrix multiplication that serves as part of the calculations in a triple product is described in detail. Two new all-at-once algorithms based on hash tables for the sparse matrix triple products in multigrid/multilevel methods are described in Section 3, and numerical results for a model problem and a realistic neutron transport problem with up to 32,768 processor cores are given and discussed in Section 4. The results are summarized and conclusions are drawn in Section 5.

## Sparse matrix-matrix multiplications

There are different types of algorithms for sparse matrix-matrix multiplications. The algorithms are generally classified into 1D, 2D and 3D according to the partitioning schemes used to assign the computations of matrix to each processor core. Based on the current infrastructures in PETSc, we consider 1D algorithms only in this work. In the 1D approaches, each processor core owns a chunk of the rows of matrix using a row partition or a chunk of the columns of matrix using a column partition. The 1D matrix-matrix multiplication algorithms are further divided into row-wise, column-wise, and outer product. A matrix in PETSc is distributed in row wise so that a row-wise algorithm is chosen. Next we describe the row-wise algorithm that is employed in the first matrix-matrix multiplication of our new all-at-once approaches.

Let A\in\mathbb{R}^{n,n}, and P\in\mathbb{R}^{n,m}, where n is associated to the number of unknowns on the fine mesh, and m is determined by the coarse mesh. A matrix-matrix multiplication operation is defined such that the jth column of the ith row of output matrix C, denoted as C(i,j), is calculated as follows

C(i,j)=\sum_{k=1}^{n}A(i,k)P(k,j). |

The atomic task in the row-wise algorithm is the calculation of a row of C, that is,

C(i,:)=\sum_{k=1}^{n}A(i,k)P(k,:). |

Here the ith row of C, C(i,:), is a linear combination of the rows of P, some of which are not local, corresponding to the nonzero columns of the ith row of A. In the row-wise algorithm, the matrix is equally distributed to processors in block rows, where each processor owns n_{l}=n/(np) consecutive rows shown in Eq. (Sparse matrix-matrix multiplications). Here np is the number of processor cores. For simplicity of discussion, we assume n is divisible by np. In reality, n may be not divisible by np, and if it is the case, some processor cores will own a few more rows than others. Let A_{l} denote the rows of A owned by the lth processor core, that is,

A_{l}=A(ln_{l}:(l+1)n_{l}-1,:). |

For simplicity of description, we also define the pth block columns of A_{l} as A_{lp}=A_{l}(:,pn_{l}:(p+1)n_{l}-1) as shown in Eq. (Sparse matrix-matrix multiplications).

A=\left[\begin{array}[]{llll}A_{1}\\ \hline A_{2}\\ \hline\cdots\\ \hline A_{np}\\ \cr\omit\span\omit\span\omit\span\@@LTX@noalign{ }\omit\\ \end{array} |