Sparse approximate matrix multiplication in a fully recursive distributed task-based parallel frameworkJune 21, 2019

# Sparse approximate matrix multiplication in a fully recursive distributed task-based parallel framework††thanks: June 21, 2019

Anton G. Artemov Division of Scientific Computing, Department of Information Technology, Uppsala University, Box 337, SE-751 05 Uppsala, Sweden (anton.artemov@it.uu.se).
###### Abstract

In this paper we consider parallel implementations of approximate multiplication of large matrices with exponential decay of elements. Such matrices arise in computations related to electronic structure calculations and some other fields of science. Commonly, sparsity is introduced by truncation of input matrices. In turn, the sparse approximate multiplication algorithm [M. Challacombe and N. Bock, arXiv preprint 1011.3534, 2010] performs truncation of sub-matrix products. We consider these two methods and their combination, i.e. truncation of both input matrices and sub-matrix products. Implementations done using the Chunks and Tasks programming model and library [E. H. Rubensson and E. Rudberg, Parallel Comput., 40:328â343, 2014] are presented and discussed. Error bounds are derived. A comparison between the three methods in terms of performance is done on a model problem. The algorithms are also applied to matrices coming from large chemical systems with atoms.

## 1 Inroduction

Although dense matrix-matrix multiplication is a well-known procedure, its cubic complexity makes it very resource-demanding or even infeasible to use for large matrices. Much effort has been invested in finding better algorithms with reduced complexities, see for instance [43], [18]. However, when it comes to sparse matrices, the methods used in the community are very similar in terms of performance due to data formats.

As a rule, in sparse matrix algorithms, matrices are stored in special formats, such as compressed sparse rows (CSR) or compressed sparse columns (CSC) [42]. The drawback of these formats is that once one of them is chosen, operations with matrix transpose become difficult to perform in parallel. Using the idea of blocking, Buluç et al. [15] suggested compressed sparse blocks (CSB) format, which solves this problem, at least partially. Another application of blocking technique is the blocked CSR format (BCSR), which works better for matrices with zeros occurring in blocks [28]. All these storage formats and sparse matrix algorithms are designed for common-sense sparse matrices, i.e. such that have a few non-zero elements per row.

The problem of our interest falls somewhere in between dense and sparse matrix multiplication. In electronic structure calculations based on the Hartree–Fock method or the Kohn–Sham density functional theory, a key component is computation of a density matrix for a given Hamiltonian . A popular method is to perform a polynomial approximation of the function where is the Heaviside step function and is the chemical potential. This can be done in different ways. The two commonly used approaches are recursive application of low-order polynomials (density matrix purification, [32, 27]) and construction of Chebyshev polynomials [24, 5, 31]. Depending on the implementation, the core operation is often a matrix-matrix multiplication, thus it is the crucial component for overall performance and scaling of the code.

Matrices in electronic structure calculations have an important property of decay of elements. A matrix is said to obey an exponential decay with respect to if for some constants and or an algebraic decay if , where is a distance function defined on the index set of the matrix. The distance function corresponds to physical distance between some a priori chosen basis functions. It is convenient to look at such chemical system as at a spatial graph, where distance function is defined on the edges. Under certain assumptions small matrix elements can be ignored, as usually done to matrices arising in linearly scaling electronic structure calculations [30, 16, 38, 12]. In this context linear scaling means that the computational time is directly proportional to the size of the considered chemical system. Such matrices are not sparse in the common sense, since they may have from several to several thousands non-zero elements per row, depending on the type of basis functions used, and thus standard sparse matrix algorithms are not applicable here. However, the special structure of matrices with decay allows to achieve linear scaling, hence the name for the group of methods. Ideally, with linearly increasing computational power, it should give constant computational time. However, this is difficult to achieve due to communication costs, which start to play a significant role at some point.

The number of non-zero elements per row and decay properties dictate also that the storage formats listed above are unlikely to be usable in this context, because their strong point, i.e. compression, does not work well enough in this case. One alternative is introduced by Mohr et al. [31], where they use the so-called segment storage format (SSF). This format is based on the idea of grouping together consecutive non-zero entries in segments. Another approach is to exploit the hierarchical structure of a matrix, i.e. treat it as a matrix of matrices [36]. Similar ideas are used in [9, 35]. This representation is natural not only for the matrices which arise in electronic structure calculations, but in a wider class of problems including, but not limited to integral equations, partial differential equations, matrix equations and many more, see [7, 26] for extensive study. However, the notion of a hierarchical matrix used by Hackbusch [26] is different, since it is a data-sparse approximation of matrices, which are not necessarily sparse by construction.

Multiplication of matrices, which have more non-zero elements per row, than standard sparse matrices, or, in other words, nearly sparse matrices, is a challenging problem, especially since in general the sparsity pattern is not known in advance. In this case, load balancing becomes a substantial difficulty.

One of the first parallel implementations is done by Challacombe [16]. The author uses conventional MPI and a special data format with distributed blocked compressed sparse rows. The load balancing and locality issues are addressed by introducing a space filling curve, which is a heuristic solution for the travelling salesman problem. The efficiency is demonstrate up to 128 cores.

Better results are achieved by using a space filling curve ordering with two-dimensional matrix decompositions, for instance VandeVondele et al. [44] report scalability of atoms on 46656 cores.

An alternative approach is suggested by Buluç and Gilbert [14]. They address load balancing issues by randomly permuting rows and columns, which leads to high efficiencies in the general case. This approach is successfully applied in electronic structure calculations by Borštnik et al. [11].

Dawson and Nakajima [19] employ communication-avoiding 3D multiplication algorithm by Ballard et al. [6] in their NTPoly code. The load balancing is addressed in the same manner using permutations as in [14].

Rubensson and Rudberg [35] present a way of multiplying sparse matrices in parallel, preserving an important property of locality. The approach is based on a hierarchical representation of matrices and a task-based parallel programming model called Chunks and Tasks [34]. The load balancing issue is addressed by a scheduler, which utilizes a work stealing concept. A more detailed description of the model is presented in Section LABEL:cht_model.

Commonly, in electronic structure calculations, sparsity is maintained by truncation of small elements before and sometimes after multiplication. To his end, different strategies of choosing those elements exist, see for example the discussions in [23, 37]. Unlike the majority of methods, the algorithm considered in this article performs truncation of sub-matrix products to reduce the complexity of operations.

In this article we consider different aspects of parallel implementations of the regular multiplication of truncated matrices, sparse approximate matrix multiplication (SpAMM) and their combination in a fully recursive task-based parallel environment called Chunks and Tasks [34]. The main contributions of the article are the error analysis and estimations and the combined technique, which has reduced communication, but preserves the accuracy of the two original methods.

The article is organized as follows. Section LABEL:the_algorithm contains a brief description of the sparse approximate matrix multiplication algorithm (SpAMM), existing implementations. Section LABEL:error_control contains derivations related to the error control. In Section LABEL:inplementation_cht we describe our implementations within the Chunks and Tasks programming model, briefly discuss the programming model itself and the leaf level library. Section LABEL:results contains description of the experimental set-up, benchmarks and their results discussion. Some conclusions are given in Section LABEL:conclusion.

## 2 The SpAMM algorithm

The sparse approximate matrix multiplication algorithm introduced in [17, 9] belongs to a family of -body solvers, which use hierarchical approximations of sub-problems to reduce complexity, similarly to the famous fast multipole method by Greengard and Rokhlin [25]. The algorithm exploits a hierarchical representation of the matrix and the decay property, which in this case is preserved for all levels in the hierarchy, to skip multiplication of small sub-matrices. The decay property is preserved from level to level because the distance function does not change. Assuming that the matrices are represented as quad-trees such that

 At=(At+10,0At+10,1At+11,0At+11,1),

where is the level in the hierarchy and

 ∥At∥F= ⎷1∑i,j=0∥At+1i,j∥2F,

the SpAMM algorithm can be written recursively as in Algorithm LABEL:SPAMM_pseudocode. The squared Frobenius norm is kept along with each matrix, and, since it is additive, it can be conveniently computed recursively starting from the lowest level. The parameter determines truncation of sub-matrix products, i.e. in the product space, whereas commonly truncation is applied on the input matrices, i.e. in the vector space. (see for example [33]).

### 2.1 Serial implementations of SpAMM

Following the naive serial recursive implementation in [17], the highly optimized serial version of the algorithm is implemented by the authors of the algorithm in [9] using the technique of symbolic multiplication with linkless trees and SSE intrinsics. This implementation easily outperformed vendor-provided sgemm (single precision gemm routine of BLAS [21]) routines both from the Intel Math Kernel Library and AMD’s Core Math Library and even demonstrated lower error with than sgemm had. The authors noted that sgemm has stagnating performance with increasing matrix size due to memory bandwidth saturation, whereas SpAMM has increasing performance.

### 2.2 Parallel implementations of SpAMM

Bock et al. [10] present two parallel implementations of the SpAMM algorithm. The first one, which uses the OpenMP application programming interface, exploits parallel quad-tree traversal using untied task, i.e. a task that could be executed by more that one thread simultaneously. The implementation demonstrates good parallel scaling behavior with efficiency up to 80 %. However, the authors notice that at some point load balancing becomes an issue with decreasing the quad-tree height.

The second implementation by the same authors targets distributed memory machines and is done using the Charm++ parallel programming model [29]. The implementation is tested on a large cluster with more than 24000 cores. A presence of modest serial component is noted, which is likely to be attributed to the Charm++ runtime. The load balancing problem is addressed by Charm++ built-in load balancer, which migrates chares (actors in the Charm++ terminology) between the nodes taking into account communication costs.

## 3 Error control

As it was noted earlier, the SpAMM algorithm performs truncation of sub-matrix products, i.e. operates in the product space. Truncation inevitably brings some error in the product matrix. The authors of the original algorithm note that the dependency between the parameter in the algorithm and resulting error norm of the product matrix is not known, see [9, p. C84]. In this section we derive such relation.

In Section LABEL:intro we mentioned the decay property of matrices, which are common in electronic structure calculations. This property and ignoring matrix elements smaller than a certain threshold value is exploited to derive the error estimates. We limit ourselves to the case of matrices with exponential decay.

Define first a distance function. We use the same framework as in [39]. The function is a distance function on the index set if for any

In other words, is a pseudo-metric defined on

Rubensson et al. [39] consider a sequence of matrices with exponential decay with the same constants and and associated distance functions . Under the assumption where is the set of all vertices within a distance away from the -th vertex and is its cardinality, and are some constants independent of , they showed that for any and for any the matrix contains at most elements greater than in magnitude and that each row and column of each has the number of entries greater than in magnitude bounded by a constant independent of , see Theorem 4 in [39]. The assumption means that the number of vertices situated within a distance away from a vertex is finite, which holds for the underlying physical systems. The theorem states that each row and column of such a matrix has at most significant elements, i.e. elements with magnitude greater than , where is a constant independent of . The rest of the elements are not significant, but not necessarily zeros. We directly exploit these results to derive our estimations.

First we consider the error introduced by truncation of the input matrices, i.e. truncation in the vector space.

###### Lemma 1.

Let and be sequences of matrices with a common associated distance function for each . Assume

 |[An]i,j|≤c1e−αdn(i,j),|[Bn]i,j|≤c2e−αdn(i,j) (1)

for all where and are positive constants independent of . Assume also that there exist constants and such that

 |Ndn(i,R)|≤γRβ,∀i=1,…,n,∀R>0. (2)

Let and where

 (3)

are the truncated matrices, is some given parameter. Then, the error matrix has the following element-wise property:

 |[En]i,j|=O(τ),∀i,j=1,…,n.

###### Proof.

Any element of a matrix product is computed as follows:

 [Cn]i,j=n∑k=1[An]i,k[Bn]k,j, (4)

which is a scalar product of the -th row of and the -th column of . At the same time, any element of the product of the truncated matrices and is computed as follows:

 [~Cn]i,j=n∑k=1[ϕn]ikj,[ϕn]ikj={0,if |[An]i,k|<τor|[Bn]k,j|<τ,i,k[Bn]k,jotherwise.

Then any element of the error matrix has an absolute value

 |[En]i,j|=|n∑k=1([An]i,k[Bn]k,j−[ϕn]ikj)|,∀i,j=1,…,n.

This quantity attains zero if no information is lost when computing a given element, i.e. no elements truncated to zero, and attains its maximum when all information is lost when performing multiplication, i.e. . The worst-case scenario is when every element in the error matrix attains its maximum, i.e. and thus

Let us consider the matrix . We have

 |[Cn]i,j|=|n∑k=1[An]i,k[Bn]k,j|≤n∑k=1|[An]i,k||[Bn]k,j|. (5)

Since both and possess on the exponential decay property with respect to a common distance function by Theorem 4 in [39] for any both matrices have at most significant elements (i.e. greater than in magnitude) in every row or column, where is a constant independent of . Let us use .

When estimating the absolute value in (LABEL:c_trunc_mul_element_abs_value), three possible types of summands are to be taken into account:

The case where both multipliers are greater than in magnitude is not included since it falls outside the worst-case scenario. The estimation of the product of the absolute values for the first two types is due to exponential decay property, There are not more than summands of type 1 and 2 for any element (Theorem 4 from [39]). The rest summands are of type 3. We split the index set into two sets, and Without loss of generality, we assume that and Then

 |[En]|i,j=|[Cn]|i,j≤∑k∈In1|[An]i,k||[Bn]k,j|+∑k∈In2|[An]i,k||[Bn]k,j|=2κ∑k=1|[An]i,k||[Bn]k,j|S1+n∑k=2κ+1|[An]i,k||[Bn]k,j|S2=S1+S2.

We consider next the partial sums and separately, starting with

 S1=2κ∑k=1|[An]i,k||[Bn]k,j|≤2κcτ. (6)

To estimate , one should take into account that all summands in this expression are of type 3, i.e. both :

 S2=n∑k=2κ+1|[An]i,k|<τ|[Bn]k,j|<τn∑k=2κ+1|[Bn]k,j|≤c2τn∑k=2κ+1e−αdn(k,j). (7)

It can be shown that the last sum in (LABEL:S_2_trunc_mul_almost_done) is bounded by a constant independent of

 n∑k=2κ+1e−αdn(k,j)≤n∑k=1e−αdn(k,j)≤∞∑r=1(|Ndn(j,r)|−|Ndn(j,r−1)|)e−α(r−1)≤∞∑r=1(|Ndn(j,r)|)e−α(r−1)≤∞∑r=1γrβe−α(r−1),

where the last series is a convergent one, which can be demonstrated with d’Alembert test [41]. Let denote its sum, then

 S2

By combining (LABEL:S_1_truncmul_done) and (LABEL:S_2_truncmul_done), we obtain

 |[En]i,j|<δ1τ+δ2τ=(δ1+δ2)τ=δτ,∀i,j=1,…,n,
 |[En]i,j|=O(τ)∀i,j=1,…,n.

The same machinery can be applied to the SpAMM algorithm. It is straightforward to see that the only difference is the cases to be considered when computing an element of the product matrix.

###### Lemma 2.

Let and be sequences of matrices with a common associated distance function for each . Assume that each and each satisfy the exponential decay property (LABEL:eq:exp_decay_property) with constants and and (LABEL:eq:num_vertices_within_R) holds for some constants and . Let and where is some given parameter. Then, the error matrix has the following element-wise property:

 |[En]i,j|=O(τ),∀i,j=1,…,n.

###### Proof.

Any element of a matrix product is computed as:

 [Cn]i,j=n∑k=1[An]i,k[Bn]k,j,

which is nothing but a scalar product of the -th row of and the -th column of . At the same time, any element of a SpAMM product is computed as follows:

 [¯Cn]i,j=n∑k=1[χn]ikj,[χn]ikj={0,if |[An]i,k||[Bn]k,j|<τ,i,k[Bn]k,jotherwise.

Then any element of the error matrix has an absolute value

 |[En]i,j|=|n∑k=1([An]i,k[Bn]k,j−[χn]ikj)|,∀i,j=1,…,n.

This quantity attains zero if no SpAMM condition has been met when computing a given element, i.e. the information is preserved, and attains its maximum when all information is lost when performing SpAMM, i.e. . The worst-case scenario is when every element in the error matrix attains its maximum, i.e. and thus

Let us consider the matrix . We have

 |[Cn]i,j|=|n∑k=1[An]i,k[Bn]k,j|≤n∑k=1|[An]i,k||[Bn]k,j|. (9)

Since both and have exponential decay with respect to a common distance function by Theorem 4 in [39] for any both matrices have at most significant elements (i.e. greater than in magnitude) in every row or column, where is a constant independent of . Let us use where is a SpAMM parameter. Note that since we are at the worst case scenario.

When estimating the absolute value in (LABEL:c_spamm_element_abs_value), three possible types of summands are to be taken into account:

1. Both and

2. Either or

3. Both and

There are not more than summands of type 1 and 2 for any element due to Theorem 4 from [39]. The rest summands are of type 3. We split the index set to two sets, and Without loss of generality, we assume that and The further proof is technical and very similar to the proof of Lemma LABEL:lemma_truncmul_error_elementwise_property, therefore it is omitted.

One can also combine the two approaches, i.e. perform the SpAMM algorithm on truncated matrices. We assume that the truncation parameter and the SpAMM are the same. In this case, we can again apply the machinery used in Lemmas LABEL:lemma_truncmul_error_elementwise_property and LABEL:lemma_spamm_error_elementwise_property.

###### Lemma 3.

Let and be sequences of matrices with a common associated distance function for each . Assume that each and each satisfy the exponential decay property (LABEL:eq:exp_decay_property) with constants and and (LABEL:eq:num_vertices_within_R) holds for some constants and . Let and where and are the truncated matrices (LABEL:truncated_matrices) and is some given parameter. Then, the error matrix has the following element-wise property:

 |[En]i,j|=O(τ),∀i,j=1,…,n.

###### Proof.

Let be the true product of and and be the result of the SpAMM algorithm applied on matrices and truncated with the same as used by SpAMM. Then, the error matrix satisfies

 ∣∣[En]i,j∣∣=∣∣[Cn]i,j−[^Cn]i,j∣∣=∣∣[Cn]i,j−[~Cn]i,j+[~Cn]i,j−[^Cn]i,j∣∣≤∣∣[Cn]i,j−[~Cn]i,j∣∣+∣∣[~Cn]i,j−[^Cn]i,j∣∣,∀i,j=1,…,n. (10)

The first expression in (LABEL:eq:split_error) is the element-wise error introduced by truncation of matrices before multiplication, and its bound is derived in Lemma LABEL:lemma_truncmul_error_elementwise_property. The second expression in (LABEL:eq:split_error) is the error introduced by the SpAMM algorithm assuming that the matrices have been already truncated, and its bound is derived in Lemma LABEL:lemma_spamm_error_elementwise_property. Combination of those results gives us

 ∣∣[En]i,j∣∣=O(τ).

In the proofs of Lemmas LABEL:lemma_truncmul_error_elementwise_property, LABEL:lemma_spamm_error_elementwise_property and LABEL:lemma_tuncated_spamm_error_elementwise_property we considered the worst case scenarios, where the true product matrix becomes the error matrix. The exponential decay property of the product matrix [39] can be exploited further to estimate the sum of the squared absolute values of all elements below certain threshold in magnitude.

###### Lemma 4.

Let be a sequence of matrices with associated distance functions and assume that every matrix satisfies (LABEL:eq:exp_decay_property) with constants and Assume also that (LABEL:eq:num_vertices_within_R) holds for some constants and Then, for any and every ,

 ∑i,j:|[Cn]i,j|≤ε|[Cn]i,j|2=O(n)⋅O(εp),p<2.

###### Proof.

One should consider two cases, in which insignificant elements can exist: and This is due to inequality sign in (LABEL:eq:exp_decay_property). Note that it is implicitly assumed that

Let us first consider the latter case, i.e. We directly use the exponential decay property (LABEL:eq:exp_decay_property). here. Then

 ∑i,j:dn(i,j)≥rε|[Cn]i,j|2≤c2∑i,j:dn(i,j)≥rεe−2αdn(i,j)=c2∑j∑i:dn(i,j≥rεe−2αdn(i,j)≤c2∑j∞∑r=rε+1(|Ndn(j,r)|−|Ndn(j,r−1)|)e−2α(r−1)≤c2∑j∞∑r=rε+1|Ndn(j,r)|e−2α(r−1)≤c2∑j∞∑r=rε+1γrβe−2α(r−1)≤c2n∞∑r=rε+1γrβe−2α(r−1).

One can set then

 c2n∞∑r=rε+1γrβe−2α(r−1)=c2n∞∑p=0γ(p+rε+1)βe−2α(p+rε)=c2ne−2αrε∞∑p=0γ(p+rε+1)βe−2αp (11)

The series in (LABEL:series_insignificant) converges due to d’Alembert test [41]. Let be the sum of that series. It is a constant independent of . Note that thus

 ∑i,j:dn(i,j)≥rε|[Cn]i,j|2≤nε2ν.

Since is independent of , and

 limε→0ε2ε2=1,

then

 ∑i,j:dn(i,j)≥rε|[Cn]i,j|2=O(n)⋅O(ε2). (12)

Now one has to treat the other part, i.e. the elements, for which the distance functions is less than using the fact that the elements are smaller than in magnitude. This can be done as follows:

 ∑i,j: dn(i,j)

Since and are constants independent of , for any and

 limε→0ε2γ(1αlncε)βεp=0,∀p<2,

one concludes that

 ∑i,j:dn(i,j)

Combining together (LABEL:sum_insignificant_outside_re) and (LABEL:sum_insignificant_within_re), we obtain

 ∑i,j:|[Cn]i,j|≤ε|[Cn]i,j|2=O(n)⋅(O(ε2)+O(εp))=O(n)⋅O(εp), p<2.

The result of Lemma LABEL:lemma_sum_insignificant should be interpreted as follows: the sum of all elements smaller than in magnitude is proportional to when is fixed and grows, but decays slower than when is fixed.

Now we combine Lemmas LABEL:lemma_truncmul_error_elementwise_property, LABEL:lemma_spamm_error_elementwise_property, LABEL:lemma_tuncated_spamm_error_elementwise_property and LABEL:lemma_sum_insignificant to show the behavior of the error matrix Frobenius norm for all the methods.

###### Theorem 5.

Let and be sequences of matrices with a common associated distance function for each . Assume that each and each satisfy the exponential decay property (LABEL:eq:exp_decay_property) with constants and and (LABEL:eq:num_vertices_within_R) holds for some constants and . Let and where and are the truncated matrices (LABEL:truncated_matrices). Then, the error matrix has the following property:

 ∥En∥F=O(n1/2)⋅O(τp/2),p<2.

###### Proof.

In the worst-case scenario, and thus We know by theorem 5 from [39] that satisfies the exponential decay property with respect to the same distance function as and Thus by theorem 4 from [39] has at most significant elements . In order to estimate the Frobenius norm of the error matrix, we use lemma LABEL:lemma_sum_insignificant to compute partial sum for all insignificant elements with and lemma LABEL:lemma_truncmul_error_elementwise_property to the rest of elements, i.e. significant ones:

 ∥En∥2F=∥Cn∥2F=∑i,j:|[Cn]i,j|≤τ|[Cn]i,j|2+∑i,j:|[Cn]i,j|>τ|[Cn]i,j|2=O(n)⋅O(τp)+O(n)⋅(O(τ))2=O(n)⋅(O(τp)+O(τ2))=O(n)⋅O(τp),p<2.

By taking the square root from both sides of the previous relation, we arrive at

 ∥En∥F=√O(n)⋅O(τp)=O(n1/2)⋅O(τp/2),p<2.

The result of Theorem LABEL:theorem_truncmul_error should be read as follows: the Frobenius norm of the error matrix grows as for fixed and decays strictly slower than when is fixed.

The SpAMM algorithm has the same behavior of the error, as in the multiplication of truncated matrices.

###### Theorem 6.

Let and be sequences of matrices with a common associated distance function for each . Assume that each and each satisfy the exponential decay property (LABEL:eq:exp_decay_property) with constants and and (LABEL:eq:num_vertices_within_R) holds for some constants and . Let and where is some given parameter. Then, the error matrix has the following property:

 ∥En∥F=O(n1/2)⋅O(τp/2),p<2.

The proof is very similar to the proof of Theorem LABEL:theorem_truncmul_error and is based on combination of Lemmas LABEL:lemma_spamm_error_elementwise_property and LABEL:lemma_sum_insignificant and therefore is omitted.

If one performs truncation in both vector and product spaces, i.e. applies the SpAMM algorithm on truncated matrices, then the norm of the error matrix still behaves similarly.

###### Theorem 7.

Let and be sequences of matrices with a common associated distance function for each . Assume that each and each satisfy the exponential decay property (LABEL:eq:exp_decay_property) with constants and and (LABEL:eq:num_vertices_within_R) holds for some constants and . Let and where are the truncated matrices (LABEL:truncated_matrices) and is some given parameter. Then, the error matrix has the following property:

 ∥En∥F=O(n1/2)⋅O(τp/2),p<2.

The proof is very similar to the proof of Theorems LABEL:theorem_truncmul_error and LABEL:theorem_spamm_error and is based on combination of Lemmas LABEL:lemma_tuncated_spamm_error_elementwise_property and LABEL:lemma_sum_insignificant and therefore is omitted.

Based on the Theorems LABEL:theorem_truncmul_error and LABEL:theorem_spamm_error we see the error matrix norm for both regular multiplication of truncated matrices and the SpAMM algorithm has a similar behavior. Even if one applies SpAMM on truncated matrices, the behavior of the error matrix norm does not become qualitatively worse. We show in Section LABEL:results that in a distributed environment it is profitable to combine the two algorithms into a new technique, referred to as hybrid.

## 4 Implementation details

### 4.1 The Chunks and Tasks programming model

The Chunks and Tasks model, developed by Rubensson and Rudberg [34], is defined by a C++ API. The basis of the model comprises two basic abstractions - a chunk and a task. A chunk represents a piece of data whereas a task represents a piece of work. The parallelism is exploited in both work and data dimensions.

The user prepares a chunk and registers it to the library. Then, a chunk identifier is obtained and the modification of the chunk is no longer possible (similarly to the Concurrent Collections model [13]). This identifier can be provided as input to a task or be a part of another chunk. This allows to build hierarchies of chunks and the runtime takes responsibility of managing such hierarchies.

The work is expressed in terms of tasks. A task has one or more input chunk identifiers and a single output chunk identifier. Access to chunk data is possible only as input to a task, thus one cannot execute a task without having all input chunks available. The user is free to start other tasks inside a task, giving rise to dynamic parallelism.

A proof-of-the-concept runtime library [34] is implemented in C++ and internally parallelized with MPI and Pthreads. A Chunks and Tasks program is executed by a group of worker processes. To address load balancing, the implementation uses a concept of work-stealing, similarly to XKaapi [22], Cilk [8] and StarPU [4].

The set of rules introduced by the model is intended to simplify parallel programming for distributed memory machines by reducing the chance of getting a deadlock to zero. Moreover, immutable data makes it impossible to get a race condition of any kind. However, it takes a certain effort for a conventional MPI programmer to shift the programming paradigm.

### 4.2 Matrix representation

Similarly to [36] we employ a hierarchical matrix representation. Not only is it naturally suitable for recursive algorithms like SpAMM or inverse factorization [3], but it is easily mapped onto the Chunks and Tasks model. Rubensson and Rudberg [35] present a matrix library implemented using the model (CHTML). A matrix is viewed as a quadtree, and all non-leaf nodes are chunks which contain identifiers of child nodes. Matrices containing no non-zero elements are not kept in the memory, they are represented by a special identifier value. Matrix elements are kept at the leaf level only. Actual computations take place only if the leaf level is reached. This approach allows to skip computations on zero sub-matrices and thus exploit sparsity patterns dynamically. Figure LABEL:matrix_representation illustrates the hierarchical approach.

The leaf level matrices can be organized in different ways. One could simply keep them dense, or as a more advanced approach, use a block-sparse representation, where non-zero blocks are kept in a 2-D array, see [35] for details and discussion. In any case, when implementing the leaf-level functionality, one can assume that the matrix fits into a single computational node. For efficient computations, leaf level routines should rely on BLAS (Basic Linear Algebra Subroutines [21]).

In this work we continue developing the Chunks and Tasks matrix library by adding new functionality.

### 4.3 Matrix chunk type

In the previous works [35, 3] we use a chunk type which directly corresponds to the hierarchical matrix representation described in above. Now it has to be modified in order to be able to keep the Frobenius norm of the corresponding matrix as internal information. We choose to make updates of the internal information only explicitly, otherwise it is set to zero by default and never touched. For instance, the internal information should be updated before registering the SpAMM task or any other task, which uses the information. However, one arrives at a contradiction with the rules of the Chunks and Tasks programming model described in Section LABEL:cht_model: once a chunk is registered to the library, it cannot be changed, while the update operation implies changing the chunk.

To overcome this difficulty, we implement a special task type for updating internal information. In our case it is a simple double floating point number, but in general, it might be a more complex data structure. The task builds a copy of the chunk hierarchy, starting from the leaves, and registers parent chunks only after the children chunks provide their Frobenius norms. After that the original chunks can be deleted. Another option is to update the internal information when constructing the chunk or the hierarchy.

### 4.4 SpAMM task type

The SpAMM Algorithm LABEL:SPAMM_pseudocode is expressed in terms of the Chunks and Tasks matrix library. The main difference from regular matrix-matrix multiplication introduced by Rubensson and Rudberg [35] is that it accepts input chunks of another type, checks Frobenius norms of multipliers and if their norm product is small enough, the output chunk is set to null. Otherwise the same task is registered for sub-matrices and the process continues recursively until reaching the leaf level. The parameter is provided as one of the input chunks.

### 4.5 Leaf level library

As we discussed in Section LABEL:matrix_representation, CHTML can be used together with different leaf level representations. The previous works [35, 3] use a block-sparse leaf level library. However, it does not suite the SpAMM algorithm, since it is not consistent with the hierarchical structure used in the algorithm.

In order to be fully consistent, we develop a new leaf level library, namely hierarchical blocks-sparse library, which relies on a hierarchical matrix representation as the name suggests. At each level a matrix contains four C++ smart pointers to point at children matrices, one regular pointer to point at the parent, 1-D array where actual numbers are kept and own sizes. A null pointer represents a zero block of a matrix. Matrices containing no non-zero elements are only allowed at the top level, in the intermediate levels such matrices are represented as null pointers. A matrix is not allowed to have a child of equal or larger size. The lowest level matrices do not have children, they only keep actual numbers and have a parent pointer.

The choice of smart pointers is done to simplify routines where removal of sub-matrices is necessary. A parent pointer is chosen to be a regular one to avoid a state of a cyclic reference, which may lead to memory leaks.

The parent pointer is utilized as follows: using this pointer, for any matrix one can determine which of the parent’s children the current matrix is. By going up in the hierarchy using a sequence of parent pointers, it is possible to construct a sequence of digits from 0 to 3, which uniquely identifies a matrix in the hierarchy at any level. Using this code, it is easy to get the coordinates of the top left corner of the current matrix in the ”big picture”, which might be useful for some algorithms.

The hierarchy does not go down to single matrix elements. It stops when reaching a certain predefined block size. A block is kept if it contains at least one non-zero value. So, in some sense it exploits blocking ideas similar to block-sparse leaf matrix in [35] and [14].

The most important advantage of the hierarchical block-sparse representation over the previously used block-sparse format is that is handles ”not so dense” (nearly sparse) matrices more efficiently than the previously used format, since there is no need to iterate and look for a non-zero block when accessing it. This is confirmed by experimental results in Section LABEL:results.

The list of operations currently supported by the hierarchical block-sparse matrix library includes:

1. where in this case, only the upper triangle of the matrix is stored and utilized;

2. where either or is symmetric;

3. where is symmetric;

4. ;

5. ;

6. where is the inverse Cholesky factor, is a symmetric positive-definite matrix;

Corresponding BLAS operations are called at the lowest level of the hierarchy. As one can see, the list of operations in consistent with the functionality of the Chunks and Tasks matrix library, which can be found in [35], except the last one. This is done for compatibility purposes. The last operation corresponds to the SpAMM task type.

Basically, the hierarchical block-sparse leaf matrix library is a small-scale re-creation of the Chunks and Tasks matrix library (to a certain extent). It exploits similar structure, but, since the computation happens within a single computational node, several operations can be improved.

Let us consider how the regular multiplication is done in the Chunks and Tasks matrix library. The reader can find description in Algorithm LABEL:MM_pseudocode. At line 1 the input is checked and if any of input matrices are zero (represented as special NULL identifier), then the output matrix identifier is also set to NULL. Then, lines 2–4 describe leaf-level operations, lines 5–14 describe higher level operations. The code is organized in this manner due to one of imposed limitations: a chunk content can be accessed only as an input to a task, i.e. only a single level of the hierarchy is visible at a time. In other words, if both of the two input matrices are not NULL, then 13 tasks (8 multiplications, 4 additions, creating a big matrix out of child identifiers) will be anyway registered, unless the lowest level is involved. Registration of tasks definitely leads to overhead.

In a shared memory environment multiplication of hierarchical matrices can be optimized. Consider an illustrative example of such multiplication in (LABEL:nonoptimal_MM). As one can see, although both input matrices are not NULL, the resulting matrix is NULL. So, it is not worth to multiply those matrices at all. However, we could judge this only when seeing the whole hierarchy, but not the two matrix identifiers.

Since the leaf level library works in a shared memory, it is indeed possible to see the whole hierarchy at once and predict if it is worth to start multiplication of two matrices. This check should be done recursively, since, it might happen that the resulting matrix does have a non-zero entry, see (LABEL:nonoptimal_MM2). In this case, prediction also should be done when multiplying by . If those matrices give non-zero product, then the procedure is worth to perform.

In our implementation, such prediction is utilized not only for multiplication routine, but also for the others, including the SpAMM routine, where the conclusion is based not only on the layout of children and recursive checks, but also on the norms of corresponding matrices. This allows our library to be competitive in terms of performance with the block-sparse library used in the previous works. More details can be found in Section LABEL:results.

Though predictor routines are called before every call to multiplication and thus extra traversals of the quad-tree are done, the overhead of such prediction is small compared to the overhead of memory allocation for new matrix objects, which anyway would not be used. Prediction is cheap since it involves only pointer arithmetic and logical operations. Note that these predictor routines allow us to avoid the prohibited states in matrices discussed above.

 [A0NULLNULLNULL]×[NULLNULLNULLB3]=[NULLNULLNULLNULL] (14)
 [A0A1NULLNULL]×[NULLNULLNULLB3]=[NULLA1×B3NULLNULL] (15)

### 4.6 Two-level algorithm

The full implementation then looks as follows: at the level of the Chunks and Tasks library, the task-based version of Algorithm LABEL:SPAMM_pseudocode is utilized. The matrices are split into relatively large blocks so that one task could occupy a whole computational node. Inside those blocks the matrices are kept in hierarchies as well using the hierarchical blocks-sparse leaf level library. When the task code eventually arrives at the leaf level, a corresponding SpAMM method of the leaf library is called. Actual computations are performed there using the prediction routines and the resulting block products are computed. Then, a chunk hierarchy of the resulting product is built from bottom.

## 5 Results

### 5.1 Experimental setup

In this section we describe the set of experiments which are carried out in order to determine performance of the leaf level library and the approximate multiplications algorithms.

All experiments are done at PDC high performance computing center located at KTH Royal Institute of Technology in Stockholm. The target machine is the Beskow cluster, which is a Cray machine equipped with 2060 nodes each containing 2 16-core Intel Xeon E5-2698v3 CPUs combined with 64 gigabytes of RAM. The CPUs operate at 2.3 GHz frequency. The nodes are connected in Dragonfly topology using the Cray Aries high speed network. The code is compiled with GCC 7.3.0 g++ compiler, Cray MPICH 7.7.0 and OpenBLAS 0.2.20 [2]. The OpenBLAS library is built from the source code with disabled multi-threading. This is highly recommended by the developers in case an application uses multi-threading somewhere else, which is the case with the Chunks and Tasks library.

### 5.2 Performance of the leaf level library

In this benchmark the aim is to investigate the performance of the new hierarchical block-sparse matrix library and compare it with the existing block-sparse library. For this, we perform matrix-matrix multiplication at the leaf level, and the Chunks and Tasks library is not used at all. Two matrices of size