Parallelizing Sequential Sweeping on Structured Grids – Fully Parallel SOR/ILU preconditioners for Structured nDiagonal Matrices
Abstract
There are variety of computational algorithms need sequential sweeping; sweeping based on specific order; on a structured grid, e.g., preconditioning (smoothing) by SOR or ILU methods and solution of eikonal equation by fast sweeping algorithm. Due to sequential nature, parallel implementation of these algorithms usually leads to miss of efficiency; e.g. a significant convergence rate decay. Therefore, there is an interest to parallelize sequential sweeping procedures, keeping the efficiency of the original method simultaneously. This paper goals to parallelize sequential sweeping algorithms on structured grids, with emphasis on SOR and ILU preconditioners. The presented method can be accounted as an overlapping domain decomposition method combined to a multifrontal sweeping procedure. The implementation of method in one and two dimensions are discussed in details. The extension to higher dimensions and general structured ndiagonal matrices is outlined. Introducing notion of alternatively block upperlower triangular matrices, the convergence theory is established in general cases. Numerical results on model problems show that, unlike related alternative parallel methods, the convergence rate and efficiency of the presented method is close to the original sequential method. Numerical results also support successful use of the presented method as a cache efficient solver in sequential computations as well.
September 25, 2019
September 25, 2019
omain decomposition, overlapping decomposition, parallel Gauss Seidel, parallel sweeping.
65F10, 65H10, 65N06, 65N22, 65Y05
1 Introduction
Structured grids are commonly used in scientific computing for the purpose of numerical solution of partial differential equations (PDEs). Despite the unstructured grid based methods which manage domain complexity in a more consistent manner, in certain applications using structured grids is still the best choice, mainly due to preference in terms of consumed memory, efficiency and the ease of implementation. Moreover, with the invention of immersed boundary methods with make possible to manage complex geometries on structured grids, there is recently a significant interest to application of structured grids in scientific computing. Progress in power of supercomputer changes the bias in favor of structured grids too, as they are usually more appropriate for parallel computing.
Some of numerical algorithms on structured grids use sequential sweeping; grid by grid procedure on a mandatory order. In fact, using the conventional version of sequential algorithms, it is not possible to perform computations simultaneously. This limit us to use full power of multiprocessor shared and/or distributed memory machines.
Since the advent of parallel computers, several parallel versions of the SOR method have been developed. Most variants of these algorithms have been developed by using multicolor ordering schemes or domain decomposition techniques [adams1982mcs, oa1985lar, adams1986scb, neumann1987cpm, white1987mpi, melhem1987tei, melhem1988mrs, ashcraft1988vif, block1990bcs, ortega1991ocg, ii1993oma, stotland1997opc, xie1999nps]. In contrast to domain decomposition method, a multicolor SOR method has usually a faster convergence rate, but is usually more difficult to solve elliptic boundary value problems in complex domains [xie2006nbp]. To improve the convergence rate of a domain decomposition algorithms, a novel mesh partition strategy was proposed in [xie1999nps] which led to an efficient parallel SOR. This method has the same asymptotic rate of convergence as the RedBlack SOR [xie1999nps]. Although, multicolor SOR algorithms are also usually implemented based on a domain decomposition strategy, the Xie and Adams’s algorithm [xie1999nps] takes less interprocessor data communication time than a multicolor SOR. The blocked version of the Xie and Adams’s parallel SOR algorithm [xie1999nps] is suggested in Xie [xie2006nbp]. According to presented results, the rate of convergence and parallel performance of this version is better than that of the nonblocked version.
The main drawback of the above mentioned parallel versions of SORs is lower rate of convergence in contrast to original one. This is mainly due to enforced decoupling between equations to provide some degree of concurrency in computations. For example in the mcolor SOR method, an increase in the number of colors decreases the rate of convergence. So a key to achieve an efficient parallel SOR method is to minimize the degree of decoupling due to parallelization.
Other techniques such as the pipelining of computation and communication and an optimal schedule of a feasible number of processors are also applied to develop parallel SOR methods when the related coefficient matrix is banded [missirlis1987spi, robert1988csp, eisenstat1989csp, niethammer1989smp, comput1995pgs, amodio1996pis]. These techniques can isolate parts of the SOR method which can be implemented in parallel without changing the sequential SOR. The convergence rate of these methods are same as the original one but the application is limited to bandedstructure matrices.
The present goals to suggest a new method to parallelize sequential sweeping on structured grids. The specific focus is on the SOR method which can be regarded as an iterative preconditioner, smoother or even a linear solver. Without loss of generality, the later case is taken into account in this study to simplify arguments. Since there is a direct correspondence between SOR and ILU preconditioners on structured grids, the presented method can be used to develop parallel ILU methods as well.
The rest of this paper is organized as follows. Sections 2 and 3 present the implementation of method in one and two spatial dimensions respectively. The extension to three spatial dimensions is commented in Section 4. he Considering the connection between SOR and ILU preconditioners on structured grids, Section 5 argues the application of presented approach to develop ILU(p) preconditioners on structured grids. The convergence theory is established in Section 6. Section 7 extend utility of the presented method to general structured ndiagonal matrices. Section 8 is devoted to numerical results on the convergence and performance of the method. Finally Section 9 close this paper by summarizing results and outlining possible application and extension of the presented method.
2 Parallel SOR sweeping in onedimension
Consider the following scaler 1D elliptic PDE:
(1) 
where , , is , . Also, it is assumed field variables , and posses the sufficient regularity required by the applied numerical method to ensure the existence and uniqueness of the discrete solution. Assume the spatial domain is discrtized into an intervals with grid points and , where .
Without loss of generality; for the ease of presentation; we discrtize (1) with a second order finite difference method as follows:
(2) 
where subscript denotes the value of field variable at (e.g. ),
. Using boundary conditions and , it is possible to write above equations in the following matrix form,
(3) 
where and . It is evident that matrix is irreducibly diagonally dominant which ensure the convergence of stationary iterative methods, like SOR. The SOR method using the natural rowwise ordering (left to right sweeping) generates the following sequence of iterations from a given initial guess ,
(4) 
where is left to right (LR) sweeping relaxation factor and superscript denotes the iteration number. The other alternative SOR method is resulted by reversing the sweeping direction (right to left sweeping):
(5) 
where is the right to left (RL) sweeping relaxation factor. The application of (4) and (5) alternatively during each two consecutive iterations leads to the classical symmetric SOR (SSOR) method. It is clear that these procedures are sequential, in the sense that update of grid point should be performed after its left (or right) neighbor. In fact, it is not possible to perform relaxation of more than one grid point at each time in the original version of SOR method.
The goal of parallel implementation of SOR method is to provide possibility of concurrent computation, simultaneously keep the convergence rate of original SOR method as much as possible. The concept of domain decomposition is used here to parallelize the SOR method. Assume spatial domain is decomposed into number of nonoverlapping subdomains, i.e., , and (splitting of grid points).
For the purpose of convenience, we first describe our method for two (approximately equal) subdomains. As illustrated in figure 1, discrtized spatial domain is decomposed into subdomains and , where and . If (4) is used in and the computation begins from the left boundary, and simultaneously, (5) is used in and the calculation is performed from the right boundary toward left direction, then the computation of subdomains will be independent during the first iteration. In our algorithm during the next iteration (5) and (4) are used respectively in and ; and the computations are started from grid points and in the corresponding subdomains. Therefore, we have the following equation at grid point ,
(6) 
while the following equation should be used at grid point ,
(7) 
6 and (7) can be written in the following matrix form:
(8) 
The above system of equation can be easily inverted to compute and . Computing the new values of grid points and based on the mentioned procedure, the computations for the remaining grid points will be performed trivially according to the corresponding update formulae. These sequence of computation is repeated for each pair of consecutive iterations until the convergence criteria is satisfied. The flow chart of this procedure is displayed in figure 1. In this plot and symbols are used to denote the application of (4) and (5) respectively; also symbol is used to denote the application of (8).
The extension of this algorithm to more than two subdomains is a straightforward job which can be described as follow (cf. figure 2):

Division of the spatial computational domain into desired number of subdomains, based on desired load balancing constraints.

Determination of the sweeping direction for each subdomain. Sweeping direction of each subdomain should be in opposite direction of its neighbors. For example we use LR direction for odd subdomains and RL direction for even subdomains. These sweeping directions are inverted after each iteration.

Updating the starting node of each subdomain with (8) and the remained nodes with either of (4) or (5), based on the corresponding sweeping direction. If the starting node be located at the physical boundaries, the prescribed boundary value will be used there (there is no need to use (8) in this case).
Now let’s to look at the properties of the presented parallel algorithm and its connection to other domain decomposition methods. It is obvious that this algorithm is an overlapping domain decomposition method which changes the overlap regions area alternatively. As the method has some similarities to overlapping Schwarz algorithm, lets to mention some differences of our algorithm with the Schwarz method. In contrast to the Schwarz method the size of overlap regions here are changes alternatively. Subsystem of equations are only solved exactly at overlapped regions (of course alternatively); and within internal regions of subdomains, equations are solved approximately (by onepass SOR iteration). In Schwarz algorithm subdomain boundaries are treated as virtual boundary conditions of Dirichlet, Neumann, Robin or mixture of these types. When relaxation factors are equal to unity (Gauss Seidel iterations), our numerical treatment for subdomain boundaries are equivalent to Neumanntype (flux) boundary conditions, in the other cases our internal boundary conditions are equivalent to a mixture of Dirichlet and Neumann ones. In the later case, the relaxation factors and plays role of blending factors. In fact our method tries to keep the partial coupling between subdomain during iterations, in expense of a negligible computational cost. It is worth mentioning that due to alteration of sweeping directions we expect good smoothing properties of the presented method. Later we will observe that how such enforced local coupling leads to competitive convergence rate of the presented method in contrast to that of original SOR.
Note that the procedure for higher order stencils or other numerical methods are conceptually the same. We avoid further remarks in this regard to save the space.
3 Parallel SOR sweeping in twodimensions
In this section we shall extend the presented parallel sweeping algorithm for twodimensional problems. We consider the following scaler 2D elliptic PDE:
(9) 
where , , is diagonal matrix with diagonal entries , . For the sake of convenience, it is assumed that . Same as previous, sufficient regularities are assumed for the field variables.
Suppose that is divided into and Cartesian grids along and directions respectively. The grid points are given by and ; where , , , and . Same as previous, the finite difference approximation of (9) is used here to describe the presented method (with second order central differencing scheme). Higher order finite difference methods as well as other numerical methods (like FVM, FEM) can also be employed. The only requirement of our method is discretization of PDE on a structured grid. We use to denote the finite difference approximations of . The approximation of (9) with the mentioned scheme has the following form,
(10) 
for and ; where ,
and the mid grid point coefficient are computed by harmonic averaging, e.g.,
Using the corresponding boundary conditions,
Same as previous, it is possible to write above equations in a matrix form. However, the presented method is matrixfree and there is no need to write equations explicitly in a matrix form at this point.
There are variety of sweeping direction to solve (10) with SOR method. Considering a rectangular computational domain, it is possible to start sweeping procedure from one corner toward its corresponding opposite corner. Therefore, there is four sweep directions. Using geographic directions (east, west, north, south), it is possible to show these directions by SWtoNE, NEtoSW, SEtoNW and NWtoSE. There are various sweeping strategy for each of mentioned directions. These strategies are shown in figure 3 for SWtoNE, NEtoSW directions on a structured grid. Every sweeping strategy should preserve causality; should be performed along a causal direction. The causality here means that order of update should be such that new values of at least one half of neighboring nodes of the current grid should be known prior to its update. It is possible to use distinct relaxation factors along each sweeping direction. We show relaxation parameter of each direction by subscript according to its starting corner, i.e., , , , .
The SOR updating (from iteration to ) formulae for grid point has one of the following formulae,
(11)  
(12)  
(13)  
(14) 
It is clear that due to required causality condition, the above procedures are sequential in nature. To parallelize this sequential procedure, we first split the computational domain into (almost) equallysize nonoverlapping subdomains (based on load balancing constraints). To describe our algorithm, we consider a Cartesian decomposition in this section, i.e., where along are number of subdomains along and spatial axes respectively. For the sake of convenience, it is assumed that is decomposed into a array of subdomains (cf. figure 4).
As it is implied in previous, a key to achieve a parallel SOR with a little convergence rate decay is to minimize degree of decoupling due to parallelizing, as much as possible. In the original SOR when a point is updated the remaining points sense its new value due to causality of sweeping strategy.
To keep the causality of sweeping partially, a multifrontal strategy is employed here for the purpose of parallelization. Consider the mentioned Cartesian splitting of the computational domain. According to figure 5, we use frontal sweeping along directions, , , and for subdomains 1 though 4 respectively. The order of updating is shown in plots. It is clear that to update node #1 of each subdomain, we need two unknown values (new values) from two neighbor subdomains. A shaded region in figure show this coupling in iteration th. Therefore a local system of equation should be solved for this purpose. Assuming node #1 of (at iteration th, left side of figure) is denoted by and using global index, this local system of equation has the following form (application of (12), (14), (13) and (11) for though respectively),
(15) 
This system of equation is nonsingular and can be easily inverted. After update of the corner node (node 1), we should update nodes on (virtual) boundaries of subdomains. As it is shown in figure 5), node #2 of each subdomain is coupled to node #2 of a neighbor subdomain. Assuming node #1 of is denoted by and using global index, using (12) and (13)the following system of equations should be solved to compute new values of node #2 in and , in fact and (cf. 5),
(16) 
Note that the new values of and are located in right hand side of (3) as they are known after solution of (3). In the same manner other remaining boundary nodes will be updated. Then, the internal nodes will be updated using a classic SOR method along the corresponding sweeping direction.
The next sweep (iteration th) for the mentioned decomposition will be performed along the corresponding opposite directions, i.e., , , and for subdomains 1 though 4 respectively (cf. figure 5). In this specific example there is no need to solution some local system of equations for this sweep; but in general one may usually needs to solve such systems at coupling points of subdomains which are determined based on sweeping directions. Figure 6 shows the alteration of sweeping directions during each four consecutive iterations for a array of subdomains. In the next following subsections some implementation issues of our algorithm will be discussed in details.
As a final remark in this section it should be mentioned that the extension of our algorithm for ninepoint or higher order central stencils is straightforward. As the extend of computational molecule is increased in a higher order method, the size of local systems to be solved will be also increased accordingly. Following the same strategy, it would be possible to apply the presented method for other numerical methods, e.g., FVM and FEM methods, on a structured grid. Note that every structured grid has a onetoone correspondence with with a Cartesian grid. Therefore, the presented method can be directly used for nonCartesian structured grids too. To save the space, further details in this regards are ignored in this study.
3.0.1 Domain decomposition
The first aspect of our parallel algorithm is to divide the computational grid into parts. This is how the full computational task is divided among the various processors. Each processor works on a portion of the grid and anytime a processor needs information from another one a message will be passed (in message passing protocol; no message passing is needed regarding to sharedmemory machines).
To achieve a good parallel performance, we like to have an optimal load balancing and the least communication between processors. Consider the load balancing first. Assuming a homogeneous parallel machine, one would like that each processor does the same amount of work during every iteration. That way, each processor is always working and not sitting idle. It makes sense to partition the grid such that each processor gets an equal (or nearly equal) number of nodes to work on. The second criterion, needs to minimize the number of edge cuts and the degree of adjacency for each subdomain (number of neighbors processor). So the domain decomposition could be converted to a constrained optimization problem in which the communication cost to be minimized under the load balancing constraint. In the case of regular Cartesian grid, this optimization problem can be easily solved within a negligible cost.
The cost of communication is composed from two types of elapsed time, one proportional to the communicated data size (function of network communication speed) and one due to the network latency which is independent from the data size. Therefore, dealing with a high latency networks, we may prefer to decrease the degree of adjacency in our communication graph; in expense of a higher size of communicated data size. On the other hand, dealing with low latency networks, we may neglect the effect of network latency. It is worth mentioning that, using an overlapped communication and computation strategy, it will be possible to decrease cost of communication.
The above discussion is correct in general for every parallel algorithms. However, one should consider other criteria when the convergence rate of the parallel algorithm is a function domain decomposition topology (number of subdomains and their geometries).
In the present study increasing number of subdomains along each spatial direction is equivalent to decreasing the degree of coupling in that direction. Rough speaking, it usually decreases the convergence rate. However, this conclusion is not correct in general. In some cases, a limited degree of decoupling not only decreases the convergence rate but also improves it. As an example consider the solution of Poisson equation in a square domain. Also, assume that the center of domain is the center of symmetry for the exact solution. Now, let’s to decompose the domain into equal subdomains. Due to mentioned symmetry, the parallel algorithm performs superior than the original SOR method (as it is more consistent to mentioned symmetry in actual solution). This example implies that decoupling does not always decrease the convergence rate. Therefore, one may expect that an algorithm with partial coupling (like that of ours) performs even superior than the original SOR in practice.
3.0.2 Sweeping directions
After the domain decomposition, it is essential to determine the
sweeping directions for each subdomain during 4 consecutive
iterations. For this purpose it is sufficient to use the following
rules (lie 1D version of algorithm):

Along each spatial direction the sweeping direction of a subdomain should be in opposite direction of its neighbor subdomains (cf. figure 6).

The sweeping directions (along all spatial directions) should be reversed during every pair of iterations (cf. figure 6).

After each pair of iteration, the sweeping will be performed along a new diagonal, e.g., after two consecutive iterations along diagonal , the direction of sweeping for the next two iteration will be changed to  diagonal (cf. figure 6).
Using the above mentioned rules, the sweeping directions are generated automatically without any complexity, e.g., figure 6 is produced applying the above mentioned rules within a nested loop using a few lines of Tikz^{1}^{1}1www.texample.net/tikz/ scripts.
3.0.3 Data structure
In the sequential SOR a twodimensional array is sufficient to store the global data. This needs a nested loop to sweep the hole domain. Since in the presented parallel SOR method, we do not have a regular order of updating for subdomains boundary nodes, it is preferable to store order of update for these nodes. To achieve better performance every data (mentioned indexes and field variables) are stored in a onedimensional array (sometimes called as vectorization). In fact in this way we have a onetoone mapping from 2D indexing to 1D indexing, e.g., node is stored at location in the onedimensional array. In this way, the location of nodes and will be the location of node plus 1 and values respectively.
To update nodes which are located at boundaries between neighbor subdomains, the data from the corresponding neighbor subdomains is required. Therefore, we add two rows/columns of halo points around each subdomain which are renewed (via communication) prior to a new iteration (in the case of higher order schemes, the width of halo region should be increased accordingly). In the present implementation every subdomain has at most 8 neighbors, i.e., 4 firstdegree neighbors (contacted via an edge), 4 seconddegree neighbors (contacted via a corner). Figure 7 shows schematically the communicated data during to and to sweeps for a subdomain with 8 active neighbors. In this figure, yellowshaded regions contain native nodes and grayshaded regions includes required data which should be communicated. It is clear that it is possible to overlap a portion of communications with computations to improve the performance.
4 Extension to threedimensions
Following the geometric procedure discussed in the previous section, it is straightforward to extend the presented method to threedimensional (3D) cases. The main ingredients of such extensions are briefly addressed here.
In 3D case, there are eight alternatives frontal sweeping directions. Adding front and back directions to mentioned geographic directions (east, west, north, south), these eight sweep directions are as follows: to, to, to, to, to, to, to and to. There are eight relaxation parameters accordingly, , , , , , , and . After the domain decomposition, the sweeping directions are determined for each domain for each eight consecutive iterations using mentioned rules in 3.0.2. Using a Cartesian decomposition topology, each subdomain will has at most 26 neighbors, i.e., 6 firstdegree neighbors (contacted via a face), 12 seconddegree neighbors (contacted via an edge), 8 thirddegree neighbors (contacted via a corner).
Same as 2D version of algorithm, some local system of equations are needed to be solved at subdomains coupling nodes. Now, consider a subdomain with 26 active neighbor. At starting corner, a system of linear equations should be solved. Note that this system is not dense (includes 32 nonezero entries). Then, along each of the three coupling edges, a sort of local system of linear equations should be solved in an appropriate order. In the same manner, on each of the three corresponding coupling faces a sort of local systems of equations should be solved based on an appropriate update order. Updating coupling nodes, the remaining nodes are updated using classic SOR method along the corresponding sweep direction.
5 Parallel incomplete LU preconditioners
The SOR method does not usually used as an iterative solver in practice. But it is commonly used as a smoother in multigrid methods or as a preconditioner in Krylov subspace methods. In these cases, a few iterations of SOR method is applied during each step of preconditioning (or smoothing). Another kind of preconditioners which are extensively used in Krylov subspace methods are incomplete LU (ILU) preconditioners. However due to sequential nature of incomplete factorization and also forward/backward subsituation procedure, application of ILU methods in parallel is challenge.
In this section, we show that the presented parallel method can be equivalently used to develop parallel ILU preconditioners. A simple observation shows that single pass of a symmetric GaussSeidel method on a ndiagonal matrix (precisely a matrix includes only a few nonzero diagonals with symmetric sparsity pattern) is equivalent to ILU(0) preconditioning (cf. section 10.3 of [saad2003ims]). Therefore, the presented parallel algorithm can be used as a parallel ILU(0) preconditioner as well. Similarly, ILU(p) preconditioning for such structured matrices is equivalent to application of our method for a higher order method (cf. section 10.3 of [saad2003ims]) which make sense to use our method to parallelize ILU(p) preconditioners as well.
Later, we will show that our method can be applied for ndiagonal matrices raised from discretization of PDEs on structured grids. Therefore, the presented strategy can be also regarded as a parallel ILU(p) preconditioner for such ndiagonal matrices.
6 Convergence analysis
In this section we first proof the convergence of the presented parallel algorithm in one and two spatial dimensions. Following the same line of proof, the extension of theory to higher dimension will be straightforward.
6.1 Monodimensional convergence analysis
To simplify the analysis, it is worth to write the presented method in an algebraic form. Without loss of generality, assume that the spatial domain is decomposed into number of subdomains where is an even integer and all subdomains include number of grid points (note that boundary nodes are not considered here, i.e., ). Also, assume that the sweeping direction of the first subdomain is at the starting iteration. We can decompose matrix in (3) as follows,
(17) 
where denotes an identity matrix here,
for ,
for ,
Note that matrices and should be assembled in and
such that their positive diagonal match to the main diagonal
of and . Assume that eigenvalues of matrices and
are denoted by vectors and respectively. The following Lemma determines
and explicitly.
and are strictly positive definite and we have,
Considering the structures of and , it is easy to verify
that and are block lower and upper triangular matrices.
Note that the definition of both of the block lower triangular and
upper triangular matrix hold for each of and . Therefore,
their corresponding eigenvalues are equal to union of their diagonal
blocks’s eigenvalues. The Diagonal blocks have a simple bidiagonal
structure and their eigenvalues are equal to their main diagonals.
Considering the fact that that
completes the proof.
The presented parallel onedimensional algorithm is convergent if
Considering the decomposition (17) of matrix , with simple algebra, we can write the presented onedimensional parallel iterative algorithm in the following matrix form ,
(18) 
where . From (18) we have,
(19) 
in which,
(20) 
To proof the convergence of iterations (19) it is remained to show that the spectral radius of iteration matrix which is denoted by here is strictly less than unity (cf. [saad2003ims]). Now, let’s to define the matrix which is equivalent to (has the same spectral radius),
(21) 
Using (6.1), can be written as follows,
(22) 
Let’s to define the following matrices
Assume eigenvalues of , , and are denoted by the following vectors respectively,
It is easy to show that,
Using the knowledge of linear algebra we have,
where the norm operator is understood as the Euclidean norm here,
Therefore, if,
which complete the proof.
The presented onedimensional parallel GaussSeidel algorithm
() is unconditionally convergent.
When the convergence criterion of the
presented onedimensional parallel SOR algorithm is similar to that
of sequential SOR method, i.e., .
Moreover in this case.
Using the same procedure, it will easy to show that the coefficient
matrix corresponding to a high order methods (includes more
extended computational molecules) admits a decomposition like to
(17), where and are two strictly
positive definite block lower and upper triangular matrices
respectively. Then, it will be
straightforward to proof the convergence similarly.
6.2 Twodimensional convergence analysis
Following the previous section, it is possible to proof the convergence of the presented method in two spatial dimensions. To illustrate this issue, the convergence of the presented twodimensional algorithm for discrtized equation (10) will be discussed in this section.
Extension of monodimensional proof to higher dimensions is based on tensor product properties of the structured grids. The standard fivepoint discretization of (9) problem on an structured grid (ignoring Dirichlet boundary nodes) leads to the following system of linear equation (the effects of Dirichlet boundary nodes are included in the right hand side vector),
where strictly positive definite dimensional matrix has the following structure (rowwise ordering is assumed for grid points),
where each of , and is an multidiagonal (three or monodiagonal) matrices. For ,