# Modified Block BiCGSTAB for Lattice QCD

###### Abstract

We present results for application of block BiCGSTAB algorithm modified by the QR decomposition and the SAP preconditioner to the Wilson-Dirac equation with multiple right-hand sides in lattice QCD on a lattice at almost physical quark masses. The QR decomposition improves convergence behaviors in the block BiCGSTAB algorithm suppressing deviation between true residual and recursive one. The SAP preconditioner applied to the domain-decomposed lattice helps us minimize communication overhead. We find remarkable cost reduction thanks to cache tuning and reduction of number of iterations.

###### keywords:

Lattice gauge theory, Lattice Dirac equation, Multiple right-hand sides, Block Krylov subspace, Domain decomposition^{†}

^{†}journal: Computer Physics Communications

## 1 Introduction

Lattice QCD simulations initiated 30 years ago stand finally at the point where one can obtain results of physical observables at the physical up, down and strange quark masses PACSCS10 (). The next steps would be refinement of the results reducing the systematic errors and challenge to computationally difficult problems, e.g. calculation of disconnected diagrams. A main difficulty in lattice QCD simulations is that solution of Dirac equation, which have to be repeated many times both in configuration generation and measurement of physical observables on given configurations, is computationally expensive near the physical up and down quark masses. In the measurement of physical observables, however, computational cost may be reduced by block Krylov subspace methods BCG (), since its expensive part is the multiple right-hand side problem. (This is not the case for configuration generation.) One can expect that block Krylov subspace methods make convergence faster with the aid of better search vectors generated from wider Krylov subspace enlarged by number of right-hand side vectors in comparison with non-blocked method. Another possible ingredient to improve performance is an efficient use of memory bandwidth in implementation of block matrix-vector multiplication.

Since the Dirac matrix in lattice QCD is non-Hermitian, we might expect the block BiCGSTAB algorithm BBCGSTAB () is applicable in a straightforward way. One problem in block Krylov subspace methods, however, is that the true residual stops decreasing at some point, while the recursive one continues to decrease. Recently, three of the authors have proposed a new algorithm named block BiCGGR, which showed significant improvement for this problem BBiCGGR1 (); BBiCGGR2 (); BBiCGGR3 ().

In this paper we improve block BiCGSTAB algorithm with two modifications. First one is the QR decomposition, which is known to improve the numerical accuracy in block CG BCG (); GSBCG () and also useful for block BiCGSTAB algorithm GSBiCGSTAB_theo (). Second one is Schwarz alternating procedure (SAP) preconditioner proposed by Lüscher SAP (), which is applied to the domain-decomposed lattice. We can minimize communication overhead with the SAP preconditioner.

## 2 Algorithm

### 2.1 Modified Block BiCGSTAB

We consider to solve the linear systems with the multiple right-hand sides expressed as

(1) |

where is an complex sparse non-Hermitian matrix. and are complex rectangular matrices given by

(2) | |||||

(3) |

In the case of the Wilson-Dirac equation the matrix dimension is given by with the volume of a hypercubic four-dimensional lattice. is the number of the right-hand-side vectors which is called source vectors in lattice QCD. is 12 in the simplest case and (perhaps more in some case) for the stochastic method.

The matrix-vector multiplication for the Wilson-Dirac equation is written as

(4) |

(5) |

where and contain 12 complex numbers at site , are the gamma matrices,
are link variables of SU(3) matrix and is hopping parameter.
Computation of requires
1320^{1}^{1}11296 flops if is non-relativistic representation.
floating point number operations and 360 words per site.
This means the value of Flops/Byte is about 0.9 (0.45)
in the single (double) precision.
It should be difficult to obtain high performance
on recent computer architecture.

In block Krylov subspace methods, Eq. (5) can be expressed as

(6) |

with . An important point is that same 8 link variables around site are used in common for with and their size is just 576 (1152) bytes in the single (double) precision, which are small enough to be retained in low level cache, for example L1 cache (if there is). This allows us more efficient usage of cached data than repeating independent matrix-vector multiplications. Figure 1 illustrates how Flops/Byte increases as is enlarged. For an effective use of cache, we arrange loop for the index of () in the most inner level with running first in memory allocation for vectors.

Pseudocode for modified block BiCGSTAB algorithm is described in Algorithm 2.1. Note that preconditioner at lines 4.2 and 4.6 in Algorithm 2.1 must be implemented by a stationary iterative method in this algorithm since the common preconditioning should be applied to all right-hand sides, though nonstational iterative methods are often used in lattice QCD mix2 (). The orthogonalization of improves numerical accuracy since each span works effectively to search approximated solutions. We employ modified Gram-Schmidt method for the QR decomposition. Even when non-block BiCGSTAB fails to converge, modified block BiCGSTAB may converge by adding dummy right-hand side vectors if they can play a supplementary role GSBiCGSTAB_theo (). We also present a memory saving version in Algorithm 2.1.

### 2.2 Preconditioning

In this work we employ the -improved
^{2}^{2}2The leading cut-off error in terms of the lattice spacing is removed. Wilson fermions.
After Jacobi preconditioning (division of + clover term),
the matrix is decomposed as
the following blocked matrix form,

(7) |

where the subscript and denote the even and odd domains, respectively. The SAP preconditioner is computed as

(8) |

where () is an approximation for () obtained by SSOR method SSOR ()

(9) |

with

(10) |

is the forward hopping term and is the backward one. We perform SAP preconditioning in the single precision for effective use of memory bandwidth and network bandwidth between nodes.

It is known that “sloppy” precision can be used in right preconditioning, but not in left one. Suppose calculation of at line 4.6 in Algorithm 2.1 is performed with “sloppy” precision in -th iteration. Numerical errors for , , and may be expressed as

(11) | |||||

(12) | |||||

(13) | |||||

(14) |

These yield

(15) |

which satisfies the exact relation between approximate solutions and residuals in -th iteration. For the case that both at line 4.2 and at line 4.6 are computed with “sloppy” precision one can also reproduce the above relation with the following formulae:

(16) | |||||

(17) | |||||

(18) | |||||

(19) | |||||

(20) | |||||

(21) | |||||

(22) | |||||

(23) |

## 3 Numerical test

### 3.1 Choice of parameters

We test modified block BiCGSTAB employing a so-called “local source”, , with for color-spin components. We use statistically independent 10 configurations generated at almost the physical point, and , in 2+1 flavor lattice QCD with the nonperturbatively -improved Wilson quark action and the Iwasaki gauge action iwasaki () at on a lattice PACSCS10 (). We choose the hopping parameter for the Wilson-Dirac equation and with domain size for the SAP preconditioning following Ref. PACSCS10 (). Parameters for SSOR method are also fixed with and . The stopping criterion is set to be with .

### 3.2 Test environment

Numerical test is performed on 16 nodes of a large-scale cluster system called T2K-Tsukuba. The machine consists of 648 compute nodes providing 95.4Tflops of computing capability. Each node consists of quad-socket, 2.3GHz Quad-Core AMD Opteron Model 8356 processors whose on-chip cache sizes are 64KBytes/core, 512KBytes/core, 2MB/chip for L1, L2, L3, respectively. Each processor has a direct connect memory interface to an 8GBytes DDR2-667 memory and three hypertransport links to connect other processors. All the nodes in the system are connected through a full-bisectional fat-tree network consisting of four interconnection links of 8GBytes/sec aggregate bandwidth with Infiniband. For numerical test we modify the lattice QCD simulation program LDDHMC/ver1.04b12.31 developed by PACS-CS Collaboration LDDHMC ().

### 3.3 Results

Figure 2 shows a representative case for residual norm as a function of number of iterations for modified block BiCGSTAB. We observe one of important features of block Krylov subspace methods that the number of iterations required for convergence decreases as the block size is increased.

time[s] | T(gain) | NMVM | NM(gain) | |
---|---|---|---|---|

3827(755) | 1 | 17146(3326) | 1 | |

2066(224) | 1.9 | 12942(1379) | 1.3 | |

1619(129) | 2.4 | 10652(832) | 1.6 | |

1145(99) | 3.3 | 9343(835) | 1.8 | |

1040(87) | 3.7 | 7888(663) | 2.2 | |

705(70) | 5.4 | 6106(633) | 2.8 |

The results are summarized in Table 1. Second column is total time to solve the Wilson-Dirac equation for all 12 colour-spin components at one local source. In case of , for example, 12 right-hand side vectors are divided into two blocks: and . Third column is gain factor of time compared with case. Fourth and fifth columns are number of matrix-vector multiplication (NMVM) and its gain factor, respectively. Modified block BiCGSTAB with is about 5 times faster than case. The iteration number is reduced by a factor of three. Additional speed-up by a factor of two is obtained by cache tuning. This is roughly consistent with the enhancement of Flops/Byte from 1.05 at to 1.95 at plotted in Fig. 1.

## 4 Conclusions

In this paper, we have carried out numerical test for block BiCGSTAB with two modifications of the QR decomposition and the SAP preconditioner in lattice QCD at almost physical quark masses. The QR decomposition successfully removes the problem in block BiCGSTAB that is the deviation between the true and the recursive residuals. We find remarkable cost reduction at large due to smaller number of iterations and efficient cache usage. Further gain could be expected in calculations of disconnected diagram and reweighting factor, where larger value of is required. One concern is that numerical cost for modified Gram-Schmidt method increases as .

## Acknowledgments

Numerical calculations for the present work have been carried out on the T2K-Tsukuba computer at the University of Tsukuba. This work is supported in part by Grants-in-Aid for Scientific Research from the Ministry of Education, Culture, Sports, Science and Technology (Nos. 20800009, 20105002, 22244018 ).

## References

- (1) PACS-CS Collaboration, S. Aoki et al., Physical Point Simulation in 2+1 Flavor Lattice QCD, Phys. Rev. D 81 (2010) 074503.
- (2) D. P. O’Leary, The block conjugate gradient algorithm and related methods, Linear Algebra Appl., 29 (1980), pp. 293-322; A. A. Nikishin and A. Y. Yeremin, Variable block CG algorithms for solving large sparse symmetric positive-definite linear systems on parallel computers, SIAM J. Matrix Anal. Appl., 16 (1995), No. 4, pp. 1135-1153
- (3) A. El Guennouni, K. Jbilou, H. Sadok, A Block Version of BiCGSTAB for Linear Systems with Multiple Right-Hand Sides, Elec. Trans. Numer. Anal. 16 (2003) 129.
- (4) H. Tadano, T. Sakurai, Y. Kuramashi, Block BiCGGR: A New Block Krylov Subspace Method for Computing High Accuracy Solutions JSIAM Lett. 1 (2009) 44.
- (5) T. Sakurai, H. Tadano, Y. Kuramashi, Application of block Krylov subspace algorithms to the Wilson-Dirac equation with multiple right-hand sides in lattice QCD Comput. Phys. Commun. 181, (2010) 113.
- (6) H. Tadano, Y. Kuramashi, T. Sakurai, Application of preconditioned block BiCGGR to the Wilson-Dirac equation with multiple right-hand sides in lattice QCD, Comput. Phys. Commun. 181, (2010) 883.
- (7) A. A. Dubrulle, Retooling the method of block conjugate gradients, Elec. Trans. Numer. Anal. 12 (2001) 216.
- (8) H. Tadano et al., in preparation.
- (9) M. Lüscher, Lattice QCD and the Schwarz alternating procedure, JHEP 0305, 052 (2003); Comput. Phys. Commun. 165, (2005) 119.
- (10) S. Dürr et al., Scaling study of dynamical smeared-link clover fermions, Phys. Rev. D 79 (2009) 014501.
- (11) S. Fischer et al., A Parallel SSOR Preconditioner for Lattice QCD, Comput. Phys. Commun. 98 (1996) 20, hep-lat/9602019.
- (12) Y. Iwasaki, Renormalization group analysis of lattice theories and improved lattice action; 2, four-dimensional non-Abelian SU(N) gauge model, preprint, UTHEP-118 (Dec. 1983), unpublished.
- (13) PACS-CS Collaboration, S. Aoki et al., 2+1 Flavor Lattice QCD toward the Physical Point, Phys. Rev. D 79 (2009) 034503.