An update on the BQCD Hybrid Monte Carlo program
Abstract
We present an update of BQCD, our Hybrid Monte Carlo program for simulating lattice QCD. BQCD is one of the main production codes of the QCDSF collaboration and is used by CSSM and in some Japanese finite temperature and finite density projects. Since the first publication of the code at Lattice 2010 the program has been extended in various ways. New features of the code include: dynamical QED, action modification in order to compute matrix elements by using FeynmanHellman theory, more trace measurements (like for , and chemical potential reweighting), a more flexible integration scheme, polynomial filtering, termsplitting for RHMC, and a portable implementation of performance critical parts employing SIMD.
An update on the BQCD Hybrid Monte Carlo program
Taylor Ryan Haar, Yoshifumi Nakamura, and Hinnerk Stüben^{}^{}Speaker, email: hinnerk.stueben@unihamburg.de
1CSSM, Department of Physics, The University of Adelaide, Adelaide, SA, Australia 5005
\@textsuperscript2RIKEN Advanced Institute for Computational Science, Kobe, Hyogo 6500047, Japan
\@textsuperscript3Universität Hamburg, Regionales Rechenzentrum, 20146 Hamburg, Germany

Abstract.
1 Introduction
BQCD is a Hybrid Monte Carlo program for simulating lattice QCD with dynamical Wilson fermions. It was first published at Lattice 2010 [1] and has been used by several groups: the QCDSFUKQCD collaboration [2, 3, 4, 5, 6, 7], CSSM [6, 7, 8, 9], Japanese finite density [10, 11] and finite temperature [12] projects, and the RQCD collaboration [13].
Here we report on extensions and optimizations that were made meanwhile and give an update on compute performance. The code and a manual can be downloaded from [14]. New features of the program are:

Actions: hopping term with chemical potential, clover improved Wilson action plus a CPT breaking term, QCD+QED, QCD+Axion. See section 2.

Algorithms: polynomial filtering, a generalized multiscale integration scheme, truncated RHMC, Zolotarev approximation. See section 4.

Compute performance optimizations: explicit vectorization with SIMD intrinsics, improvement of MPI communication. See section 5.
2 Actions
2.1 Gauge actions
Implemented are the Wilson gauge action and an improved gauge action, see [1].
2.2 Fermion actions
2.3 Qcd+qed
The program can simulate QCD+QED [5] using the action
(3) 
is an SU(3) gauge action, is the noncompact U(1) gauge action
(4) 
and the fermion action for flavour is
(5) 
where , and is a singly iterated stout link.
2.4 QCD+Axion
The program can simulate QCD+Axion [15] using the action
(6) 
is scalar action for the axion field
(7) 
and the fermion action for flavour in the case of Wilson fermions is
(8) 
where
(9) 
3 Measurements
The following quantities can be measured with BQCD: plaquettes (quadratic and rectangular), topological charge (cooling method), Polyakov loop, Wilson flow, traces of the fermion matrix (, , ), quark determinant with chemical potential, smallest and largest eigenvalue of the Dirac matrix, meson and baryon propagators.
4 Algorithms
In addition to nested integrators for multiple time scales, a generalized integration scheme [8] has been implemented. This is where separate integration schemes for each action term are superimposed onto a single time step evolution. This allows the integration stepsizes for each action term to be completely independent of the others.
Rational Hybrid Monte Carlo (RHMC) is implemented with rational approximations from the Remez algorithm. In the case of approximating , an alternative rational function is available, namely the Zolotarev optimal rational approximation (see e.g. [16] for an explanation).
Alongside Hasenbusch filtering, there are two new filtering methods, one of which applies exclusively to RHMC:

Polynomial filtering applies to both RHMC and standard HMC fermions, and is the application of a polynomial filter to split the fermion action into several terms:
(10) 
Termsplitting for RHMC splits the sum in the rational approximation for RHMC into several terms, giving action
(11) where
(12) are ordered decreasing.
BQCD has a wide range of iterative solvers: cg, BiCGstab, GMRES, GCRODR, multishift cg, block multishift cg.
5 Optimization of compute performance
5.1 SIMD vectorization and MPI
In addition to parallelization with MPI and OpenMP a third level of parallel implementation was introduced for solvers: SIMD vectorization with SIMD intrinsic functions. The SIMD implementation is generic, i.e. it works for any size of SIMD vectors. In order to achieve this, the data layout of arrays had to be changed. All arrays (for gauge, spincolour and clover fields) now have SIMD vectors as the smallest structure. In Fortran notation the gauge field is defined in the following way
old: complex(8) :: u(3, 3, volume/2)
new: real(8) :: u(SIMDsize, re:im, 3, 3, volume/2 / SIMDsize)
and the new layout of the spincolor field is
old: complex(8) :: a(4, 3, volume/2)
new: real(8) :: a(SIMDsize, re:im, 2, 3, volume/2 / SIMDsize, 2)
where the spin components of the spincolour field are split into components which optimizes MPI communication in direction. The clover arrays, for which a packed format is used, were changed accordingly.
At the single core level the SIMD code is about 2 times faster than the corresponding Fortran code. With this speedup computations are increasingly dominated by communication and improvement of MPI communication becomes important. Hence, the following MPI optimizations where made:

The overhead introduced by the reduction to twocomponent spinors was minimized. Previously the projection was done for the whole local volume, now it is only done for boundary sites, and there is no projection in the direction needed any more.

All MPI ’buffers’ are consecutive in memory and aligned to SIMD vector boundaries.

Communication can overlap with computation. This is implemented with MPI plus OpenMP, where the master thread communicates while the other threads compute.
In tables 1 and 2 performance figures are listed for machines and lattices that are currently used in production. The optimized code runs between 1.3 and 1.7 times faster.
lattice  lattice  
Fortran  SIMD  Fortran  SIMD  
per core  overall  per core  overall  per core  overall  per core  overall  
#cores  Mflop/s  Tflop/s  Mflop/s  Tflop/s  Mflop/s  Tflop/s  Mflop/s  Tflop/s 
1536  930  1.4  1557  2.4  785  1.2  1028  1.6 
3072  949  2.9  1516  4.7  795  2.4  1231  3.8 
6144  1222  7.5  1558  9.6  876  5.4  1419  8.7 
9216  1253  11.5  1775  16.4  –  –  –  – 
12288  1274  15.7  1678  20.6  927  11.4  1572  19.3 
lattice  lattice  
Fortran  SIMD  Fortran  SIMD  
per core  overall  per core  overall  per core  overall  per core  overall  
#cores  Mflop/s  Tflop/s  Mflop/s  Tflop/s  Mflop/s  Tflop/s  Mflop/s  Tflop/s 
8192  539  4.4  912  7.5  525  4.3  752  6.2 
16384  –  –  –  –  596  9.8  783  12.8 
32768  –  –  –  –  503  16.5  771  25.3 
5.2 Quda
BQCD can run on GPUs by employing the QUDA library [17]. QUDA has a BQCD interface to its cg and multishift cg solvers.
Acknowledgements
We would like to thank Gerrit Schierholz, Roger Horsley, Waseem Kamleh, Paul Rakow and James Zanotti for support, stimulating discussions and bug reports. The computations were performed on a Cray XC40 of the NorthGerman Supercomputing Alliance (HLRN) and the IBM BlueGene/Q at Jülich Supercomputer Centre (JSC).
References
 [1] Y. Nakamura, H. Stüben, PoS LATTICE2010, 040 (2010), 1011.0199
 [2] H. Stüben (UKQCD, QCDSF), Nucl. Phys. Proc. Suppl. 94, 273 (2001), heplat/0011045
 [3] M. Göckeler et al. (QCDSF), PoS LAT2007, 041 (2007), 0712.3525
 [4] N. Cundy et al., Phys. Rev. D79, 094507 (2009), 0901.3302
 [5] R. Horsley et al., JHEP 04, 093 (2016), 1509.00799
 [6] A.J. Chambers et al. (QCDSF/UKQCD, CSSM), Phys. Rev. D90, 014510 (2014), 1405.3019
 [7] A.J. Chambers et al., Phys. Rev. D92, 114517 (2015), 1508.06856
 [8] T. Haar, W. Kamleh, J. Zanotti, Y. Nakamura, Comput. Phys. Commun. 215, 113 (2017), 1609.02652
 [9] S. Hollitt, P. Jackson, R. Young, J. Zanotti, PoS INPC2016, 272 (2017)
 [10] X.Y. Jin, Y. Kuramashi, Y. Nakamura, S. Takeda, A. Ukawa, Phys. Rev. D88, 094508 (2013), 1307.7205
 [11] X.Y. Jin, Y. Kuramashi, Y. Nakamura, S. Takeda, A. Ukawa, Phys. Rev. D92, 114511 (2015), 1504.00113
 [12] X.Y. Jin, Y. Kuramashi, Y. Nakamura, S. Takeda, A. Ukawa, Phys. Rev. D96, 034523 (2017), 1706.01178
 [13] G.S. Bali, S. Collins, A. Cox, A. Schäfer, Phys. Rev. D96, 074501 (2017), 1706.01247
 [14] https://www.rrz.unihamburg.de/bqcd
 [15] G. Schierholz, Y. Nakamura, Dynamical QCD+Axion simulation: First results, in Proceedings, 35th International Symposium on Lattice Field Theory (Lattice2017): Granada, Spain, to appear in EPJ Web Conf.
 [16] T.W. Chiu, T.H. Hsieh, C.H. Huang, T.R. Huang, Phys. Rev. D66, 114502 (2002), heplat/0206007
 [17] M.A. Clark, R. Babich, K. Barros, R.C. Brower, C. Rebbi, Comput. Phys. Commun. 181, 1517 (2010), 0911.3191