lattice2016

# Platform independent profiling of a QCD code

## 1Introduction

The numerical simulations in lattice QCD are large scale. Besides constant algorithmic development, they require massive parallelization and consume a substantial amount of resources of the leading supercomputing centers in Europe and worldwide. In order to efficiently exploit ever-changing HPC architectures, one continually needs to adapt and optimize the existing codes for new hardware. To develop a lattice QCD code that scales well on diverse platforms, besides the access to different machines, the developer needs to be able to perform large experimental campaigns in a fairly short period of time. It would therefore be advantageous to design a universal procedure that would allow for benchmarking of the native code executions on a wide range of supercomputing platforms. We propose a way to use an adaptable simulator for parallel systems, SimGrid[2] in combination with openQCD[1] code, in order to measure the scalability of the latter without accessing the target machine. The procedure presented here can in the same way be applied to other up-to-date QCD codes.

The aim of this setup is to speed up the benchmarking of the chosen QCD code on an unknown machine and to enable benchmarking of platforms to which the user does not yet have access. Moreover, many worlds’ supercomputing centers, after awarding computer time, let the choice of the suitable machine within the given center to the users. Being able to estimate the optimal choice of the machine with the appropriate number of nodes in these situations in timely manner and without wasting computational resources is another possible application of this procedure.

In Section 2, we briefly describe the program package openQCD and perform a conventional native set of benchmarks for a pure gauge simulation code. SimGrid simulation framework is introduced in Section 3. We outline a procedure for simulating openQCD programs using SimGrid in Section 4. After reporting our preliminary results, we discus next steps envisioned for this study in Sections Section 5 and Section 6, respectively. Finally, we conclude and address possible future applications of this work in Section 7.

## 2openQCD: benchmarking and scaling

The simulation package openQCD contains programs for generating both pure gauge and dynamical QCD configurations, the latter with a choice of improved Wilson fermion action. Although original motivation was to create a portable simulation tool for recently introduced open boundary conditions [3], the programs in this package allow for several different choices of boundary conditions1: open, periodic, Schrödinger Functional (SF), and open-SF. The simulation program is based on the Hybrid Monte Carlo algorithm [6] and supports parallelization in 0,1,2,3 or 4 dimensions. All the programs in this package are highly optimized for machines with modern Intel or AMD processors, but will run correctly on any system that complies with the ISO C89 and the MPI 1.2 standards. Optimized version of the code for BlueGene/Q machines is available from [7]. The code is open source with GPL license [1].

We apply the first set of benchmarking on the program ym1 that generates ensembles of SU(3) gauge configurations. Exactly which theory is simulated depends on the parameters passed to the program. For the scaling tests shown in Figure ?, we chose tree-level improved Symanzik gauge action and open-SF boundary conditions. The scaling tests are performed at the PiZDaint2 machine of the CSCS supercomputing center.

## 3SimGrid simulation framework

SimGrid is a versatile simulation framework for distributed systems that allows conducting studies in different fields, such as grid, cloud, peer-to-peer and volunteer computing. In the HPC context, more precisely regarding MPI applications, SimGrid provides SMPI module [8] which can be used in two different modes: offline and online. In the offline mode, a trace of a previous native execution on a supercomputer is replayed in the simulator. Quick and accurate predictions can be obtained, but such results are hard to faithfully extrapolate to different scenarios. The main reason behind this is that modern hardware and software stacks are so complex, that even a minor change to a platform parameter or an application parameter can cause substantial change to the overall executions. On the other hand, online mode, which relies only on the target platform description without any traces, executes directly the application. Therefore, simulations in this mode can mimic much better any modifications to the application or machine setup. Although such approach is much harder to implement and validate, its predictive capabilities are more advanced. There is a plethora of MPI simulators used in HPC community ([9], [10], and many more), but we have chosen SimGrid as its faithful online simulations with accurate modeling of network contention are needed for this type of study.

## 4Methodology for profiling openQCD with a SimGrid simulator

Workflow for conducting simulations of openQCD is presented in Figure 1. In the initial phase, target machine needs to be carefully calibrated. One needs to reserve only few nodes of a possibly large supercomputer. These nodes are first benchmarked to capture their characteristics, typically number of processing units and their corresponding speed. Then, a different set of benchmarks containing a wide range of MPI calls is performed to measure the network characteristics and its behavior in the case of contention. The obtained values will be used by SimGrid when instantiating its communication models. These coefficients together with the previously captured platform details are saved in a separate XML file.

Once this platform description file is generated, no more access to the supercomputer is needed. One needs to simply compile openQCD application with the SimGrid SMPI, instead of classical MPI library. This new program can then be run on a single core of a commodity laptop and provide faithful performance predictions. openQCD application “thinks” that it is executed on a large scale machine using many nodes, while it is actually run inside a single SimGrid process.

Since almost no actual computations and no data transfers are performed, simulation is typically much faster than the native execution and additionally consumes much less memory. This allows for running many profilings at low cost and fully assessing the influence of an application input parameter (e.g. lattice size, bare gauge coupling, etc. in the case of a lattice QCD code). Since platform description file can be easily manually edited, one may also explore “what-if” scenarios by increasing the number of nodes, number of processing unit, network latency and bandwidths, etc. This can provide invaluable information to the openQCD developers and users who have a challenging task of correctly dimensioning the problem size they want to study, as well as determine the optimal machine for such workload.

## 5Initial evaluation of ym1

We used SimGrid to simulate pure Yang-Mills simulation program (ym1) of openQCD. We compared the time needed to run four trajectories in the generation of pure gauge configurations using Symanzik gauge action and open-SF boundary conditions, on a local lattice. The first tests were run on 4 and 8 processors (2 Dodeca-core Haswell Intel Xeon 2,5 GHz 128GB RAM machine) on PlaFRIM computing cluster, corresponding to the global lattice sizes of and . Actual execution times of the ym1 program were compared to the execution times predicted by SimGrid. The outcome is shown in Figure ?. Native openQCD code demonstrates very good performance, as for these weak scaling experiments the timings between two executions are very similar. Although SimGrid provides an accurate estimation in the case of 4 processors, for 8 nodes prediction is largely overestimating the overall execution duration.

To understand the source of such discrepancy, we intend to introduce timestamps into openQCD code. Then, by comparing the traces between native and SimGrid executions, we expect to detect and later correct the error in our models, which will allow for achieving the good estimations for large scale executions. Currently, we suspect that the issues we encountered are caused by bogus network drivers and problems with MPI libraries on PlaFRIM cluster at the time of benchmarking, which lead to inaccurate generation of communication models. However, these assumptions need to be empirically verified.

## 6Future prospects

For practical applications in lattice QCD, one would need to go beyond testing the program for pure gauge configurations generation ym1. On the example of openQCD, this would mean to get the good matching of the native and simulated execution times of the qcd1 program. Providing faithful estimates for the dynamical gauge configuration generation program requires accurate predictions of the Dirac matrix inversions. However, since SimGrid has been already successfully used for the simulation of the solvers for sparse linear systems [11], we expect a smooth transition to more complex simulation programs such as qcd1.

Presented study was mostly exploratory, as we wanted to evaluate the feasibility of the simulation approach on openQCD application and identify the biggest challenges. Once our methodology is extensively validated on a set of small scale experiments and it becomes more robust, we plan to generalize our approach in order to use it for the performance predictions of the large scale arbitrary machines. To accomplish this challenging task, it is mandatory to obtain accurate calibrations of other supercomputers. This is a common prerequisite shared by many researchers in SimGrid community, and thus there is an ongoing work by SimGrid developers in automatizing the calibration process and collecting platform description files for many supercomputers, such as: Stampede and Blue Waters in the USA, MareNostrum in Spain, CSCS in Switzerland, etc.

## 7Summary and Outlook

We have proposed a procedure of implementing an intermediate profiling for a QCD code using SimGrid simulator, which is particularly convenient for simulating parallel applications based on MPI programming paradigm. We tested our methodology on the ym1 program of openQCD, although, the same procedure can be easily applied to other lattice QCD codes. Our initial prototype works and provides encouraging results, but still needs some improvement in order to reach the desired accuracy, which would allow for doing experiments on a much larger scale. We expect these to come from the refinement in communication modeling that we are currently pursuing.

In the future, with more robust and automatized solution, we envisage application of the proposed approach on multiple use cases. First, this can help users who are preparing the computer time demands for a machine they yet do not have access to. Additionally, it may be useful for choosing the most suitable machine within already awarded computer time allocation, in a timely manner. Furthermore, this procedure is not only applicable on currently existing machines. It can even be applied on machines that are still in the design phase, once the features such as machine topology, processor speed and communication characteristics become known. Nevertheless, such results would have to be interpreted carefully, as unknown systems could contain some yet undiscovered (and thus not modeled) phenomena.

Finally, source code and raw data generated during this study are publicly available3 to anyone interested in learning more details about this work.

### Acknowledgments

This work was supported by a grant from the Swiss National Supercomputing Centre (CSCS) under project ID s642. Some experiments presented in this paper were carried out using the PlaFRIM experimental testbed, being developed under the Inria PlaFRIM development action with support from LABRI and IMB and other entities: Conseil Regional d’Aquitaine, FeDER, Universite de Bordeaux and CNRS (see https://plafrim.bordeaux.inria.fr/).

### Footnotes

1. Extending the code to include QED field dynamics and an additional choice of C* boundary conditions [4] is in progress.
2. As of December 2016, PizDaint machine has been upgraded. The benchmarks performed in this work correspond to the configuration of June 2016.
3. http://gitlab.inria.fr/stanisic/openqcd-simgrid/

### References

1. http://luscher.web.cern.ch/luscher/openQCD/.
2. Versatile, scalable, and accurate simulation of distributed applications and platforms.
Henri Casanova, Arnaud Giersch, Arnaud Legrand, Martin Quinson, and Frédéric Suter. Journal of Parallel and Distributed Computing, 74(10), June 2014.
3. Lattice QCD without topology barriers.
Martin Luscher and Stefan Schaefer. JHEP, 07:036, 2011.
4. C periodic and G periodic QCD at finite temperature.
U. J. Wiese. Nucl. Phys., B375:45–66, 1992.
5. Charged hadrons in local finite-volume QED+QCD with C* boundary conditions.
Biagio Lucini, Agostino Patella, Alberto Ramos, and Nazario Tantalo. JHEP, 2015.
6. Hybrid Monte Carlo.
S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth. Phys. Lett., B195:216–222, 1987.
7. http://hpc.desy.de/simlab/codes/openqcd_bgopt/.
BlueGene/Q optimized openQCD.
8. Toward Better Simulation of MPI Applications on Ethernet/TCP Networks.
Paul Bedaride, Augustin Degomme, Stéphane Genaud, Arnaud Legrand, George Markomanolis, Martin Quinson, Mark Stillwell, Lee, Frédéric Suter, and Brice Videau. In Intl. Workshop on Perf. Modeling, Benchmarking and Simul. of HPC Systems, November 2013.
9. The Structural Simulation Toolkit.
A. F. Rodrigues, K. S. Hemmert, B. W. Barrett, C. Kersey, R. Oldfield, M. Weston, R. Risen, J. Cook, P. Rosenfeld, E. CooperBalls, and B. Jacob. SIGMETRICS Perform. Eval. Rev., 38(4):37–42, March 2011.
10. BigSim: A Parallel Simulator for Performance Prediction of Extremely Large Parallel Machines.
Gengbin Zheng, Gunavardhan Kakulapati, and Laxmikant Kalé. In Proc. of the 18th International Parallel and Distributed Processing Symposium (IPDPS), April 2004.
11. Fast and Accurate Simulation of Multithreaded Sparse Linear Algebra Solvers.
Luka Stanisic, Emmanuel Agullo, Alfredo Buttari, Abdou Guermouche, Arnaud Legrand, Florent Lopez, and Brice Videau. In The 21st IEEE International Conference on Parallel and Distributed Systems, Melbourne, Australia, December 2015.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters