Optimization of FASTEST-3Dfor Modern Multicore Systems

Optimization of FASTEST-3D
for Modern Multicore Systems

Christoph Scheit, Georg Hager, Jan Treibig, Stefan Becker, Gerhard Wellein    Christoph Scheit and Stefan Becker

Institute of Process Machinery and Systems Engineering, University of Erlangen-Nuremberg
D-91058 Erlangen
Email: sh@ipat.uni-erlangen.de
   Georg Hager, Jan Treibig and Gerhard Wellein
Erlangen Regional Computing Center (RRZE), University of Erlangen-Nuremberg
D-91058 Erlangen
Email: georg.hager@fau.de

FASTEST–3D is an MPI-parallel finite-volume flow solver based on block-structured meshes that has been developed at the University of Erlangen-Nuremberg since the early 1990s. It can be used to solve the laminar or turbulent incompressible Navier-Stokes equations. Up to now its scalability was strongly limited by a rather rigid communication infrastructure, which led to a dominance of MPI time already at small process counts.

This paper describes several optimizations to increase the performance, scalability, and flexibility of FASTEST–3D. First, a node-level performance analysis is carried out in order to pinpoint the main bottlenecks and identify sweet spots for energy-efficient execution. In addition, a single-precision version of the solver for the linear equation system arising from the discretization of the governing equations is devised, which significantly increases the single-core performance. Then the communication mechanisms in FASTEST–3D are analyzed and a new communication strategy based on non-blocking calls is implemented. Performance results with the revised version show significantly increased single-node performance and considerably improved communication patterns along with much better parallel scalability. In this context we discuss the concept of “acceptable parallel efficiency” and how it influences the real gain of the optimizations. Scaling measurements are carried out on a modern petascale system. The obtained improvements are of major importance for the use of FASTEST–3D on current high-performance computer clusters and will help to perform simulations with much higher spatial and temporal resolution to tackle turbulent flow in technical applications.

I Introduction and related work

The numerical solution of turbulent, unsteady flow problems requires vast amounts of compute resources in order to get reasonable accuracy and time to solution. This problem can be tackled in two complementing ways: advanced numerical simulation algorithms and highly efficient, scalable implementations.

This paper deals with FASTEST–3D, a finite volume solver based on co-located, block-structured meshes. It originated from the Institute of Fluid Mechanics (LSTM) at the University of Erlangen-Nuremberg is being developed since the early 1990s. Today different versions of FASTEST–3D exist at the University of Erlangen-Nuremberg, the Technical University of Darmstadt, and the University of Freiberg. The code is used to solve the laminar or turbulent incompressible Navier-Stokes equations. Time evolution is based either on implicit schemes like Crank-Nicolson or on explicit low-storage multi-stage Runge-Kutta time advancement schemes. The linear equation system resulting from the discretization of the governing equations is solved using Stone’s [1] strongly implicit method (SIP), which is based on an incomplete factorization. Several - and large-eddy simulation (LES) models are available for the simulation of turbulent flow. Different coupling interfaces exists in order to simulate, e.g., fully coupled fluid-structure interaction [2, 3], one-way coupled acoustic interaction [4], and flow with chemical reactions. Parallelization in FASTEST–3D is based either on shared memory or on domain decomposition. The latter approach is also used for the MPI domains in the hybrid MPI/OpenMP version of FASTEST–3D. During the past years specific parallelization and optimization strategies have been implemented in order to improve the performance of FASTEST–3D on different high performance systems including mixed/shared-memory parallelization strategies [5, 6]. Most of those optimizations have focused on vector computers with a low number of processors and a relatively high workload per process, hence the communication overhead was relatively small. This is different on massively parallel systems with a relatively small workload per process, which leads to increased communication overhead, causing poor parallel efficiency. In this paper we focus on the improvement of single-core (and thus single-node) performance by work reduction and the optional use of single-precision arithmetic, and on the optimization of communication performance by using non-bocking point-to-pint MPI functions, to the effect of a much-improved massively parallel scalability.

This work is organized as follows: In Sect. II we give an overview of the hard- and software used and the benchmark cases. Section III shows an analysis of the FASTEST–3D code in terms of function profiles and communication patterns. These results are used in Sect. IV for the optimization of single-core execution and the reduction of communication overhead. In Sect. V we demonstrate the achieved improvements in performance and scalability on a petascale-class system. Finally we give a summary and an outlook to future work in Sect. VI.

Ii Test bed and benchmark cases

Ii-a Test systems

The scaling results in this paper were obtained on SuperMUC [7], a federal system installed at Leibniz Supercomputing Centre (LRZ) in Garching, Germany. SuperMUC is available to scientists across Europe as a tier-0 system within PRACE (Partnership for Advanced Computing in Europe). It consists of islands with compute nodes each. A compute node comprises two Intel Xeon E5-2680 (“Sandy Bridge”) eight-core processors, with an achievable aggregate memory bandwidth of about 80 GByte/s (as obtained with the standard STREAM benchmarks). The islands are fully non-blocking QDR InfiniBand fat trees, with a 4:1 oversubscribed fat tree across islands.

The sequential function profiles in Sect. III-A were taken on the LiMa cluster at Erlangen Regional Computing Center (RRZE) at the University of Erlangen-Nuremberg. A compute node on LiMa consists of two Intel Xeon X5650 (“Westmere”) six-core processors, with an achievable aggregate memory bandwidth of about 40 GByte/s.

The Intel compiler in version 12.1 and the Intel MPI library in version 4.1 were used for all benchmarks on both systems.

Ii-B Test problems

Fig. 1: Boundary conditions and domain for test case I, the forward–facing step
Fig. 2: Isosurfaces of turbulent pressure fluctuations for the flow over a forward-facing step computed via DNS; isovalues: 9/-9 Pa

Three different test problems were considered for the purpose of analysis:

  • flow over a forward-facing step [8]

  • turbulent flow in a plane channel, cf. Moser et al. [9]

  • Taylor-Green Problem [10]

For test case I, the simulation of the flow over a forward-facing step, we used three different grids consisting of a total number of , , and about control volumes, respectively. This case was used as an example of a real configuration with grid sizes according to real-world large-eddy ( Mio. control volumes) and direct numerical simulations ( and Mio. control volumes). Figure 1 shows the setup and Fig. 2 shows computed isosurfaces for this problem. Test case II was used for validation of both the non-blocking communication strategy and especially the single precision version of the linear equation solver. This test considers the fully developed turbulent flow in a two-dimensional plane channel with periodic boundary conditions in stream- and span-wise directions and no-slip wall boundary conditions at the bottom and top wall. Test case III was also used for verification and to evaluate the single-socket performance (see Sect. IV-A below), since this problem is perfectly load-balanced including the boundary conditions (periodicity in and direction and symmetry at bottom and top).

Iii Systematic code analysis of FASTEST–3D

As a first step for the implementation of different communication, parallelization and optimization strategies, a function profile of the original version of FASTEST–3D was taken along with a basic investigation of its requirements towards the hardware (such as memory bandwidth) and its communication patterns. For this analysis we used the likwid tools [11], the GNU profiler gprof, and the Intel Trace Analyzer and Collector [12].

Iii-a Function profile

Tables I and II show function profiles for the serial version of FASTEST–3D using test case I (forward-facing step) with implicit and explicit time discretization, respectively. Only the ten subroutines consuming most of the total runtime are listed.

time % self [s] calls name
25.83 56.48 7500 resforward3d
22.53 49.26 945 celuvw
11.54 25.22 1200 lu3d
9.03 19.74 900 celp2
8.67 18.96 7500 backward3d
3.80 8.31 100 calcp
3.45 7.54 1916 exvec
2.96 6.48 300 coefadd
2.96 6.47 105 caluvw
2.32 5.07 205 calcdp
TABLE I: Sequential function profile, implicit time discretization, test case I
time % self [s] calls name
32.93 26.93 3600 resforward3d
16.68 13.64 900 flx1
11.26 9.21 3600 backward3d
9.34 7.64 135 celuvw_exp
7.74 6.33 300 lu3d
6.49 5.31 5 calcp_exp
5.09 4.16 215 calcdp_exp
1.77 1.45 1529 exsca
1.47 1.20 5 runge_kutta
1.22 1.00 1800 flxnb1
TABLE II: Sequential function Profile, explicit time discretization, test case I

In addition, Table III shows the profile for the explicit time discretization using six MPI processes on one socket of LiMa, to obtain a profile representative for parallel execution on the first bottleneck (the socket) without the adverse effects of communication overhead. The routines to solve the linear equation systems arising from the discretization of the Navier-Stokes equation are resforward3d, lu3d and backward3d. These routines together consume between and of the total execution time. This fraction depends on the settings for the linear equation solver and on the time discretization scheme used: In the case of an explicit time discretization only one equation for the pressure correction has to be solved, while in the implicit time discretization equation systems for the three Cartesian components of the momentum equations have to be solved in addition. Another factor is the linear equation solver, i.e., the number of iterations per correction step. However, based on this analysis one can identify the linear equation solver as the most time-consuming part for both parallel and serial execution. Hence, any serial code optimization should first try to improve the performance of this part of FASTEST–3D. One issue which can not be seen from the profile tables is that the arrays used to store the coefficients of the linear equation system are reused and overwritten by other, unrelated subroutines of FASTEST–3D. Hence, even if the coefficients for the equations do not change (as for the pressure correction equation in the explicit time discretization), the coefficients have to be recalculated. Since the explicit time discretization is of special interest for DNS and LES, a redesign of the solver should avoid unnecessary recalculation of the coefficients. Finally, it should be mentioned that the equation system for the pressure correction equation is symmetric, but this symmetry has not been exploited so far.

time % self [s] calls name
37.45 125.03 8100 resforward3d_pp
14.52 48.47 8100 backward3d_pp
11.26 37.59 1800 flx1
9.01 30.07 645 calcdp_exp
6.78 22.65 15 calcp_exp
5.72 19.09 270 celuvw_exp
5.06 16.89 900 lu3d_pp
1.85 6.19 2400 setval
1.60 5.33 15 runge_kutta
1.45 4.85 3600 flxnb1
TABLE III: Function profile using six MPI processes on one socket of LiMa for explicit time discretization of test case I

Another possibility to gain more insight is the use of likwid-perfctr [11], a simple command-line tool to measure hardware performance metrics on x86 processors under Linux. It also features a lightweight API which enables restricting the measurements to certain parts of the code. This allows, e.g., to get the memory bandwidth used while executing a specific loop in the code. The result for a serial run of FASTEST–3D is shown in Tab. IV. In principle, by this analysis a piece of code can be identified as being either memory bound or compute bound. Note that a single core cannot saturate the memory bandwidth of a socket on modern multicore chips even when running strongly memory-bound code [13], for which FASTEST–3D is a typical example. It can be seen from Tab. IV that the overall memory bandwidth is in the range of  GByte/s using the serial version, with individual routines drawing between and . The latter is also roughly the maximum bandwidth obtained by a simple sequential streaming benchmark such as STREAM. With Threads on one socket of LiMa, an average bandwidth of over  GByte/s for the whole application can be reached, which is close to the limit of  GByte/s. Multiplying the values of the memory bandwidth in Tab. IV with the number of threads, most of the measured code sections would easily exceed the maximal available memory bandwidth. As a consequence, FASTEST–3D shows the typical saturation behavior of a bandwidth-limited code on a multicore chip. See Sect. IV-A below for some further analysis.

Routine Mem. BW [MByte/s]
Fastest-3D 6212
calcp_exp 10894
calcp_exp (2) 6728
flx1 3845
calcdp_exp 12570
calcdp_exp(2) 13334
lu3d 4204
backward3D 7025
resforward3D 7116
celuvw_exp 2490
TABLE IV: Per-routine analysis of memory bandwidth for the serial version (test case I) on LiMa

Iii-B Communication pattern

An analysis of the communication pattern has been performed using Intel Trace Analyzer and Collector (ITAC). Typical screen shots of a run with the original code for test case II are shown in Fig. 3 for a total number of 96 processes. A single process per node was run in this experiment so that intra-socket saturation effects could be ruled out. As can be seen from Fig. 3(a) more time is spent for communication (or rather within the MPI library) than for computation. The number of control volumes (CVs) per process for this problem is , which is a realistic number for a production run of FASTEST–3D, given enough parallel resources. In a production environment one would require a minimum acceptable parallel efficiency of (single node baseline) at problem sizes of approximately CVs per process. The parallel efficiency is defined as


where and are the execution times on a single node and on nodes, respectively. Since the simulations can not always be run on a single node due to the problem not fitting into main memory, an alternative version for the parallel efficiency will also been used below, where the baseline is not a single node but nodes:


Using this definition, the base configuration already contains a certain amount of (inter-node) parallel overhead.

(a) Overview of total amount of communication time vs. computation time
(b) Event time-line for a short time interval
(c) Distribution of communication time on different MPI functions
Fig. 3: ITAC analysis: Overview, event timeline, and time spent per MPI routine using one MPI process per node on the LiMa cluster for test case II and 96 processes

Based on the measurements obtained from ITAC, the parallel efficiency for this configuration is already below the acceptable limit of . From Fig. 3(b) it can be seen that the blocking MPI_Recv and MPI_Send functions are used for point-to-point communication between adjacent sub-domains/processes. While some processes return relatively quickly from the call to MPI_Recv, others spend much longer in the call until a matching MPI_Send has been issued. This leads to a partial serialization of the communication, since the blocking calls to send/receive functions are issued sequentially on each process, and successive calls can not be issued until the current call returns. This kind of partial communication serialization is a well-known pattern [14] that may be overcome by using non-blocking point-to-point send/receive calls (MPI_Isend/MPI_Irecv). Also a large part of the communication time is lost in the collective function MPI_Allreduce. This is a consequence of the artificial load imbalance caused by the serialization described above. Moreover, reduction operations are mainly used in FASTEST–3D in order to compute the residual. By keeping those evaluations to a minimum, the overhead from MPI_Allreduce can be reduced substantially.

Iv Code optimization

Using the results of the code analysis and the conclusions drawn from it, a revised version of FASTEST–3D was developed. In this section we give a short overview of the major changes that were implemented to improve the single-core performance and the scaling properties of FASTEST–3D on multicore systems. The correctness of the improved code is shown for the important case of a turbulent, fully developed channel flow.

Iv-a Improving the single-core performance

From the measurements of the memory bandwidth and the runtime profiles it was concluded that a relatively large part of the overall runtime was spent solving the linear equation system and that FASTEST–3D is a mostly memory-bound application. This is true for both explicit and implicit time discretization, but the focus will be in the following on the explicit time discretization, which is of special interest for this work and direct numerical simulations performed with FASTEST–3D at the moment. Three possibilities to improve the performance of the explicit time discretization have been identified:

  1. avoiding unnecessary re-computation of the coefficients of the pressure correction equation or the incomplete factorization matrices (ILU) constructed in Stone’s implicit method [1]

  2. exploiting the symmetry of the pressure correction equation

  3. using single precision to solve the linear equation system

Regarding the first point, the use of an implicit time discretization scheme was a very common use case for FASTEST–3D, and hence three equation systems resulting for the momentum and one equation system resulting for the pressure correction have to be solved. This is done in a segregated fashion in FASTEST–3D. By this the coefficient arrays can be shared by all equations and thus are overwritten by each other. For the explicit time discretization this is not the case, since only one equation system is considered. Furthermore, the coefficients of this equation system do not change and need to be calculated only once. Consequently this is also true for the ILU factorization. Once computed, the resulting decomposition can be used for all iterations and time steps without the need to recompute it again. This fact was not exploited in the original version of FASTEST–3D.

Concerning the second point, only the equation system for the pressure correction equation is symmetric, since the convective operator in the momentum equation is discretized using a deferred correction method and a blend of central and upwind interpolation schemes yielding non-symmetric equation systems. Hence, exploiting this property is most advantageous for the explicit time discretization and the adoption for non-symmetric equation systems has still to be investigated but is not the focus of the present work. Taking advantage of this symmetry in the pressure correction equation is expected to be beneficial, since the subroutine in question, resforward3d, is memory bound.

The third measure implemented to improve the single core performance is beneficial for both implicit and explicit time discretization without restrictions and reduces the amount of data which has to be loaded while solving the equation system. Since the rest of the algorithm is still performed in double precision, an additional overhead for data conversion is generated, slightly reducing the possible gain in performance.

All of the above strategies were implemented in a redesign of the linear equation solver and have changed the way the solver had to be interfaced. Figures 4 and 5 show the structure and call sequence of the SIP solver before and after the redesign. In the original version, a call to the SIP solver always implied a call to the subroutine for ILU factorization. Then the linear equation system was relaxed until a given number of iterations was reached. Finally, the pressure and velocity field was corrected using the computed pressure correction and the outer loop was checked for convergence, i.e., if the error in mass conservation is below a given threshold.

The implementation of the new version was based on a modular concept. The solver allocates and deallocates memory for the coefficients, which are not overwritten any more and thus do not need to be recomputed in between. Furthermore, the computation of the ILU factorization is done in advance to the actual solver routine. Before performing forward and backward substitution, the right hand side and solution vectors have to be converted from double to single precision. After the relaxation, only the solution vector is converted back to double precision for the subsequent pressure and velocity correction.

Fig. 4: Original version of SIP solver
Fig. 5: Optimized version of SIP solver

In order to take the symmetry of the linear equation system into account, the calculation of the residual, which is part of the forward substitution step, was changed so that only the values of the east, north and top coefficients are used (see Listing 1).

do k = 2, nk-1
  do i = 2, ni-1
    do j = 2, nj-1
      res(j,i,k) =
          ae(j,i,k) * fi(j,i+1,k)
       + ae(j,i-1,k) * fi(j,i-1,k)
       + an(j,i,k) * fi(j+1,i,k)
       + an(j-1,i,k) * fi(j-1,i,k)
       + at(j,i,k) * fi(j,i,k+1)
       + at(j,i,k-1) * fi(j,i,k-1)
       + su(j,i,k) - ap(j,i,k) * fi(j,i,k)
Listing 1: Residual calculation for a symmetric system matrix

The effect of the different single-core optimizations has been measured on one socket of SuperMUC. The dashed line in Fig. 6 shows the inverse runtime per timestep of the code version using non-blocking communication (see the next section) and the original version of the SIP solver. The solid line shows the performance of the non-blocking code version in combination with all single core optimizations applied in the new version of the SIP solver. To evaluate the influence of the single core optimizations, additional measurements for the same case running with eight MPI processes were taken. They are shown with two additional scattered data points in Fig. 6. By performing the relaxation of the equation system using single precision arrays, a reduction in runtime by  % is achieved. This is in good agreement with the function profile shown in Tab. III, where the two subroutines resforward3d and backward3d consume about  % of the overall runtime at double precision. Reducing the size of the data set by a factor of two leads to a speedup of two for those memory-bound routines. Avoiding the recomputation of the ILU factorization yields another reduction by  %, again in agreement with the function profile (see Tab. III) measured on LiMa. The last change, exploiting the symmetry in the system matrix, improves runtime by another  %. The solid line in Fig. 6 combines all optimizations.

Fig. 6: Scaling on one socket of an Intel Xeon Sandy Bridge node, non-blocking code version with additional code optimizatinos applied (test case III)

Only a slight increase in performance can be gained using eight instead of four cores on one socket. However, this also indicates the FASTEST–3D is not completely memory bound, since some parts of the code still benefit from the additional number of cores.

Iv-B Improving communication performance

The data exchange at block boundaries in FASTEST–3D is based on transfer tables. These tables are set up in a preprocessing step and stored in a binary file which is read by FASTEST–3D after a simulation is started. Each process stores only information about the send/receive operations it is involved in. Also no separate tables for process-local and remote data transfer are present. Finally, all communication routines are wrappers around the basic MPI routines, providing a consistent interface no matter if the underlying communication library is MPI, no communication (in case of a serial run), or Parallel Virtual Machine (PVM). In addition, FASTEST–3D uses different exchange routines for scalars, vectors, and tensors. Figure 7 shows the basic flow of the blocking data exchange in the original version:

Fig. 7: Blocking data exchange at block boundaries in FASTEST–3D.

Each process performs a loop over all its patches (block boundaries) that need to exchange data with their neighboring blocks, regardless of whether the neighboring block is located on the same process. If the patch between two neighboring blocks is on the same process, no MPI routines are called and only a local copy operation is performed. Otherwise the data is sent to the neighboring process. Since MPI_Send is a blocking call, the sending process has to wait until the data is received at least for larger messages. This leads to a partial serialization, as described above, and also applies to data copied to diagonal neighbors into block edges and corner points.

In order to perform data exchange based on non-blocking MPI calls a module was created that, after an initialization process, sets up derived data transfer tables from the original tables by splitting these in data transfers between blocks on the same process and remote data transfers for remote sends and remote receives. For each kind of transfer an according exchange routine was provided. To ensure that all data is received by the correct target process the MPI message tag is used, which is incremented after each data transfer at block boundaries. In addition we exploit the property that if process A sends two or more messages to process B, the messages arrive in the same order as they were sent. As can be seen in Fig. 8, first the calls to MPI_Irecv are made, than all send statements are issued, and finally MPI_Waitall is called to ensure that all outstanding non-blocking sends and receives are finished. In a final step the data is copied from the receive buffers to the arrays which store the variables. In contrast to the original algorithm no data which was sent is available in the arrays which store the actual data since all data remains in the receive buffer until all send and receive statements have returned. Hence, the data at the block corner points and block edges is always data from the previous exchange operation. As long as all variables converge and a sufficient number of operations is done, this is no problem since the difference between variables from two consecutive exchange operations is very small at convergence. Also, the data at corners/edges is only needed for the solution of the equation system for the diffusive fluxes on the right hand side of the momentum equation. In order to have all data correctly exchanged at the same time level, a three step approach would be necessary, first exchanging corner points, then edges, and finally the inner points of block patches. After each exchange step a call to MPI_Waitall would also be required, resulting in complex exchange routines and preprocessing as well as in additional overhead.

Fig. 8: Non-Blocking data exchange at block boundaries in FASTEST–3D.

Iv-C Verification

In order to verify the correctness of the new implementation, test problem II is compared to the original implementation. All three versions (original, with non-blocking calls, and non-blocking plus single-core optimizations) were used to compute the mean velocity profile for a turbulent channel flow (see Fig. 9). The velocity in direction () and the distance from the wall () are normalized by the friction velocity with dynamic viscosity , and the kinematic viscosity divided by the friction velocity, respectively, which is indicated by the superscript . All three variants are very close to each other and no clear advantage can be seen for either of them in terms of the quality of the solution. Compared to the DNS data of Moser et al. [9] a small deviation in the log-law region for all the three variants can be observed; this may be due to the lower approximation order for the derivatives in the governing equations compared to Moser et al. [9], who used highly accurate spectral methods while keeping the same grid sizes. On the other hand, the FASTEST–3D version with non-blocking calls and a single-precision solver performs best using the same hardware with respect to the time to solution.

Fig. 9: Comparison of original, non-blocking, and non-blocking version with single core performance optimizations of FASTEST–3D with respect to DNS data of Moser et al. [9]

V Performance and Scaling Results

In this section we present selected strong scaling measurements on SuperMUC for different code versions of FASTEST–3D with 16 processes per node (ppn). Results are reported for the reciprocal of the averaged time spent per time step () with runtime per time step and parallel efficiency following the definition in Eq. (2). Performance is plotted vs. the number of compute nodes, since this is the smallest allocatable unit on any modern cluster system. All results presented here are based on test case I, the flow over the forward facing step, which is a real-world problem and not only useful for benchmarking purposes.

V-a Scaling Results on SuperMUC

Scaling measurements were performed for domain sizes of control volumes per process down to for the coarsest grid, down to control volumes for the grid consisting of Mio. control volumes and from Mio. down to control volumes per process for the largest grid. The inverse of the averaged runtime per timestep and the parallel efficiency are shown in Figures 10 and Fig. 11 for the blocking and non-blocking versions of FASTEST–3D at the smallest grid size. Dashed lines correspond to the original code version with blocking MPI calls and the original version of the SIP solver. Solid lines denote the version with non-blocking MPI calls, and dotted lines correspond to the code version using non-blocking communication and the improved SIP solver. As can be seen from Fig. 10, the best performance is obtained for the non-blocking version with the improved SIP solver, but the non-blocking version with the original SIP solver shows better scaling because the slower linear system solver spends more time performing calculations between two subsequent communication steps. Using the optimized version of the SIP solver along with non-blocking communication, nearly the same performance as for the non-blocking communication with the original SIP solver can be obtained using only half of the resources ( vs. nodes). Hence, if double precision is not necessary, using the optimized SIP solver along with non-blocking communication is the best choice in terms of overall cost since less resources for nearly the same time to solution are needed. The horizontal solid line in Fig. 11 denotes the threshold of minimum acceptable parallel efficiency ( %).

Obviously the optimized version of the SIP solver is not efficient beyond nodes at this small problem size. It is remarkable that the original version of FASTEST–3D is not efficient at all even for very small node counts (), which is due to its severe communication inefficiencies. Comparing the performance at the limit of acceptable efficiency of , the new version performs about ten times better for the test problem. We regard this as the more relevant gain, as opposed to the factor of achieved when comparing both versions at the same number of nodes (but vastly different parallel efficiencies). Whether  % is the right threshold to apply is certainly debatable, but the overall conclusions are largely independent of the actual limit.

Fig. 10: Inverse runtime per timestep vs. nodes on SuperMUC, 16 ppn, test case I,  CVs, strong scaling
Fig. 11: Parallel efficiency vs. nodes on SuperMUC, 16 ppn, test case I,  CVs, strong scaling

In addition to the scaling runs at a moderate problem size, additional measurements for an existing setup of a direct numerical simulation (DNS) with approximately control volumes were done on SuperMUC  comparing the new version with the original version of FASTEST–3D for a real simulation. Scaling results are shown in Figures 12 and 13. Note that, in contrast to the measurements for the coarse grid, scaling has been performed here for more than nodes, which means that for the last data point in Figures 12 and 13 two network islands have been used, leading to a considerable drop in bisection bandwidth per node. Results are reported for the code version using non-blocking communication with the original SIP solver (solid line) and the original code version based on blocking communication (dashed line). Again, considering the inverse runtime, the code version using non-blocking communication shows much better parallel scaling. While for the non-blocking version the parallel efficiency drops below  % for more than nodes, this is the case for the version using blocking calls already for about nodes (see Fig 13). Comparing both versions at approximately  % parallel efficiency, the non-blocking version is about times faster using only twice the amount of resources at the same level of parallel efficiency.

Fig. 12 also shows two additional scattered data points denoting the performance using a naive core-rank mapping which does not take the neighborhood relation between parallel subdomains for MPI task placement into account. In case of the non-blocking communication, neglecting the neighborhood relations causes a drop in performance of about  %. For the version using blocking communication this effect does not play any role since the parallel efficiency for the last data point is already very low.

Fig. 12: Inverse runtime per time step vs. nodes on SuperMUC, 16 ppn, test case I,  CVs, strong scaling
Fig. 13: Parallel efficiency on SuperMUC vs. nodes, 16 ppn, test case I,  CVs, strong scaling
Fig. 14: Inverse runtime per time step vs. nodes on SuperMUC, 16 ppn, test case I,  CVs, strong scaling
Fig. 15: Parallel efficiency on SuperMUC vs. nodes, 16 ppn, test case I,  CVs, strong scaling

Finally, Figures 14 and 15 show performance and efficiency data for the largest set of CVs. The optimized code version at double precision can scale efficiently to node counts which are unreachable by the original code (beyond nodes, i.e., cores). The single-precision SIP solver can still be of use even at these large node counts. Note that the parallel efficiency shown in Fig. 15 is relative to a -node baseline due to the size of the problem.

Vi Conclusions and Outlook

We have implemented several optimizations in FASTEST–3D, a parallel finite-volume flow solver based on co-located, block-structured meshes, to improve its sequential performance and parallel scalability on modern multicore systems. By work-avoiding strategies and employing a single-precision linear solver, the single-node performance could be improved by about  %. At the MPI-parallel level, employing non-blocking MPI for block boundary exchange resulted in a massive improvement of strong parallel scalability, not because of overlapping communication with computation but due to eliminating partial communication serialization. For the purpose of comparing the cost of computations we have introduced the concept of “minimum acceptable parallel efficiency.” At an efficiency limit of  % we could achieve between x and x speedup over the original version, depending on the problem size. The code in its optimized form is now ready for massively parallel, strongly scaled simulations on current high-performance cluster platforms.

Future work may include a more thorough study of the effects of rank-core mapping on the scaling properties, and a detailed analysis of the hybrid (MPI+OpenMP) version of FASTEST–3D, which is expected to show different communication and convergence behavior compared to the pure MPI version studied here.


This work was supported by the Bavarian Competence Network for Scientific High Performance Computing in Bavaria (KONWIHR).


  • [1] Herbert L. Stone, “Iterative Solution of Implicit Approximations of Multidimensional Partial Differential Equations,” SIAM J. Numer. Anal., vol. 5, no. 3, pp. 530–558, Sep. 1968.
  • [2] M. Glück, “Ein Beitrag zur numerischen Simulation von Fluid-Struktur-Interaktion – Grundlagenuntersuchungen und Anwendung auf Membrantragwerke,” Ph.D. dissertation, Univeristy of Erlangen-Nuremberg, Germany, 2002.
  • [3] F. Schäfer, S. Müller, T. Uffinger, S. Becker, J. Grabinger, and M. Kaltenbacher, “Numerical and Experimental Investigations on the Fluid-Structure-Acoustic Interaction of the Flow Past a Thin Flexible Structure,” AIAA Journal, vol. 48, no. 4, pp. 738–748, 2009.
  • [4] I. Ali, M. Escobar, M. Kaltenbacher, and S. Becker, “Time Domain Computation of Flow Induced Sound,” Comput. Fluids, vol. 37, no. 4, pp. 349–359, May 2008.
  • [5] Frank Deserno, “Basic Optimisation Strategies for CFD-Codes,” Regionales Rechenzentrum Erlangen, Tech. Rep., 11 Jul. 2003.
  • [6] C. Bartels, M. Breuer, K. Wechsler, and F. Durst, “Computational Fluid Dynamics Applications on Parallel-Vector Computers: Computations of Stirred Vessel Flows,” Computers & Fluids, vol. 31, no. 1, pp. 69–97, Jan. 2002.
  • [7] “SuperMUC petascale system.” [Online]. Available: http://www.lrz.de/services/compute/supermuc
  • [8] C. Scheit, A. Esmaeili, and S. Becker, “Direct Numerical Simulation of Flow over a Forward-Facing Step - Flow Structure and Aeroacoustic Source Regions,” in Conf. Mod. Fluid Flow, J. Vad, Ed., vol. 2, Sep. 2012, pp. 891–898.
  • [9] R. D. Moser, J. Kim, and N. N. Mansour, “Direct numerical simulation of turbulent channel flow up to Re_tau = 590,” Phys. Fluids, vol. 11, no. 4, pp. 943–945, 1999.
  • [10] J. Kim and P. Moin, “Application of a Fractional-Step Method to Incompressible Navier-Stokes Equations,” J. Comp. Physics, vol. 59, pp. 308–323, 1985.
  • [11] Jan Treibig, Georg Hager, Gerhard Wellein, and Michael Meier, “LIKWID Performance Tools,” Innovatives Supercomputing in Deutschland, vol. 1, no. 8, pp. 50–53, 2010.
  • [12] http://software.intel.com/en-us/intel-trace-analyzer.
  • [13] G. Hager, J. Treibig, J. Habich, and G. Wellein, “Exploring performance and power properties of modern multicore chips via simple machine models,” submitted. [Online]. Available: http://arxiv.org/abs/1208.2908
  • [14] Georg Hager and Gerhard Wellein, Introduction to High Performance Computing for Scientists and Engineers.   Chapman & Hall/CRC Press, 2 Jul. 2010.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description