Optimization of FASTEST3D
for Modern Multicore Systems
Abstract
FASTEST–3D is an MPIparallel finitevolume flow solver based on blockstructured meshes that has been developed at the University of ErlangenNuremberg since the early 1990s. It can be used to solve the laminar or turbulent incompressible NavierStokes equations. Up to now its scalability was strongly limited by a rather rigid communication infrastructure, which led to a dominance of MPI time already at small process counts.
This paper describes several optimizations to increase the performance, scalability, and flexibility of FASTEST–3D. First, a nodelevel performance analysis is carried out in order to pinpoint the main bottlenecks and identify sweet spots for energyefficient execution. In addition, a singleprecision version of the solver for the linear equation system arising from the discretization of the governing equations is devised, which significantly increases the singlecore performance. Then the communication mechanisms in FASTEST–3D are analyzed and a new communication strategy based on nonblocking calls is implemented. Performance results with the revised version show significantly increased singlenode performance and considerably improved communication patterns along with much better parallel scalability. In this context we discuss the concept of “acceptable parallel efficiency” and how it influences the real gain of the optimizations. Scaling measurements are carried out on a modern petascale system. The obtained improvements are of major importance for the use of FASTEST–3D on current highperformance computer clusters and will help to perform simulations with much higher spatial and temporal resolution to tackle turbulent flow in technical applications.
I Introduction and related work
The numerical solution of turbulent, unsteady flow problems requires vast amounts of compute resources in order to get reasonable accuracy and time to solution. This problem can be tackled in two complementing ways: advanced numerical simulation algorithms and highly efficient, scalable implementations.
This paper deals with FASTEST–3D, a finite volume solver based on colocated, blockstructured meshes. It originated from the Institute of Fluid Mechanics (LSTM) at the University of ErlangenNuremberg is being developed since the early 1990s. Today different versions of FASTEST–3D exist at the University of ErlangenNuremberg, the Technical University of Darmstadt, and the University of Freiberg. The code is used to solve the laminar or turbulent incompressible NavierStokes equations. Time evolution is based either on implicit schemes like CrankNicolson or on explicit lowstorage multistage RungeKutta time advancement schemes. The linear equation system resulting from the discretization of the governing equations is solved using Stone’s [1] strongly implicit method (SIP), which is based on an incomplete factorization. Several  and largeeddy simulation (LES) models are available for the simulation of turbulent flow. Different coupling interfaces exists in order to simulate, e.g., fully coupled fluidstructure interaction [2, 3], oneway coupled acoustic interaction [4], and flow with chemical reactions. Parallelization in FASTEST–3D is based either on shared memory or on domain decomposition. The latter approach is also used for the MPI domains in the hybrid MPI/OpenMP version of FASTEST–3D. During the past years specific parallelization and optimization strategies have been implemented in order to improve the performance of FASTEST–3D on different high performance systems including mixed/sharedmemory parallelization strategies [5, 6]. Most of those optimizations have focused on vector computers with a low number of processors and a relatively high workload per process, hence the communication overhead was relatively small. This is different on massively parallel systems with a relatively small workload per process, which leads to increased communication overhead, causing poor parallel efficiency. In this paper we focus on the improvement of singlecore (and thus singlenode) performance by work reduction and the optional use of singleprecision arithmetic, and on the optimization of communication performance by using nonbocking pointtopint MPI functions, to the effect of a muchimproved massively parallel scalability.
This work is organized as follows: In Sect. II we give an overview of the hard and software used and the benchmark cases. Section III shows an analysis of the FASTEST–3D code in terms of function profiles and communication patterns. These results are used in Sect. IV for the optimization of singlecore execution and the reduction of communication overhead. In Sect. V we demonstrate the achieved improvements in performance and scalability on a petascaleclass system. Finally we give a summary and an outlook to future work in Sect. VI.
Ii Test bed and benchmark cases
Iia Test systems
The scaling results in this paper were obtained on SuperMUC [7], a federal system installed at Leibniz Supercomputing Centre (LRZ) in Garching, Germany. SuperMUC is available to scientists across Europe as a tier0 system within PRACE (Partnership for Advanced Computing in Europe). It consists of islands with compute nodes each. A compute node comprises two Intel Xeon E52680 (“Sandy Bridge”) eightcore processors, with an achievable aggregate memory bandwidth of about 80 GByte/s (as obtained with the standard STREAM benchmarks). The islands are fully nonblocking QDR InfiniBand fat trees, with a 4:1 oversubscribed fat tree across islands.
The sequential function profiles in Sect. IIIA were taken on the LiMa cluster at Erlangen Regional Computing Center (RRZE) at the University of ErlangenNuremberg. A compute node on LiMa consists of two Intel Xeon X5650 (“Westmere”) sixcore processors, with an achievable aggregate memory bandwidth of about 40 GByte/s.
The Intel compiler in version 12.1 and the Intel MPI library in version 4.1 were used for all benchmarks on both systems.
IiB Test problems
Three different test problems were considered for the purpose of analysis:

flow over a forwardfacing step [8]

turbulent flow in a plane channel, cf. Moser et al. [9]

TaylorGreen Problem [10]
For test case I, the simulation of the flow over a forwardfacing step, we used three different grids consisting of a total number of , , and about control volumes, respectively. This case was used as an example of a real configuration with grid sizes according to realworld largeeddy ( Mio. control volumes) and direct numerical simulations ( and Mio. control volumes). Figure 1 shows the setup and Fig. 2 shows computed isosurfaces for this problem. Test case II was used for validation of both the nonblocking communication strategy and especially the single precision version of the linear equation solver. This test considers the fully developed turbulent flow in a twodimensional plane channel with periodic boundary conditions in stream and spanwise directions and noslip wall boundary conditions at the bottom and top wall. Test case III was also used for verification and to evaluate the singlesocket performance (see Sect. IVA below), since this problem is perfectly loadbalanced including the boundary conditions (periodicity in and direction and symmetry at bottom and top).
Iii Systematic code analysis of FASTEST–3D
As a first step for the implementation of different communication, parallelization and optimization strategies, a function profile of the original version of FASTEST–3D was taken along with a basic investigation of its requirements towards the hardware (such as memory bandwidth) and its communication patterns. For this analysis we used the likwid tools [11], the GNU profiler gprof, and the Intel Trace Analyzer and Collector [12].
Iiia Function profile
Tables I and II show function profiles for the serial version of FASTEST–3D using test case I (forwardfacing step) with implicit and explicit time discretization, respectively. Only the ten subroutines consuming most of the total runtime are listed.
time %  self [s]  calls  name 

25.83  56.48  7500  resforward3d 
22.53  49.26  945  celuvw 
11.54  25.22  1200  lu3d 
9.03  19.74  900  celp2 
8.67  18.96  7500  backward3d 
3.80  8.31  100  calcp 
3.45  7.54  1916  exvec 
2.96  6.48  300  coefadd 
2.96  6.47  105  caluvw 
2.32  5.07  205  calcdp 
time %  self [s]  calls  name 

32.93  26.93  3600  resforward3d 
16.68  13.64  900  flx1 
11.26  9.21  3600  backward3d 
9.34  7.64  135  celuvw_exp 
7.74  6.33  300  lu3d 
6.49  5.31  5  calcp_exp 
5.09  4.16  215  calcdp_exp 
1.77  1.45  1529  exsca 
1.47  1.20  5  runge_kutta 
1.22  1.00  1800  flxnb1 
In addition, Table III shows the profile for the explicit time discretization using six MPI processes on one socket of LiMa, to obtain a profile representative for parallel execution on the first bottleneck (the socket) without the adverse effects of communication overhead. The routines to solve the linear equation systems arising from the discretization of the NavierStokes equation are resforward3d, lu3d and backward3d. These routines together consume between and of the total execution time. This fraction depends on the settings for the linear equation solver and on the time discretization scheme used: In the case of an explicit time discretization only one equation for the pressure correction has to be solved, while in the implicit time discretization equation systems for the three Cartesian components of the momentum equations have to be solved in addition. Another factor is the linear equation solver, i.e., the number of iterations per correction step. However, based on this analysis one can identify the linear equation solver as the most timeconsuming part for both parallel and serial execution. Hence, any serial code optimization should first try to improve the performance of this part of FASTEST–3D. One issue which can not be seen from the profile tables is that the arrays used to store the coefficients of the linear equation system are reused and overwritten by other, unrelated subroutines of FASTEST–3D. Hence, even if the coefficients for the equations do not change (as for the pressure correction equation in the explicit time discretization), the coefficients have to be recalculated. Since the explicit time discretization is of special interest for DNS and LES, a redesign of the solver should avoid unnecessary recalculation of the coefficients. Finally, it should be mentioned that the equation system for the pressure correction equation is symmetric, but this symmetry has not been exploited so far.
time %  self [s]  calls  name 

37.45  125.03  8100  resforward3d_pp 
14.52  48.47  8100  backward3d_pp 
11.26  37.59  1800  flx1 
9.01  30.07  645  calcdp_exp 
6.78  22.65  15  calcp_exp 
5.72  19.09  270  celuvw_exp 
5.06  16.89  900  lu3d_pp 
1.85  6.19  2400  setval 
1.60  5.33  15  runge_kutta 
1.45  4.85  3600  flxnb1 
Another possibility to gain more insight is the use of likwidperfctr [11], a simple commandline tool to measure hardware performance metrics on x86 processors under Linux. It also features a lightweight API which enables restricting the measurements to certain parts of the code. This allows, e.g., to get the memory bandwidth used while executing a specific loop in the code. The result for a serial run of FASTEST–3D is shown in Tab. IV. In principle, by this analysis a piece of code can be identified as being either memory bound or compute bound. Note that a single core cannot saturate the memory bandwidth of a socket on modern multicore chips even when running strongly memorybound code [13], for which FASTEST–3D is a typical example. It can be seen from Tab. IV that the overall memory bandwidth is in the range of GByte/s using the serial version, with individual routines drawing between and . The latter is also roughly the maximum bandwidth obtained by a simple sequential streaming benchmark such as STREAM. With Threads on one socket of LiMa, an average bandwidth of over GByte/s for the whole application can be reached, which is close to the limit of GByte/s. Multiplying the values of the memory bandwidth in Tab. IV with the number of threads, most of the measured code sections would easily exceed the maximal available memory bandwidth. As a consequence, FASTEST–3D shows the typical saturation behavior of a bandwidthlimited code on a multicore chip. See Sect. IVA below for some further analysis.
Routine  Mem. BW [MByte/s] 

Fastest3D  6212 
calcp_exp  10894 
calcp_exp (2)  6728 
flx1  3845 
calcdp_exp  12570 
calcdp_exp(2)  13334 
lu3d  4204 
backward3D  7025 
resforward3D  7116 
celuvw_exp  2490 
IiiB Communication pattern
An analysis of the communication pattern has been performed using Intel Trace Analyzer and Collector (ITAC). Typical screen shots of a run with the original code for test case II are shown in Fig. 3 for a total number of 96 processes. A single process per node was run in this experiment so that intrasocket saturation effects could be ruled out. As can be seen from Fig. 3(a) more time is spent for communication (or rather within the MPI library) than for computation. The number of control volumes (CVs) per process for this problem is , which is a realistic number for a production run of FASTEST–3D, given enough parallel resources. In a production environment one would require a minimum acceptable parallel efficiency of (single node baseline) at problem sizes of approximately CVs per process. The parallel efficiency is defined as
(1) 
where and are the execution times on a single node and on nodes, respectively. Since the simulations can not always be run on a single node due to the problem not fitting into main memory, an alternative version for the parallel efficiency will also been used below, where the baseline is not a single node but nodes:
(2) 
Using this definition, the base configuration already contains a certain amount of (internode) parallel overhead.
Based on the measurements obtained from ITAC, the parallel efficiency for this configuration is already below the acceptable limit of . From Fig. 3(b) it can be seen that the blocking MPI_Recv and MPI_Send functions are used for pointtopoint communication between adjacent subdomains/processes. While some processes return relatively quickly from the call to MPI_Recv, others spend much longer in the call until a matching MPI_Send has been issued. This leads to a partial serialization of the communication, since the blocking calls to send/receive functions are issued sequentially on each process, and successive calls can not be issued until the current call returns. This kind of partial communication serialization is a wellknown pattern [14] that may be overcome by using nonblocking pointtopoint send/receive calls (MPI_Isend/MPI_Irecv). Also a large part of the communication time is lost in the collective function MPI_Allreduce. This is a consequence of the artificial load imbalance caused by the serialization described above. Moreover, reduction operations are mainly used in FASTEST–3D in order to compute the residual. By keeping those evaluations to a minimum, the overhead from MPI_Allreduce can be reduced substantially.
Iv Code optimization
Using the results of the code analysis and the conclusions drawn from it, a revised version of FASTEST–3D was developed. In this section we give a short overview of the major changes that were implemented to improve the singlecore performance and the scaling properties of FASTEST–3D on multicore systems. The correctness of the improved code is shown for the important case of a turbulent, fully developed channel flow.
Iva Improving the singlecore performance
From the measurements of the memory bandwidth and the runtime profiles it was concluded that a relatively large part of the overall runtime was spent solving the linear equation system and that FASTEST–3D is a mostly memorybound application. This is true for both explicit and implicit time discretization, but the focus will be in the following on the explicit time discretization, which is of special interest for this work and direct numerical simulations performed with FASTEST–3D at the moment. Three possibilities to improve the performance of the explicit time discretization have been identified:

avoiding unnecessary recomputation of the coefficients of the pressure correction equation or the incomplete factorization matrices (ILU) constructed in Stone’s implicit method [1]

exploiting the symmetry of the pressure correction equation

using single precision to solve the linear equation system
Regarding the first point, the use of an implicit time discretization scheme was a very common use case for FASTEST–3D, and hence three equation systems resulting for the momentum and one equation system resulting for the pressure correction have to be solved. This is done in a segregated fashion in FASTEST–3D. By this the coefficient arrays can be shared by all equations and thus are overwritten by each other. For the explicit time discretization this is not the case, since only one equation system is considered. Furthermore, the coefficients of this equation system do not change and need to be calculated only once. Consequently this is also true for the ILU factorization. Once computed, the resulting decomposition can be used for all iterations and time steps without the need to recompute it again. This fact was not exploited in the original version of FASTEST–3D.
Concerning the second point, only the equation system for the pressure correction equation is symmetric, since the convective operator in the momentum equation is discretized using a deferred correction method and a blend of central and upwind interpolation schemes yielding nonsymmetric equation systems. Hence, exploiting this property is most advantageous for the explicit time discretization and the adoption for nonsymmetric equation systems has still to be investigated but is not the focus of the present work. Taking advantage of this symmetry in the pressure correction equation is expected to be beneficial, since the subroutine in question, resforward3d, is memory bound.
The third measure implemented to improve the single core performance is beneficial for both implicit and explicit time discretization without restrictions and reduces the amount of data which has to be loaded while solving the equation system. Since the rest of the algorithm is still performed in double precision, an additional overhead for data conversion is generated, slightly reducing the possible gain in performance.
All of the above strategies were implemented in a redesign of the linear equation solver and have changed the way the solver had to be interfaced. Figures 4 and 5 show the structure and call sequence of the SIP solver before and after the redesign. In the original version, a call to the SIP solver always implied a call to the subroutine for ILU factorization. Then the linear equation system was relaxed until a given number of iterations was reached. Finally, the pressure and velocity field was corrected using the computed pressure correction and the outer loop was checked for convergence, i.e., if the error in mass conservation is below a given threshold.
The implementation of the new version was based on a modular concept. The solver allocates and deallocates memory for the coefficients, which are not overwritten any more and thus do not need to be recomputed in between. Furthermore, the computation of the ILU factorization is done in advance to the actual solver routine. Before performing forward and backward substitution, the right hand side and solution vectors have to be converted from double to single precision. After the relaxation, only the solution vector is converted back to double precision for the subsequent pressure and velocity correction.
In order to take the symmetry of the linear equation system into account, the calculation of the residual, which is part of the forward substitution step, was changed so that only the values of the east, north and top coefficients are used (see Listing 1).

The effect of the different singlecore optimizations has been measured on one socket of SuperMUC. The dashed line in Fig. 6 shows the inverse runtime per timestep of the code version using nonblocking communication (see the next section) and the original version of the SIP solver. The solid line shows the performance of the nonblocking code version in combination with all single core optimizations applied in the new version of the SIP solver. To evaluate the influence of the single core optimizations, additional measurements for the same case running with eight MPI processes were taken. They are shown with two additional scattered data points in Fig. 6. By performing the relaxation of the equation system using single precision arrays, a reduction in runtime by % is achieved. This is in good agreement with the function profile shown in Tab. III, where the two subroutines resforward3d and backward3d consume about % of the overall runtime at double precision. Reducing the size of the data set by a factor of two leads to a speedup of two for those memorybound routines. Avoiding the recomputation of the ILU factorization yields another reduction by %, again in agreement with the function profile (see Tab. III) measured on LiMa. The last change, exploiting the symmetry in the system matrix, improves runtime by another %. The solid line in Fig. 6 combines all optimizations.
Only a slight increase in performance can be gained using eight instead of four cores on one socket. However, this also indicates the FASTEST–3D is not completely memory bound, since some parts of the code still benefit from the additional number of cores.
IvB Improving communication performance
The data exchange at block boundaries in FASTEST–3D is based on transfer tables. These tables are set up in a preprocessing step and stored in a binary file which is read by FASTEST–3D after a simulation is started. Each process stores only information about the send/receive operations it is involved in. Also no separate tables for processlocal and remote data transfer are present. Finally, all communication routines are wrappers around the basic MPI routines, providing a consistent interface no matter if the underlying communication library is MPI, no communication (in case of a serial run), or Parallel Virtual Machine (PVM). In addition, FASTEST–3D uses different exchange routines for scalars, vectors, and tensors. Figure 7 shows the basic flow of the blocking data exchange in the original version:
Each process performs a loop over all its patches (block boundaries) that need to exchange data with their neighboring blocks, regardless of whether the neighboring block is located on the same process. If the patch between two neighboring blocks is on the same process, no MPI routines are called and only a local copy operation is performed. Otherwise the data is sent to the neighboring process. Since MPI_Send is a blocking call, the sending process has to wait until the data is received at least for larger messages. This leads to a partial serialization, as described above, and also applies to data copied to diagonal neighbors into block edges and corner points.
In order to perform data exchange based on nonblocking MPI calls a module was created that, after an initialization process, sets up derived data transfer tables from the original tables by splitting these in data transfers between blocks on the same process and remote data transfers for remote sends and remote receives. For each kind of transfer an according exchange routine was provided. To ensure that all data is received by the correct target process the MPI message tag is used, which is incremented after each data transfer at block boundaries. In addition we exploit the property that if process A sends two or more messages to process B, the messages arrive in the same order as they were sent. As can be seen in Fig. 8, first the calls to MPI_Irecv are made, than all send statements are issued, and finally MPI_Waitall is called to ensure that all outstanding nonblocking sends and receives are finished. In a final step the data is copied from the receive buffers to the arrays which store the variables. In contrast to the original algorithm no data which was sent is available in the arrays which store the actual data since all data remains in the receive buffer until all send and receive statements have returned. Hence, the data at the block corner points and block edges is always data from the previous exchange operation. As long as all variables converge and a sufficient number of operations is done, this is no problem since the difference between variables from two consecutive exchange operations is very small at convergence. Also, the data at corners/edges is only needed for the solution of the equation system for the diffusive fluxes on the right hand side of the momentum equation. In order to have all data correctly exchanged at the same time level, a three step approach would be necessary, first exchanging corner points, then edges, and finally the inner points of block patches. After each exchange step a call to MPI_Waitall would also be required, resulting in complex exchange routines and preprocessing as well as in additional overhead.
IvC Verification
In order to verify the correctness of the new implementation, test problem II is compared to the original implementation. All three versions (original, with nonblocking calls, and nonblocking plus singlecore optimizations) were used to compute the mean velocity profile for a turbulent channel flow (see Fig. 9). The velocity in direction () and the distance from the wall () are normalized by the friction velocity with dynamic viscosity , and the kinematic viscosity divided by the friction velocity, respectively, which is indicated by the superscript . All three variants are very close to each other and no clear advantage can be seen for either of them in terms of the quality of the solution. Compared to the DNS data of Moser et al. [9] a small deviation in the loglaw region for all the three variants can be observed; this may be due to the lower approximation order for the derivatives in the governing equations compared to Moser et al. [9], who used highly accurate spectral methods while keeping the same grid sizes. On the other hand, the FASTEST–3D version with nonblocking calls and a singleprecision solver performs best using the same hardware with respect to the time to solution.
V Performance and Scaling Results
In this section we present selected strong scaling measurements on SuperMUC for different code versions of FASTEST–3D with 16 processes per node (ppn). Results are reported for the reciprocal of the averaged time spent per time step () with runtime per time step and parallel efficiency following the definition in Eq. (2). Performance is plotted vs. the number of compute nodes, since this is the smallest allocatable unit on any modern cluster system. All results presented here are based on test case I, the flow over the forward facing step, which is a realworld problem and not only useful for benchmarking purposes.
Va Scaling Results on SuperMUC
Scaling measurements were performed for domain sizes of control volumes per process down to for the coarsest grid, down to control volumes for the grid consisting of Mio. control volumes and from Mio. down to control volumes per process for the largest grid. The inverse of the averaged runtime per timestep and the parallel efficiency are shown in Figures 10 and Fig. 11 for the blocking and nonblocking versions of FASTEST–3D at the smallest grid size. Dashed lines correspond to the original code version with blocking MPI calls and the original version of the SIP solver. Solid lines denote the version with nonblocking MPI calls, and dotted lines correspond to the code version using nonblocking communication and the improved SIP solver. As can be seen from Fig. 10, the best performance is obtained for the nonblocking version with the improved SIP solver, but the nonblocking version with the original SIP solver shows better scaling because the slower linear system solver spends more time performing calculations between two subsequent communication steps. Using the optimized version of the SIP solver along with nonblocking communication, nearly the same performance as for the nonblocking communication with the original SIP solver can be obtained using only half of the resources ( vs. nodes). Hence, if double precision is not necessary, using the optimized SIP solver along with nonblocking communication is the best choice in terms of overall cost since less resources for nearly the same time to solution are needed. The horizontal solid line in Fig. 11 denotes the threshold of minimum acceptable parallel efficiency ( %).
Obviously the optimized version of the SIP solver is not efficient beyond nodes at this small problem size. It is remarkable that the original version of FASTEST–3D is not efficient at all even for very small node counts (), which is due to its severe communication inefficiencies. Comparing the performance at the limit of acceptable efficiency of , the new version performs about ten times better for the test problem. We regard this as the more relevant gain, as opposed to the factor of – achieved when comparing both versions at the same number of nodes (but vastly different parallel efficiencies). Whether % is the right threshold to apply is certainly debatable, but the overall conclusions are largely independent of the actual limit.
In addition to the scaling runs at a moderate problem size, additional measurements for an existing setup of a direct numerical simulation (DNS) with approximately control volumes were done on SuperMUC comparing the new version with the original version of FASTEST–3D for a real simulation. Scaling results are shown in Figures 12 and 13. Note that, in contrast to the measurements for the coarse grid, scaling has been performed here for more than nodes, which means that for the last data point in Figures 12 and 13 two network islands have been used, leading to a considerable drop in bisection bandwidth per node. Results are reported for the code version using nonblocking communication with the original SIP solver (solid line) and the original code version based on blocking communication (dashed line). Again, considering the inverse runtime, the code version using nonblocking communication shows much better parallel scaling. While for the nonblocking version the parallel efficiency drops below % for more than nodes, this is the case for the version using blocking calls already for about nodes (see Fig 13). Comparing both versions at approximately % parallel efficiency, the nonblocking version is about times faster using only twice the amount of resources at the same level of parallel efficiency.
Fig. 12 also shows two additional scattered data points denoting the performance using a naive corerank mapping which does not take the neighborhood relation between parallel subdomains for MPI task placement into account. In case of the nonblocking communication, neglecting the neighborhood relations causes a drop in performance of about %. For the version using blocking communication this effect does not play any role since the parallel efficiency for the last data point is already very low.
Finally, Figures 14 and 15 show performance and efficiency data for the largest set of CVs. The optimized code version at double precision can scale efficiently to node counts which are unreachable by the original code (beyond nodes, i.e., cores). The singleprecision SIP solver can still be of use even at these large node counts. Note that the parallel efficiency shown in Fig. 15 is relative to a node baseline due to the size of the problem.
Vi Conclusions and Outlook
We have implemented several optimizations in FASTEST–3D, a parallel finitevolume flow solver based on colocated, blockstructured meshes, to improve its sequential performance and parallel scalability on modern multicore systems. By workavoiding strategies and employing a singleprecision linear solver, the singlenode performance could be improved by about %. At the MPIparallel level, employing nonblocking MPI for block boundary exchange resulted in a massive improvement of strong parallel scalability, not because of overlapping communication with computation but due to eliminating partial communication serialization. For the purpose of comparing the cost of computations we have introduced the concept of “minimum acceptable parallel efficiency.” At an efficiency limit of % we could achieve between x and x speedup over the original version, depending on the problem size. The code in its optimized form is now ready for massively parallel, strongly scaled simulations on current highperformance cluster platforms.
Future work may include a more thorough study of the effects of rankcore mapping on the scaling properties, and a detailed analysis of the hybrid (MPI+OpenMP) version of FASTEST–3D, which is expected to show different communication and convergence behavior compared to the pure MPI version studied here.
Acknowledgments
This work was supported by the Bavarian Competence Network for Scientific High Performance Computing in Bavaria (KONWIHR).
References
 [1] Herbert L. Stone, “Iterative Solution of Implicit Approximations of Multidimensional Partial Differential Equations,” SIAM J. Numer. Anal., vol. 5, no. 3, pp. 530–558, Sep. 1968.
 [2] M. Glück, “Ein Beitrag zur numerischen Simulation von FluidStrukturInteraktion – Grundlagenuntersuchungen und Anwendung auf Membrantragwerke,” Ph.D. dissertation, Univeristy of ErlangenNuremberg, Germany, 2002.
 [3] F. Schäfer, S. Müller, T. Uffinger, S. Becker, J. Grabinger, and M. Kaltenbacher, “Numerical and Experimental Investigations on the FluidStructureAcoustic Interaction of the Flow Past a Thin Flexible Structure,” AIAA Journal, vol. 48, no. 4, pp. 738–748, 2009.
 [4] I. Ali, M. Escobar, M. Kaltenbacher, and S. Becker, “Time Domain Computation of Flow Induced Sound,” Comput. Fluids, vol. 37, no. 4, pp. 349–359, May 2008.
 [5] Frank Deserno, “Basic Optimisation Strategies for CFDCodes,” Regionales Rechenzentrum Erlangen, Tech. Rep., 11 Jul. 2003.
 [6] C. Bartels, M. Breuer, K. Wechsler, and F. Durst, “Computational Fluid Dynamics Applications on ParallelVector Computers: Computations of Stirred Vessel Flows,” Computers & Fluids, vol. 31, no. 1, pp. 69–97, Jan. 2002.
 [7] “SuperMUC petascale system.” [Online]. Available: http://www.lrz.de/services/compute/supermuc
 [8] C. Scheit, A. Esmaeili, and S. Becker, “Direct Numerical Simulation of Flow over a ForwardFacing Step  Flow Structure and Aeroacoustic Source Regions,” in Conf. Mod. Fluid Flow, J. Vad, Ed., vol. 2, Sep. 2012, pp. 891–898.
 [9] R. D. Moser, J. Kim, and N. N. Mansour, “Direct numerical simulation of turbulent channel flow up to Re_tau = 590,” Phys. Fluids, vol. 11, no. 4, pp. 943–945, 1999.
 [10] J. Kim and P. Moin, “Application of a FractionalStep Method to Incompressible NavierStokes Equations,” J. Comp. Physics, vol. 59, pp. 308–323, 1985.
 [11] Jan Treibig, Georg Hager, Gerhard Wellein, and Michael Meier, “LIKWID Performance Tools,” Innovatives Supercomputing in Deutschland, vol. 1, no. 8, pp. 50–53, 2010.
 [12] “http://software.intel.com/enus/inteltraceanalyzer.”
 [13] G. Hager, J. Treibig, J. Habich, and G. Wellein, “Exploring performance and power properties of modern multicore chips via simple machine models,” submitted. [Online]. Available: http://arxiv.org/abs/1208.2908
 [14] Georg Hager and Gerhard Wellein, Introduction to High Performance Computing for Scientists and Engineers. Chapman & Hall/CRC Press, 2 Jul. 2010.