Performance Optimisation of Smoothed Particle Hydrodynamics Algorithms for Multi/Many-Core Architectures

Performance Optimisation of Smoothed Particle Hydrodynamics Algorithms for Multi/Many-Core Architectures

Fabio Baruffa,(Γ) Luigi Iapichino,(Γ) Nicolay J. Hammer and Vasileios Karakasis Current address: CSCS - Swiss National Supercomputing Centre, Lugano, Switzerland Leibniz Supercomputing Centre of the Bavarian Academy of Sciences and Humanities, Garching b. München, Germany.
Email: {fabio.baruffa, luigi.iapichino, nicolay.hammer},

We describe a strategy for code modernisation of Gadget, a widely used community code for computational astrophysics. The focus of this work is on node-level performance optimisation, targeting current multi/many-core Intel® architectures. We identify and isolate a sample code kernel, which is representative of a typical Smoothed Particle Hydrodynamics (SPH) algorithm. The code modifications include threading parallelism optimisation, change of the data layout into Structure of Arrays (SoA), auto-vectorisation and algorithmic improvements in the particle sorting. We obtain shorter execution time and improved threading scalability both on Intel Xeon® ( on Ivy Bridge) and Xeon Phi™ ( on Knights Corner) systems. First few tests of the optimised code result in faster execution on second generation Xeon Phi (Knights Landing), thus demonstrating the portability of the devised optimisation solutions to upcoming architectures.

Performance optimisation SPH OpenMP vectorisation Intel Xeon Intel Xeon Phi KNC KNL

=0mu plus 1mu

I Introduction

Among the different branches of physics, astrophysics has a long-standing tradition in the use of numerical simulations. In particular, in the evolution of the cosmic large-scale structure, one has to deal with a problem which is inherently multi-dimensional, non-linear, and presents no clear spatial symmetry to be exploited. Although analytical models can provide some general insight, e.g. on scaling relations of galaxy clusters [1], this field of study has early moved to an extensive use of computational models [2]. Like in many research areas with a similar evolution, a number of codes widely used in the community have a development history dating back to the last ten to twenty years. As a consequence, the modernisation of such codes for their effective use on modern computer architectures has become mandatory. Especially for applications with a wide community, it would be preferable from the user’s perspective to optimise an existing code, rather than to develop new numerical tools from scratch.

The computational astrophysics code Gadget [3] is a cosmological, fully hybrid MPI + OpenMP parallelised TreePM-MHD-SPH code. In this scheme, both gas and dark matter are discretised by means of particles. In the TreePM approach, the gravity is split into a long-range part, which is computed by sorting the particles onto a mesh (PM) and then solving the Poisson equation via FFT methods, and a short-range part, where a direct sum of the forces between particles is performed. For the short-range part, forces at intermediate distances can be approximated by grouping particles and taking the moments of these groups (tree method). Additional physics modules use different sub-grid models, to properly treat processes which are far below the resolution limit in galaxy simulations.

A basic version of the code has been publicly released as Gadget-2 in 2005111Website: . Since then, many groups have started independent development lines (see for example the comparison project described in [4]), some of them leading also to algorithmic improvements (e. g. [5]). In this work, we describe a strategy of code modernisation at node level, performed on the code version dubbed as P-Gadget3 [3, 6]. Our code optimisation work has as target multi-core and many-core Intel® architectures. We will show that such modern systems expose new performance bottlenecks even on an optimised code which is successfully used on large-scale HPC systems. In particular, node-level performance plays now a crucial role, because both threading parallelism and vectorisation have an ever increasing importance for keeping pace with Moore’s law.

We want to stress a basic principle leading our development: we perform code modifications which are ‘as non-invasive as possible’. Our intention is to ensure the portability of the code across different computer architectures, and the readability for the developers. Furthermore the group of general users, which often have basic programming skills and are more interested on the physics implementation, have to be able to modify the code without coping with performance questions. For all these reasons, our attempts will focus on minimally invasive code modifications, relying on compiler auto-vectorisation and refraining from intrinsics instructions.

In order to have a simpler testbed for our development, an OpenMP-only code kernel has been isolated from P-Gadget3. The selected kernel is part of the Subfind algorithm ([7, 8]; Section II-B), originally developed as a standalone tool and later embedded into Gadget for inline data analysis. Besides being representative of analogous code parts in P-Gadget3, this kernel is also relevant for the vast category of halo finding algorithms in computational astrophysics [9].

In this paper we focus on the node-level performance optimisation of the Subfind algorithm, and describe the steps to analyse and improve it. The proposed solutions for removing the performance bottlenecks can be considered as a proof of concept for a future modernisation of the whole code. Some otherwise important topics like MPI performance or parallel I/O go beyond the scope of the present work and will be addressed elsewhere.

The structure of this work is the following: in Section II, the kernel and its data serialisation are described, and in Section III the hardware and software environment of the tests is presented. The different optimisations applied to the kernel and the related performance improvements are reported in Section IV. In Section V, it is shown how our strategy of code modernisation improves the performance on the Intel Xeon Phi™ of second generation, code-named Knights Landing (thereafter KNL). The conclusions are finally drawn in Section VI.

Ii Numerical algorithms

Ii-a Particle data and particle interaction scheme in Gadget

A detailed description of the algorithms and workflow of the Gadget code would go beyond the scope of this paper; we refer the reader to [3] for a thorough overview of Gadget-2 (whose general structure is largely shared by P-Gadget3). Here we recall only the information which is essential for the current work.

In Gadget, the particles belonging to the same MPI process (so-called local particles) are organised using a Barnes-Hut octal tree represented as a flat array in memory. The actual particles are not stored in the tree, but just their position in a global particle array per node. This array stores all the local particles contiguously in memory, with each particle being represented by a large C struct. We will get back to the particle data layout and its potential optimisation in Section IV-B.

Both gravitational force computation and search of particle neighbours are performed with the tree algorithm by ‘traversing’ it, in the sense of the hierarchical multipole expansion, implying decisions on opening or not opening tree nodes ([3] and references therein). In the following, we refer mainly to the use of tree traversal in the neighbour search: in this case, for any given particle, its neighbours are searched in the tree, and subsequently for each neighbour the particle interactions are calculated. This particle interaction scheme, consisting of neighbour search and particle interactions, is present in all major code components and comprises most of the total execution time.

Ii-B The Subfind algorithm

Subfind [7, 8] is a tool for identifying gravitationally bound particle groups (physically representing dark matter haloes) in cosmological simulations, based on the friends-of-friends algorithm [10]. Additionally, Subfind has the capability of detecting gravitationally bound substructures within parent groups. In the detection of substructure, a crucial role is played by the computation of density associated to particles, because the identification of substructures is based on the overdensity with respect to the background value. This computation of density at the spatial location of the particle is done according to the standard SPH procedure, namely by kernel interpolation over the particle nearest neighbours:


In (1), is the mass associated with the particle , is the kernel function, and is the kernel smoothing length for particle . In our runs, the Wedland C6 kernel [11] will be adopted as kernel function, using 295 neighbours.

Equation (1) is implemented in the code (Listing LABEL:loop-implemented) as a for loop over the n neighbours of a given particle. For every n, the index j in the particle data structure is retrieved in line 2 (implications for the memory access of this data structure are discussed in Section IV-B), and in line 3 the distance of particle n from the given particle is checked. For a neighbour particle n within a smoothing length from the given particle, two inlined functions are evaluated. They return as output a value for the kernel function (w in line 6), which is then used for the density sum. We stress here that the code described in Listing LABEL:loop-implemented finds application not only in the Subfind algorithm but in several instances of Gadget, therefore its performance optimisation is of general interest for a wide variety of SPH codes. Moreover, as pointed out in [12], this algorithm has access patterns which are pretty similar to those arising from the use of Verlet lists in Molecular Dynamics, hence this work is relevant even beyond astrophysics.

1for (n = 0, n < neighbouring particles){
2  j = ngblist[n];
3  if (particle n within smoothing length){
4    inlined_function1(..., &w);
5    inlined_function2(..., &w);
6    rho += Particle[j].Mass * w;
7  }
8  // other particle interaction computations
Listing 1: Pseudocode for the density computation (Equation 1), typical of SPH schemes. Code and variables are described in the text.

From the Subfind algorithm, only the kernel subfind_density, responsible for the density computation described above, has been isolated through the serialisation procedure, motivated and discussed in the next section. This kernel presents both phases of the particle interactions introduced in Section II-A. In every Gadget component, the workload fraction of the two phases is different. The first phase is nearly the same for all the components, while the second one varies from kernel to kernel, depending on the physics interactions to be calculated. For subfind_density the second phase has less workload, but on the other hand it consists mainly of the code in Listing LABEL:loop-implemented, which has a fundamental importance in every SPH implementation. Because of this, we decided to focus on the optimisation of this kernel, even though in this case the workload of the particle interactions in Listing LABEL:loop-implemented is subdominant with respect to the neighbour search (cf. Section IV-B).

Ii-C Isolating a representative code kernel

The key concept behind the isolation of subfind_density is a simple serialisation library, implemented as a wrapper to the C standard libraries fwrite() and fread(), adding additional functionality, such as checksumming. The isolated subfind_density can be then invoked independently of the main application, but with the exact same input and environment as it had originally. The first step to achieve this is to identify which data is actually accessed by the routines of interest and dump them on the disk, to be later used ‘as-is’ by the isolated kernel. For the Subfind kernel one has to dump the complete particle array, the particle tree data structures and the global configuration variables (held in a global data structure), plus a few additional variables. This sums up to  MB for the physics workload under examination (a simulation evolving dark matter particles and the same number of gas particles).

As a design choice, the isolated kernel has been implemented within a simple front-end driver, completely integrated with the build framework of Gadget. The driver deserialises the input data from the disk, feeds the kernel and finally dumps and verifies the result. By this it simulates the original algorithm behaviour, as far as data structure, variable and function definitions are concerned. The resulting quick prototyping and testing of different optimisation solutions is of great help and importance in the code modernisation process. Moreover we notice that, for simplicity and for focusing on the node-level performance, every reference in the original source code to the MPI parallelism has been removed from the kernel, reducing it to pure OpenMP shared memory parallelism.

Iii Systems and software environment

The results presented here are based on tests performed on the SuperMIC cluster at LRZ222 The system is equipped with 32 nodes, each containing two 8-core Intel Xeon® E5-2650v2 processors (code-named Ivy-Bridge, henceforth IVB) @ 2.6 GHz. To each host node, two Intel Xeon Phi Knights Corner (KNC) coprocessors 5110P with 60 cores @ 1.1 GHz are attached. For all KNC tests presented here, the kernel is executed in native mode.

We use the Intel® C++ Compiler 16.0. Compilation flags, if not otherwise specified, are:

  • -xhost for generating vectorised code on IVB;

  • -mmic for generating vectorised code on KNC.

  • -no-vec -no-simd for scalar code on IVB;

  • -mmic -no-vec -no-simd for scalar code on KNC;

  • In all the cases, we use the compiler flags -O2 -qopenmp.

For the multithreaded version of the code (OpenMP), the thread execution has been bound to physical processing units (thread pinning) using the thread affinity interface of the Intel® OpenMP Runtime Library. On IVB the assignment of consecutive threads is done on physical cores which are close in topology to each other, using multiple assignments (simultaneous multithreading, SMT) only if all physical cores are filled. On KNC an even distribution throughout the cores has been guaranteed333In terms of the environment variable KMP_AFFINITY, these assignments (tuned by optimal performance) are equivalent to granularity=fine, compact,1,0 on IVB and to granularity=fine,scatter on KNC.. SMT is used both on IVB (two threads per core, resulting in tests with 32 threads) and on KNC, where up to four threads per core (a maximum of 240 threads in total) are available. Especially on this system, the efficient utilisation of multiple threads per core is crucial for getting performance [13].

The application analysis is performed using tools from the Intel Parallel Studio XE 2017. In particular, the multithreading issues have been identified using VTune™ Amplifier XE, and the vectorisation prototyping has been investigated with Intel® Advisor. Furthermore, when generating vectorised code we make extensive use of the vectorisation reports, produced with the compiler flag -qopt-report=5. The reports provide a good insight on the optimisations performed automatically by the compiler. In Section IV-B, the cache performance has been studied with LIKWID [14].

Iv Performance results

Iv-a Threading parallelism

Before focusing on the particle interaction phase, the OpenMP implementation of the subfind_density kernel (similar to that of other components of P-Gadget3) will be analysed and improved in this section. The version of the kernel we start our optimisation from, thereafter dubbed as original, has been profiled with 8 threads using VTune Amplifier XE on an IVB node. The profiling using the analysis type Basic hotspots shows a severe OpenMP parallelisation overhead, with more than 40% of the execution time spent in threads spinning on locks.

1more_particles = partlist.length;
2while (more_particles) {
3  int i = 0;
4  while (!error && i < partlist.length) {
5   #pragma omp parallel
6   {
7     #pragma omp critical
8     {
9       p = partlist[i++];
10     }
11     if (!must_compute(p))
12       continue;
13     ngblist = find_neighbours(p);
14     sort(ngblist);
15     for (auto n : select(ngblist, K))
16       compute_interactions(p, n);
17   }
18  }
19  more_particles = mark_for_recomputation(partlist);
Listing 2: Parallel particle interaction scheme in Gadget (pseudocode in C++11 notation).

Taking a closer look at the original algorithm (schematically drawn in Listing LABEL:alg-subfind-orig), it is easy to spot the potential bottleneck at the lock set when accessing the particle list. More specifically, within the list of particles partlist, each thread gets the next particle to process within the critical code section (line 7), and checks only then whether it is actually needed to be processed (line 11). If this is the case, then it proceeds with the neighbour finding and particle sorting (lines 13 and 14; sorting will be discussed in detail in Section IV-D) and the particle interaction phase (line 16, basically the code in Listing LABEL:loop-implemented), otherwise tries to get the next particle in the list. The reason behind this is that, at the end of each iteration on particles, a set of them is marked for recomputation (line 19), if they have a number of neighbours smaller than 295. For these particles, the smoothing length is adjusted, and the algorithm is repeated until no particles are left. The number of particles to be recomputed in each iteration decreases exponentially, almost halving in every iteration. For our workload, a total of 18 iterations of the outermost loop (line 2) is required. The problem with the locking scheme of this algorithm is that, at later iterations, the particle list is locked and unlocked constantly since most of the particles need not be recomputed, creating a tremendous parallelisation overhead.

Having identified the problem, one can devise more than a single solution. In a first optimisation attempt, a ‘todo’ particle list, which contains only particles that need to be recomputed in each iteration, has been introduced. Instead of iterating the whole particle list every time, the iteration is performed only over this ‘todo’ list. This intermediate solution fixes the problem of unnecessary locking and unlocking the particle list, since it is guaranteed that the acquired particle needs to be recomputed. This new list has to be updated at the end of each iteration, but this incurs negligible overhead, since it can be easily integrated in the process of marking the particles for recomputation. This improved strategy is quite efficient on IVB, but still shows limitations for very large thread counts on KNC. The key efficiency problem still stems from the common particle queue, from which all threads try to fetch the next particle to compute.

A tempting approach for better fixing this problem is to split the particle queue into equal parts, and assign each part to a separate thread for processing. This approach, however, is problematic since the work associated with each particle is not known beforehand. Indeed, trying to implement such a static partitioning scheme leads to severe load imbalance among the threads and worse performance than the version with lock contention fix. In order to deal with the unavoidable intrinsic variations in the amount of work associated with each particle (caused for instance by accessing neighbours from memory), the individual work items must be scheduled dynamically among the threads. The most straightforward way to implement this takes advantage of the OpenMP’s work sharing capabilities, by means of a simple omp for loop using dynamic scheduling (Listing LABEL:alg-subfind-lockless). We notice here at line 1 the particle list todo_partlist, consisting of particles flagged for being recomputed, as described above. Additionally, to suppress lock contention completely, the critical section inside the loop must be avoided. To achieve this, we have restructured the loop by moving all premature exit condition checking (compare Listing LABEL:alg-subfind-orig, line 4, with Listing LABEL:alg-subfind-lockless, lines 6–7) outside the loop (i.e., at line 13 of Listing LABEL:alg-subfind-lockless).

1todo_partlist = partlist.length;
2while (todo_partlist) {
3  error = 0;
4  #pragma omp parallel for schedule(dynamic)
5  for (auto p : todo_partlist) {
6    if (something_is_wrong)
7      error = 1;
8    ngblist = find_neighbours(p);
9    sort(ngblist);
10    for (auto n : select(ngblist, K))
11      compute_interactions(p, n);
12  }
13  // ... check for any errors
14  todo_partlist = mark_for_recomputation(partlist);
Listing 3: Like Listing LABEL:alg-subfind-orig, but with the work sharing implemented through a omp for loop with dynamic scheduling.

We will dub the kernel version described by Listing LABEL:alg-subfind-lockless as lockless hereafter, because the explicit locks are removed (although they may still be used by the OpenMP implementation). The improvement of performance produced by this new work sharing strategy is shown firstly by the dramatic decrease of OpenMP overhead: the fraction of execution time spent on thread spinning in this improved version is , to be compared with of the original version.

Fig. 1: Parallel speed-up of the subfind_density kernel on IVB (lines with circles) and KNC (lines with triangles), in a comparison of the original and lockless code versions. The different runs are indicated by lines and symbols as in the legend, while the dashed black line denotes ideal speed-up. All data are relative, and normalised to the one-thread performance of the corresponding architecture.

Another useful information about the performance improvement can be obtained by a thread scaling analysis of the kernel (Figure 1). In the original kernel version, the lock contention problem is also the source of a slowdown observed on the scaling plot with increasing thread number. In particular, moving beyond a single socket on IVB has a large negative impact on performance, since locking cost becomes quite expensive when synchronising across sockets in NUMA systems. Conversely, the code optimisation in the lockless version increased the scalability of the kernel, especially on the Xeon Phi.

More specifically, the speed-up has improved by on a IVB socket (8 threads), reaching parallel efficiency, and a by significant on KNC with 60 threads ( parallel efficiency after the optimisation). Important to note, the kernel performance in the lockless version can efficiently scale across the sockets. Furthermore, it can benefit from the use of SMT on IVB, which was not the case for the original version. In addition on KNC, with the lockless version the performance is improved also when moving beyond one thread per core.

Since the single thread performance has not changed, these results automatically translate into a time to solution speed-up. Concerning the performance comparison between Xeon and Xeon Phi, for the original kernel version KNC was slower than IVB, while for lockless kernel it is only (comparing 60 threads on KNC and 8 threads on IVB). Optimising the work sharing among the OpenMP threads has thus allowed KNC performance to nearly meet half of the performance of a single IVB socket.

Iv-B Enabling vectorisation: redesign of the particle data layout

After having improved the work sharing strategy of subfind_density, the next logical step is to investigate its potential for vectorisation. However a performance bottleneck on the data structure has to be removed first. This section is dedicated to this topic.

In order to obtain timing information and vector optimisation hints at loop level, the lockless version is compiled on IVB and profiled using Intel Advisor. We observe in the Loop Metrics that of the total runtime is spent in scalar loops. Among them, the most expensive part in terms of runtime ( is given by the tree-walk computation needed for the neighbour search. Its current implementation involves a while loop, which is not a suitable candidate for being automatically vectorised by the compiler. A modification of this part of the algorithm would be a major code restructuring, which is out of the scope of the current work. However, an essential part of the neighbour finding phase is the particle sorting, which is subject of the algorithmic improvement in Section IV-D.

The second most expensive loop ( of the total runtime) is instead more suitable for our vectorisation analysis. This corresponds to the computation described in Listing LABEL:loop-implemented, whose general importance has been introduced in Section II-B. In the lockless kernel version, this loop is not vectorised due to the data layout of the code, which prevents spatial data locality. For this reason, before dedicating any effort to vectorisation, an essential prerequisite is a redesign of the data structure.

As mentioned in Section II-A, in P-Gadget3 the particle data are organised as a C struct, and the collection of all of them is implemented as array of structures (AoS for brevity). This kind of data organisation is typical for particle codes because it naturally allows the addition of further physical quantities into the algorithmic scheme with minimal coding overhead. In P-Gadget3, and consequently in subfind_density, the particle data structure can be quite large, especially in problems involving a large number of different physical variables. In our test case, the particle data structure object has indeed 224 bytes per particle.

It is well known that accessing a particle data structure organised as AoS hinder efficient cache utilisation. In the case of subfind_density, a VTune Amplifier XE General exploration analysis on IVB shows that the kernel is back-end bound, a situation typically hinting to data cache misses. The very large particle struct incurs memory accesses with large strides, even if the neighbours are accessed completely sequentially in the particle array.

Moreover the utilisation of Single Instruction Multiple Data (SIMD) registers is suboptimal with this data layout. According to the general auto-vectorisation guide [15], the Intel C++ Compiler can automatically generate a vectorised code version for loops computing on AoS. Presumably, the performance of such code is very low, due to the non-unit strided memory access. However in our specific case the loop was not even vectorised by the compiler, because of the unfavourable data layout.

The reasons given above motivate the change of the data structure into a structure of arrays (henceforth SoA), in order to promote better data locality and auto-vectorisation, and thus to explore the vector capabilities of the tested hardware. For this change of data layout, a minimally invasive approach has been chosen: we assume that the code framework surrounding the kernel is still making use of the AoS data structure implementation (e.g. for the MPI communication), and that the new SoA data structure is only exposed in loops over particle neighbours. This choice thus restricts the code modifications that are required, and makes the consistency checks simpler, at the cost of additional routines for copying data from AoS to SoA and vice-versa.

Among the many variables defined in the former particle data structure Particle (cf. Listing LABEL:loop-implemented, line 6), only those used in subfind_density are converted into the new SoA particle data structure ParticleSoA. This results in a more compact particle data structure (60 bytes per particle). We implemented software gather and scatter routines, in order to keep the new improvement consistent with the kernel infrastructure. The gather function is called before the computation to copy the necessary data from Particle into ParticleSoA. After the computation, using the SoA layout, the data is copied back (scatter) to the AoS Particle. The gather and scatter routines include threading parallelism. The overhead of these routines sum up to at most of the total runtime, both on IVB and KNC. We dub the resulting new kernel version as SoA. The performance improvements introduced in this version are clearly visible from the better memory data access. In the initial analysis using LIKWID we observe a large fraction of cache misses (), which are lowered to when the SoA particle data structure is used.

Fig. 2: Node-level performance, defined as the inverse of the execution time, on IVB and KNC at three different stages of our optimisation process, as indicated in the legend. The performance is normalised to the best original IVB run (bar on the left-hand side); higher bars correspond to higher performance.

An overview of the node-level performance improvements, including the benefit associated with the introduction of the SoA, is shown in Figure 2. Here the inverse of the execution time is shown and the best result on the IVB node (or on the KNC) is reported, including SMT whenever applicable. Changing the data structure to SoA gives an additional improvement in performance on IVB, while a significant impact () is apparent on KNC. Compared with the gain between the original and the lockless version, the improvement introduced by the SoA version seems less remarkable (leaving aside the benefit for memory access). However, the main goal for introducing the SoA data structure, namely enabling auto-vectorisation, has been achieved. The results in Figure 2 still refer to the kernel compiled for generating scalar code (i.e. options -no-vec -no-simd). Results for vectorised code are discussed in Section IV-C.

Iv-C Optimising vectorisation

With the change in data layout described in the Section IV-B, the compiler is finally able to automatically vectorise our target loop, previously identified as the one shown in Listing LABEL:loop-implemented. Now we want to evaluate the speed-up introduced by vectorising this loop on different systems, and possible further improvements.

From the SIMD perspective, probably the most problematic point in Listing LABEL:loop-implemented is the relatively large part embraced in the if statement starting at line 3. If the compiler implements the if condition through masking, all subsequent statements inside the condition (and function calls) need to be masked, too. To reduce the masking overhead, one should put the condition as low as possible in the call tree. Therefore we have moved the if statement inside inlined_function1. When the if condition is not fulfilled, w = 0.0 is returned in line 6 of Listing LABEL:loop-implemented, making this implementation equivalent to the initial one. This new kernel version is dubbed as vectorised.

We measure the runtime spent in the target loop to quantify the vectorisation speed-up, defined as the ratio between the time spent in the scalar loop and the vectorised versions:


The corresponding (architecture-dependent) vector efficiency is defined as the ratio of and , which is the loop vector length. For the loop under investigation, on IVB and 8 on KNC, since double precision FP values are used in the computation.

In the SoA version the loop was vectorised, as confirmed by the compiler report. However no performance benefits with respect to the scalar code are measured, neither on IVB nor on KNC, resulting in . An improvement is observed between the kernel versions SoA and vectorised, which supports our assumption of a masking bottleneck. More specifically, for IVB (), while for KNC (). Although very promising, the reported speed-ups are still well below the ideal vector speed-up. This is due to several reasons. First, the use of auto-vectorisation is generally less efficient than programming with intrinsics. Second, the way the particle data structure is accessed (line 2 in Listing LABEL:loop-implemented) leads to strided memory access, as also been confirmed by the Memory Access Pattern analysis of Intel Advisor. Furthermore additional optimisation like memory-aligned data access is missing (as shown by the compiler report) and definitely deserves attention in future work.

Iv-D Further algorithmic improvements

Examining again the execution time profile of subfind_density with the previous optimisations reveals that a significant amount of time () is spent sorting the neighbour list. The original algorithm (cf. Listing LABEL:alg-subfind-lockless, line 9) first sorts the resulting neighbour list according to their distance from the current particle and then selects the first neighbours ( in our configuration). However, in this specific kernel the sorting information is not used beyond that point, meaning that the only purpose of sorting is to select the nearest neighbours from the neighbour list. This selection can be achieved in linear scaling time with the number of particles, instead of the sorting time, by only partially sorting the neighbours.

An array of elements, on which an ordering relation is defined, can be partially sorted in linear time, so that for each . For our purpose, we have used the QUICKSELECT algorithm [16], which is based on QUICKSORT. Although its worst-case performance is of as in case of QUICKSORT, in the average case it is of . Therefore, it is a good candidate for substituting the complete sorting in our case. Our initial experiments comparing the two algorithms (our straightforward QUICKSELECT implementation vs. C’s QUICKSORT) showed that the difference in average complexity pays off for very large arrays where the performance difference is at least one order of magnitude. However, the neighbour list that needs to be sorted in the Subfind kernel is relatively small (500–1000 particles), but this operation is executed nearly a million times in our test. The small neighbour list tightens the margin of the performance improvement we get by replacing QUICKSORT in the Subfind kernel. However, the replacement still brings to a significant improvement of 41% on IVB (16 threads) and 27% on KNC (120 threads) compared to the vectorised kernel version described in the previous section. The kernel version discussed in this section represents the final step of the optimisation process presented in this work, and will be henceforth indicated as optimised.

V Performance on Intel Knights Landing

The optimisations described in Section IV have the merit of being inspired by a general approach (code modernisation, [17]), which can be used for porting and improving the performance on modern computer architectures. In this section we verify that the code improvements described in this work lead to a reproducible performance gain also on newer hardware, which was not the original target of our project. More specifically, we show here results of tests performed on Intel Xeon processor E5-2697v3 (code-named Haswell, HSW), with 28 cores @  GHz, and on the Intel Xeon Phi 7210 with 64 cores @  GHz (code-named Knights Landing, KNL).

The software environment is the same as described in Section III. For HSW, the same compilation flags of IVB were used. In the case of KNL, a full exploration of memory and cluster modes available on this system [18, 19] would go beyond the scope of this paper, and will be subject of forthcoming work. Our settings are Flat for memory mode and All–to–All for cluster mode. The compilation flags on KNL are: -xMIC-AVX512 -O2 -qopenmp. For simplicity of use, we decided to allocate on the High-Bandwidth Memory as preferred option, and falling back to DDR memory if needed444This is achieved at runtime by numactl -preferred 1.. The thread pinning has been set similarly to KNC (KMP_AFFINITY=scatter).

Fig. 3: Parallel speed-up of subfind_density on different architectures (see legend). The dotted lines correspond to the original version of the kernel, while the solid lines to the optimised version. All data are relative, and normalised to the one-thread performance of the corresponding architecture.

In Figure 3 the scaling with the number of threads of the optimised version of the kernel is compared with the original version, for HSW and KNL. As a term of comparison, the line for the optimised KNC speed-up is added.

The original kernel shows poor scaling also on the newest architectures, similar to what initially seen in Figure 1. The optimised version, on the other hand, reproduces the performance improvement observed on IVB and KNC. Between the two versions we measure a speed-up of on HSW (14 threads), and a significant improvement on KNL (64 threads). On more than two threads per core the scaling for KNL is lower than the one for KNC, indicating a different architectural need for SMT between the two generations of Xeon Phi. Final results on time to solution will be presented in the next section, but we mention here for completing this discussion that the single-thread execution time on KNL is faster than on KNC for the optimised kernel.

Finally, the auto-vectorisation of the target loop (Section IV-C) has a remarkable performance on KNL, where we measure (, having ). The reason for the improvement over IVB and KNC has not been explored yet. We speculate that it might be caused by the use of the Intel® AVX512 instruction set, featuring more efficient gathers and mask manipulation instructions.

Vi Conclusions and future work

Many community codes have been originally designed as a serial application, and grown by merely adding different layers of parallelism. Code modernisation is important for these codes, in order to be prepared for next generation processors, and to immediately profit from the current hardware. In this work we presented node-level performance optimisation of algorithms used in P-Gadget3, having as target Intel Xeon and Xeon Phi architectures. The described strategies and the resulting code modifications, such as prevention of the lock contention, SoA data layout and vectorisation improvement are general, and applicable to a broad range of similar codes. The performance improvements are significant both in terms of threading scalability (Figures 1 and 3) and execution time.

Concerning these final diagnostics, a summary of obtained results on all tested architectures is shown in Figure 4. We can see an improvement in runtime by factors of and on IVB and HSW, respectively. It has been verified that performance on the latter system benefits also from the compiler generation of Fused Multiply-Add (FMA) instructions.

Fig. 4: Execution time of subfind_density on different architectures, indicated in the legend, for the kernel versions original and optimised. The data refer to tests performed on 8 threads (one socket) for IVB, 14 threads (one socket) for HSW, 240 threads (i.e. four threads per core) for KNC and 128 threads (two threads per core) for KNL.

The unsatisfactory result of the original kernel on KNC highlights a known feature of this system, namely its sensitivity to parallelism bottlenecks. On the other hand, KNC benefits considerably (and more than Xeon) from the code improvements, with a total speed-up of . The performance gap between KNC and IVB then decreases from to . The results for KNL show that this system is more tolerant concerning performance issues than KNC. Indeed, on KNL the optimised kernel is faster than original, and faster than the result obtained on the HSW architecture. We stress once more that KNL, despite of its good performance, was not the original target of this project. Early work on Intel Xeon E5-2699v4 (Broadwell) further confirms these findings (total speed-up: ) and thus the portability of the presented approach.

The relevance of the presented work goes beyond subfind_density, and impacts the modernisation of the Gadget code and of similar community applications. Our path for minimally invasive, portable and readable optimisation is a clean and straightforward strategy, for the wide Gadget community to be adopted in their own development lines. Future work will certainly include a plan to backport our optimisation solutions to the most important parts of P-Gadget3. This task is demanding, because it is going to touch the current MPI implementation of the code and its underlying data structure, defined as AoS. Moreover, after the first tests shown here, more work will be devoted to KNL-specific features (use of MCDRAM, Intel AVX512 instruction set, clustering modes) and their potential for optimisation. On the long term, one must not exclude further modifications, like refactoring the tree traversal algorithm, in a fashion which is more favourable for execution on many-core systems with large vector registers.


The research of F. B., L. I. and V. K.  has been partly supported by the Intel Parallel Computing Center (Intel PCC) ExScaMIC – Extreme Scaling on MIC/x86 at LRZ and Technical University of Munich (TUM). Thanks to N. Tchipev (TUM) for valuable suggestions on the draft, to K. Dolag, M. Petkova and A. Ragagnin for their support in the use of Gadget, and to H. Bockhorst and G. Zitzlsberger (Intel) for consulting and careful reading of the manuscript. Intel, Xeon, Xeon Phi and VTune are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries.


  • [1] N. Kaiser, “Evolution and clustering of rich clusters,” MNRAS, vol. 222, pp. 323–345, 1986.
  • [2] V. Springel, “Larger, faster, better: Current trends in cosmological simulations,” Astronomische Nachrichten, vol. 333, pp. 515–522, 2012.
  • [3] ——, “The cosmological simulation code GADGET-2,” MNRAS, vol. 364, pp. 1105–1134, 2005.
  • [4] F. Sembolini, G. Yepes, F. R. Pearce, A. Knebe, S. T. Kay, C. Power, W. Cui, A. M. Beck, S. Borgani, C. Dalla Vecchia, R. Davé, P. J. Elahi, S. February, S. Huang, A. Hobbs, N. Katz, E. Lau, I. G. McCarthy, G. Murante, D. Nagai, K. Nelson, R. D. A. Newton, V. Perret, E. Puchwein, J. I. Read, A. Saro, J. Schaye, R. Teyssier, and R. J. Thacker, “nIFTy galaxy cluster simulations - I. Dark matter and non-radiative models,” MNRAS, vol. 457, pp. 4063–4080, 2016.
  • [5] G. Bazin, K. Dolag, and N. Hammer, “Gadget3: Numerical Simulation of Structure Formation in the Universe,” inSiDE, vol. 11, no. 2, 2013.
  • [6] A. M. Beck, G. Murante, A. Arth, R.-S. Remus, A. F. Teklu, J. M. F. Donnert, S. Planelles, M. C. Beck, P. Förster, M. Imgrund, K. Dolag, and S. Borgani, “An improved SPH scheme for cosmological simulations,” MNRAS, vol. 455, pp. 2110–2130, 2016.
  • [7] V. Springel, S. D. M. White, G. Tormen, and G. Kauffmann, “Populating a cluster of galaxies - I. Results at z=0,” MNRAS, vol. 328, pp. 726–750, 2001.
  • [8] K. Dolag, S. Borgani, G. Murante, and V. Springel, “Substructures in hydrodynamical cluster simulations,” MNRAS, vol. 399, pp. 497–514, 2009.
  • [9] P. Behroozi, A. Knebe, F. R. Pearce, P. Elahi, J. Han, H. Lux, Y.-Y. Mao, S. I. Muldrew, D. Potter, and C. Srisawat, “Major mergers going Notts: challenges for modern halo finders,” MNRAS, vol. 454, pp. 3020–3029, 2015.
  • [10] A. Klypin, S. Gottlöber, A. V. Kravtsov, and A. M. Khokhlov, “Galaxies in N-Body Simulations: Overcoming the Overmerging Problem,” ApJ, vol. 516, pp. 530–551, 1999.
  • [11] W. Dehnen and H. Aly, “Improving convergence in smoothed particle hydrodynamics simulations without pairing instability,” MNRAS, vol. 425, pp. 1068–1082, 2012.
  • [12] A. Ragagnin, N. Tchipev, M. Bader, K. Dolag, and N. J. Hammer, “Exploiting the Space Filling Curve Ordering of Particles in the Neighbour Search of Gadget3,” in Parallel Computing: On the Road to Exascale, ser. Advances in Parallel Computing, vol. 27.   IOS Press, 2016, pp. 411–420.
  • [13] A. Vladimirov, R. Asai, and V. Karpusenko, Parallel Programming and Optimization with Intel® Xeon Phi™ Coprocessors.   Colfax International, 2015.
  • [14] J. Treibig, G. Hager, and G. Wellein, “LIKWID: Lightweight Performance Tools,” ArXiv e-prints, 2011, 1104.4874.
  • [15] (2012) A Guide to Auto-vectorization with Intel C++ Compilers. Intel Developer Zone. [Online]. Available:
  • [16] C. A. R. Hoare, “Algorithm 65: find,” Communications of the ACM, vol. 4, no. 7, 1961.
  • [17] (2015) What is Code Modernization? Intel Developer Zone. [Online]. Available:
  • [18] A. Vladimirov and R. Asai. (2016) Clustering modes in Knights Landing processors: developer’s guide. Colfax International. [Online]. Available:
  • [19] R. Asai. (2016) MCDRAM as High-Bandwidth Memory (HBM) in Knights Landing processors: developer’s guide. Colfax International. [Online]. Available:
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description