A Systematic Comparison of Dynamic Load Balancing Algorithms for Massively Parallel Rigid Particle Dynamics

A Systematic Comparison of Dynamic Load Balancing Algorithms for Massively Parallel Rigid Particle Dynamics

Abstract

As compute power increases with time, more involved and larger simulations become possible. However, it gets increasingly difficult to efficiently use the provided computational resources. Especially in particle-based simulations with a spatial domain partitioning large load imbalances can occur due to the simulation being dynamic. Then a static domain partitioning may not be suitable. This can deteriorate the overall runtime of the simulation significantly. Sophisticated load balancing strategies must be designed to alleviate this problem. In this paper we conduct a systematic evaluation of the performance of six different load balancing algorithms. Our tests cover a wide range of simulation sizes, and employ one of the largest supercomputers available. In particular we study the runtime and memory complexity of all components of the simulation carefully. When progressing to extreme scale simulations it is essential to identify bottlenecks and to predict the scaling behaviour. Scaling experiments are shown for up to over one million processes. The performance of each algorithm is analyzed with respect to the quality of the load balancing and its runtime costs. For all tests, the waLBerla multiphysics framework is employed.

keywords:
Dynamic Domain Partitioning, Dynamic Load Balancing, Rigid Body Dynamics, Discrete Element Method, Non-Smooth Granular Dynamics
1

1 Introduction

Rigid body dynamics are widely used for the simulation of rigid particles. Well known methods in this field are the Discrete Element Method (DEM) Cundall1979 () and methods based on non-smooth granular dynamics Stewart2000 (); Preclik2015 (); Preclik2017a (). Together with the increasing compute power of today’s supercomputers, scenarios comprising billions of non-spherical particles and contacts are possible Preclik2015 (); Eibl2018 (). To achieve simulations this large, a carefully engineered software is needed with a focus on highly parallel algorithms. The typical parallelization strategy for particle simulations is to partition the simulation domain into subdomains and assign them to different processes Beazley1995 (); Plimpton1995 (). One important aspect of this initial domain partitioning is to achieve an equal workload for all cores. However, since the simulated system is dynamic, and the particles may migrate between subdomains, the workload might be shifted during the simulation. This leads to load imbalances, that can slow down the whole simulation. To overcome this problem, the domain partitioning must be adapted dynamically throughout the simulation and/or the subdomains must be reassigned to different processes. Many simulation frameworks have therefore adopted load balancing and results are published for simulations of various sizes.

1.1 Related Work

Compared to rigid body dynamics, molecular dynamics simulations differ in some aspects, however, the load balancing problem is closely related. Therefore we also consider methods proposed in the context of molecular dynamics here. A slightly dated but still relevant review of different methods suitable for load balancing can be found in Hendrickson2000 (). Owen et al. Owen2000 () use load balancing based on the ParMetis Karypis1998a () graph partitioning library and show results for simulations with up to 6 cores. Deng et al. Deng2000 () present a dynamic load balancing approach for molecular dynamics simulations which deforms the domain partitioning at runtime. The initial rectangular grid is optimized by moving the corners of all subdomains individually in space to adjust to the simulation. Good quality of the partitioning is reported which improves the runtime performance but no claims about the global quality of the runtime improvement are made. Scaling results for up to cores are shown by Begau and Sutmann Begau2015 () for an optimized version. Another famous way for doing domain partitioning is by using Orthogonal Recursive Bisection (ORB) Warren1993 () or Recursive Coordinate Bisection Berger1987 (). These methods recursively subdivide the simulation space by separating planes trying to keep the workload on both sides of the separating plane equal. Adaptions of these methods for particle simulations are reported Fleissner2007 (); Fleissner2008 (); Shojaaee2012 (). Fattebert et al. Fattebert2012 () present a detailed analysis of their load balancing approach based on a domain partitioning with Voronoi cells implemented in ddcMD. A steepest decent algorithm is used to adapt the Voronoi cells dynamically. Good scaling results are shown up to MPI ranks. A kd-tree based load balancing approach as implemented in the molecular dynamics software package ls1 mardyn Heinecke2015 () is reported to shows good performance up to 1024 MPI ranks Niethammer2014 (). GROMACS Hess2008 (), a well-known molecular dynamics framework uses cells to partition their domain. These cells are adapted due to time measurements carried out during the simulation. Unfortunately no more detailed measurement-based analysis of the load balancing algorithm is available in Hess2008 (). The LIGGGHTS DEM framework Berger2015 () uses a Cartesian grid of subdomains and the dynamic load balancing is performed with a recursive multi-sectioning algorithm based on previous split locations and aggregated particle sums. Measurements of the load imbalance with different MPI/OpenMP configurations with up to threads are presented Berger2015 (). Building on this work, Cintra et al. Cintra_2016 (); Cintra_2016a () demonstrate a hybrid OpenMP/MPI parallelization implemented in DEMOOP. A RCB based algorithm is used for domain partitioning and various methods are applied for particle sorting and distribution onto threads. Scaling results are shown for up to 64 cores for different test setups. However, the parallel efficiency most of the time remains below . Markauskas and Kaceniauskas Markauskas2015 () present a comparison of the RCB implemented in the Zoltan library Boman2012 () and the multilevel k-way algorithm of the ParMetis library Karypis1998a () with regard to dynamic load balancing. Comparisons with up to cores are conducted using the DEMMMAT_PAR code. A summary of the largest simulations with load balancing as reported by various authors is collected in Tab. 1.

1.2 Contribution and Outline

reference cores MPI ranks load balancing algorithm
Owen et al. Owen2000 () 6 subdomain deformation
IMD Begau2015 () subdomain deformation
Fleissner et al. Fleissner2007 (); Fleissner2008 () 16 ORB based
ls1 mardyn Niethammer2014 () based on kd-trees
DEMOOP Cintra_2016 (); Cintra_2016a () RCB based
ddcMD Fattebert2012 () based on Voronoi cells
DEMMMAT_PAR Markauskas2015 () Zoltan & ParMetis
LIGGGHTS Berger2015 () recursive multi-sectioning
our contribution octree based & various algorithms
Tab. 1: Largest simulation reported with load balancing.

Although load balancing is widely used, only few authors evaluate the performance they gained by load balancing in detail. In this paper we will use a systematic setup which enables us to give an a-priori estimates on the expected performance gained by using load balancing. We then evaluate different load balancing algorithms for rigid body dynamics simulations and compare the results with our expectations. Our goal is also to assess these algorithms for the usability in high performance computing. This includes problem sizes which were run on cores with MPI ranks in parallel. To our knowledge this exceeds previously published results significantly, see Tab. 1. Additionally we analyze the runtime and resource costs that have to be paid for these algorithms. All experiments were conducted on one of today’s largest supercomputers – the Juqueen supercomputer located in Jülich, Germany. Unfortunately this supercomputer was shut down in May 2018.

The remainder of this paper is structured as follows. In Sec. 2 we describe our load balancing environment comprising the domain partitioning (Sec. 2.1), the general load balancing pipeline (Sec. 2.2) and all load balancing algorithms selected for testing (Sec. 2.3). This is followed by a description of the computing environment in Sec. 3.1, the metrics we employ to compare the algorithms in Sec. 3.2 and our test setup in Sec. 3.3. Subsequently, the evaluation of the performance of the algorithms is presented in Sec. 3.4 and Sec. 3.5. Sec. 4 summarizes the insight gained and discusses possible future lines of investigation.

2 Our approach

Our approach targets millions of processes and beyond. The most important design criterion is therefore the strict locality of data and algorithms. Most data needed in a rigid particle dynamics simulation can be stored locally with only additional data from neighboring subdomains. This includes particle data, collision data, and all data needed for resolving collisions. However, the domain partitioning is typically stored on every process. This results in memory requirements that grow like , with being the number of processes. A memory complexity like this limits the number of processes that can be used until all memory is depleted. Therefore the runtime and memory complexity is already an essential concern that must be addressed in the early stages of the algorithm and software design process. In the following sections we will outline our approach that leads to excellent scalability also for very large simulations Schornbaum2016 (); Schornbaum2017 (); Eibl2018 ().

2.1 Domain Partitioning

We use a distributed forest of octrees to organize our simulation domain. To generate the partitioning, we decompose the simulation domain into a grid of equally shaped brick-like domains. Each of these bricks acts as the root of an octree. Depending on the setup we then start to refine each octree individually by subdividing the corresponding subdomain at its center into eight equal subdomains. Each of these subdomains forms a branch of the octree. The procedure is then repeated for every branch. The subdomains used for the simulation are the leaves of the octree. We also impose additional constraints on our data structure to make it possible to handle and refine the domain efficiently during a simulation. A parent node is always split exactly at its center and neighboring subdomains can differ by at most on level of refinement. This helps to keep the number of neighbors of each subdomain bounded. A more detailed description of this data structure and its distributed implementation can be found in Schornbaum2016 (). Building upon this, the load balancing pipeline is described in the next section.

2.2 Load Balancing Pipeline

The actual load balancing pipeline is a three step process. First weights are calculated for every subdomain. Two weights are needed for the subsequent load balancing process. One is the computational weight which quantifies the work that must be performed to advance all particles within a subdomain by one time step. Typically this number is closely related to the number of collisions that need to be resolved in that subdomain. The time needed for collision detection and collision resolution scales essentially with the number of contacts. A second weight is the communication weight which describes how much communication is involved in keeping the subdomain in sync with its neighbors. The exact computation of the weights in our setup will be described in Sec. 3.3.

When all weights have been determined, we refine the octree in areas with a high workload and coarsen it in low workload areas. This is done with the goal of maintaining a compromise. On the one hand we must keep the number of subdomains small and on the other hand generate small enough workload portions so that we can efficiently balance the workload. Since only whole subdomains can be moved between processes, the granularity of the workload limits the achievable workload balance. Note that the octree is not refined/coarsened for physical reasons as the simulation accuracy is not related to the subdomain size.

In the final step, the leaf nodes (subdomains) are distributed among the available processes. There different algorithms can be chosen. The goal of these algorithms is to make sure that the workload of every process is similar. In the following we will briefly describe the algorithms used for this evaluation.

2.3 Load Balancing Algorithms

In this paper we compare several load balancing algorithms for their suitability in large scale parallel particle simulations. The first class of algorithms is based on space filling curves (SFC). SFCs map the 3D space onto a 1D curves Bader2012 (). This is used to construct a linear ordering of all subdomains. The load balancing itself then searches for cuts within this linear ordering to generate equal partitions. The partitioning must be computed with respect to the aggregated load of each partition, defined as the sum of the loads of each octree leaf in it. The partitions are then distributed among the processes. Morton Morton1966 () and Hilbert Peano Campbell2003 () space filling curves are used in this article. A property of these SFCs is the conservation of spatial locality. This is especially important in communication sensitive applications, like rigid body dynamics, since it keeps the communication distances small. But this advantage comes at a cost. SFCs inherently need global information about all weights and subdomains. To maintain consistency across all processes, all processes have to agree on the same subdomain distribution. This can only be achieved by broadcasting all weights to all processes. Like for the non distributed domain partition, this will eventually lead to a memory problem, since the memory needed here grows like .

The second class of algorithms is based on load diffusion. Here load balancing is achieved by an iterative algorithm which tries to smoothen out load imbalances. The algorithm computes load gradients between processes and transfers subdomains from highly loaded processes to processes with less load Cybenko1989 (). In principle, the number of iterations has to be adjusted together with the number of subdomains to achieve a consistent good load balance. However, if the algorithm is run regularly, very large load imbalances should not occur and a fixed low number of iterations should be sufficient. This class of algorithms has shown promising results in very large scale fluid dynamics simulations Schornbaum2017 (). The major benefit of this algorithm is its strict locality which makes it a good candidate for extremely parallel simulations. However, due to its strict locality and since it is an iterative method, the quality of the load balance might be worse than for other algorithms. In particular, there is no guarantee that domains are kept together which might significantly hurt performance in communication sensitive applications.

The last class are graph based algorithms, as provided by the ParMetis Karypis1998a () graph partitioning framework. In particular we use the Kway algorithm which is based on a multilevel k-way partitioning algorithm Karypis1997 (); Schloegel2000 (). The Geom_Kway algorithm uses a space filling curve to compute the initial partitioning and then applies the Kway algorithm. Finally we also compare with Adaptive_Repart which is a Unified Repartitioning Algorithm that combines remapping and diffusion-based repartitioning schemes Schloegel2000a ().

3 Performance Results

In this chapter we will introduce the software framework and computer used for the performance evaluation. We then describe the metrics used for our comparison as well as the simulation setup. Following this introduction we present a detailed analysis of the performance data we gathered.

3.1 Framework and Supercomputer

All algorithms introduced in chapter 2.3 are implemented in the waLBerla multiphysics framework that is freely available at www.walberla.net and licensed under GPLv3. Previous results obtained with this framework show very good scalability for rigid body dynamics simulations Preclik2015 (); Eibl2018 (). All tests are executed on the Juqueen supercomputer located at Jülich Supercomputing Center (JSC) and ranked 22 in the TOP500 list2 as of November 2017. This BlueGene/Q system has compute nodes Gilge2014 (); Wautelet2014 (). Each node is equipped with a single IBM PowerPC A2 processor that has 18 cores clocked at , but only 16 cores are available for computing. Each processor supports 4-way simultaneous multi threading (SMT). The available memory for each node is of RAM. The interconnect fabric is a 5D torus topology which features a bandwidth of per link and direction Chen2012 (). This computer reached its end of lifetime in May 2018.

3.2 Metrics

Since the performance of a parallel code is always limited by the slowest process our first metric targets exactly this process. We define the max load per process by

with being the accumulated load of all subdomains located on process . As the load is used as an input parameter for all load balancing algorithms, this quantity should be close to the average load if the load balancing algorithm is working correctly.

With the second metric we want to measure the performance gained by using load balancing. The performance gain is defined by

with being the time needed to complete one time step. To reduce the sensitivity to fluctuations we always take the average over 100 time steps. It is also important to make sure that the setup does not change drastically within the measurement period to exclude unwanted variability for this metric. We will take care of this with a specifically designed setup which will be described in Sec. 3.3.

Finally we also examine at the time needed for the actual load balancing pipeline . This includes the time needed to refine/coarsen the domain, the balancing process and the migration of all subdomains. For the load balancing to be usable, this should be in the order of at most a few regular time steps. Otherwise the load balancing will dominate the overall runtime of the simulation.

3.3 Simulation Setup

Fig. 1: 2d cross section along the plane of the simulation setup. This setup is used to measure the performance of different load balancing algorithms. The simulation domain is confined by solid walls. The gravity is acting towards the lower left edge. All particles are arranged along that edge in a stable hexagonal close packing lattice. All particles are in contact with their neighbors. This assures that the setup will not change when the simulation is integrated in time. The simulation setup extends uniformly in the direction.

As the benchmark scenario, a particle configuration is chosen that does not move in time. This seems to be in contrast to why we need dynamic load balancing at all. However, we choose this setup deliberately to obtain a quantitative evaluation of the performance of the load balancing algorithm. It allows us to compare the runtimes before and after load balancing without the influence of a specific particle configuration. The setup itself is kept simple to enable us to give an a-priori estimate for the maximum performance gain achievable. With this setup we deliberately eliminate many factors which would be difficult to control and which could otherwise lead to biases in our measurements.

The general simulation domain is a box confined by solid walls. The box is filled to a certain height with a large number of spherical particles. Since the gravity points from the center of the box towards one edge, parallel to the axis, we start the filling process at this edge. The spherical particles are arranged in a hexagonal close packing (hcp) lattice. All particles are in contact with their neighboring particles. This makes sure that this particle configuration is already optimal and therefore the particles stay at rest. In this configuration the arrangement of particles and domain partitions is uniform in the -direction so that we deal essentially with a two dimensional situation. The simulation size can be adapted by growing the domain along the axis without changing the properties of the setup. A visualization of the setup with a half filled box is shown in Fig. 1. For the time integration of the system a non-smooth granular dynamics algorithm is used Preclik2015 ().

(a) before
(b) after
Fig. 2: Domain partitioning before and after the load balancing pipeline. Again a 2d cross section along the plane of the simulation setup is shown. The simulation domain is filled by with spherical particles. The particles are located at the lower left edge as indicated by the shading in lighter color. All subdomains are colored to represent the process they belong to. The initial partitioning uses 128 processes and 128 subdomains such that a 1:1 mapping is possible. The grid partitioning of the simulation domain is . Before the load balancing every octree is refined exactly once ( see subfigure a) ). During the refinement/coarsening process of the load balancing pipeline empty subdomains in the top right corner get merged and subdomains in the lower left get split. Note also the medium sized subdomains in the middle which are needed to fulfill the refinement constraint. Neighboring subdomains are allowed differ by at most one level of refinement. For subfigure b), a Hilbert space filling curve is used to calculate the distribution of the subdomains to the processes.

This scenario permits no efficient domain partitioning if a balanced octree is used. This can bee seen in Fig. 2a). The initial domain partitioning has exactly as many subdomains as processes available. This suboptimal initial configuration is chosen deliberately. In real world applications chances are that the user cannot choose a perfect domain partitioning for the setup due to restrictions by the framework or simply because the perfect partitioning is not known. It is also possible - and gets more likely throughout the simulation - that the particle configuration changes such that no static domain partitioning is appropriate. Therefore, the load balancing pipeline must find the best possible partitioning regardless of initial user input and at runtime. In our experiment, the initial setup is advanced by 100 time steps and the runtime is measured. This measurement acts as a baseline to judge the speedup that can be achieved by the load balancing pipeline.

The first step of the load balancing pipeline is the weight assignment for each subdomain. This is independent of the load balancing algorithm. Since the particles are arranged on a hcp lattice, we assume a contact number of 12 and a direct relation between the number of particles and the contacts that have to be resolved. Therefore we use the number of particles per subdomain as its computational weight. For the ParMetis based algorithms we use the area of the interface between two adjacent subdomains as an estimate for the communication weight. The next step is also independent of the load balancing algorithm used. The octree gets refined/coarsened to achieve a better granularity for the subsequent load balancing step. The refined domain partitioning is shown in Fig. 2b). A simple method based on thresholds on the number of particles per subdomain is used to decide whether a refinement must be performed. After the refinement the load balancing algorithm is executed. Due to the special considerations taken into account while designing the test scenario, the diffusive load balancing is guaranteed to converge in all simulations within a fixed number of iterations. We therefore keep the number of iterations constant throughout all simulations. With the new domain partition the simulation is then advanced again for 100 time steps, taking time measurements.

To assess the suitability for highly parallel applications, we conduct a weak scaling Hager2010 () study, i.e. we increase the computational effort by the same factor by which we increase the computational power. Therefore we extend the original simulation domain in the direction and at the same time we increase the number of cores used. The domain partition in the plane has cubic subdomains. Results for two scenarios with a different initial imbalance are shown in Sec. 3.4 and Sec. 3.5.

3.4 Medium Size Problem

(a) maximum load per process
(b) performance gain
Fig. 3: On the left side the quality of the load balancing pipeline is shown. The right side shows a comparison of three different load balancing algorithms. The performance is given relative to the performance before the load balancing. All data is evaluated for different simulation sizes.

The first evaluation uses a simulation box, of which is filled with particles. This means that at the start only of the processes are associated with subdomains that contain particles. Empty subdomains impose almost no workload so the whole work is done by of the processes. The size of the particles is chosen such that almost particles are located in each subdomain when it is completely filled. To cope with the limited amount of memory available per node, the SMT feature is not used, i.e. only process is spawned per core.

Every time step of the simulation consists of computational work and communication. In theory, by using all available processes efficiently, the maximum load per process can be reduced by a factor of 8 which should lead to a performance gain of 8. However, due to the refinement process, the total communication weight gets increased by a factor of 2 (eight times as many subdomains with a quarter of the original surface area). Together with also an increase of 8 in the available network resources, this should lead to a performance gain of 4 for the communication part. Since the time spent in the communication also depends on the mapping of the processes onto the machine the actual performance can vary in both directions. However a good load balancing algorithm that respects spatial locality can optimize the mapping by placing adjacent subdomains on neighboring ranks. The overall performance gain depends on the relation between the time spent in computation and communication which is not clearly identifiable since communication is interleaved with computations in our implementation.

We evaluate the performance of the Hilbert space filling curve, the diffusive algorithm and the Geom_Kway algorithm of the ParMetis library for this scenario. We start by analyzing the load distribution after the load balancing. Note that in an ideal case with perfect load distribution the number of particles per process would be approximately . Note also that the granularity of the load balancing is - all subdomains which are completely filled with particles are refined into 8 smaller subdomains. Since only whole subdomains can be transfered between processes this limits the accuracy of the load balancing. Reaching the optimal load balance is almost impossible as one misplaced block changes the load by . With this in mind all tested load balancing algorithms achieve a very good load balance for all setup sizes only overloading the slowest process by one subdomain. Fig. 3a) displays the maximum load per process after load balancing. We also do not observe differences in the load balancing optimality achieved by the algorithms. With this information we can refine our estimate on the performance gain of the computational part. The load for highly loaded processes is reduced by . This performance gain is now equivalent to the performance gain of the communication. We therefore assume that the total performance gain is 4.

The performance gain is displayed in Fig. 3b). The best performing algorithm is the Hilbert space filling curve that reaches a performance gain of almost 7 for processes. With increasing domain size the performance gain for all algorithms worsens and converges to 4 as expected. One possible reason for deviating performance in the beginning is the fundamental change in the simulation setup. For higher process numbers a larger simulation domain is needed. This is achieved by growing the simulation domain along the axis introducing more subdomains in that direction. Therefore, the domain partitioning changes from an essentially 2D setup to a 3D configuration which requires inter node communication in direction as well. Even though the scenario was chosen carefully there are still many additional factors which might influence the overall performance. The initial refinement step introduces more subdomains which will lead to longer communication paths if they are assigned to distant processes. Also the internal data structures might show different performance characteristics for a different number of particles. To pinpoint the exact source of the deviating performance an even simpler setup together with a more detailed time measurement should be used. This will be part of future work.

3.5 Large Size Problem

(a) maximum number of particles per process
(b) relative performance after load balancing
Fig. 4: On the left side the maximum load per process is shown. The right side shows a comparison of six different load balancing algorithms. Data points are collected for different simulation sizes.

For the second benchmark we use a half filled simulation box and less particles per subdomain. The number of particles for a completely filled subdomain is approximately in this setup. The smaller number of particles allows us to use four processes per core – using full SMT capabilities – on the Juqueen supercomputer since less memory is required per subdomain. From a performance point of view this is the optimal configuration for the rigid body dynamics simulation Eibl2017 (). Furthermore, this allows us to study the scalability up to one million processes.

In theory, a maximal performance gain of 2 can be achieved in this scenario. For this scenario we compare Hilbert and Morton space filling curves, the diffusive algorithm as well as Geom_Kway, Kway and Adaptive_Repart from the ParMetis library.

Fig. 4a) shows again the maximum load per process. The average workload stays the same in this scenario (). However, due to the finer resolution of the subdomains () the balancing quality is closer to the optimum. As can be seen in Fig. 4a) all algorithms achieve the same balancing quality, except Adaptive_Repart. This algorithm exhibits problems getting close to the optimum when the number of processes increases. The assumption on the maximum performance gain of the computational part can now be refined also for this scenario: . However, the communication will not be improved at all (twice the amount of communication and twice the amount of resources).

The performance gain is shown in Fig. 4b). All algorithms were run in all scenarios. Missing data points indicate that the simulation ran out of memory for this configuration. The same general behaviour as for the medium size problem can be observed. In the beginning, the performance gain is better than expected. It then converges towards for the best performing algorithms. This indicates that the computational workload is dominating in this configuration. For small problem sizes all algorithms achieve a relatively good performance gain with the SFCs yielding the best results. For large scenarios the SFCs reach a performance gain of slightly above which coincides with the predicted value. The Geom_Kway and Kway algorithms perform slightly worse than the SFCs and they can only be used for up to processes. The diffusive algorithm keeps loosing performance until processes. Beyond this the performance stabilizes at . The diffusive algorithm is the only algorithm that can be used for all problem sizes. The Adaptive_Repart algorithm performs worst of all algorithms, yielding only a gain of at a maximal number of processes.

The general performance of the SFCs is comparable to the previous evaluation (see Sec. 3.4). The same reasons apply here. Since the Geom_Kway algorithm is partly based on SFCs, a similar performance can be expected. Unfortunately no improved performance compared to the plain SFCs can be observed with this more complicated algorithm. The plain Kway algorithm produces similar results to the Geom_Kway algorithm. The other ParMetis algorithm, Adaptive_Repart, shows poor performance even for moderate process numbers. All ParMetis based algorithms run out of memory quickly. The reason for this lies in the implementation of the ParMetis library. Unfortunately, an analysis of the implementation to pinpoint the actual cause is outside the scope of this article. The SFCs implemented natively in the framework run out of memory at a later point. During the implementation, special attention was paid to the memory requirements. Nevertheless, all space filling curves run out of memory at a certain process number since they rely on an allgather operation that requires memory, as explained in Sec. 2.3. This limits the applicability of SFCs for extreme scale parallel simulations. A better implementation might reduce the memory requirement by a constant factor, but the asymptotic complexity will still be quadratic. The diffusive algorithm has no problems with memory requirements since it is a strictly local (only stores information about and communicates only with next neighbors). However, the performance gain is worse compared to SFCs. One likely reason for this is fact that the diffusive algorithm does not maintain spatial locality. The diffusive process will often separate adjacent subdomains onto distant processors which will require communication over longer distances. Especially in Juqueen’s torus topology this will reduce the performance significantly when the data must be sent across multiple nodes.

Fig. 5: Runtime of various load balancing algorithms in a loglog plot. A weak scaling experiment shows the evolution of the runtime as more and more processes are used.

Finally we compare the runtime of the load balancing algorithms. The results are displayed in Fig. 5. Geom_Kway and native Kway from the ParMetis library show quadratic runtime complexity. Although this is not significant for small process numbers, it will become a problem when progressing to extreme scale. The third algorithm of the ParMetis library Adaptive_Repart shows linear runtime complexity but runs out of memory early. For the SFCs, a linear runtime complexity is expected since every process must compute the partitioning for all p processes. The runtime complexity of the allgather operation can be neglected as it can be implemented in complexity Bruck1997 (). The diffusive algorithm exhibits a linear increase for small process numbers, but asymptotically reaches a constant runtime complexity for large process numbers. This is also expected since it communicates only with next neighbors. In the case of the octree structure, as used in our implementation, every process has only a constant number of communication partners. The runtime of the complete load balancing pipeline for all load balancing algorithms ranges between and . Compared to a typical time step duration before load balancing of the runtime of the load balancing pipeline has to be considered and the load balancing should only be rerun when necessary.

4 Conclusion

In this paper we have evaluated what performance gain can be expected by using different load balancing algorithms in rigid particle dynamics simulations. We have used a carefully selected simulation setup that allowed us to give a-priori estimates on the expected performance gain. Different load balancing algorithms, namely graph based algorithms from the ParMetis library, a diffusion based algorithm, and space filling curves, have been evaluated and compared to our predictions. We have also studied the resources required by these algorithms with respect to memory and runtime. All these experiments were executed with different problem sizes to evaluate the suitability of these algorithms for extreme scale parallel environments.

Only the diffusive algorithm was usable up to the maximum number of processes used for this evaluation. All other algorithms exhausted the available memory earlier at a smaller number of processes. Also with respect to runtime, the diffusive algorithm was the only one showing a constant runtime complexity. With respect to this criterium, the SFCs performed second best, having a linear runtime complexity. The Kway algorithm with quadratic complexity came in last.

However the performance gain achieved by the diffusive algorithm is worse compared to SFC based algorithms. This leads to the conclusion that for small to medium size parallel applications, SFCs are currently the best choice. At very large scale, currently only the diffusive algorithm is applicable, however, its performance is only suboptimal. It is also clearly shown that the maximal performance gain is limited by the forest of octrees partitioning regardless of the balancing algorithm used. Since a perfect equal load distribution is highly unlikely to be reached by all processes due to the granularity of the load balancing, the maximal performance gain will always be less than the theoretical optimal value.

The measurements shown in this paper also indicate that there are other factors influencing the load balancing which were not considered here. Future investigations should pinpoint the exact cause for performance deviations of all algorithms to identify areas for improvement. The diffusive algorithm is already a promising candidate for load balancing in extremely parallel environments. However, the performance gain by diffusive balancing must be improved to become competitive with SFCs for medium size problems.

Acknowledgment

The authors gratefully acknowledge the Gauss Centre for Supercomputing e.V. (www.gauss-centre.eu) for funding this project by providing computing time through the John von Neumann Institute for Computing (NIC) on the GCS Supercomputer JUQUEEN at Jülich Supercomputing Centre (JSC).

The authors would like to acknowledge the support through the Cluster of Excellence Engineering of Advanced Materials (EAM).

Declaration of interest: none

References

Footnotes

  1. journal: Computer Physics Communications
  2. https://www.top500.org/

References

  1. P. A. Cundall, O. D. L. Strack, A discrete numerical model for granular assemblies, Géotechnique 29 (1) (1979) 47–65. doi:10.1680/geot.1979.29.1.47.
  2. D. E. Stewart, Rigid-body dynamics with friction and impact, SIAM Review 42 (1) (2000) 3–39. doi:10.1137/s0036144599360110.
  3. T. Preclik, U. Rüde, Ultrascale simulations of non-smooth granular dynamics, Computational Particle Mechanics 2 (2) (2015) 173–196. doi:10.1007/s40571-015-0047-6.
  4. T. Preclik, S. Eibl, U. Rüde, The maximum dissipation principle in rigid-body dynamics with inelastic impacts, Computational Mechanics (2017) 1–16doi:10.1007/s00466-017-1486-0.
  5. S. Eibl, U. Rüde, A local parallel communication algorithm for polydisperse rigid body dynamics, submitted to Parallel ComputingarXiv:1802.02765.
  6. D. M. Beazley, P. S. Lomdahl, N. Gronbech-Jensen, R. Giles, P. Tamayo, Parallel algorithms for short-range molecular dynamics, Annual reviews of computational physics 3 (1995) 119–175. doi:10.1142/9789812830647_0004.
  7. S. Plimpton, Fast parallel algorithms for short-range molecular dynamics, Journal of Computational Physics 117 (1) (1995) 1–19. doi:10.1006/jcph.1995.1039.
  8. B. Hendrickson, K. Devine, Dynamic load balancing in computational mechanics, Computer methods in applied mechanics and engineering 184 (2) (2000) 485–500.
  9. D. Owen, Y. Feng, K. Han, D. Peric, Dynamic domain decomposition and load balancing in parallel simulation of finite/discrete elements, in: ECCOMAS 2000, Barcelona Spain, 2000.
  10. G. Karypis, V. Kumar, A fast and high quality multilevel scheme for partitioning irregular graphs, SIAM Journal on scientific Computing 20 (1) (1998) 359–392. doi:10.1137/s1064827595287997.
  11. Y. Deng, R. F. Peierls, C. Rivera, An adaptive load balancing method for parallel molecular dynamics simulations, Journal of Computational Physics 161 (1) (2000) 250–263. doi:10.1006/jcph.2000.6501.
  12. C. Begau, G. Sutmann, Adaptive dynamic load-balancing with irregular domain decomposition for particle simulations, Computer Physics Communications 190 (2015) 51–61. doi:10.1016/j.cpc.2015.01.009.
  13. M. S. Warren, J. K. Salmon, A parallel hashed oct-tree n-body algorithm, in: Proceedings of the 1993 ACM/IEEE conference on Supercomputing - Supercomputing '93, ACM, ACM Press, 1993, pp. 12–21. doi:10.1145/169627.169640.
  14. M. J. Berger, S. H. Bokhari, A partitioning strategy for nonuniform problems on multiprocessors, IEEE Transactions on Computers C-36 (5) (1987) 570–580. doi:10.1109/tc.1987.1676942.
  15. F. Fleissner, P. Eberhard, Load balanced parallel simulation of particle-fluid dem-sph systems with moving boundaries, Parallel Computing: Architectures, Algorithms and Applications 38 (2007) 37–44.
  16. F. Fleissner, P. Eberhard, Parallel load-balanced simulation for short-range interaction particle methods with hierarchical particle grouping based on orthogonal recursive bisection, International Journal for Numerical Methods in Engineering 74 (4) (2008) 531–553. doi:10.1002/nme.2184.
  17. Z. Shojaaee, M. R. Shaebani, L. Brendel, J. Török, D. E. Wolf, An adaptive hierarchical domain decomposition method for parallel contact dynamics simulations of granular materials, Journal of Computational Physics 231 (2) (2012) 612–628. doi:10.1016/j.jcp.2011.09.024.
  18. J.-L. Fattebert, D. F. Richards, J. N. Glosli, Dynamic load balancing algorithm for molecular dynamics based on voronoi cells domain decompositions, Computer Physics Communications 183 (12) (2012) 2608–2615. doi:10.1016/j.cpc.2012.07.013.
  19. A. Heinecke, W. Eckhardt, M. Horsch, H.-J. Bungartz, Supercomputing for Molecular Dynamics Simulations: Handling Multi-Trillion Particles in Nanofluidics, Springer, 2015.
  20. C. Niethammer, S. Becker, M. Bernreuther, M. Buchholz, W. Eckhardt, A. Heinecke, S. Werth, H.-J. Bungartz, C. W. Glass, H. Hasse, et al., ls1 mardyn: The massively parallel molecular dynamics code for large systems, Journal of Chemical Theory and Computation 10 (10) (2014) 4455–4464. doi:10.1021/ct500169q.
  21. B. Hess, C. Kutzner, D. Van Der Spoel, E. Lindahl, GROMACS 4:  algorithms for highly efficient, load-balanced, and scalable molecular simulation, Journal of Chemical Theory and Computation 4 (3) (2008) 435–447. doi:10.1021/ct700301q.
  22. R. Berger, C. Kloss, A. Kohlmeyer, S. Pirker, Hybrid parallelization of the LIGGGHTS open-source DEM code, Powder Technology 278 (2015) 234–247. doi:10.1016/j.powtec.2015.03.019.
  23. D. T. Cintra, R. B. Willmersdorf, P. R. M. Lyra, W. W. M. Lira, A hybrid parallel DEM approach with workload balancing based on HSFC, Engineering Computations 33 (8) (2016) 2264–2287. doi:10.1108/EC-01-2016-0019.
  24. D. T. Cintra, R. B. Willmersdorf, P. R. M. Lyra, W. W. M. Lira, A parallel DEM approach with memory access optimization using HSFC, Engineering Computations 33 (8) (2016) 2463–2488. doi:10.1108/EC-07-2015-0203.
  25. D. Markauskas, A. Kačeniauskas, The comparison of two domain repartitioning methods used for parallel discrete element computations of the hopper discharge, Advances in Engineering Software 84 (2015) 68–76. doi:10.1016/j.advengsoft.2014.12.002.
  26. E. G. Boman, Ü. V. Çatalyürek, C. Chevalier, K. D. Devine, The zoltan and isorropia parallel toolkits for combinatorial scientific computing: Partitioning, ordering and coloring, Scientific Programming 20 (2) (2012) 129–150. doi:10.1155/2012/713587.
  27. F. Schornbaum, U. Rüde, Massively parallel algorithms for the lattice boltzmann method on NonUniform grids, SIAM Journal on Scientific Computing 38 (2) (2016) C96–C126. doi:10.1137/15M1035240.
  28. F. Schornbaum, U. Rüde, Extreme-scale block-structured adaptive mesh refinement, SIAM Journal on Scientific Computing 40 (3) (2018) C358–C387. doi:10.1137/17m1128411.
  29. M. Bader, Space-filling curves: an introduction with applications in scientific computing, Vol. 9, Springer Science & Business Media, 2012.
  30. G. M. Morton, A computer oriented geodetic data base and a new technique in file sequencing.
  31. P. M. Campbell, K. D. Devine, J. E. Flaherty, L. G. Gervasio, J. D. Teresco, Dynamic octree load balancing using space-filling curves, Williams College Department of Computer Science, Tech. Rep. CS-03-01.
  32. G. Cybenko, Dynamic load balancing for distributed memory multiprocessors, Journal of Parallel and Distributed Computing 7 (2) (1989) 279–301. doi:10.1016/0743-7315(89)90021-X.
  33. G. Karypis, A coarse-grain parallel multilevel k-way partitioning algorithm, in: Proc. Of 8th SIAM Conf. on Prallel Processing for Sci. Comp., 1997.
  34. K. Schloegel, G. Karypis, V. Kumar, Parallel multilevel algorithms for multi-constraint graph partitioning, in: Euro-Par 2000 Parallel Processing, Springer, Springer Berlin Heidelberg, 2000, pp. 296–310. doi:10.1007/3-540-44520-x_39.
  35. K. Schloegel, G. Karypis, V. Kumar, A unified algorithm for load-balancing adaptive scientific simulations, in: ACM/IEEE SC 2000 Conference (SC'00), IEEE, IEEE, 2000, pp. 59–59. doi:10.1109/sc.2000.10035.
  36. M. Gilge, et al., IBM System Blue Gene Solution Blue Gene/Q Application Development, IBM Redbooks, 2014.
  37. P. Wautelet, M. Boiarciuc, J. Dupays, S. Giuliani, M. Guarrasi, G. Muscianisi, M. Cytowski., Best Practice Guide – Blue Gene/Q, v1.1.1 edition Edition (2014).
  38. D. Chen, N. Eisley, P. Heidelberger, R. Senger, Y. Sugawara, S. Kumar, V. Salapura, D. Satterfield, B. Steinmacher-Burow, J. Parker, The IBM blue gene/q interconnection fabric, IEEE Micro 32 (1) (2012) 32–43. doi:10.1109/mm.2011.96.
  39. G. Hager, G. Wellein, Introduction to high performance computing for scientists and engineers, CRC Press, 2010. doi:10.1201/ebk1439811924.
  40. S. Eibl, T. Preclik, U. Rüde, JUQUEEN Extreme Scaling Workshop 2017, no. FZJ-JSC-IB-2017-01 in JSC Internal Report, 2017, p. 47.
  41. J. Bruck, C.-T. Ho, S. Kipnis, E. Upfal, D. Weathersby, Efficient algorithms for all-to-all communications in multiport message-passing systems, IEEE Transactions on Parallel and Distributed Systems 8 (11) (1997) 1143–1156. doi:10.1109/71.642949.
248302
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
Edit
-  
Unpublish
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel
Comments 0
Request comment
""
The feedback must be of minumum 40 characters
Add comment
Cancel
Loading ...

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description