Improving Locality of Unstructured Mesh Algorithms on GPUs

Improving Locality of Unstructured Mesh Algorithms on GPUs


[Summary] To most efficiently utilize modern parallel architectures, the memory access patterns of algorithms must make heavy use of the cache architecture: successively accessed data must be close in memory (spatial locality) and one piece of data must be reused as many times as possible (temporal locality).

In this work we analyse the performance of unstructured mesh algorithms on GPUs, specifically the use of the shared memory and two-layered colouring to cache the data. We also look at different block layouts to analyse the trade-off between data reuse and the amount of synchronisation.

We developed a standalone library that can transparently reorder the operations done and data accessed by a kernel, without modifications to the algorithm by the user. Using this, we performed measurements on relevant scientific kernels from different applications, such as Airfoil, Volna, Bookleaf, Lulesh and miniAero; using Nvidia Pascal and Volta GPUs. We observed significant speedups ().

finite volume, finite element, race condition, GPU

Research Article\newcolumntypeL[1]>\arraybackslashm#1 \newcolumntypeR[1]>\arraybackslashm#1 1]András Attila Sulyok* 1]Gábor Dániel Balogh 1]István Zoltán Reguly 2]Gihan R Mudalige \corres*Attila András Sulyok,

1 Introduction

The computational needs of scientific computations are increasing much faster than the clock frequency of modern CPUs. To execute these codes with sufficient performance the parallelisation of workloads could be the solution. In the past ten years since the first release of the CUDA language extension for C/C++ the GPUs became widely used for high performance and scientific computations. CUDA provides a low level abstraction commonly referred to as Single Instruction Multiple Threads (SIMT), which gives us a fine grained control over GPU architectures.
A significant class of scientific applications operate on unstructured meshes, which are represented as sets and explicit connections between them. These applications are operating on large sets and usually execute similar computations for every set element. For processing the elements, they can access data on the set which they operate on, or using indirections data defined on other sets. In the latter case multiple threads may try to modify the same data, leading to race conditions. This algorithmic pattern is the focus of our work: operations on sets that read and most importantly increment data indirectly on other sets. These operations are common in discretised PDE solver, such as Finite Volumes and Finite Elements. The adoption of GPUs for these kind of computations could lead to considerable speedups due to the highly parallel execution of the code. However, the straightforward parallelisation of the scientific algorithms may not lead to optimal performance on GPUs. The SIMT abstraction exposes a number of techniques that can improve performance. One of these are the explicit management of the memory system of the GPU.
The proper usage of the memory system for simulations is inevitable in order to get good performance. The latency of global memory accesses is a bottleneck for a lot of kernels in modern scientific computations. To improve performance we have to lower the impact of the memory transactions, since we can’t fully hide the latency with computations, we can either increase the number of memory transactions in flight or decrease the number of memory transactions with high latency. To achieve the latter goal we can use shared memory for blocks as an explicitly managed cache, which has much lower latency than global memory. However, to get good performance we have to ensure that we can use these memories efficiently from all threads in the blocks.

Most codes do one of two things: they use a straightforward parallelisation relying on either colouring or large temporary datasets to avoid two threads writing the same data that run simultaneously, but these end up with poor data locality for most of the cases. The only advantage of this approach is that the parallelisation is simple, which means that it has lower overhead at the beginning to plan the execution of the kernel. However, if we spend more time for planning the execution strategy for the kernel, for example by altering the order in which the elements are processed we can significantly improve data locality. Better memory locality means that we can use cache lines more efficiently (with one read transaction we can read multiple pieces of data that we can use for the simulation) so that we end up with fewer memory transactions, which means less time spent waiting for data.
Reordering and partitioning algorithms are already used for maximising data reuse in CPUs and minimising communication in distributed memory systems. In our work we use them to improve the data locality inside CUDA blocks, to get better performance. To get an idea how the best performance could be achieved we make the following contributions:

  1. We adopt a caching mechanism on the GPU that loads indirectly accessed elements into shared memory. We use hierarchical colouring to avoid data races.

  2. We design a reordering algorithm based on partitioning that increases data reuse within a thread block, also increasing shared memory utilisation.

  3. We implement a library that parallelises applications on unstructured meshes and is capable of reordering the threads to increase efficiency.

  4. We analyse the performance of various parallelisation and data layout approaches on several representative applications.

The rest of the paper is structured as follows: the rest of Section 1 introduces the basic concepts of unstructured meshes and the CUDA architecture, and also discusses some related works, Section 2 describes the used algorithms and the motivation behind them, Section 3 shows the structure of our library, then Section 4 shows the test-cases and the measurements we performed. Finally, in Section 5, we draw some conclusions from the measurements.

1.1 Background

Unstructured meshes

In the scientific and high performance communities concurrent operations on different meshes are commonly used; meshes can be used as discretisations to numerically solve partial differential equations (PDEs). Usually these simulations require millions of elements to get correct results. Structurally, we can differentiate between structured and unstructured meshes.
Structured meshes are defined by blocks. Inside the blocks we have elements and implicit connectivity between them. The blocks are organized in a structured manner meaning that for every element in a block we can get all the neighboring elements simply from indexing arithmetic without any further information. Structured applications consist of loops over blocks which access datasets defined on blocks and the access patterns required for an element is described by stencils. A part of a block and a stencil describing access shown in Figure 1. Most operations on structured grids can be easily parallelised even on GPUs since all nodes can be processed simultaneously.

Figure 1: Structured mesh and stencil describing access pattern

Unstructured meshes are defined by sets (e.g. nodes, edges, faces…), data defined on the sets, and explicit connectivity information between sets (mapping tables). For determining the neighbours of a set element the index isn’t enough, we need external information. Thus there is an overhead associated with the reading of the mapping tables to determine the neighboring elements, but we gain significant performance when we need finer resolution for interesting parts of the mesh while a rougher resolution is sufficient for the most of the mesh (e.g. in the case of PDE-s where the gradient changes slowly). In such a case with structured meshes we need to compute unnecessary points otherwise we wouldn’t get correct results in areas where detail is necessary.

For unstructured meshes the computations on the mesh are given as operations on sets. Basically a loop over the elements of a set, executing the same operations for every element, while accessing data directly on the iteration set or indirectly through a mapping. An example for a mesh and a mapping from edges to cells are shown in Figure 2. For kernels that only access data directly or only read data indirectly but write the data on the iteration set, the whole iteration space could run simultaneously. However, for kernels which indirectly increment data, we need to ensure that there are no race conditions during the execution.

The parallelisation of such unstructured mesh operations is much harder than for structured mesh codes: where the elements are writing data indirectly we can not tell which elements could be computed at the same time from compile time information, as it is driven by the mapping table. One of the most commonly used approaches is to use a runtime colouring of elements to parallelise the computation.

Figure 2: Unstructured mesh, the arrow represents the mapping tells is connected to and .

In our case the sets are represented as consecutive indexes from zero to the size of the set. The mapping between two sets are represented as a mapping table, an array which stores the index of set elements in the second set of the mapping (referred as to-set) for every set element of the first set (from-set). For the example in Figure 2 a part of the mapping from edges to cells is shown in Figure 3. In this way we can access the index of those elements which are connected to the current element of the from-set from other sets (the to-sets of the mappings).

Figure 3: A part of the mapping from edges to cells.

In our case we used some restrictions for the applications, but out work is general enough that we are able to represent most applications in a way that doesn’t violate our restrictions. The first is that there is no data access through multiple mappings. This means every data that we access is either accessed directly with the index of the set we iterate on or it is accessed with an index from one of to-sets of the accessible mappings get with using the index of the current element. This restriction does not exclude applications using nested indirections, since we can create a mapping that contains the indexes that we access through multiple mapping. The second restriction is that the result of the operations on the sets are independent from the order of processing the elements of the sets. This restriction is necessary to ensure that we can parallelise the operations, because there is no guarantee for execution order in the parallelised versions.

1.2 The CUDA Architecture and Programming Model

GPUs were originally purpose-built processors, targeted solely at graphics processing, but gradually these pipelines became more and more flexible. In 2007, NVIDIA released CUDA (Compute Unified Device Architecture) which for the first time allowed general-purpose programming on GPUs - CUDA is both a hardware architecture and an extension of the C/C++ language, and the two are closely related.

NVIDIA GPUs are sold as separate accelerator cards that can be connected to a CPU system via the PCI-Express bus. As such, it has its own memory, and processing unit, and is driven by the CPU, which can explicitly transfer data between its own memory and GPU memory, and can launch GPU “kernels”. GPU kernels are tasks issued by the CPU and executed entirely on the GPU, mapping work to the processing units. The processing unit part of the device has a hierarchical design: arithmetic-logic units (ALUs) are organised into scalar multiprocessors (64-192 cores), which have their own instruction counters, registers and locally shared memory, and work independently. The ALUs, also called CUDA cores, are organised much like vector units on CPUs: in groups of 32 (a warp), they always execute the same instructions, allowing for certain lanes to be masked out to support branching in code.

The CUDA programming model, a Single Instruction Multiple Thread (SIMT) model, follows the architectural features closely: a kernel launch consists of a number of threads, each with its own unique index, all executing the same code but on potentially different pieces of data. Threads are grouped into thread blocks, within which threads can synchronise and communicate via shared memory or register exchange instructions. Thread blocks share the same view of “global memory”, which is the dedicated GPU memory. Groups of 32 consecutive threads are assigned to warps, which execute the same instruction in lockstep - therefore if any branching occurs in code, certain threads in the warp will have to be deactivated, resulting in a loss of parallelism. Similarly, memory load and store instructions are grouped so they can be served by cache lines - if the memory addresses for different threads in the warp lie on some of the same cache lines, these requests can be served by fewer transactions. Cache lines loaded are temporarily stored in caches, but these chips have very small L1 caches, therefore they cannot be solely relied upon for mitigating the cost of bad memory access patterns.

Accesses to shared memory have similar concerns albeit for different reasons: shared memory is organised into banks, and when threads in the warp access different addresses that fall into the same bank, that requires multiple transactions - these are called shared memory bank conflicts.

1.3 Related work

Algorithms defined on unstructured meshes have been around for a long time - discretisations such as finite volumes or finite elements often rely on these meshes to deliver high-quality results. Indeed there are thousands of papers detailing such algorithms, and a large number of commercial, government, and academic research codes; to name a few OpenFOAM [1], Rolls-Royce Hydra [2], FUN3D [3]. All of these codes use unstructured meshes in some form or shape, and since they are often used for large experiments, the efficiency of their implementation is an important concern. The problem of effectively parallelising applications using unstructured meshes is far from trivial and a number of solutions can be found in the literature.

There are a large number of papers discussing the efficient implementation and parallelisation in classical CPU architectures [4, 5], FPGAs [6, 7], and of course GPUs, which is the focus of our work.

There are libraries and domain specific languages that target unstructured mesh computations on GPUs, such as Liszt [8], OP2 [9], or more specialised ones such as the OCCA[10] open source library by Recacle et al. [11], which looks at efficiently solving elliptic problems on unstructured hexahedral meshes. Some of these libraries do use shared memory for improving data locality, however advanced techniques, such as reordering and partitioning studied in this work are not evaluated.

In work done by Castro et al. [12] on implementing path-conservative Roe type high-order finite volume schemes to simulate shallow flows, they also used auxiliary accumulators to avoid race conflicts while incrementing. Wu et al. [13] introduce caching using the shared memory with partitioning (clustering), but they do not use colouring, instead, they use a duplication method similar to that of Lulesh and miniAero, as described below. Fu et al. [14] also create contiguous patches (blocks) in the mesh to be loaded into shared memory, although they partition the nodes (the to-set) not the elements (the from-set of the mapping), further, they do not load all data into shared memory, only which is inside the patch. Writing the result back to shared memory is done by a binary search for the column index and atomic adds.

Mudalige et al. [15] work on the OP2 library, which is a basis of the present work. In the CUDA implementation, they use shared memory for caching with hierarchical colouring, but they do not reorder the threads to increase data reuse.

Parallel to these works the US Department of Energy labs have released a set of proxy applications that represent large internal production codes, showing some of the computational and algorithmic challenges that they face. In the Lulesh [16] and the miniAero [17] applications, two methods are proposed to address race conflicts: the first is to allocate large temporary array where the intermediate results (the increments) are placed, avoiding any race conditions, and to use a separate kernel to gather the results; the second is to use atomics. Both lead to increased warp divergence and high data latencies; and the use of the temporary array also leads to much more data being moved.

In our work, instead of porting any specific computation, we present a general approach to accelerate unstructured mesh applications, and in particular the indirect increment algorithmic pattern, on GPUs.

2 Parallelisation on GPUs

In this section we outline the techniques used to effectively optimise unstructured mesh applications on GPUs. We briefly show a naïve solution, then continue with describing various improvements others used, and then show our approach.

2.1 Traditional parallelisation approaches

On a GPU, groups of threads run at the same time, in lockstep, so it is not efficient to run computations of different length on different threads. Because of this, the usual practice is for each thread to take responsibility for one element on which computation is carried out (ie. the iteration set or from-set), because then the number of computations is fixed: the amount of data involved is fixed in the dimension of the mapping and the data arrays.

Care must be taken when writing parallel code to avoid data races when different threads modify the same data. It could happen that the threads are scheduled so that both reads occur before the writes. For example, if the threads increment by and , respectively, then the final result could be an increment of (when one increment happens fully before the other), (when the second thread writes last), (when the first thread writes last). There are several approaches to tackle this.

A solution would be to colour each thread according to the indirect data it writes, so that no two threads with the same colour write the same data (Figure c). Then do multiple kernel launches corresponding to the colours, so there is no concurrent writes. We call this the global colouring approach. The problem with this is that there is no data reuse: when multiple elements write the same data, they are scheduled for execution in different launches. Since these operations also tend to read data through the same mappings, there is no data reuse in the reads either. Worsening the issue is low cache line utilisation: elements of the same colour are not neighbours in the graph, and therefore unlikely to be stored in consecutive memory locations.

Another approach is to serialise the writing back of the result by means of locks or atomic additions (Figure a). This is quite costly on the GPU, since the whole warp has to wait at the synchronisation step leading to warp divergence.

Yet another solution is the use of a large temporary array that stores the results for each thread separately, avoiding race conditions by formulation (Figure b). But after the computation finishes, another kernel is required to gather the results corresponding to one data point. This suffers from the problem of not knowing how many values one thread has to gather, and the warps could diverge significantly, and memory access patterns are less than ideal (they can be good either for the write or the read, not both). Also, the size of the helper array is the number of cells multiplied by the dimension of the mapping. As a result, it can be quite large, for example, in Lulesh, it is in our measurements (where is the size of the from-set in Lulesh), compared to the array defined on nodes where these values will ultiamtely end up, which is roughly the same as the number of elements themselves.

Figure 4: Visualisation of methods to avoid data races between three edges writing the same cell in parallel.

Note that in most codes there are two arrays for every data that is updated: one for storing the result of the previous iteration and one to hold the values computed in the current iteration.

Array-of-Structure (AOS) to Structure-of-Array (SOA) transformation

On GPUs multiple threads execute the same code at the same time. This means that the consecutive threads in a warp are reading the memory at the same time, hence the layout of the data in the memory is an important factor. We used two different layouts in our measurements. The first is the Array-of-Structures layout which means that the data associated with one element is in consecutive places in the array (and thus in memory). The AoS layout shown in Figure a. The other layout is the Structure-of-Arrays where we store the components of elements consecutively e.g. the first data component of the elements are in the beginning of the array followed by the second, etc. as shown on Figure b.
Although, in most cases the SoA gives better performance on GPUs, the AoS layout is still commonly used on CPU architectures with large caches. In the case of AoS layout consecutive threads read data from different places of the memory and thus it is possible that we move unnecessary data with cache lines, if not all components are used. However with the SoA layout the threads read data next to each other which means that the data needed by consecutive threads are probably in the same cache line and we get coalesced memory transactions. The data layouts and the corresponding memory reads are shown in Figure 5. However, when indirections are involved, these access patterns become more complicated - even with the SoA pattern, consecutive threads may not be reading consecutive values in memory, and therefore cache line utilisation degrades. The choice of data layout in unstructured mesh computations is therefore highly non-trivial, as we show later.

Figure 5: AoS and SoA layout example with data access pattern in the case of threads read the first two data of the points one after another.

2.2 Shared memory approach

Our approach exploits the shared memory on the GPU, which has much lower access latencies than the global memory, but is only shared within thread blocks. The idea is to collect the data necessary to the computation and first load it into shared memory. During computation, the indirect accesses are to the shared memory, and the result is also stored there. After every computation has finished, the global memory is updated with the contents of the shared memory.

Note that fetching the data from the global memory (and also writing the result back) can be done by the threads independently of which thread will use which pieces of data for the computations. Particularly, fetching can be done in the order the data is laid out in memory, ensuring the utilisation of cache lines as much as possible. Also, if AoS layout is used, data can be read in contiguous chunks as large as the dimension of the data.

To address data races, we use two-layered colouring or hierarchical colouring[15]: we colour the blocks and the threads within a block as well. The former is to avoid data races when thread blocks write the result back to global memory and the latter is to avoid threads writing their results into shared memory at the same time. At the end of computations in each thread, the intermediate results are placed in a register array. After every computation has finished, the contents of this array are stored in shared memory in a controlled manner: one thread colour at a time, with block-level synchronisations in-between. For every block colour, we launch a new kernel, just like in global colouring.

Note that the data that is loaded into shared memory is not necessarily only the data that is updated. The partitioning is done with regard to the former, the colouring with regard to the latter. In fact, all of the indirect accessed data can be loaded if it fits into shared memory (or does not cause such a low occupancy that offsets the benefits of high reuse).

The steps done by a kernel are described in Algorithm 2.2. The indirect data accessed by the block is fetched from global to shared memory (which is identified in a preprocessing phase). This operation consists of two nested loops: an iteration over the data points, and within that, an iteration over the data corresponding to the data point. For the SoA layout, only the outer loop needs to be parallelised, since this will cause parallel read operations to access memory addresses next to each other. For the same reason, if AoS layout is used, both parallel loops need to be parallelised (collapsed into one).

The data layout in shared memory is always SoA: our measurements showed a consistent degradation in performance when switching to AoS layout, due to the spatial locality described in Section 2.1.1: it leads to fewer bank conflicts.

After the data is loaded into shared memory (ensured by a \lstinline!__syncthreads! operation), the intermediate result of the thread is computed by the main body of the kernel, and placed into registers. The most common operation is incrementing: the threads increment the data from the previous iteration with their intermediate results.

Next the threads update the result in shared memory with their increments. Note that this is not the same data array as the one used as the input for the computations, since those are values calculated by a previous kernel or iteration. Finally the updated data is written back to global memory.


tid = blockIdx.x * blockDim.x + threadIdx.x \Statebid = blockIdx.x \ForAlld indirect_data \Stateshared[shared_ind(d)] = global_indirect[d] \Stateshared_out[shared_ind(d)] = global_indirect_out[d] \EndFor\State__syncthreads() \Stateresult = computation(shared[mapping[tid]], global_direct[tid]) \State__syncthreads() \Forc = 1 num_thread_colours \Ifc == thread_colours[tid] \Statewrite result back to shared_out \EndIf\State__syncthreads() \EndFor\ForAlld indirect_data \Statewrite shared_out back to global_indirect_out \EndFor Algorithm to use the shared memory to preload indirect data accessed within a thread block. \lstinline!global_indirect! holds the data indirectly read, \lstinline!global_indirect_out! holds the result of the iteration.

There is no need to keep two separate arrays in shared memory for storing the input (\lstinline!global_indirect!) and the output (\lstinline!global_indirect_out!) data: the same memory space can be used. After the computations, a new loop clears the memory, the \lstinline!result! array is written to shared memory, and from there, \lstinline!global_indirect_out! is updated.

One other benefit from using the shared memory with hierarchical colouring is data reuse within the block: each piece of data has to be loaded from global memory only once, but can be used by multiple threads. However, the greater the reuse, the more thread colours we have: the number of colours is no less than the number of threads writing the same data. Since the number of synchronisations also grows with the number of thread colours (more precisely, it is the number of colours plus two, one before and one after the computation if the input and the increment are stored separately in shared memory), there is a trade-off between the number of synchronisation and data reuse. Our measurements showed, that if the kernel is memory-bound, the greater data reuse leads to increased performance, but the trade-off is non-trivial, as we will demonstrate in the Measurements section.

2.3 Increasing data reuse

Our work’s key contribution is the reordering of the elements in the from-set (which map to the threads), so we can control how CUDA thread blocks are formed and how much data reuse can be achieved. We consider mainly the shared memory approach, where the benefit of data reuse is twofold: it decreases the number of global memory transactions and decreases the size of shared memory needed, which leads to greater occupancy.

We have looked at two different approaches: the first is the Gibbs-Poole-Stockmeyer algorithm, the second is partitioning.

Gibbs-Poole-Stockmeyer-based reordering

For serial implementations (typically on CPUs) of computations on graphs, the Gibbs-Poole-Stockmeyer (GPS, [18]) is a heuristic algorithm that increases spatial and temporal locality when traversing the nodes. It does this by renumbering the nodes, and with this, changing the order of traversal. (In this case, the edges are the elements of the from-set of the mapping, and the nodes form the to-set.)

It does this by going through the nodes in a breadth first manner from two distant starting points, and then renumbers the nodes so that the levels of the resulting spanning trees will constitute contiguous blocks in the new permutation.

After renumbering its points, which by design improves spatial locality, we order the edges of the graph lexicographically, so that consecutive threads (or spatial iterations in serial implementations) have a higher chance of accessing the same points, which improves temporal locality (data reuse).

When the operations on the edges are executed in parallel rather than sequentially, this increase in data reuse is also present, and the sharing approach can benefit from this.

The algorithm can be generalised to meshes by transforming each element into a fully connected graph of its points and then taking the union of these. An example of this is shown on Figure 6.

Figure 6: An example of converting a mesh (shown in LABEL:sub@fig:mesh2grapha, with mapping dimension 4) to a graph (on Figure LABEL:sub@fig:mesh2graphb) for the GPS algorithm.

There are several straightforward generalisations to handle multiple sets and mappings (e.g. vertices, edges, cells and their connections). The first is to assume that all the mappings describe a similar topology, so the elements can be reordered based on one of the mappings (as described above), then reorder the points accessed through the other mappings by, for example, a greedy method. Another approach could be to reorder every data set separately, and then reorder the elements based on the new order of the accessed points, combining the separate data sets (and corresponding mappings) in some way. Since the mappings in the applications we measured are very similar topologically (in fact, except for Airfoil, there is only only one mapping in each application), we used the first method.

However, the algorithm fails to take into account that on the GPU the threads are grouped into blocks, and data reuse can only realistically be exploited within blocks. The next proposed algorithm will address this limitation.

Partition based reordering

To increase data reuse within a block is equivalent to decreasing data shared between the blocks, more specifically, to decrease the number of times the same data is loaded in different blocks. (With the sharing approach, data needs to be loaded only once per block.) So the task is to partition the elements into blocks of approximately the same size in such a way that when these blocks are assigned to CUDA thread blocks, the common data used (loaded into shared memory) by different blocks is minimised.

Let be a graph constructed from the original mapping, where the points are the threads, and there is an edge between them if and only if they access the same data, and let be a partition of this graph with blocks. This works even with multiple mappings.

If there is a set of blocks that access the same piece of data, then they form a clique in in the sense that between any pair of blocks and (where ), there is an edge of between and such that . Note that the cliques have edges, which is a monotone increasing function in , since (there is at least one block writing each data point, otherwise it is of no relevance).

That means that partitioning using the usual objective, ie. minimising the number of edges in the block is a good heuristic for maximising data reuse.

We chose the k-way recursive partitioning algorithm used by the METIS[19] library to partition the graph . It is a hierarchical partitioning algorithm: it first coarsens the graph by collapsing nodes, then partitions using the recursive bisection algorithm, then, while progressively uncoarsening the graph, locally optimises the cuts.

The algorithm tries to maintain equal block sizes in the resulting partition, however, there are some differences. A tuning parameter is the load imbalance factor, which can be used to specify the tolerance for this difference. (It is called load imbalance because METIS is originally used for distributing computation, ie. load, in a distributed memory system.) It is defined as , where is the number of blocks and is the size of the th block. Due to the local optimisation in the uncoarsening phase, it is impractical to set this parameter to (meaning the block sizes must be exactly the same). We found that a tolerance of works well in practice.

In our library, the block size is a tuning parameter, which specifies the actual block size of the launched GPU kernels, naturally, the number of working threads cannot exceed it. Therefore, we calculate a new block size () and tolerance () with margins for the imbalance: {align} S’ &= ⌊Sl ⌋
l’ &= S + ϵS’, where is the original block size, is the original load imbalance parameter and is an empirical tuning parameter to create as large blocks (within the limit) as possible.

The variable number of working threads also incurs a slight overhead of two global loads for each thread: this number and the starting offset in the arrays. We found this overhead to be minimal in practice.

Due to the way global loads and stores work on the GPU, what actually affects performance is not the number of data points accessed, but rather the number of cache lines of size 32 bytes that are accessed. We used a simple heuristic reordering of data points that takes this into account, it is described in Algorithm 2.3.2.

The idea is to group data points together that are read/written by the same set of blocks, especially a set of more than one block, since any inefficiencies in that case will count more than once. To achieve this, we simply sort the data points by the number of partitions that write them, and within this, the indices of the partitions themselves.


p indirect_data \Statepoint_to_block[p] = vector of all blocks that write p \Statesort point_to_block[p] \EndFor\Statesort point_to_block by first the size of vector, then the contents of the vector (in lexicographical order) \Statereorder data points in the order created by the sort Reorder the indirect data based on the thread blocks

2.4 Optimisations

There are a few optimisations we introduced to further improve performance.

During the load to and write from shared memory, in the case where the subsequent threads access addresses that are next to each other (the case when both loops are parallelised, as described in Subsection 2.2), we can make use of CUDA’s built-in vector types (\lstinline!float2!, \lstinline!float4! and \lstinline!double2!) to load and write larger chunks (16 bytes) of data at a time.

To reduce warp divergence when updating the shared memory with the increments, the threads can be sorted within a block by their colours. After this, threads with the same colour will be next to each other, so warps will have fewer colours, hence less divergence in average.

To allow the compiler to apply some reordering optimisations that it wouldn’t be able to do otherwise, the data pointers on the GPU is marked \lstinline!__restrict__! and \lstinline!const! where applicable. The former tells the compiler that the pointers do not alias one another, ie. they do not point to the same memory space. The latter enables the compiler to place the data in texture cache that has lower latency than the global memory.

3 Library implementation

In this section we describe the design considerations for the reordering library. The goal was to minimise the need for the users to modify their computation algorithms.

The basic architecture is shown on Figure 7. The flow of the application is as follows. The user first initialises the library and specifies the data and mapping sources (typically files). After some schedule and memory layout planning that optionally contains a reordering or partitioning step, the library calls the computation code in a loop, and the result is then saved.

Figure 7: The architecture of the reordering library.

3.1 User kernels

The concept of a computation is in the form of a loop (or kernel) body. The same code can be used as in the serial algorithm, with the restrictions described in Section 1.1.1 and with the modification that, since the data layout is not necessarily the usual AoS, the accesses need to use the stride parameters (supplied to the kernel by the library) that are defined to be the distance between two consecutive data component (e.g. this is in the case of AoS).

Four types of loops are supported: one serial and three parallel. The parallelisations are done by (1) OpenMP on the CPU, (2) CUDA with global colouring and (3) CUDA using the shared memory apporach with hierarchical colouring on the GPU. The CPU versions are not optimised and only there for testing and verification purposes.

The user kernel consists of two main levels. In the first (upper), the pointers to the data arrays are acquired, typecast and the result is written back into global (GPU or CPU memory). This is similar in the differently parallelised loops (e.g. there is usually no difference between the serial and the OpenMP versions), but there are small variations in the use of GPU shared memory, the synchronisation steps and the calculation of the loop variable. By loop variable, we mean the index from the from-set. The second (lower) level is the calculation itself that should be the same for all loop forms.

The read and written data is supplied to the kernel by means of \lstinline!void! pointers marked \lstinline!__restrict__! on the GPU (and \lstinline!const!, where applicable). Since the kernel is written by the user, it knows about the type of the data and can use type conversion, furthermore, the order of the pointers is the same as it was given to the library when loading the data.

The data is automatically copied to the device before the beginning of the loop when running on the GPU, and the result is copied back to the host after. The pointers also point to the location in the appropriate address space.

The directly accessed data is always in SoA form, while the layout of the indirectly accessed data is a user supplied compile time parameter. The shared memory in our kernels is in SoA form, but that is in the hand of the user. The layout of the mappings is AoS, except when using hierarchical colouring.

The synchronisation is done by the library using multiple kernel launches in the case of OpenMP and global colouring. The hierarchical colouring has some additional synchronisation steps, as described by Algorithm 2.2 in Section 2.2.

The loop variable is supplied runtime by the library on the CPU and can be calculated (e.g. for the global colouring, it is \lstinline!blockDim.x * blockIdx.x + threadIdx.x!) on the GPU. Since the blocks in the shared memory approach are not necessarily of the same size (they may be formed by partitioning), that kernel looks up its loop variable from an array.

Note that these modifications and variations to the user kernel can be easily automated by means of a source-to-source translator tool.

3.2 Execution planning

The serial loop is trivial to set up: it is a simple for loop over the from-set elements with no concurrency. For the parallel loops, we colour the elements to avoid data races.

Two kinds of colouring algorithms is used: global colouring is used for the OpenMP and the first CUDA parallelisation, while hierarchical or two-layer colouring is used for the shared memory approach. Both are variants of the basic greedy graph colouring algorithm.

The global colouring is a direct generalisation of the greedy graph colouring algorithm, as shown in Algorithm 3.2: for all elements of the from-set, its colour will be one of those that are used but none of its corresponding indirect points have. Which of the possible colours is used depends on the implementation of the \lstinline!chooseColour! function; in the OpenMP and global colouring parallelisations, the colour with the least amount of from-set elements assigned to it (so far) is chosen.

The threads with different colours are started in different kernel launches. The mappings and the direct accessed data arrays are reordered according to their colour. This avoids introducing another indirection (which would map from their index in the current colour to the actual from-set element that it should be working on).


\Stateinitialise with empty sets \State \ForAll \State \State \If \Stateinsert a new colour into \State \EndIf\State = chooseColour() \State \ForAll \State \EndFor\EndFor Global colouring

The hierarchical colouring reorders the threads according to the given partition and maps the threads to blocks. The block sizes are limited by a parameter (given by the user), if one block in the partition is larger, it is divided into two.

The same algorithm described above is then used to colour the thread blocks, iterating over the indirectly written data points of all threads in the blocks in the union operation and the inner for loop of Algorithm 3.2.

After that, the indices of the points that need to be loaded to shared memory are determined for each block and each thread is coloured within the block. Since here the number of thread colours is not important, the first possible colour with the lowest colour index is assigned.

The threads are then sorted according to their colour to reduce warp divergence, as described in Section 2.4.

The mappings and the direct accessed data are transformed into SoA layout to increase spatial locality. Similarly to the global colouring, these are then reordered according so they can be directly indexed by the loop variable.


Before the execution of the loop an optional step is the reordering of from- or to-set elements using the GPS or the partitioning algorithms described in Section 2.3.

We use the Scotch library[20] for the GPS algorithm, and the METIS[19] library for the multilevel k-way partitioning algorithm (using 64 bit integers for indices). We also tried the recursive bisection algorithm in the METIS library, but the result was orders of magnitude worse, and the partitioning algorithm in the Scotch library, but that failed to stay within tolerance (and half of the blocks in the partition were empty).

It must be noted that these algorithms were developed for distributing workloads on computing clusters that typically have much larger block size to total size ratio. This also caused the reordering (partitioning) phase to be quite long: even minutes; but this is a one-off cost: the reordering can be saved from the library and reused many times later. Further improvement could be achieved by using the parallel versions of the METIS library: ParMETIS[21] and the alpha version mt-Metis[22] libraries. Partitioning algorithms targeting specifically small partition sizes required for CUDA thread blocks are a target of future research.

4 Measurements

4.1 Used applications


Airfoil is a benchmark application, representative of large industrial CFD applications. It is a non-linear 2D inviscid airfoil code that uses an unstructured grid and a finite-volume discretisation to solve the 2D Euler equations using a scalar numerical dissipation. The algorithm iterates towards the steady state solution, in each iteration using a control volume approach, meaning the change in the mass of a cell is equal to the net flux along the four edges of the cell, which requires indirect connections between cells and edges. Airfoil is implemented using the OP2 domain specific language [15], where two versions exist, one implemented with OP2’s C/C++ API and the other using OP2’s Fortran API [9, 23].

The application consists of five parallel loops: save_soln, adt_calc, res_calc, bres_calc and update. The save_soln loop iterates through cells and is a simple loop accessing two arrays directly. It basically copies every four state variables of cells from the first array to the second one. The adt_calc kernel also iterates on cells and it computes the local area/timestep for every single cell. For the computation it reads values from nodes indirectly and writes in a direct way. There are some computationally expensive operations (such as square roots) performed in this kernel. The res_calc loop is the most complex loop with both indirect reads and writes; it iterates through edges, and computes the flux through them. It is called 2000 times during the total execution of the application and performs about 100 floating-point operations per mesh edge. The bres_calc loop is similar to res_calc but computes the flux for boundary edges. Finally update is a direct kernel that includes a global reduction which computes a root mean square error over the cells and updates the state variables.
In our work we focus on res_calc since it has indirect increments and about the 70% of the time is spent in this kernel on GPUs with global colouring. The tests are executed on a mesh containing 720 thousand edges and 2.8 million cells.


Volna is a shallow water simulation capable of handling the complete life-cycle of a tsunami (generation, propagation and run-up along the coast) [24]. The simulation algorithm works on unstructured triangular meshes and uses the finite volume method. Volna is written in C/C++ and converted to use the OP2 library[15]. For Volna top three kernels where most time is spent: computeFluxes, SpaceDiscretization and NumericalFluxes. In the computeFluxes kernel there are indirect reads and direct writes, in NumericalFluxes there are indirect reads with direct writes and a global reduction for calculating the minimum timestep and in SpaceDiscretization there are indirect reads and indirect increments. We focus on the SpaceDiscretization kernel since the 60% of the time spent inside this step on GPU with global colouring.

Tests are executed in single precision, on a mesh containing 2.4 million triangular cells, simulating a tsunami run-up to the US pacific coast. The kernel itself iterates over the edges and increments the cells.


BookLeaf is a 2D unstructured mesh Lagrangian hydrodynamics application from the UK Mini-App Consortium [25]. It uses a low order finite element method with an arbitrary Lagrangian-Eulerian method. Bookleaf is written entirely in Fortran 90 and has been ported to use the OP2 API and library. Bookleaf has a large number of kernels with different access patterns such as indirect increments similar to increments inside res_calc in Airfoil. For testing we used the SOD testcase with a 4 million element mesh. The top three kernels with the highest runtimes are getq_christiensen1, getacc_scatter, gather. In gather and getq_christiensen1 have indirect reads and direct writes, and the other two kernels have only direct reads and writes. Among these there is only one kernel (getacc_scatter) with indirect increments so we tested our optimisations on this kernel.


MiniAero [17] is a mini-application for the evaulation of programming models and hardware for next generation platforms from the Mantevo suite [26]. MiniAero is an explicit (using 4th order Runge-Kutta) unstructured finite volume code that solves the compressible Navier-Stokes equations. Both inviscid and viscous terms are included. The viscous terms can be optionally included or excluded. For miniAero meshes are created in code and are simple 3D hex8 meshes. These meshes are generated on the host and then moved to the device. While the meshes generated in code are structured, the code itself uses unstructured mesh data structures and access patterns. This mini-application uses the Kokkos library [27].

For miniAero we tested the compute_face_flux kernel that computes the flux contributions of the faces and increments it with the appropriate cell flux values. The original code (depending on a compile time parameter) either uses the auxiliary apply_cell_flux kernel that does the actual incrementing by gathering the intermediate results from a large temporary array, or uses atomics to do it within the kernel. Both the atomics and the work of the auxiliary kernel was substituted in our code by colouring.


Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics (LULESH, [16]) represents a typical hydrocode representing the Shock Hydrodynamics Challenge Problem that originally defined and implemented by Lawrence Livermore National Lab as one of five challenge problems in the DARPA UHPC program and has since become a widely studied proxy application in DOE co-design efforts for exascale.

LULESH is a highly simplified application, hard-coded to only solve a simple Sedov blast problem that has an analytic solution [28] – but represents the numerical algorithms, data motion, and programming style typical in scientific C or C++ based applications at the Lawrence Livermore National Laboratory. LULESH approximates the hydrodynamics equations discretely by partitioning the spatial problem domain into a collection of volumetric elements defined by a mesh.

Like miniAero, the mesh itself is structured (and generated in the code), but the algorithm doesn’t take this into account and accesses the data through an eight-dimensional mapping for the hex8 (brick) elements. In our measurements, we used a mesh where the the number of elements (from-set) was and the number of nodes (to-set) .

For Lulesh, we tested the IntegrateStressForElems kernel that calculates the forces in the nodes. The original CUDA version of the code contracted this kernel with CalcFBHourglassForceForElems; the only modifications we did to this code for our measurements is to remove these parts from the kernel.

4.2 Experimental setup

For testing we used NVIDIA Tesla P100 GPU, Intel(R) Xeon(R) CPU E5-1660 (3.20GHz base frequency, 1 socket with 8 cores) with Ubuntu 16.04. We used the nvcc compiler with CUDA 9.0 (V9.0.176). The parameters of the P100 GPU are shown in Table 1.

P100 (GP100)
Streaming Multiprocessors (SM) 56
Max Thread Block Size 1024
Max Warps / Multiprocessor 64
Max Threads / Multiprocessor 2048
Max Thread Blocks / Multiprocessor 32
Max Registers / Thread 255
Shared Memory Size / SM 64 KB
Max 32 - bit Registers / SM 65536
Memory Size 16 GB
L2 Cache Size 4096 KB
Table 1: Important informations about NVIDIA Tesla P100 GPU [29]

When comparing performances, we used the achieved bandwidth that is the key performance metric with memory-bound applications such as our test applications. It is calculated by the following formula:

where iterates over the datasets, is if the data is read and written, otherwise, is the size of the dataset, is the overall runtime of the kernel and is the number of iterations.

We also collected other relevant metrics that describe the observations, such as

  • data reuse factor (the average number of time an indirectly accessed data point is accessed),

  • the number of read/write transactions from/to grobal memory, which is closely related to the data reuse factor but is affected by memory access patterns, and therefore cache line utilisation,

  • the occupancy reflecting the number of threads resident on the SM versus the maximum - the higher this is, the better chance of hiding the latency of compute/memory operations and synchronisation

  • the percentage of stalls occurring because of data requests, execution dependencies, or synchronisation,

  • the number of block colours; the higher it is, the less work in a single kernel launch, which tends to lead to lower utilisation of the GPU,

  • the number of thread colours; the higher this is the more synchronisations are required to apply the increments in shared memory — but also strongly correlates with data reuse,

  • warp execution efficiency (ratio of the average active threads per warp to the maximum number of threads per warp).

Studying performance and these metrics help us understand and explain why ceratin variants are better than others.

4.3 Measurement results

Analysis of Airfoil

First we analyse the Airfoil application from the OP2 library — this is the most well understood and thoroughly studied example, therefore we go into more detail here — for later kernels and applications we then identify key differences.

Figure 8 show the effect of reordering on the global colouring approach in the Airfoil application (res_calc kernel). Since no shared memory was used, only the block size affected the occupancy of the kernels, which was high enough not to be the bottleneck (e.g. at 480 block size, it was with AoS, with SoA layout): accordingly there are relatively small variations in performance along the different block sizes.

The SoA layout, as mentioned in Section 2.1.1, improves on the performance, since the threads in a warp access data addresses that are near each other. This can also be seen in the number of global memory read transactions (Table 2) as it is roughly of that with AoS layout. This is even strengthened by GPS renumbering, which places data points that are accessed in consecutive threads close to each other ( reduction in global read transactions compared to AoS).

Reordering none GPS partition
Indirect data layout AOS SOA SOA SOA
Achieved Occupancy
Warp Execution Efficiency
Device Memory Read Transactions
Device Memory Write Transactions
Number of Colours
Table 2: Collected performance metrics of the global colouring implementation of the res_calc kernel. Block size is .

The partition based reordering is primarily intended for the hierarchical colouring; it groups threads that access the same data together, while the global colouring puts them into different kernel launches, eliminating any chance for spatial reuse, therefore reducing performance.

Figure 8: res_calc bandwidth on a dataset with cells with global colouring. Note that the y-axis doesn’t start at the origin.

The measurement on Figure 9 used the hierarchical colouring and the shared memory approach, that, coupled with the fact that it uses more registers ( (SoA) and (AoS)) compared to global colouring ( and ), leads to larger variations in occupancy and performance as the block size changes; for example, between block sizes and ( versus ).

The key goal of this strategy is to better exploit data reuse by using the shared memory; the results of which show immediately in the number of global transactions as well (Table 3): at block size 480, there is roughly decrease in global read and write transactions, leading to three times the performance.

These also show that the reordering using partitioning is indeed effective. With a block size of 448, data reuse increased from with the reference version, to , leading to the performance gain over the version without reordering (AoS layout). This is also consistent with the number of global transactions: there is a decrease in the number of reads and decrease in the number of writes, and a decrease in the percentage of stalls occurring because of data requests: with partitioning, without.

With the increased reuse, the number of thread colours is also larger ( versus ) and this leads to more synchronisation: with reordering, of the stalls were caused by synchronisation, compared to .

This is further illustrated by Figure 10 that shows the relative speedup compared to the original OP2 version. In this case, the original version also used the shared memory approach, so the performance gain is caused by the reordering.

Figure 9: res_calc bandwidth on a dataset with cells with hierarchical colouring. Note that the y-axis doesn’t start at the origin.
Figure 10: res_calc kernel speedup compared to the original code. Done on a dataset with cells. The block sizes are shown in parentheses.

Although the number of block colours increased with reordering, it is only a problem when the size of the dataset is so small that it decreases the achieved occupancy (there are not enough threads in one kernel launch to use the available resources on the GPU).

In the shared memory approach, the AoS version uses adjacent threads to load adjacent components of data points, therefore it can exploit the higher spatial locality better: the numbers of global read and write transactions are smaller by and , respectively. Note that the data reuse metric is the same between the to versions, because it is based on the number of datapoints (it is a metric of temporal locality).

Despite the higher locality, in most cases, the AoS version also uses more registers (sometimes the compiler can optimise the register count of both versions to the same number), and this may lead to lower occupancy. Hence it is not always obvious which is the better choice.

Reordering no no partition partition
Indirect data layout AOS AOS AOS SOA
Block size 480 448 448 448
Achieved Occupancy
Device Memory Read Transactions (total)
Device Memory Write Transactions (total)
Issue Stall Reasons (Synchronization)
Issue Stall Reasons (Data Request)
Number of Block Colours
Reuse Factor
Number of Thread Colours
Table 3: Collected performance metrics of the hierarchical colouring implementation of the res_calc kernel.

When running on the newer Volta GPU architecture, the results are the same (Figure 11), except that the absolute value of the bandwidths are (understandably) higher.

Figure 11: res_calc bandwidth on a dataset with cells with hierarchical colouring, on Volta architecture. Note that the y-axis doesn’t start at the origin.

Analysis of Volna

In measurements of the Volna application (SpaceDiscretization kernel) with global colouring, the effects of the partitioning are reversed, as is shown on Figure 12, because the blocks formed by the colouring algorithm roughly correspond to those formed by the partitioning algorithm (Table 4).

Figure 12: SpaceDiscretization kernel bandwidth on a mesh with edges using global colouring. Note that the y-axis doesn’t start at the origin.

In hierarchical colouring (Figure 13) the reordering by partitioning again improves performance: it increases reuse from to and decreases the number of global transactions by for reads and for writes (Table 5). The larger reduction in writes can be explained by the fact that the calculation does not read the values of the indirect data from the previous iteration, only the direct data.

Again, since the AoS version uses adjacent threads to load adjacent components of data points, and also because one thread loads single precision values into shared memory using the built-in vector type \lstinline!float4!, more data can be transferred at the same time, thus there is a and reduction in global memory transfers for reads and writes, respectively, leading to increased performance ( versus ).

Due to the low register counts () and the fact that volna uses float (and int) datatypes compared to the doubles in Airfoil, the occupancy was quite high (around ) in both the global and hierarchical colouring approaches. This explains the lack of dependence on the block size as shown by the figures.

Compared to Airfoil, the number of thread colours increases less when partitioning: from to , hence, the synchronisation overhead is also less: the percentage of stalls caused by synchronisation increases from to just . Of course, with high occupancy, the latency caused by synchronisation can be hidden by running warps from other blocks.

Figure 13: SpaceDiscretization kernel bandwidth on a mesh with edges using hierarchical colouring. Note that the y-axis doesn’t start at the origin.

As can be seen from Figure 14, our implementation is more than two times faster than the original OP2 version.

Reordering no no GPS partition
Indirect data layout AOS SOA SOA SOA
Block size 448 448 448 448
Achieved Occupancy
Device Memory Read Transactions
Device Memory Write Transactions
Number of Colours
Table 4: Collected performance metrics of the global colouring implementation of the SpaceDiscretization kernel.
Reordering no no partition partition
Indirect data layout AOS SOA AOS SOA
Block size 448 448 448 448
Achieved Occupancy
Warp Execution Efficiency
Device Memory Read Transactions
Device Memory Write Transactions
Issue Stall Reasons (Synchronization)
Issue Stall Reasons (Data Request)
Number of Block Colours
Reuse factor
Average cache lines/block
Number of Thread Colours
Table 5: Collected performance metrics of the hierarchical colouring implementation of the SpaceDiscretization kernel.
Figure 14: SpaceDiscretization kernel speedup compared to the original code, done on a mesh with edges. The block sizes are shown in parentheses.

Analysis of BookLeaf

The measurements on the BookLeaf application (specifically the getacc_scatter kernel), as can be seen on Figures 15, 16, 17, also benefits from the use of the shared memory and the partitioning of the mapping.

The register count and occupancy are also similar to those with Airfoil ( registers for the shared memory approach and and for SoA and AoS global colouring, respectively, achieving occupancy or around ), this leads to the variations in performance along different block sizes.

With partitioning, the number of thread colours increased from to (Table 6), this leads to the increased stalls from synchronisation: from to , while the reuse factor increases (from to ), which is comparable to that of Airfoil, this explains the smaller increase in performance (only , compared to the increase in Airfoil).

The higher data reuse leads to and decrease of the number of global transactions, for reads and writes, respectively. This large difference between reads and writes is also because, like SpaceDiscretization, getacc_scatter does not read values indirectly.

Figure 15: getacc_scatter kernel bandwidth on a mesh with edges using global colouring.
Figure 16: getacc_scatter kernel bandwidth on a mesh with edges using hierarchical colouring. Note that the y-axis doesn’t start at the origin.
Figure 17: getacc_scatter kernel speedup compared to the original code, done on a mesh with edges. The block sizes are shown in parentheses.
Reordering no no partition partition
Indirect data layout AOS SOA AOS SOA
Block size 320 320 320 320
Achieved Occupancy
Warp Execution Efficiency
Device Memory Read Transactions
Device Memory Write Transactions
Issue Stall Reasons (Synchronization)
Issue Stall Reasons (Data Request)
Number of Block Colours
Reuse Factor
Average Cache Lines/Block
Number of Thread Colours
Table 6: Collected performance metrics of the hierarchical colouring implementation of the getacc_scatter kernel.

Analysis of LULESH

With LULESH (Figures 18 and 19), the partitioning approach actually worsened performance.

The IntegrateStressForElems kernel uses a mapping with 8 dimensions (compared to the 2 or 3 as in the case of the previous ones) that has high connectivity, so the number of colours is quite high: , and in the global coloring versions (for the different reorderings), and , and in the hierarchical colouring versions. The number of thread colours was also quite high: in the non-reordered ( in the GPS) and in the partitioned version: this is a much higher increase compared to the previous applications (Table 7).

The other aspect in which LULESH is different is that it uses a high amount of registers (, except for the AOS global colouring version, which uses ), significantly decreases occupancy: with block size 320, the AoS version achieved and the SoA version achieved around .

Because of these two reasons, the synchronisation overhead ( stalls were from synchronisation on the partitioned mesh) couldn’t be hidden: there were no warps from other blocks to be scheduled in place of the stalled ones because there was only one block running on each multiprocessor.

In terms of reuse, although the gain (from to ) is larger compared to Airfoil, but still not enough to offset the overhead of synchronisation: in this case, the original ordering of the mesh is better. The difference in achieved occupancy also means that the SoA version with two blocks per multiprocessor performs better.

Figure 18: IntegrateStressForElems kernel bandwidth on a mesh with cells using global colouring.
Figure 19: IntegrateStressForElems kernel bandwidth on a mesh with cells using hierarchical colouring. Note that the y-axis doesn’t start at the origin.

Nevertheless, as shown on Figure 20, our hierarchical colouring algorithm performs significantly better than the original, two-step implementation, and also uses much less memory. Although the original needs less synchronisation, the resulting warp divergence also cannot be hidden if the occupancy is this low.

Figure 20: IntegrateStressForElems kernel speedup compared to the original (gathering) code, done on a mesh with cells. The block sizes are shown in parentheses.
Reordering no no partition partition
Indirect data layout AOS SOA AOS SOA
Block size 320 320 320 320
Achieved Occupancy
Warp Execution Efficiency
Device Memory Read Transactions (total)
Device Memory Write Transactions (total)
Issue Stall Reasons (Synchronization)
Issue Stall Reasons (Data Request)
Number of Block Colours
Reuse Factor
Average Cache Lines/Block
Number of Thread Colours
Table 7: Collected performance metrics of the hierarchical colouring implementation of the IntegrateStressForElems kernel.

Analysis of MiniAero

The compute_face_flux kernel is the most computationally intensive among the ones we tested: it uses registers in hierarchical ( for SoA) and in global colouring. Also, it achieves (with block size and reordered by GPS) of peak double precision efficiency, compared to the in Airfoil (Table 8). It also uses square root operations and several divides that can’t efficiently fill the pipelines at such low occupancy.

The amount of data indirectly accessed by the kernel is also large: each thread accesses data points indirectly, each holding double precision values. If all of that is loaded into shared memory, the size of it exceeds the hardware limits with block sizes larger than ; the GPS reordered version didn’t run with any block size (Figure 21). The other measurements were carried out by only loading the incremented data into shared memory.

The mesh also has a complex structure ( and block colours for GPS reordered and partitioned versions, respectively) and the original ordering was far from optimal: we couldn’t run the non-reordered version, because the number of block colours exceeded the implementation limit of the library, which is .

As with LULESH, only one block was running at a time on each multiprocessor. Although the synchronisation overhead wasn’t that high ( and thread colours in the GPS reordered and partitioned versions, respectively), the costly operations prevented high performance gains in the case of the partitioned version (Figure 22).

The original Kokkos implementation either used atomic adds or the two-step gathering approach depending on compilation parameters. Our implementation outperformed both (Figure 23), for the same reason as in case of LULESH.

Figure 21: The compute_face_flux kernel bandwidth on a mesh with faces. The kernel didn’t fit into the shared memory with block size larger than 288 or if not partitioned because the large amount of shared memory needed.
Figure 22: The compute_face_flux kernel bandwidth on a mesh with faces. The shared memory was only used to cache the increments, reducing the need for large shared memory size. The kernel didn’t fit into the shared memory with block size larger than 384 or if not reordered because the large amount of shared memory needed.
Figure 23: The compute_face_flux kernel speedup compared to the original code that was using atomic adds, on a mesh with faces. The shared memory was only used to cache the increments, reducing the need for large shared memory size. The last bar shows the relative performance of the original code with the gathering approach.
Reordering GPS GPS partition partition
Indirect data layout AOS SOA AOS SOA
Block size 384 384 384 384
Achieved Occupancy
Warp Execution Efficiency
Device Memory Read Transactions
Device Memory Write Transactions
Issue Stall Reasons (Synchronization)
Issue Stall Reasons (Data Request)
Issue Stall Reasons (Execution Dependency)
FLOP Efficiency(Peak Double)
Number of Block Colours
Reuse Factor
Average Cache Lines/Block
Number of Thread Colours
Table 8: Collected performance metrics of the hierarchical colouring implementation of the compute_face_flux kernel.

5 Conclusion

We investigated the methods of accelerating parallel scientific computations on unstructured meshes. We specifically looked at improving the performance of memory-bound implementations executing on GPUs.

We designed a reordering algorithm which uses k-way recursive partitioning to improve data reuse and with it, performance of unstructured mesh applications accelerated on GPU platforms.

We implemented a library that can automatically (without major modifications from the part of the user) parallelise serial user code, avoiding data races using global and hierarchical colouring. It uses optimisations specifically targeting GPUs, such as caching in shared memory, reordering by colours to reduce warp divergence and using vector types to more efficiently utilise available global memory bandwidth.

Using this library, we analysed the performance of our algorithms on a number of representative unstructured mesh applications varying a number of parameters, such as the different thread block sizes and data layouts (Array of Structs versus Struct of Arrays).

When comparing the performance of global and hierarchical colouring (shared memory caching) approach, the shared memory approach consistently performed better, since it could exploit the temporal locality in indirectly accessed data by avoiding data races in shared memory with synchronisation within thread blocks rather than different kernel launches.

We also analysed the performance of reordering based on GPS renumbering and partitioning. The former improves global colouring with increasing spatial reuse, while the latter can significantly improve the shared memory approach by increasing data reuse within thread blocks, which results in smaller shared memory and fewer global memory transactions.

We have shown that there is a trade-off between high data reuse and large numbers of thread colours in hierarchical colouring that is especially pronounced when the achieved occupancy is low: the more thread colours a block has, the more synchronisations it will need, the latency of which can be hard to hide when there are few eligible warps.

Using our methods, we were able to achieve performance gains of (Airfoil), (Volna), (Bookleaf), (Lulesh) and (miniAero) over the original implementations. These results significantly advance the state of the art, demonstrating that the algorithmic patterns used in most current implementations (particularly in case of US DoE codes represented by LULESH and MiniAero) could be significantly improved upon by the adoption of two-level colouring schemes and partitioning for increased data reuse.

When carrying out this work, it had become clear that partitioning algorithms in traditional libraries such as Metis and Scotch weren’t particularly well suited for producing such small partition sizes. As potential future work, we wish to explore algorithms that are better optimised for this purpose. The performance of these partitioning algorithms was also low - parallelising this could be another interesting challenge. Finally, we are planning to integrate these algorithms into the OP2 library, so they can be automatically deployed on applications that already use the OP2 library, such as Airfoil, BookLeaf, Volna or Rolls-Royce Hydra.


This paper was supported by the János Bólyai Research Scholarship of the Hungarian Academy of Sciences. The authors would like to acknowledge the use of the University of Oxford Advanced Research Computing (ARC) facility in carrying out this work The research has been supported by the European Union, co-financed by the European Social Fund (EFOP-3.6.2-16-2017-00013).


  1. OpenCFD . OpenFOAM - The Open Source CFD Toolbox - User’s Guide. OpenCFD Ltd.United Kingdom1.4 ed.2007.
  2. Moinier Pierre, Muller JD, Giles Michael B. Edge-based multigrid and preconditioning for hybrid grids. AIAA journal. 2002;40(10):1954–1960.
  3. Biedron Robert T, Carlson Jan-Reneé, Derlaga Joseph M, et al. FUN3D Manual: 13.1. 2017;.
  4. Mavriplis Dimitri J. Parallel performance investigations of an unstructured mesh Navier-Stokes solver. The International Journal of High Performance Computing Applications. 2002;16(4):395–407.
  5. Jin Hao-Qiang, Frumkin Michael, Yan Jerry. The OpenMP implementation of NAS parallel benchmarks and its performance. 1999;.
  6. Nagy Zoltán, Nemes Csaba, Hiba Antal, et al. Accelerating unstructured finite volume computations on field-programmable gate arrays. Concurrency and Computation: Practice and Experience. 2014;26(3):615–643.
  7. Akamine Takayuki, Inakagata Kenta, Osana Yasunori, Fujita Naoyuki, Amano Hideharu. Reconfigurable out-of-order mechanism generator for unstructured grid computation in computational fluid dynamics. In: :136–142IEEE; 2012.
  8. DeVito Zachary, Joubert Niels, Palacios Francisco, et al. Liszt: a domain specific language for building portable mesh-based PDE solvers. In: :9ACM; 2011.
  9. Giles Mike, Mudalige Gihan, Reguly István. OP2 Airfoil Example. 2012;.
  10. OCCA
  11. Remacle J-F, Gandham R, Warburton Tim. GPU accelerated spectral finite elements on all-hex meshes. Journal of Computational Physics. 2016;324:246–257.
  12. Castro Manuel J, Ortega Sergio, Asuncion Marc, Mantas José M, Gallardo José M. GPU computing for shallow water flow simulation based on finite volume schemes. Comptes Rendus Mécanique. 2011;339(2-3):165–184.
  13. Wu Bo, Zhao Zhijia, Zhang Eddy Zheng, Jiang Yunlian, Shen Xipeng. Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on gpu. In: :57–68ACM; 2013.
  14. Fu Zhisong, Lewis T James, Kirby Robert M, Whitaker Ross T. Architecting the finite element method pipeline for the GPU. Journal of computational and applied mathematics. 2014;257:195–211.
  15. Mudalige GR, Giles MB, Reguly I, Bertolli C, Kelly PHJ. OP2: An active library framework for solving unstructured mesh-based applications on multi-core and many-core architectures. In: :1–12IEEE; 2012.
  16. Karlin Ian, Keasler Jeff, Neely Rob. LULESH 2.0 Updates and Changes. LLNL-TR-641973: ; 2013.
  17. miniAero CFD Mini-Application
  18. Norman E. Gibbs Jr., Stockmeyer Paul K.. An Algorithm for Reducing the Bandwidth and Profile of a Sparse Matrix. SIAM Journal on Numerical Analysis. 1976;13.
  19. Karypis George, Kumar Vipin. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on scientific Computing. 1998;20(1):359–392.
  20. Pellegrini François, Roman Jean. Scotch: A software package for static mapping by dual recursive bipartitioning of process and architecture graphs. In: :493–498Springer; 1996.
  21. Karypis George, Schloegel Kirk, Kumar Vipin. Parmetis: Parallel graph partitioning and sparse matrix ordering library. Version 1.0, Dept. of Computer Science, University of Minnesota. 1997;.
  22. LaSalle Dominique, Karypis George. Efficient nested dissection for multicore architectures. In: :467–478Springer; 2015.
  23. OP2 github repository
  24. Dutykh Denys, Poncet Raphaël, Dias Frédéric. The VOLNA code for the numerical modeling of tsunami waves: Generation, propagation and inundation. European Journal of Mechanics-B/Fluids. 2011;30(6):598–615.
  25. Uk mini-app consortium
  26. Heroux Michael A, Doerfler Douglas W, Crozier Paul S, et al. Improving Performance via Mini-applications. SAND2009-5574: Sandia National Laboratories; 2009.
  27. Edwards H. Carter, Trott Christian R., Sunderland Daniel. Kokkos: Enabling manycore performance portability through polymorphic memory access patterns. Journal of Parallel and Distributed Computing. 2014;74(12):3202 - 3216. Domain-Specific Languages and High-Level Frameworks for High-Performance Computing.
  28. Hydrodynamics Challenge Problem, Lawrence Livermore National Laboratory. LLNL-TR-490254: ; .
  29. NVIDIA Corp.. NVIDIA Tesla P100. v01.1: ; 2016.
This is a comment super asjknd jkasnjk adsnkj
The feedback cannot be empty
Comments 0
The feedback cannot be empty
Add comment

You’re adding your first comment!
How to quickly get a good reply:
  • Offer a constructive comment on the author work.
  • Add helpful links to code implementation or project page.