Memory transfer optimization for a lattice Boltzmann solver on Kepler architecture nVidia GPUs

Memory transfer optimization for a lattice Boltzmann solver on Kepler architecture nVidia GPUs

Abstract

The Lattice Boltzmann method (LBM) for solving fluid flow is naturally well suited to an efficient implementation for massively parallel computing, due to the prevalence of local operations in the algorithm. This paper presents and analyses the performance of a 3D lattice Boltzmann solver, optimized for third generation nVidia GPU hardware, also known as ‘Kepler’. We provide a review of previous optimisation strategies and analyse data read/write times for different memory types. In LBM, the time propagation step (known as streaming), involves shifting data to adjacent locations and is central to parallel performance; here we examine three approaches which make use of different hardware options. Two of which make use of ‘performance enhancing’ features of the GPU; shared memory and the new shuffle instruction found in Kepler based GPUs. These are compared to a standard transfer of data which relies instead on optimised storage to increase coalesced access. It is shown that the more simple approach is most efficient; since the need for large numbers of registers per thread in LBM limits the block size and thus the efficiency of these special features is reduced. Detailed results are obtained for a D3Q19 LBM solver, which is benchmarked on nVidia K5000M and K20C GPUs. In the latter case the use of a read-only data cache is explored, and peak performance of over 1036 Million Lattice Updates Per Second (MLUPS) is achieved. The appearance of a periodic bottleneck in the solver performance is also reported, believed to be hardware related; spikes in iteration-time occur with a frequency of around 11Hz for both GPUs, independent of the size of the problem.

1 Introduction

The implementation of Lattice Boltzmann method (LBM) solvers on Graphics Processing Units (GPUs) is becoming increasingly popular due to the intrinsic parallelizable nature of the algorithm. A growing literature exists in this area, though with frequent hardware changes there is a need to constantly review the means of obtaining optimal performance. As a derivative of Lattice Gas Cellular Automata (LGCA), LBM defines macroscopic flow as the collective behaviour of underlying microscopic interactions [?]. The LBM improves upon LGCA by describing each point in space using a mesoscopic particle distribution function rather than an individual particle, reducing statistical noise within the method. LBM has been used extensively in literature over the past decade and is now regarded as a powerful and efficient alternative to the classical Navier-Stokes solvers (see [?] for a complete overview). Despite early suggestions in the literature to the contrary, Shan et al. [?] formally demonstrated how LBM can return exact Navier-Stokes even for high Reynolds and high Mach number flows.

Algorithmically, the method consists of a local relaxation (‘collide’) and a linear advection (‘stream’) of a number of discrete components of the distribution function, rendering the method highly parallelizable. Implementation of the LBM on GPUs can be traced back to 2003, when Li et al. [?] mapped a 2D LBM algorithm to texture and rasterization operations within the graphics pipeline. Since then, a variety of both two and three dimensional models have been implemented and GPU based LBM algorithms have been proposed for a range of applications; e.g. free surface [?], thermal [?], biomedical [?].

Another early attempt was made by Ryoo et al. [?], who tested a CUDA port of a simple LBM code from the SPEC CPU2006 benchmark [?]. This was followed by a more in-depth optimization reported by Tölke and Krafczyk [?], who implemented a 3D model with a reduced 13 component distribution function on G80 generation GPUs, specifically designed to increase maximum throughput. They used shared memory to avoid costly misaligned access to the RAM of the GPU when performing advection of the distribution function (although in general, using less components of will reduce accuracy). A more complex split propagation method was proposed by [?], in which the data is first shifted along the contiguous dimension within shared memory for each block, before the perpendicular shifts are performed in global memory1. This approach demonstrated high levels of efficiency, but necessitated careful handling of data entering/leaving each block. Habich et al. [?] extended this work to the D3Q19 model (the same model presented in this work).

Obrecht et al. [?] identified that, contrary to previous attempts to avoid misalignment at all cost, the cost of a misaligned access in shared memory was actually similar to that caused by a global memory exchange; thus they proceeded to investigate the potential for performance improvement brought about by avoiding the use of shared memory altogether. Indeed, previous advances in nVidia hardware, first from compute capability 1.0 to 1.3, and then on to 2.0, substantially improved the efficiency of misaligned memory transactions; this had the important consequence that the use of shared memory was no longer so crucial. Furthermore, [?] reported the cost of a misaligned read to be less than a misaligned write; an observation leads one to prefer the ‘pull’ algorithm to the ‘push’, as discussed in Section 4. A peak performance of 516 Million Lattice Updates Per Second (MLUPS)2 was reported with a maximum throughput of 86%. In addition, it is noted that previous works using Shared Memory were highly optimised, and thus obtaining substantial extra performance would not be trivial. More importantly, the implementation of more complex code to handle e.g. multiple distribution functions or extra body force terms would only be possible with a significant reduction of performance. Indeed, the present work is building towards an efficient GPU implementation of the Immersed Boundary Method with LBM reported in [?], and thus this reasoning is of high relevance to our work (see [?]).

A comprehensive series of further work by Obrecht et al. [?] focused on compute 2.x capable hardware in their development of a multi-GPU implementation of a Hybrid thermal LBM scheme based on D3Q19, and did not make use of shared memory. Later, Habich et al. [?], also presented implementations for ‘Fermi’ generation GPUs without the use of shared memory.

In the present work, particular attention is paid to a comparison of three methods of performing the advection operation; the first which performs the propagation directly in the GPUs RAM (DRAM), a second that utilizes a shared memory space as an intermediate buffer and a third, new, method that performs the propagation locally within a group of 32 threads without using any intermediate memory using the shuffle instruction new to GPUs based on the Kepler architecture[?]. The results of this comparison are then used to implement an efficient 3D LBM solver on first and second generation Kepler GPUs (compute capabilities 3 and 3.5 respectively). In the case of second generation Kepler GPUs a read-only cache is also enabled to provide a small, but measurable, improvement in performance. In the following sections the architecture and programming model for CUDA based GPUs is introduced, before the mathematical description and specific form of the LBM used is given. Analysis of the three propagation techniques is then performed, along with consideration of other implementation aspects and estimation of the maximum achievable performance of a LBM solver based on memory requirements.

2GPU Computing

Figure 1: Typical GPU and CPU architectures
Figure 1: Typical GPU and CPU architectures
Figure 2: Blocks & Threads in CUDA .
Figure 2: Blocks & Threads in CUDA .
Figure 3: Typical Kepler GPU Architecture
Figure 3: Typical Kepler GPU Architecture

Modern GPUs use a Unified Shader Architecture [?], in which blocks of processing cores (capable of performing operations within all parts of the graphics pipeline) are favored over hardware in which the architecture matches the flow of the graphics pipeline. For the purpose of this paper it is sufficient to note that the graphics pipeline takes the vertices of triangles as inputs and performs several vertex operations, such as spatial transformations and the application of a lighting model, to create a ‘scene’. The scene is then rasterized to create a 2D image which can then have textures placed over the component pixels to create the final image. For a more comprehensive introduction to the graphics pipeline and how older GPUs matched their architecture to it see [?].

The architecture of a generic Unified Shader based GPU is shown alongside that of a generic CPU in Figure 1. Processing cores can be seen arranged into rows with small amounts of cache and control hardware; the combination of all three is known as a Streaming Multiprocessor (SMX). Comparison with a generic CPU highlights the following key differences:

  1. GPUs sacrifice larger amounts of cache and control units for a greater number of processing cores.

  2. The cores of a SMX take up less die space than those of a CPU, and as a result are required to be more simple in design.

These attributes render the GPU suitable for performing computation on large datasets where little control is needed, i.e. the same task is performed across the entire dataset. Indeed, it is this aspect which has generated interest in GPU computing for the Lattice Boltzmann Method, as identical independent operations are performed across the majority of the fluid domain, boundary conditions being the obvious exception.

2.1 Threads and Blocks of Processing

Code written for nVidia GPUs is generally parallelized at two levels; the computation is divided into blocks which contain component parallel threads (see Figure 2). A single block of threads is allocated to a SMX at any one time, with the component threads divided into groups of 32 called ‘warps’. The threads within a warp are executed in parallel, and all threads within the warp must execute the same instruction or stall. Figure 2 shows an example of threads and blocks being allocated two-dimensionally; the division of threads and blocks can be performed in n dimensions, depending on the problem.

2.2nVidia Kepler Architecture GPUs

In this paper two Kepler architecture GPUs are tested; the K5000M and the K20c. Details of the general hardware layout for both devices is provided in Figure 3. The K5000M is based on the first generation GK104 Kepler architecture, and as such is limited in it’s double precision performance, with a ratio of 24:1 between single and double precision peak performance. Processing cores are grouped into blocks of 192, known as Streaming Multiprocessors (SMX), in order to share access to 64KB of configurable cache/shared memory, and four instruction schedulers capable of dispatching two instructions per clock cycle3. Seven SMXs make up the K5000M, and these SMXs share access to 512KB of L2 cache and 4GB of DRAM. In the configuration used in this paper, the K5000M is clocked at 601MHz, giving a theoretical peak single precision performance of 1.62 TFLOPS. Peak DRAM bandwidth was measured as 64.96GB/sec using the benchmarking program included in the CUDA SDK.

The K20c is based on the newer GK110 architecutre, which is largely the same as a GK104, the most significant difference being extra double precision units within each SMX to improve the single/double precision performance ratio to 3:1. A K20c contains 13 SMXs, which share access to 1.25MB of L2 cache and 4.7GB of DRAM. The theoretical peak single precision performance of the configuration used in this work is 3.5 TFLOPS, and measured bandwidth was 157.89GB/sec.

3 The Lattice Boltzmann Method

3.1 Numerical method

In this study the Lattice Boltzmann Method is used to simulate fluid flow, this method is based on microscopic models and mesoscopic kinetic equations; in contrast to Navier-Stokes which is in terms of macroscale variables. The Boltzmann equation for the probability distribution function is given as follows:

where are the space coordinates, and is the particle velocity. The collision operator is simplified using the ‘BGK’ single time relaxation approach found in [?], in which context, it is assumed that local particle distributions relax to an equilibrium state, in time :

The discretized form of Equation 1 is obtained via Taylor series expansion following [?], as shown in Equation 3, where refers to the discrete directions of . The dimensionality of the model and spatial discretization of is given in the ’DmQn’ format, in which the lattice has m dimensions and has n discrete directional components. In the current work the D3Q19 model is used, in which the discrete velocity is defined according to Eqn. ? and is visualized in Figure 4. Since spatial and temporal discretization in the lattice are set to unity, the lattice speed .

Figure 4: The D3Q19 Lattice.
Figure 4: The D3Q19 Lattice.

The equilibrium function can be obtained by Taylor series expansion of the Maxwell-Boltzmann equilibrium distribution [?] :

In Equation 4, is the speed of sound and the coefficients of are , and .

Macroscopic quantities (moments of the distribution function) are obtained as follows:

The multi-scale expansion of Equation 3 neglecting terms of O() and using Eqns. Equation 5 and Equation 6 returns the Navier-Stokes equations to second order accuracy [?], allowing the LBM to be used for fluid simulations.

4Implementation

4.1Memory Arrangement

The present solver is parallelized such that one thread will perform the complete LBM algorithm at one spatial location in the fluid domain. Each thread stores values of , , u and information about whether or not the current location is a boundary in a struct within register space to minimize high latency access to DRAM once initially loaded. Within DRAM it is common practice to ‘flatten’ multiple dimension arrays into a single dimension, as the extra de-referencing operations required add to the already large latency of accessing off-chip memory. This is extended to combining and flattening the components of f into a single array, such that is addressed as f[i*Nx*Ny*Nz+z*Ny*Nx+y*Nx+x]. Storing f in order of i and then by spatial coordinates will cause neighbouring threads within a warp to access contiguous memory in f. If these accesses are aligned within a 128-byte segment (see Section ?) the data transactions can be grouped into a single larger transaction; i.e. resulting in a coalesced access.

The ordering of the data in is known as storing in a ‘Structure of Arrays’ (SoA) format; without the collapsing of the different components of into a single array this can be seen as a structure containing 19 arrays, each one corresponding to all of the spatial locations of one component of . The opposite ‘Array of Structures’ (AoS) arrangement is shown in Figure 5 for clarity, which corresponds to an a single array with one element per spatial location, each containing a data structure to store the 19 components of . Once has been read into a core, an analogy can be drawn between this format and the ‘array’ of GPU cores, each containing their own local structure. While AoS is shown to be preferable for serial CPU implementations [?], SoA is necessary to improve coalesced access to global memory with GPU versions.

Figure 5: Arrays of Structures and Structures of Arrays.
Figure 5: Arrays of Structures and Structures of Arrays.

Without the presence of macroscopic values, a single point in the lattice requires 19 loads from global memory, and 19 stores back to global memory during an iteration of the LBM algorithm 4. In single precision this yields 152 bytes of data to be transferred for each lattice point. Using the measured bandwidth of 65GB/sec for the K5000M from Section 2.2, the theoretical limit is 459 MLUPS. For the K20c this limit is 1115 MLUPS. If macroscopic values are required, then an extra four storage operations are needed and the maximum theoretical performance drops to 415 MLUPS (1009 MLUPS for the K20c).

4.2Independent LBM algorithm

The conventional LBM is typically broken down into several steps as described in Algorithm ?, which is typically the form taken for CPU implementations. This algorithm poses some locality problems if it is to be used in a highly parallel fashion. If a thread is launched for each location then the non-local operation will require a synchronization across the domain (as shown) before the boundary conditions, and u can be calculated.

Instead, one of the two re-orderings presented in [?] can be used. These are described as ‘push’ or ‘pull’ algorithms, depending on whether the streaming operation (which causes misaligned access to DRAM, see Section ?) occurs at the end () or beginning () of the algorithm. Both algorithms remove the need for an additional synchronization by placing the synchronization point at the end of an iteration, where a synchronization implicitly occurs as the loop (or kernel when implemented in CUDA) across the domain exits. This also eliminates the requirement to store and u in DRAM, unless they are required for post-processing, as they are only used in enforcing the boundary conditions and the calculation of .

4.3Read/Write Memory speed

As stated in the Introduction, [?] examined the cost of misaligned reads and writes for compute 1.3 devices and reported that the former were more efficient than the latter; motivating their preference for the ‘pull’ algorithm. In what follows, we provide results of a similar experiment for the more recent compute 3.0 and 3.5 devices. aligned and misaligned read and writes to several large vectors were performed to mimic the behaviour of the ‘push’ and ‘pull’ algorithms, and the DRAM bandwidth achieved is measured.

Multiple vectors are used to provide improved Instruction Level Parallelism (ILP), which is a strategy to mask some memory latency by allowing a single thread to launch several independent memory requests before previous requests have returned [?]. This helps make use of the dual-instruction dispatching feature of each warp scheduler; two instructions can only be dispatched in a single clock cycle only if they are from the same warp and the instructions are independent. In the full implementation of the LBM solver ILP is used across the various components of when streaming, and also when performing the collision operation. For compute 3.5 devices the read-only data cache (previously only accessible through the use of textures) is also considered.

Figure 6: Effective bandwidth on K20c for aligned/misaligned reads and writes
Figure 6: Effective bandwidth on K20c for aligned/misaligned reads and writes
Figure 7: Effective bandwidth on K20c for aligned/misaligned reads and writes
Figure 7: Effective bandwidth on K20c for aligned/misaligned reads and writes

Figure ? provides a comparison of read and write bandwidth on the K5000m, for both coalesced and misaligned access to data. Figure 7 displays the corresponding results for the K20c, in addition to times for Read-Only data access. Results show that misaligned reads incur a smaller penalty than misaligned writes when compared with their aligned access counterparts on the K5000m, and are therefore more efficient (99% of aligned access bandwidth versus 96%). On the K20c the read-only cache is required to maintain the efficiency of misaligned reads, efficiency drops to 88% without it. The use of read-only cache does have a small detrimental effect on the performance of aligned reading from memory due to the overhead of passing through extra hardware. Overall, it is clear that the pull algorithm is preferable.

The re-ordered pull algorithm is used, following [?] and [?] and shown in Algorithm ?. The pull algorithm takes it’s name from the operation , which is used instead of ; i.e. data is loaded directly into it’s new location.

Register Usage for LBM

A single SMX in compute 3.x devices contains 65536 registers, and so the trade off between block size and grid size is best understood as follows:

The D3Q19 solver presented in this chapter is uses approximately 45 registers/thread, depending on the boundary conditions imposed.

If a block size of 1024 were used, a total of 45K threads would be needed for each block, and thus only one block could be launched, causing the remaining 20K registers to go unused. The authors in [?] recommend block sizes of no more than 256 threads for this reason.

Access Patterns for The Streaming Operation

Access to global memory is most efficient when threads within a warp access data in the same 128-byte segment, when this occurs the 32 requests from each thread in the warp are coalesced into a single request. Within the streaming operation, alignment to a 128-byte segment is dependent on the value of the the component of . When it is zero these coalesced accesses are guaranteed, as propagation of values in the and directions move access to a different segment without any data misalignment. When the component of has a non-zero value, misaligned access to the segments will occur, and two memory requests will be required per warp; one to load values from 31 addresses in the same segment and a second to load a value from the previous (if ) or next (if ) segment.

Figure 8: Using shuffle instruction to propagate values
Figure 8: Using shuffle instruction to propagate values
Figure 9: Using shuffle instruction to propagate values
Figure 9: Using shuffle instruction to propagate values
Figure 10: Using shuffle instruction to propagate values
Figure 10: Using shuffle instruction to propagate values
Figure 11: Effective bandwidth on K20c for offest-by-one DRAM reads
Figure 11: Effective bandwidth on K20c for offest-by-one DRAM reads
Figure 12: Effective bandwidth on K20c for offest-by-one DRAM reads
Figure 12: Effective bandwidth on K20c for offest-by-one DRAM reads

Current work on LBM solvers for GPUs either accepts this extra memory transaction (see, for example [?]), or utilises shared memory to perform propagation in the x direction, either performing global memory accesses to propagate values between blocks or matching the x dimension of the domain to the size of a block [?].

New instructions present within Kepler architecture GPUs allow for propagation within a warp without the use of shared memory, and only require shared memory to propagate values between warps (See Figure 10). In order to determine the most efficient method for use within an LBM solver, the performance tests in Section 4.2 are repeated for misaligned reading using the three different methods. Figures ? and Figure 12 present this comparison for both hardware derivatives, in which aligned reads from DRAM are also shown to provide a reference point, as they represent optimal memory behaviour. For both GPUs simple misalignment in DRAM is observed to be the more efficient method, achieving at least a 7.6% improvement over shared memory and 4.7% improvement over shuffle memory on a K5000m, rising to 17.6% and 17.1% on a K20c. The extra synchronization and intermediate registers required for the use of shared memory or the shuffle instruction lowers the achieved bandwidth. Still, for reasonable block sizes (block sizes much above 256 would be impossible to implement in a full LBM code due to the number of registers consumed) the shuffle instruction slightly outperforms the use of shared memory alone.

5Validation of the LBM code

To validate the LBM solver, a lid driven cavity case was performed at three Reynolds numbers {100, 400, 1000} and compared against reference data from [?]. In each case a cubic domain is created with a non-zero x velocity on the top wall and non-slip conditions on every other wall. The boundary conditions from [?] are used to create the stationary and moving boundaries, and scaling is controlled according to

Figure ? displays profiles of velocity extracted from the centre of the 3D domain and the resolutions used. For higher Reynolds number computations the domain sizes are increased to reflect the higher levels of resolution required. In all cases the results demonstrate a convergence for subsequent increase in lattice size and reference results are shown to be reproduced.

Figure 13: Re=100
Figure 13: Re=100
Figure 14: Re=400
Figure 14: Re=400
Figure 15: Re=1000
Figure 15: Re=1000

6Performance

One hundred iterations of the lid driven cavity test case were performed over a variety of block sizes for domains up to a size of in single precision on K5000m and K20c GPUs, with the domain size limited by DRAM size. The blocks are kept as dimension dominant as possible to facilitate improved cache use, with the obvious exception that threads must not be allocated to indices outside of the computational domain. In this event multiple rows of the highest common denominator between the domain and the desired block size are used. Tables Table 1 and Table 2 show the mean performance and standard deviation of the 100 iterations in MLUPS. Peak performance is 420 MLUPS on the K5000m and 1036 MLUPS on the K20c.

Figure ? displays the performance of the present LBM solver compared to implementations on previous hardware generations found in work by Obrecht et al.[?] (compute 1.3), Rinaldi et al.[?] (compute 1.3) and Astorino et al.[?] (compute 2.0), which are also tabulated in Table 3. All results are reported for single precision calculations and include the calculation of boundary condition values.

Figure 16: Performance scaled for bandwidth
Figure 16: Performance scaled for bandwidth
Figure 17: Performance scaled for bandwidth
Figure 17: Performance scaled for bandwidth

Figure 17 scales the performance relative to the measured bandwidth of the device (theoretical bandwidth is used for Astorino et al. as a measured bandwidth is not reported) and it can be seen that the present implementation is as efficient (slightly more in the case of the K5000M) than the current best implementation found elsewhere in literature. It is worth noting that, following the examination of the use of shared memory in section ?, the implementations found in Astorino et al [?] and Rinaldi et al [?] both make use of shared memory.

The L2 cache is used exclusively within the LBM solver when misaligned accesses in the stream operation occur, every other operation is fully coalesced and has no re-use of fetched data. The hit rate for L2 cache will therefore depend on the ratio of misaligned to aligned accesses in the stream operation, which will be constant, except for misaligned accesses that would fall outside the boundaries of the domain, and are therefore not performed. This non-constant reduction in misaligned accesses (and therefore L2 cache use) is proportional to the ratio of boundary points to interior domain points, and will tend towards zero (and therefore constant L2 cache use) as the dimension increases and the ratio of misaligned to aligned accesses decreases, as shown in Figure 18.

Figure 18: L2 cache hit rates
Figure 18: L2 cache hit rates

Analysing the performance results of solver reveals interesting behaviour in the variation of performance. One would expect truly random variation to manifest itself in the form of a Gaussian-like distribution centred about the mean value. Whilst, at smaller domain sizes this is true, the vast majority of results exhibit periodic performance variation. A fourier analysis of the results was performed to ascertain the amount of periodic contribution to performance variation, and the frequency at which it occurs. Below sample results for a domain of size are presented.

Figure 19: Fourier analysis - domain size = 192^3
Figure 19: Fourier analysis - domain size =

As Figure 19 shows, there is clearly a frequency component to the variation in performance. In this case the maximum contribution is found at a frequency of 11.71Hz. Performing the same analysis across all of the test cases yields the following results:

No concrete explanation for this variation in performance can be provided, as it is likely a hardware level issue. Likely candidates are power and/or thermal management strategies of the GPU. The low-level analysis required to determine the cause of this variation is beyond the scope of the present work, although further study of this behaviour should be conducted as it has implications for all GPU programming.

7Conclusion

This work has demonstrated the optimization and validation of a 3D GPU-based Lattice Boltzmann solver on Kepler architecture GPUs. The use of shared memory, and an intrinsic memory-less intra-warp shuffle operation, have been shown to be ineffective at improving the performance of the memory-intensive streaming operation, in spite of the fact that their use increases the number of coalesced accesses to DRAM. Instead, a ‘naive’ implementation that has misaligned access to DRAM is found to be more effective due to its lower register usage and no need for any additional control flow. Using this information an efficient Lattice Boltzmann solver was programmed and benchmarked on GK104 and GK110 generation GPUs, achieving a peak performance of up to 1036 MLUPS on GK110 GPUs. The findings of this work have already been applied in the design of an interactive two dimensional LBM solver, where the high performance of the fluid solver allows boundary conditions to be captured from real-world geometry using an infrared depth sensor while maintaining real-time fluid flow evolution and visualization [?].

Table 1: Mean performance and in MLUPS for the 3D LBM solver on K5000m

Block Size

32

321.74 331.07 344.9 343.37 346.63 348.17 358.02
0.64 0.21 0.1 0.05 0.03 0.08 0.03

64

419.56 416.49 413.47 401.63 413.29 392.25 412.17
2.64 1.31 1.3 0.91 0.85 0.72 0.67

96

416.09 418.21 411.79 400.82 413.84 391.49 409.74
2.29 1.36 1.18 0.94 0.94 0.72 0.66

128

419.93 416.07 415.15 400.76 413.51 391.73 413.01
2.49 1.47 1.37 0.88 0.86 0.71 0.7

160

416.18 416 412.04 402.63 411.97 391.68 409.59
2.38 1.31 1.17 0.89 0.93 0.71 0.61

192

420.38 418.89 413.81 400.33 415.62 392.66 410.82
3.17 1.35 1.46 0.89 0.96 0.71 0.7

224

415.23 415.96 412.41 400.68 412.06 390.92 409.93
2.62 1.3 1.29 0.89 0.84 0.71 0.64

256

420.36 416.6 416.27 400.75 414.52 394.93 414.47
2.37 1.28 1.57 0.93 0.96 0.74 0.7
Table 2: Mean performance and in MLUPS for the 3D LBM solver on K20c

Block Size

32

674.75 701.66 725.22 714.54 723.45 722.35 739.12
1.32 0.61 1.21 0.23 0.3 0.14 0.11

64

909.79 1006.73 979.22 937.35 1036.41 944.2 991.28
3.48 2.02 1.07 1.92 1.5 1.65 0.88

96

904.91 981.86 957.39 918.37 1008.64 990.82 962.44
4.64 1.39 1.76 1.78 0.5 3.67 1.1

128

912.9 1007.04 981.79 935.4 1035.36 997.7 990.23
3.62 2.22 1.43 1.96 0. 1.74 1.09

160

913.96 962.88 986.46 912.16 985.47 987.15 1001.31
3.17 1.29 3.6 1.6 0.77 0.56 4.9

192

905.86 957.74 974.51 911.49 977.51 972.65 996.21
2.67 0.69 0.81 2.3 0.29 0.4 0.61

224

869.78 893.35 949.74 900.33 936.42 938.65 961.36
2.24 0.75 1.1 2.66 0.31 0.44 0.64

256

902.12 948.44 990.39 912.45 974.05 977.94 1020.59
3.78 1.2 1.22 1.63 0.41 0.31 0.7
Table 3: Performance expressed in MLUPS for K5000m and K20c GPUs, compared against existing work

Domain Size

K5000m K20c Astorino(2011) Obrecht(2011) Rinaldi(2012)
420 914 273 471 273
419 1007 296 512 296
416 990 314 481 314
403 937 313 482 313
Table 4: Performance normalized against DRAM bandwidth (MLUPS/GB/sec)

Domain Size

K5000m K20c Astorino(2011) Obrecht(2011) Rinaldi(2012)
6.4714 5.7886 1.539 5.233 2.579
6.4485 6.3781 1.669 5.689 2.726
6.4081 6.2727 1.77 5.344 2.284
5.1981 5.9368 1.764 5.356 2.61
Table 5: Frequency components of 3D LBM performance on K5000m

Domain Size

Peak Frequency Component (Hz)

K5000M 11.28 15.98 9.41 11.87 11.71 11.14 11.36
K20c N/A 10.72 9.445 11.15 11.01 11.31 11.56

Footnotes

  1. This paper also provides a useful introduction to LBM implementation in CUDA.
  2. This is a common performance measure based on the number of lattice points that can be updated every second.
  3. Provided the instructions are from the same warp and independent in nature.
  4. The stores are actually performed on a redundant copy of to ensure data dependency is not violated, at the end of each timestep the pointers to the original and redundant copy of are swapped.
10185
This is a comment super asjknd jkasnjk adsnkj
""
The feedback cannot be empty
Submit
Cancel
Comments 0
""
The feedback cannot be empty
   
Add comment
Cancel

You’re adding your first comment!
How to quickly get a good reply:
  • Offer a constructive comment on the author work.
  • Add helpful links to code implementation or project page.