Instructions’ Latencies Characterization for NVIDIA GPGPUs

Instructions’ Latencies Characterization for NVIDIA GPGPUs

Abstract

The last decade has seen a shift in the computer systems industry where heterogeneous computing has become prevalent. Nowadays, Graphics Processing Units (GPUs) are in a variety of systems from supercomputers to mobile phones and tablets. They are not only used for graphics operations but rather as general-purpose special hardware (GPGPUs) to boost the performance of compute-intensive applications. However, the percentage of undisclosed characteristics beyond what vendors provide is small. In this paper, we propose a very low overhead and portable analysis for exposing the hidden latency of each individual instruction executing in the pipeline and different access latencies of the various memory hierarchies at the microarchitecture level. We also show the impact of the possible optimizations a CUDA compiler have over the various latencies. We run our evaluation on seven different high-end NVIDIA GPUs from five different generations/architectures namely: Kepler, Maxwell, Pascal, Volta, and Turing. We believe that this work would help architects have an accurate characterization of the latencies of these GPUs, which would subsequently help in modeling the hardware accurately. In addition, this would also make application developers more aware of how to optimize their applications.

GPGPUs, Latency, PTX, Benchmarking, High-Level Optimizations, Turing, CUDA
\patchcmd\@makecaption

I Introduction

Graphics Processing Units (GPUs) were originally designed to accelerate graphics operations. Yet, nowadays they have become one of the most crucial hardware components of computing systems. Over the last decade, GPUs have evolved to be powerful co-processors that perform general non-specialized calculations that would typically be performed by the CPU. Thus, General Purpose Graphic Processing Units (GPGPUs) are now a fundamental component in any high-performance computing (HPC) system due to the high ability of these architectures to perform complex computations efficiently. The emerging of AI, machine learning, deep learning, bit coin mining have pushed GPGPUs over the top in popularity and versatility way beyond gaming. According to the recent rank of the top 500 most powerful non-distributed computer systems in the world (TOP500 List) [1], 56 percent of the additional flops were a result of NVIDIA Tesla GPUs running on those supercomputers. This is mainly due to the high computational power the recent GPUs have. For instance, the NVIDIA Tesla V100 GPU is capable of deriving peak computational rates of 7.8 TFLOPS for double precision floating point (FP64) performance and 15.7 TFLOPS for single precision (FP32) performance.

Over the last decade, NVIDIA has introduced seven different GPU generations/architectures [2, 3, 4, 5, 6, 7, 8]. Each architecture has its own microarchitecture and hardware characteristics. However, The percentage of undisclosed characteristics beyond what GPU vendors have documented is small. Hence, researches have proposed different micro-benchmarks written in programming languages, such as CUDA [9] or OpenCL [10] to understand the hidden characteristics of the hardware for almost every GPU generations/architectures [11, 12, 13, 14, 15]. Similarly, there are several works to develop assembly tool-chains that can provide direct access to the hardware using real machine-dependent opcodes [16, 17, 18, 19]. These tool-chains usually provide more accurate results as they rely on low-level assembly language compared to the micro-benchmarks written in a relatively high-level language such as CUDA but they are not portable across different generations of GPUs.

With each release of a new generation, a new version of the CUDA (nvcc) compiler [20] is usually released. NVIDIA has been constantly improving the CUDA compiler in terms of the techniques used to optimize the code. One type of code optimizations are machine dependent optimization which is done after the target code has been generated and when the code is transformed according to the target machine architecture. These optimizations affect the execution of individual instructions found in the ISA.

In this paper, we propose a low overhead and portable analysis to demystify the latency of different instructions executing in the pipeline and the different memory hierarchies found in various NVIDIA GPUs. We used parallel thread execution (PTX) [21] to perform our analysis. PTX is a pseudo-assembly language used in NVIDIA’s CUDA programming environment. Alternatively, PTX can be described as a low-level parallel thread execution virtual machine which provides a stable programming model and instruction set for general purpose parallel programming. PTX provides machine-independent ISA, thus the code is portable across different CUDA runtimes and GPUs. Using an assembly-like language such as PTX allow us to control the exact sequence of instructions executing in the pipeline and the type of accessed memory with very low (really minimum) overhead. Since the compiler optimization affect the instructions, we also show the effect of the CUDA compiler optimizations on the execution of all instructions.

Up to our knowledge, no prior work provides a detailed and exhaustive analysis of different GPU instructions’ latencies. Moreover, no prior work discusses the effect of compiler optimizations on every single instruction executing in the pipeline. For this reason, we believe that this work is important, especially with the aggressive emergence of various technologies that rely on GPUs. There are multiple reasons why this characterization is important. First, it can give programmers more concrete understanding of the underlying hardware. Knowing the underlying microarchitecture would help GPU developers optimizing their applications’ performance. Since the execution time of each kernel determines the application’s overall performance, therefore, the programmer needs to be concerned with the execution time of each single instruction when writing high-performance code. Hence, it is critical to utilize hardware resources efficiently in order to achieve high performance. Second, GPUs software modeling frameworks, and cycle-accurate simulators [22, 23, 24] depend on published instructions’ latencies in order to have an accurate model. Volkov [25] argued that inaccurate arithmetic instructions latencies, which are small but may accumulate to large numbers, can have a high impact on the accuracy of estimating the performance by these models. Due to the fact that there exists no work in the literature that provides an in-depth GPU instructions’ latencies characterization, researchers had to collect the latencies along with other specifications from less academic sources such as graphics card databases and online reviews, especially for newer generations such as, pascal [6] and volta [7]. Third, the effect of CUDA compiler optimizations on the instructions can guide GPU architects and code developers to choose what type of optimizations is needed and when.

Contributions. To summarize, this paper makes the following contributions:

  • We provide a low overhead (really minimal) and portable method to estimate GPUs’ instructions’ latencies that goes through their pipelines. In addition, we show the overhead of accessing each memory hierarchy in modern GPUs.

  • We demonstrate the effect of high-level optimization levels found in the CUDA (nvcc) compiler on different instructions.

  • We provide an exhaustive comparison of all the instructions found in the PTX ISA.

  • We run our evaluation on seven different high-end NVIDIA GPUs from five different GPU generations including the recently released Turing architecture

The rest of this paper is organized as follows: Section II discusses relevant related work; whereas, Section III explains the general architecture of NVIDIA GPUs; Section IV shows our methodology; while Section V shows our results; and finally, Section VI concludes the paper.

Ii Related Work

Fig. 1: Typical NVIDIA GPU architecture. The number of SMXs and the computation resources inside varies with the generation and the computational capabilities of the GPU.

Studying the hardware microarchitecture to undisclosed its hidden characteristics has been an active area of research for many years. Several micro-benchmarks were designed with the aim of dissecting the underlying CPU or GPU architecture. Furthermore, various studies have looked into tuning the application’s source code to achieve better performance [26, 27, 28] but this task is tedious and requires a deep understanding of the underlying architecture. Hence, simulators, profilers, and optimization tools [22, 29, 30, 31, 32, 33] were introduced to aid the architecture design space exploration. In this section, we discuss some of the related work in these areas in more details.

Micro-benchmarks: Wong et al.  [11] have used micro-benchmarking to measure the latencies of some instructions and the characteristics of TLB and caches of an early NVIDIA Tesla generation GPU, (GeForce GT200) . The work in [34] measured the GPU kernel start-up costs, and arithmetic throughput to optimize dense linear algebra on (GeForce 8800GTX) GPU which was released in 2006. In [12] the authors investigated the memory hierarchy of three different NVIDIA GPUs generations targeting their caches mechanism and latencies. Jia et al.  [35] studied the microarchitecture details of NVIDIA Volta (Tesla V100) GPU architecture through micro-benchmarks and instruction set disassembly. The authors of [36] used four different NVIDIA GPU generations to study the relevance of data placement optimizations of different GPU memories.

In summary, our work has much lower overhead compared to micro-benchmarking approaches similar to the assembly tool-chains [16, 17, 18, 19] introduced in Section I. Hence, the results are more accurate. In addition, our the same code runs across different NVIDIA GPUs generations without sacrificing the ease of use nor the accuracy. On the other hand, micro-benchmarks, written in CUDA need to be designed specifically for each architecture and need to be updated manually with the emerge of a newer GPU generation.

Compiler optimizations: Chakrabarti et al.  [37] described the effect of some CUDA compiler optimizations on computations written in CUDA running on GPUs. In [38] the authors applied auto-tuning techniques to CUDA compiler parameters using the openTuner [39] framework and compared the optimizations achieved by auto-tuning with the high-level optimization levels (-O0,-O1,-O2, and -O3) found in the compiler. Yang et al.  [40] proposed an optimized GPU compiler framework which focuses on optimizing the memory usage of the application. They tested their framework on old NVIDIA GPUs (GeForce GTX8800) and (GeForce GTX280).

In summary, we follow the same line of research but we focus on the effect of the high-level optimization found in the CUDA compiler on individual instructions executing in the pipeline and on the access overhead of different memories found in modern GPUs.

Iii NVIDIA GPU Architecture Overview

A normal heterogeneous compute node nowadays consist of multicore CPU sockets connected to one or more GPUs. A GPU is currently not a standalone platform but rather a co-processor hosted by a CPU. Figure 1 shows a typical GPU architecture. The host (CPU) is connected to a PCIe bus to which the GPU board is also connected to. This means that the CPU sees the GPU as a PCIe device, thus it can access specific areas of the device memory to allocate and transfer data to. This includes global, constant, and texture memories in CUDA terminology.

The GPU architecture is built around an array of Steaming Multiprocessors (SMX) each can be seen as a standalone processor that can manage thousands of concurrent threads in single instruction multiple threads (SIMT) approach. Each SMX has a number of CUDA cores that has fully pipelined integer Arithmetic Units (ALUs) and floating-point units (FPU32) while being capable of executing one 32 bit integer or floating point operation per cycle. It also includes Double-Precision Units (DPU) for 64-bit computations, Special Function Units (SFU) that executes intrinsic instructions, Load and Store units (LD/ST) for calculations of source and destination memory addresses. In addition to the computational resources, each SMX is coupled with a certain number of warp schedulers, instruction dispatch units, instruction buffer(s) along with texture and shared memory units.

CUDA Memory Model. Both GPUs and CPUs use similar principals in memory hierarchy design. The key difference is that in GPUs, the memory hierarchy is more exposed and this gives the programmer more explicit control over its behavior. Each memory space in the GPU has a different scope, lifetime, and caching behavior. Global, constant, and texture memories reside in the device memory, thus they have high access latencies and their content have the same lifetime as the running application. On the other hand, shared memory contents have the same lifetime as a thread block in a CUDA kernel with much lower access latency.

Global Memory is the largest, and most commonly used memory. It can be accessed by all threads from any SMX. The content of the global memory is cached in two levels. There is one small L1 cache per SMX and one L2 cache shared by all SMXs (per device).

Local Memory is used for register spilling. Any variables in a kernel that cannot fit into registers, would be spilled to the local memory. Local memory data are cached the same way as global memory.

Constant Memory is used for data that will not change over the course of a kernel execution and is cached in a dedicated per SMX read-only cache. The logical constant space that can be allocated on the device memory is 64KB for different computational capabilities.

Texture Memory is originally designed for traditional graphics applications but now it can be used as a read-only memory that can improve performance and reduce memory traffic when reads have certain access patterns. It is a dedicated per SMX read-only memory like constant memory.

Shared Memory is a programmable memory that is used in the communication among threads in a block. It is an on-chip per SMX memory that has high bandwidth with low access latency.

In Kepler, Volta, and Turing, the L1 data cache and the shared memory physically shares the same space, while on Maxwell and Pascal the L1 data cache is separated from the shared memory is separate and combined with texture cache.

Iv Methodology

Fig. 2: Overview of the proposed compilation workflow.
1.visible .entry Add( 2    . param .u64 Add_param_0, 3    . param .u64 Add_param_1, 4    . param .u64 Add_param_2 5){ 6    .reg .b32   %r<7>; 7    .reg .b64   %rd<4>; 8 9    ld. param.u64    %rd1, [Add_param_0]; 10    ld. param.u64    %rd2, [Add_param_1]; 11    ld. param.u64    %rd3, [Add_param_2]; 12 13    ld.global.u32   %r4, [%rd1]; 14    ld.global.u32   %r5, [%rd2]; 15 16    mov.u32         %r1, % clock; 17    add.u32         %r6, %r4, %r5; 18    mov.u32         %r2, % clock; 19    sub.s32         %r3, %r2, %r1; 20 21    st.global.u32   [%rd3], %r3; 22 23    ret; 24}
Fig. 3: The latency of unsigned add instruction using PTX.
1.visible .entry globalMem( 2    . param .u64 globalMem_param_0, 3    . param .u64 globalMem_param_1, 4){ 5    .reg .b32   %r<5>; 6    .reg .b64   %rd<3>; 7 8    ld. param.u64    %rd1, [globalMem_param_0]; 9    ld. param.u64    %rd2, [globalMem_param_1]; 10 11    mov.u32         %r1, % clock; 12    ld.global.u32   %r4, [%rd1 + 4]; 13    mov.u32         %r2, % clock; 14    sub.s32         %r3, %r2, %r1; 15 16    mov.u32         %r1, % clock; 17    ld.global.u32   %r5, [%rd1 + 8]; 18    mov.u32         %r2, % clock; 19    sub.s32         %r4, %r2, %r1; 20 21    st.global.u32   [%rd2], %r3; 22    st.global.u32   [%rd2 + 4], %r4; 23    ret; 24}
Fig. 4: The latency of accessing the device memory and L1/L2 using PTX.

In this section, we describe our implementation approach. The instructions timing model (Section IV-A) is written in PTX [21]. PTX is a virtual assembly ISA that is forward-compatible to all NVIDIA architectures and generations. PTX allows us to have control over the exact sequence of low-level instructions executing without any loop or any other CUDA overhead. Since PTX is considered as a virtual ISA, it gets translated to another machine assembly ISA that gets executed on the GPU known as Source And Assembly (SASS). SASS is only forward-compatible within the same major family (Fermi, Kepler, etc. ). SASS is not open-sourced and its instructions and characteristics are not well-documented and require CUDA Binary Utilities [41] and reverse-engineering tools to disassemble.

Figure 2 shows the compilation workflow which relies on CUDA nvcc compiler [20]. The instrumented PTX source code which contains the instructions timing model is first compiled with the PTX optimizing assembler (ptxas) to produce a device CUDA binary file (.cubin) which is in SASS. This binary file is then placed in a fatbinary (.fatbin) which gets embedded in the host input source code file. The embedded fatbinary is inspected by the CUDA runtime system whenever the device code is launched by the host program to obtain an appropriate fatbinary image for the required GPU family. A single object file (.obj) containing the host and the device source code is then generated and linked to produce an executable file.

Iv-a Timing Model

To determine each operation latency We read the clock register before and after the execution of the instruction. The clock() function provides a per-multiprocessor counter that is incremented every clock cycle. Sampling this counter at the beginning and at the end of the operation and taking the difference between the two samples gives us the exact number of cycles this operation takes to finish execution. Reading the clock register in PTX is translated to register read followed by a dependent operation in SASS. Thus, we calculate the clock function overhead in order to subtract it later from the time obtained for each operation.

Figure 4 shows an example for obtaining the latency of unsigned integer add instruction. Two scalar variables are passed to the kernel and loaded into two registers (line 8 to 14). In line 16, we read the clock register followed by an add instruction and reading the clock register again. The results we have from subtracting the two values of the clock register is then subtracted from the clock overhead (Section V-B1) to obtain the exact number of cycles the hardware took to execute the instruction, in this case the unsigned add instruction.

Figure 4 shows an example for obtaining the latency for accessing the device (global memory) and the cache memories of the GPU. The exact same approach of figure 4 is used but this time we pass a vector to the kernel so that we can calculate the caches hit latency. In line 12, the load instruction will go all the way to fetch the block since it is a cold cache. This will give us the access time of the global memory. We leverage the option provided by the CUDA compiler to tweak the application to disable or enable the L1 cache while compiling it. We compile the application two times, first using L1 and L2 caches thus the block is fetched from the global memory when loading a block (line 12) and put it in L2 and L1 caches. Hence, when loading a value from the same block again (line 17) it is a hit in the L1 cache and this gives us the hit latency of the L1 cache. We do the same thing while disabling the L1 cache and forcing the application to use the L2 cache only. This would get us the hit latency of the L2 cache. We make sure not to read the exact value again in line 17 but rather a new value from the same cache block so that the compiler does not optimize it and change it to a regular mov instruction.

Since we only care about the individual instructions’ latencies and not the overall throughput, the kernels were executed with only 1 thread per warp. To show the effect of the compiler optimizations we compile the code using the high-level optimization flags found in CUDA compiler (-O0, -O1, -O2, -O3). To make sure that the hardware executes the instructions and things do not get optimized out by the compiler in the -O3 level, we perform a dependent dummy operation on the output of the instruction. The compiler also sometimes reorder the kernel’s instructions when translating from PTX to SASS and this can move the instruction out of the clock timing block so we added memory and thread barriers just to make sure that the code gets translated as it is and the instruction is inside the clock timing block.

V Evaluation

Configuration K40m TITAN P100 V100 TITAN
X RTX
Graphics Processor
Architecture Kepler Maxwell Pascal Volta Turing
Chip GK110B GM200 GP100 GV100 TU102
Compute Capability 3.5 5.2 6.0 7.0 7.5
Clock Speeds
GPU Clock 745 MHz 1000 MHz 1190 MHz 1246 MHz 1350 MHz
Memory Clock 1502 MHz 1753 MHz 715 MHz 876 MHz 1750 MHz
Memories
Memory Size 12 GB 12 GB 16 GB 16 GB 24 GB
Memory Type GDDR5 GDDR5 HBM2 HBM2 GDDR6
Memory Bus 384 bit 384 bit 4096 bit 4096 bit 384 bit
Memory Bandwidth 288.4 GB/s 336.6 GB/s 732.2 GB/s 897.0 GB/s 672.0 GB/s
L1 Size 16 KB 48 KB 24 KB 128 KB 64 KB
L2 Size 1536 KB 3 MB 4 MB 6 MB 6 MB
Theoretical Performance (TFLOPS)
FP16 (half) NA NA 19.05 28.26 32.62
FP32 (float) 5.046 6.691 9.526 15.7 16.31
FP64 (double) 1.682 0.2061 4.763 7.8 0.5098
Texture Rate (GTexel/s) 210.2 209.1 297.7 441.6 509.8
SMX Level
# {Cores, SMX} {2880, 15} {3072, 24} {3584, 56} {5120, 80} {4608, 73}
# {SP, DP, SFU} {192, 64, 32} {128, 4, 32} {64, 32, 16} {64, 32, 4} {64, 2, 4}
# LD/ST 32 32 16 16 16
TABLE I: Target GPUs Configurations.
Fig. 5: Clock Overhead
Instruction Optimized Non Optimized
K40m / K80c TITAN X P100 TITAN V / V100 TITAN RTX K40m / K80c TITAN X P100 TITAN V / V100 TITAN RTX
(1) Integer Arithmetic Instructions
add / sub / min / max 9 6 6 4 4 16 15 15 15 15
mul / mad 9 13 13 4 4 16 87 85 15 15
{s} div (regular) 134 141 144 125 117 791 1020 1039 815 785
{s} div (irregular) 164 160 163 129 121 791 1020 1039 815 785
{s} div (average) 149 150 153 127 119 791 1020 1039 815 785
{s} rem 132 141 144 125 114 751 955 1017 770 740
abs 16 13 13 8 8 32 30 30 30 45
{u} div (regular) 123 127 130 120 112 608 856 851 619 589
{u} div (irregular) 140 146 149 125 116 608 856 851 619 589
{u} div (average) 131 136 139 122 114 608 856 851 619 589
{u} rem 116 127 130 117 109 576 826 821 590 560
(2) Logic and Shift Instructions
and / or / not / xor 9 6 6 4 4 16 15 15 15 15
cnot 18 6 12 8 8 48 45 45 45 45
shl/shr 9 6 6 4 4 16 15 15 15 15
(3) Floating Single Precision Instructions
add / sub / min / max 9 6 6 4 4 16 15 15 15 15
mul / mad / fma 9 6 6 4 4 16 15 15 15 15
div (regular) 151 / 150 135 167 123 152 661 / 629 725 671 638 546
div (irregular) 686 / 479 765 649 280 303 661 / 629 725 671 638 546
div (average) 418 / 314 450 408 201 227 661 / 629 725 671 638 546
(4) Double Precision Instructions
add / sub / min / max 10 48 8 8 40 16 52 15 15 48
mul / mad / fma 10 48 8 8 40 16 52 15 15 54
div (average) 445 / 428 709 545 159 540 1588 / 1338 1821 1399 945 1202
(5) Half Precision Instructions
add / sub NA NA 6 6 6 NA NA 15 15 15
mul NA NA 6 6 6 NA NA 15 15 15
fma NA NA 6 6 6 NA NA 15 15 15
(6) Multi Precision Instructions
add.cc / addc / sub.cc 9 6 6 4 4 16 15 15 15 15
subc 18 12 12 8 8 32 30 30 30 30
mad.cc/madc 9 13 13 4 4 16 87 85 15 15
(7) Special Mathematical Instructions
rcp 377 / 298 347 266 60 92 459 / 429 534 395 316 315
sqrt 432 / 352 360 282 60 96 465 / 431 540 399 330 330
fast approximate sqrt 49 47 35 31 31 304 285 540 270 270
fast approximate rsqrt 40 34 35 31 31 288 270 270 270 270
fast approximate sin/cos 18 15 15 11 13 32 30 30 30 30
fast approximate lg2 40 34 35 31 31 288 270 270 270 270
fast approximate ex2 49 40 41 22 32 256 240 240 225 225
copysign 21 20 20 8 7 80 75 75 75 75
(8) Integer Intrinsic Instructions
mul24() / mad24() 22 21 21 12 12 48 118 116 75 75
mulhi() 9 18 18 12 8 16 85 86 32 17
mul64hi() 226 106 118 123 123 896 1419 1420 578 578
sad() 9 6 6 4 4 16 15 15 15 15
popc() 9 13 13 15 15 32 45 45 45 45
clz() 20 19 18 5 21 32 30 30 30 30
bfe() / bfi() 9 6 6 4 4 16 15 15 15 15
bfind() / bbrev() 9 6 6 15 15 48 45 45 45 45
TABLE II: The Latency of the various ALU Instructions.
(a) Global Memory & L1 / L2 Caches Access Latencies
(b) Texture Memories & Texture Cache Access Latencies
Fig. 6: Different Memory Units Access Overhead

In this section, we illustrate the differences in characteristics between the various GPUs tested in this paper. We then present the results of running our tool on these GPUs.

V-a Target GPUs

We run our evaluation on seven different high-end GPUs from five different generations (Kepler [4], Maxwell [5], Pascal [6], Volta [7], and Turing [8]). Table I depicts the differences in configurations and theoretical performance between the main five GPUs evaluated. The additional two GPUs are K80c and TITAN V which are from the same generation as K40m (Kepler) and V100 (Volta) respectively. We used these two additional GPUs to verify if the results change within the same architecture by the particular model of the GPU. Most of the configuration parameters are collected from NVIDIA’s white papers and non-academic sources such as graphics cards databases and online reviews.

V-B Evaluation Results

We divide the instructions into two categories; instructions that use the computational units in the GPU, we name those ALU Instructions and data movement instructions that use the different memories in a GPU, we name those Memory Instructions. We used CUDA version 9.0 [42] for all our assessment except for the TITAN RTX GPU which supports only CUDA version 10.0 [43]. We run the code with the different optimization levels found in the compiler. Due to space reason we only provide the results of (-O3) and (-O0) which we donate as Optimized and Non Optimized in the results. The other two optimization levels have almost the same results as the Optimized results provided here. In order to see whether different CUDA versions will have an effect on our results, we used CUDA version 10.0 [43] on Volta GPUs and compared the results with the results we got while using CUDA version 9.0.

Clock Overhead

We first calculate the clock function overhead. Reading the clock register is processed on the hardware by a move instruction followed by a dependant operation. Figure 5 shows the difference in clock overhead between the different GPU architecture. It also shows the effect of the optimization levels on reading the clock register.

ALU Instructions

Table II shows the instruction overhead latencies for different NVIDIA GPUs. We run all the instructions found in the latest PTX version released, 6.4 [21]. We divide the instructions into eight different categories and group the instructions whose latencies are the same together. We group the GPUs that are from the same generations together and if both have the same results for the same instruction we write only one number in that entry in the table.

We noticed that some instructions such as div and rem have different results when operating on signed an unsigned numbers. We donate that by {s} and {u} respectively. In addition, all the instructions have the output when operating on different data values except div instruction which have different output depending on the data values. This is mainly because it gets optimized and changed by the compiler into shift operations when the divisor is a power of two. Thus, We donate that by (regular), (irregular), and (average) which state if the divisor is a power of two, not a power of two, and the average between the two cases respectively.

Half Precision (FP16) instructions were first supported with the releasing of the Pascal architecture GPUs. This was an artifact of the prevalence of approximate computing especially for machine/deep learning acceleration [44, 45]. Although both P100, V100, and RTX have the same numbers for the (FP16) instructions, the Turing architecture has higher theoretical performance than the other two as shown in table I.

Multi Precision instructions combine the use of different numerical precisions in a computational method. It offers significant computational speedups by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible. This is used in critical parts of neural networks [46]. If the architecture does not support half-precision it uses single-precision only. Table II shows that Volta and Turing architectures have the best results across all the other generations.

The ALU instructions’ latencies have significantly decreased from the Kepler architecture to its successor Maxwell architecture, expect for the div instruction. Maxwell and Pascal’s architectures have very close results nevertheless Maxwell has lower double precision performance. Turing architecture has the best results in the Integer Arithmetic Instructions but it exhibits very high latencies in double precision operations. In half and multi precision instructions, Turing and Volta have almost the same results but Turing has higher throughput. Hence, approximate computing applications can experience very good performance on such architecture. On the other hand, Volta GPUs are unbeatable in single and double precisions floating point performance.

We run the same evaluation using CUDA compiler version 10.0 on the Volta architecture GPUs to see whether different CUDA versions will affect the optimizations done on the instructions. Table III shows the instructions which experienced differences in latencies between the two version. From these results, we can confidently conclude that CUDA compiler version 10.0 has lower latencies, thus better optimizations.

Memory Instructions

The global memory access overhead is shown in figure 6(a) and the texture memory access overhead is shown in figure 6(b). The figures show that the difference between the access latencies is nearly unnoticeable. NVIDIA focused more on increasing the bandwidth of the main memory, the memory interconnect (bus), and the texture rate rather than the access latency since the latency is going to be tolerated by thread level parallelism. Despite the fact that Kepler has nearly the same access latency for the main and texture memories as Volta, Volta has more than double the memory bandwidth and the texture rate compared to Kepler as shown in table I. The figures also show that the non-optimized version of the texture memory is nearly double that of the optimized version which is not the case in the global memory.

The L1 data cache have higher access latency on Maxwell and Pascal which comply with the fact that they share the same physical space with the texture cache. L2 data cache can sometimes experience very high access latency due to bank conflicts which sometimes leads to memory divergence and forces many of these requests to queue up for long periods of time [47]. However, with the introduction of a new NVIDIA architecture, the size of the L1 and L2 caches will increase and this can improve bank conflicts.

Table IV shows the shared and constant cache memories access latency. They both have very low latencies compared to other memories. The constant cache memory gets optimized by the CUDA compiler to be almost the same overhead as register to register instruction latency.

Instruction CUDA Version 9.0 CUDA Version 10.0
Floating Single Precision Instructions
div (regular) 123 116
div (irregular) 280 266
div (average) 201 191
Double Precision Instructions
div 159 135
Integer Intrinsic Instructions
mul64hi() 123 85
popc() 15 5
bfind() / bbrev() 15 5
TABLE III: Optimizations effect of different CUDA Compiler Versions on VOLTA (TITANV / V100) GPUs
Memory Unit K40m TITAN X P100 V100 RTX
Shared Memory Optimized
26 24 25 18 21
Non Optimized
55 53 54 49 37
Constant Memory Optimized
16 20 12 8 8
Non Optimized
80 145 71 70 71
TABLE IV: Shared & Constant Memories Access Latency

Vi Conclusion

In this paper, we benchmark the undisclosed instructions’ latencies and different memory access overhead of various NVIDIA GPGPUs. We also show the effect of the different optimization levels found in CUDA (nvcc) compiler on the individual instructions. We run our evaluation on seven different NVIDIA GPUs from five different GPU architectures. Our results show that the instructions’ overhead latencies have mostly decreased from Kepler to Turing. These results should help architects and programmer optimize both the hardware and software and to understand the impact and sensitivity of applications on various GPUs architectures.

References

  1. Top500. [Online]. Available: https://www.top500.org/
  2. NVIDIA Tesla GPU Architecture, 2008. [Online]. Available: https://www.nvidia.com/docs/IO/55506/GeForce_GTX_200_GPU_Technical_Brief.pdf
  3. NVIDIA Fermi GPU Architecture, 2009. [Online]. Available: https://www.nvidia.com/content/pdf/fermi_white_papers/nvidia_fermi_compute_architecture_whitepaper.pdf
  4. NVIDIA Kepler GPU Architecture, 2017. [Online]. Available: https://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf
  5. NVIDIA Maxwell GPU Architecture, 2042. [Online]. Available: https://international.download.nvidia.com/geforce-com/international/pdfs/GeForce-GTX-750-Ti-Whitepaper.pdf
  6. NVIDIA Pascal GPU Architecture, 2016. [Online]. Available: https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf
  7. NVIDIA Volta GPU Architecture, 2017. [Online]. Available: https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf
  8. NVIDIA Turing GPU Architecture, 2018. [Online]. Available: https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf
  9. J. Cheng and M. Grossman, Professional CUDA C Programming. Wrox Press Ltd, 2014.
  10. J. E. Stone, D. Gohara, and G. Shi, “OpenCL: A parallel programming standard for heterogeneous computing systems,” Computing in Science Engineering, vol. 12, no. 3, pp. 66–73, May 2010.
  11. H. Wong, M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos, “Demystifying gpu microarchitecture through microbenchmarking,” in 2010 IEEE International Symposium on Performance Analysis of Systems Software (ISPASS), March 2010, pp. 235–246. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/ISPASS.2010.5452013
  12. X. Mei and X. Chu, “Dissecting gpu memory hierarchy through microbenchmarking,” IEEE Trans. Parallel Distrib. Syst, vol. 28, no. 1, pp. 72–86, Jan. 2017. [Online]. Available: https://doi.org/10.1109/TPDS.2016.2549523
  13. X. Mei and X. Chu, “A micro-benchmark suite for amd gpus,” in 2010 39th International Conference on Parallel Processing Workshops, 2010, pp. 387–396.
  14. X. Yan, X. Shi, L. Wang, and H. Yang “An opencl micro-benchmark suite for gpus and cpus,” J. Supercomput., vol. 69, no. 2, pp. 693–713, Aug. 2014. [Online]. Available: http://dx.doi.org/10.1007/s11227-014-1112-2
  15. P. Gera, H. Kim, H. Kim, S. Hong, V. George, and C. C. Luk, “Performance characterisation and simulation of intel’s integrated gpu architecture,” in n 2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2018, pp. 139–148.
  16. H. Yunqing, asfermi: An assembler for the NVIDIA Fermi Instruction Set, 2011. [Online]. Available: https://github.com/hyqneuron/asfermi
  17. S. Gray, MaxAs: Assembler for NVIDIA Maxwell architecture, 2011. [Online]. Available: https://github.com/NervanaSystems/maxas
  18. X. Zhang, G. Tan, S. Xue, J. Li, K. Zhou, and M. Chen, “Understanding the gpu microarchitecture to achieve bare-metal performance tuning,” in Proceedings of the 22Nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’17. New York, NY, USA: ACM, 2017, pp. 31–43. [Online]. Available: http://doi.acm.org/10.1145/3018743.3018755
  19. W. J. van der Laan, Decuda, 2008. [Online]. Available: https://github.com/laanwj/decuda
  20. CUDA Compiler Driver NVCC, 2019. [Online]. Available: https://docs.nvidia.com/cuda/pdf/CUDA_Compiler_Driver_NVCC.pdf
  21. Parallel Thread Execution ISA version 6.4, 2019. [Online]. Available:https://docs.nvidia.com/cuda/pdf/ptx_isa_6.4.pdf
  22. A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, “Analyzing cuda workloads using a detailed gpu simulator,” in 2009 IEEE International Symposium on Performance Analysis of Systems and Software, April 2009, pp. 163–174.
  23. Y. Arafa, A. A. Badawy, G. Chennupati, N. Santhi, and S. Eidenbenz, “Ppt-gpu: Scalable gpu performance modeling,” IEEE Computer Architecture Letters, vol. 18, no. 1, pp. 55–58, Jan 2019.
  24. M. Samadi, D. A. Jamshidi, J. Lee, and S. Mahlke, “Paraprox: Patternbased approximation for data parallel applications,” in Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS ’14. New York, NY, USA: ACM, 2014, pp. 35–50. [Online]. Available: http://doi.acm.org/10.1145/2541940.2541948
  25. V. Volkov, “A microbenchmark to study gpu performance models,” in Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’18. New York, NY, USA: ACM, 2018, pp. 421–422. [Online]. Available: http://doi.acm.org/10.1145/3178487.3178536
  26. G. Tan, L. Li, S. Triechle, E. Phillips, Y. Bao, and N. Sun, “Fast implementation of dgemm on fermi gpu,” in n SC ’11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, Nov 2011, pp. 1–11.
  27. J. Lai and A. Seznec, “Performance upper bound analysis and optimization of sgemm on fermi and kepler gpus,” in Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), Feb 2013, pp. 1–10.
  28. N. Maruyama and T. Aoki, “Optimizing stencil computations for nvidia kepler gpus,” in Proceedings of the 1st International Workshop on High Performance Stencil Computations, January 2014.
  29. Y. Arafa, A.-H. A. Badawy, G. Chennupati, N. Santhi, and S. Eidenbenzi, “Ppt-gpu: Performance prediction toolkit for gpus identifying the impact of caches: Extended abstract,” in Proceedings of the International Symposium on Memory Systems, ser. MEMSYS ’18. New York, NY, USA: ACM, 2018, pp. 301–302. [Online]. Available: http://doi.acm.org/10.1145/3240302.3270315
  30. M. Kambadur, S. Hong, J. Cabral, H. Patil, C. Luk, S. Sajid, and M. A. Kim, “Fast computational gpu design with gt-pin,” in 2015 IEEE International Symposium on Workload Characterization, Oct 2015, pp. 76–86.
  31. M. Stephenson, S. K. Sastry Hari, Y. Lee, E. Ebrahimi, D. R. Johnson, D. Nellans, M. O’Connor, and S. W. Keckler, “Flexible software profiling of gpu architectures,” in Proceedings of the 42Nd Annual International Symposium on Computer Architecture, ser. ISCA ’15. New York, NY, USA: ACM, 2015, pp. 185–197. [Online]. Available: http://doi.acm.org/10.1145/2749469.2750375
  32. D. Shen, S. L. Song, A. Li, and X. Liu, “Cudaadvisor: Llvm-based runtime profiling for modern gpus,” in Proceedings of the 2018 International Symposium on Code Generation and Optimization, ser. CGO 2018. New York, NY, USA: ACM, 2018, pp. 214–227. [Online]. Available: http://doi.acm.org/10.1145/3168831
  33. NVIDIA Visual Profiler. User’s guide, 2014. [Online]. Available: http://docs.nvidia.com/cuda/profiler-users-guide/
  34. V. Volkov and J. W. Demmel, “Benchmarking gpus to tune dense linear algebra,” in SC ’08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, Nov 2008, pp. 1–11.
  35. Z. Jia, M. Maggioni, B. Staiger, and D. P. Scarpazza, “Dissecting the NVIDIA volta GPU architecture via microbenchmarking,” CoRR, vol. abs/1804.06826, 2018. [Online]. Available: http://arxiv.org/abs/1804.06826
  36. M. A. S. Bari, L. Stoltzfus, P.-H. Lin, C. Liao, M. Emani, and B. M.Chapman, “Is data placement optimization still relevant on newer gpus?,” 2018 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), pp. 83–96, 2018.
  37. G. Chakrabarti, V. Grover, B. Aarts, X. Kong, M. Kudlur, Y. Lin, J. Marathe, M. Murphy, and J.-Z. Wang, “Cuda: Compiling and optimizing for a gpu platform,” Procedia Computer Science, vol. 9, pp. 1910 – 1919, 2012, proceedings of the International Conference on Computational Science, ICCS 2012. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1877050912003304
  38. P. Bruel, M. Amaris, and A. Goldman, “Autotuning cuda compiler parameters for heterogeneous applications using the opentuner framework,” Concurrency and Computation: Practice and Experience, vol. 29, no. 22, p. e3973, 2017, e3973 cpe.3973. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.3973
  39. J. Ansel, S. Kamil, K. Veeramachaneni, J. Ragan-Kelley, J. Bosboom, U. O’Reilly, and S. Amarasinghe, “Opentuner: An extensible framework for program autotuning,” in 2014 23rd International Conference on Parallel Architecture and Compilation Techniques (PACT), Aug 2014, pp. 303–315.
  40. Y. Yang, P. Xiang, J. Kong, and H. Zhou, “A gpgpu compiler for memory optimization and parallelism management,” in Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation, ser. PLDI ’10. New York, NY, USA: ACM, 2010, pp. 86–97. [Online]. Available: http://doi.acm.org/10.1145/1806596.1806606
  41. CUDA Binary Utilities, 2019. [Online]. Available: https://docs.nvidia.com/cuda/pdf/CUDA_Binary_Utilities.pdf
  42. CUDA Toolkit Documentation v9.0, 2018. [Online]. Available: https://docs.nvidia.com/cuda/archive/9.0/
  43. CUDA Toolkit Documentation v10.0, 2018. [Online]. Available: https://docs.nvidia.com/cuda/archive/10.0/
  44. A. Yazdanbakhsh, D. Mahajan, H. Esmaeilzadeh, and P. Lotfi-Kamran, “Axbench: A multiplatform benchmark suite for approximate computing,” IEEE Design Test, vol. 34, no. 2, pp. 60–68, April 2017.
  45. N. Ho and W. Wong, “Exploiting half precision arithmetic in nvidia gpus,” in 2017 IEEE High Performance Extreme Computing Conference(HPEC), Sep. 2017, pp. 1–7.
  46. P. Micikevicius, S. Narang, J. Alben, G. F. Diamos, E. Elsen, D. García, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, and H. Wu, “Mixed precision training,” CoRR, vol. abs/1710.03740, 2017. [Online]. Available: http://arxiv.org/abs/1710.03740
  47. PR. Ausavarungnirun, S. Ghose, O. Kayıran, G. H. Loh, C. R. Das, M. T. Kandemir, and O. Mutlu, “Holistic management of the gpgpu memory hierarchy to manage warp-level latency tolerance,” CoRR, vol. abs/1804.11038, 2018. [Online]. Available:http://arxiv.org/abs/1804.11038
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
366476
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description