\thechapter Introduction

Techniques for Shared Resource Management in Systems with Throughput Processors

\newboolean

publicversion \setbooleanpublicversiontrue

Copyright © 2017 Rachata Ausavarungnirun

Acknowledgements

First and foremost, I would like to thank my parents, Khrieng and Ruchanee Ausavarungnirun for their endless encouragement, love, and support. In addition to my family, I would like to thank my advisor, Prof. Onur Mutlu, for providing me with great research environment. He taught me many important aspects of research and shaped me into the researcher I am today.

I would like to thank all my committee members, Prof. James Hoe, Dr. Gabriel Loh, Prof. Chris Rossbach and Prof. Kayvon Fatahalian, who provided me multiple feedback on my research and spent a lot of their time and effort to help me complete this dissertation. Special thank to Professor James Hoe, my first mentor at CMU, who taught me all the basics since my sophomore year. Professor Hoe introduced me to many interesting research projects within CALCM. Thanks to Dr. Gabriel Loh for his guidance, which helped me tremendously during the first four years of my PhD. Thanks to Prof. Chris Rossbach for being a great mentor, providing me with guidance, feedback and support for my research. Both Dr. Loh and Prof. Rossbach provided me with lots of real-world knowledge from the industry, which further enhanced the quality of my research. Lastly, thanks to Prof. Kayvon Fatahalian for his knowledge and valuable comments on my GPU research.

All members of SAFARI have been like a family to me. This dissertation is done thanks to lots of support and feedback from them. Donghyuk Lee has always been a good friend and a great mentor. His work ethic is something I always look up to. Thanks to Kevin Chang for all the valuable feedback throughout my PhD. Thanks to Yoongu Kim and Lavanya Subramanian for teaching me on several DRAM-related topics. Thanks to Samira Khan and Saugata Ghose for their guidance. Thanks to Hongyi Xin and Yixin Luo for their positive attitudes and their friendship. Thanks to Vivek Seshadri and Gennady Pekhimenko for their research insights. Thanks to Chris Fallin and Justin Meza for all the helps, especially during the early years of my PhD. They provided tremendous help when I am preparing for my qualification exam. Thanks to Nandita Vijaykumar for all GPU-related discussions. Thanks to Hanbin Yoon, Jamie Liu, Ben Jaiyen, Chris Craik, Kevin Hsieh, Yang Li, Amirali Bouroumand, Jeremie Kim, Damla Senol and Minesh Patel for all their interesting research discussions.

In additional to people in the SAFARI group, I would like to thank Onur Kayiran and Adwait Jog, who have been great colleagues and have been providing me with valuable discussions on various GPU-related research topics. Thanks to Mohammad Fattah for a great collaboration on Network-on-chip. Thanks to Prof. Reetu Das for her inputs on my Network-on-chip research projects. Thanks to Eriko Nurvitadhi and Peter Milder, both of whom were my mentors during my undergrad years. Thanks to John and Claire Bertucci for their fellowship support. Thanks to Dr. Pattana Wangaryattawanich and Dr. Polakit Teekakirikul for their friendship and mental support. Thanks to several members of the Thai Scholar community as well as several members of the Thai community in Pittsburgh for their friendship. Thanks to support from AMD, Facebook, Google, IBM, Intel, Microsoft, NVIDIA, Qualcomm, VMware, Samsung, SRC, and support from NSF grants numbers 0953246, 1065112, 1147397, 1205618, 1212962, 1213052, 1302225, 1302557, 1317560, 1320478, 1320531, 1409095, 1409723, 1423172, 1439021 and 1439057.

Lastly, I would like to give a special thank to my wife, Chatchanan Doungkamchan for her endless love, support and encouragement. She understands me and helps me with every hurdle I have been through. Her work ethic and the care she gives to her research motivate me to work harder to become a better researcher. She provides me with the perfect environment that allows me to focus on improving myself and my work while trying to make sure neither of us are burned-out from over working. I could not have completed any of the works done in this dissertation without her support.

Abstract

The continued growth of the computational capability of throughput processors has made throughput processors the platform of choice for a wide variety of high performance computing applications. Graphics Processing Units (GPUs) are a prime example of throughput processors that can deliver high performance for applications ranging from typical graphics applications to general-purpose data parallel (GPGPU) applications. However, this success has been accompanied by new performance bottlenecks throughout the memory hierarchy of GPU-based systems. This dissertation identifies and eliminates performance bottlenecks caused by major sources of interference throughout the memory hierarchy.

Specifically, we provide an in-depth analysis of inter- and intra-application as well as inter-address-space interference that significantly degrade the performance and efficiency of GPU-based systems.

To minimize such interference, we introduce changes to the memory hierarchy for systems with GPUs that allow the memory hierarchy to be aware of both CPU and GPU applications’ characteristics. We introduce mechanisms to dynamically analyze different applications’ characteristics and propose four major changes throughout the memory hierarchy.

First, we introduce Memory Divergence Correction (MeDiC), a cache management mechanism that mitigates intra-application interference in GPGPU applications by allowing the shared L2 cache and the memory controller to be aware of the GPU’s warp-level memory divergence characteristics. MeDiC uses this warp-level memory divergence information to give more cache space and more memory bandwidth to warps that benefit most from utilizing such resources. Our evaluations show that MeDiC significantly outperforms multiple state-of-the-art caching policies proposed for GPUs.

Second, we introduce the Staged Memory Scheduler (SMS), an application-aware CPU-GPU memory request scheduler that mitigates inter-application interference in heterogeneous CPU-GPU systems. SMS creates a fundamentally new approach to memory controller design that decouples the memory controller into three significantly simpler structures, each of which has a separate task, These structures operate together to greatly improve both system performance and fairness. Our three-stage memory controller first groups requests based on row-buffer locality. This grouping allows the second stage to focus on inter-application scheduling decisions. These two stages enforce high-level policies regarding performance and fairness. As a result, the last stage is simple logic that deals only with the low-level DRAM commands and timing. SMS is also configurable: it allows the system software to trade off between the quality of service provided to the CPU versus GPU applications. Our evaluations show that SMS not only reduces inter-application interference caused by the GPU, thereby improving heterogeneous system performance, but also provides better scalability and power efficiency compared to multiple state-of-the-art memory schedulers.

Third, we redesign the GPU memory management unit to efficiently handle new problems caused by the massive address translation parallelism present in GPU computation units in multi-GPU-application environments. Running multiple GPGPU applications concurrently induces significant inter-core thrashing on the shared address translation/protection units; e.g., the shared Translation Lookaside Buffer (TLB), a new phenomenon that we call inter-address-space interference. To reduce this interference, we introduce Multi Address Space Concurrent Kernels (MASK). MASK introduces TLB-awareness throughout the GPU memory hierarchy and introduces TLB- and cache-bypassing techniques to increase the effectiveness of a shared TLB.

Finally, we introduce Mosaic, a hardware-software cooperative technique that further increases the effectiveness of TLB by modifying the memory allocation policy in the system software. Mosaic introduces a high-throughput method to support large pages in multi-GPU-application environments. \yellowThe key idea is to ensure memory allocation preserve address space contiguity to allow pages to be coalesced without any data movements. Our evaluations show that the MASK-Mosaic combination provides a simple mechanism that eliminates the performance overhead of address translation in GPUs without significant changes to GPU hardware, thereby greatly improving GPU system performance.

The key conclusion of this dissertation is that a combination of GPU-aware cache and memory management techniques can effectively mitigate the memory interference on current and future GPU-based systems as well as other types of throughput processors.

\rachata

KevinC: The last sentence might not be enough. Might want to add another sentence to say something. \rachataKevinC: Might want to break the MASK and MOSAIC.

List of Figures:
Contents:

Chapter \thechapter Introduction

\rachata

First introduce throughput processors and then refocus by saying this dissertation focuses on one of the most popular throughput processor, which is the Graphics Processing Units.

\rachata

First define throughput processor

Throughput processor is a type of processors that consists of numerous simple processing cores. Throughput processor allows applications to achieve very high throughput by executing a massive number of compute operations on these processing cores in parallel within a single cycle [388, 370, 87, 158, 84, 369, 48, 364, 410, 411, 389, 148, 13, 26, 5, 432, 61, 179, 80, 181, 344, 278, 307, 308, 310, 311, 312, 315, 7, 8, 427]. These throughput processors incorporate a variety of processing paradigms, such as vector processors, which utilize a specific execution model called Single Instruction Multiple Data (SIMD) model that allows one instruction to be operated on multiple data [388, 370, 87, 158, 84, 369, 48, 364], processors that utilize a technique called fine-grained multithreading, which allows the processor to issue instructions from different threads after every cycle [410, 411, 389, 148, 13, 26], or processors that utilize both techniques [5, 432, 61, 179, 80, 181, 344, 278, 307, 308, 310, 311, 312, 315, 7, 8, 427]. One of the most prominent throughput processors available in modern day computing systems that utilize both SIMD execution model and fine-grained multithreading is the Graphics Processing Units (GPUs). This dissertation uses GPUs as an example class of throughput processors.

GPUs have enormous parallel processing power due to the large number of computational units they provide. Modern GPU programming models exploit this processing power using a large amount of thread-level parallelism. GPU applications can be broken down into thousands of threads, allowing GPUs to use an execution model called SIMT (Single Instruction Multiple Thread), which enables the GPU cores to tolerate dependencies and long memory latencies. The thousands of threads within a GPU application are clustered into work groups (or thread blocks), where each thread block consists of a collection of threads that are run concurrently. Within a thread block, threads are further grouped into smaller units, called warps [251] or wavefronts [11]. Every cycle, each GPU core executes a single warp. Each thread in a warp executes the same instruction (i.e., is at the same program counter) in lockstep, which is an example of the SIMD (Single Instruction, Multiple Data) [116] execution model. This highly-parallel SIMT/SIMD execution model allows the GPU to complete several hundreds to thousands of operations every cycle.

GPUs are present in many modern systems. These GPU-based systems range from traditional discrete GPUs [310, 311, 312, 315, 7, 8, 427, 278, 344] to heterogeneous CPU-GPU architectures [5, 432, 61, 179, 80, 181, 344, 278, 307, 308]. In all of these systems with GPUs, resources throughout the memory hierarchy, e.g., core-private and shared caches, main memory, the interconnects, and the memory management units are shared across multiple threads and processes that execute concurrently in both the CPUs and the GPUs.

1 Resource Contention and Memory Interference Problem in Systems with GPUs

Due to the limited shared resources in these systems, applications oftentimes are not able to achieve their ideal throughput (as measured by, e.g., computed instructions per cycle). Shared resources become the bottleneck and create inefficiency because accesses from one thread or application can interfere with accesses from other threads or applications in any shared resources, leading to both bandwidth and space contention, resulting in lower performance. The main goal of this dissertation is to analyze and mitigate the major memory interference problems throughout shared resources in the memory hierarchy of current and future systems with GPUs.

We focus on three major types of memory interference that occur in systems with GPUs: 1) intra-application interference among different GPU threads, 2) inter-application interference that is caused by both CPU and GPU applications, and 3) inter-address-space interference during address translation when multiple GPGPU applications concurrently share the GPUs.

\paragraphbe

Intra-application interference is a type of interference that originates from GPU threads within the same GPU application. When a GPU executes a GPGPU application, the threads that are scheduled to run on the GPU cores execute concurrently. Even though these threads belong to the same kernel, they contend for shared resources, causing interference to each other [36, 247, 78, 79]. This intra-application interference leads to the significant slowdown of threads running on GPU cores and lowers the performance of the GPU.

\paragraphbe

Inter-application interference is a type of interference that is caused by concurrently-executing CPU and GPU applications. It occurs in systems where a CPU and a GPU share the main memory system. This type of interference is especially observed in recent heterogeneous CPU-GPU systems [181, 179, 176, 178, 62, 61, 432, 307, 80, 344, 278, 33, 209, 207, 187], which introduce an integrated Graphics Processing Unit (GPU) on the same die with the CPU cores. Due to the GPU’s ability to execute a very large number of parallel threads, GPU applications typically demand significantly more memory bandwidth than typical CPU applications. Unlike GPU applications that are designed to tolerate the long memory latency by employing massive amounts of multithreading [33, 179, 181, 61, 432, 310, 311, 312, 315, 7, 8, 9, 80, 344, 278, 427, 307, 308], CPU applications typically have much lower tolerance to latency [33, 220, 221, 398, 400, 399, 402, 292, 293, 103, 234]. The high bandwidth consumption of the GPU applications heavily interferes with the progress of other CPU applications that share the same hardware resources.

\paragraphbe

Inter-address-space interference arises due to the address translation process in an environment where multiple GPU applications share the same GPU, e.g., a shared GPU in a cloud infrastructure. We discover that when multiple GPGPU applications concurrently use the same GPU, the address translation process creates additional contention at the shared memory hierarchy, including the Translation Lookaside Buffers (TLBs), caches, and main memory. This particular type of interference can cause a severe slowdown to all applications and the system when multiple GPGPU applications are concurrently executed on a system with GPUs.

While previous works propose mechanisms to reduce interference and improve the performance of GPUs (See Chapter \thechapter for a detailed analyses of these previous works), these approaches 1) focus only on a subset of the shared resources, such as the shared cache or the memory controller and 2) generally do not take into account the characteristics of the applications executing on the GPUs.

2 Thesis Statement and Our Overarching Approach: Application Awareness

With the understanding of the causes of memory interference, our thesis statement is that a combination of GPU-aware cache and memory management techniques can mitigate memory interference caused by GPUs on current and future systems with GPUs. To this end, we propose to mitigate memory interference in current and future GPU-based systems via GPU-aware and GPU-application-aware resource management techniques. We redesign the memory hierarchy such that each component in the memory hierarchy is aware of the GPU applications’ characteristics. The key idea of our approach is to extract important features of different applications in the system and use them in managing memory hierarchy resources much more intelligently. These key features consist of, but are not limited to, memory access characteristics, utilization of the shared cache, usage of shared main memory and demand for the shared TLB. Exploiting these features, we introduce modifications to the shared cache, the memory request scheduler, the shared TLB and the GPU memory allocator to reduce the amount of inter-application, intra-application and inter-address-space interference based on applications’ characteristics. We give a brief overview of our major new mechanisms in the rest of this section.

2.1 Minimizing Intra-application Interference

Intra-application interference occurs when multiple threads in the GPU contend for the shared cache and the shared main memory. Memory requests from one thread can interfere with memory requests from other threads, leading to low system performance. As a step to reduce this intra-application interference, we introduce Memory Divergence Correction (MeDiC) [36], a cache and memory controller management scheme that is designed to be aware of different types of warps that access the shared cache, and selectively prioritize warps that benefit the most from utilizing the cache. This new mechanism first characterizes different types of warps based on how much benefit they receive from the shared cache. To effectively characterize warp-type, MeDiC uses the memory divergence patterns, i.e., the diversity of how long each load and store instructions in the warp takes. We observe that GPGPU applications exhibit different levels of heterogeneity in their memory divergence behavior at the shared L2 cache within the GPU. As a result, (1) some warps benefit significantly from the cache, while others make poor use of it; (2) the divergence behavior of a warp tends to remain stable for long periods of the warp’s execution; and (3) the impact of memory divergence can be amplified by the high queuing latencies at the L2 cache.

Based on the heterogeneity in warps’ memory divergence behavior, we propose a set of techniques, collectively called Memory Divergence Correction (MeDiC), that reduce the negative performance impact of memory divergence and cache queuing. MeDiC uses warp divergence characterization to guide three warp-aware components in the memory hierarchy: (1) a cache bypassing mechanism that exploits the latency tolerance of warps that do not benefit from using the cache, to both alleviate queuing delay and increase the hit rate for warps that benefit from using the cache, (2) a cache insertion policy that prevents data from warps that benefit from using the cache from being prematurely evicted, and (3) a memory controller that prioritizes the few requests received from warps that benefit from using the cache, to minimize stall time. Our evaluation shows that MeDiC is effective at exploiting inter-warp heterogeneity and delivers significant performance and energy improvements over the state-of-the-art GPU cache management technique [247].

2.2 Minimizing Inter-application Interference

Inter-application interference occurs when multiple processor cores (CPUs) and a GPU integrated together on the same chip share the off-chip DRAM (and perhaps some caches). In such as system, requests from the GPU can heavily interfere with requests from the CPUs, leading to low system performance and starvation of cores. Even though previously-proposed application-aware memory scheduling policies designed for CPU-only scenarios (e.g.,  [357, 293, 220, 221, 292, 234, 398, 400, 399, 402, 103]) can be applied on a CPU-GPU heterogeneous system, we observe that the GPU requests occupy a significant portion of request buffer space and thus reduce the visibility of CPU cores’ requests to the memory controller, leading to lower system performance. Increasing the request buffer space requires complex logic to analyze applications’ characteristics, assign priorities for each memory request and enforce these priorities when the GPU is present. As a result, these past proposals for application-aware memory scheduling in CPU-only systems can perform poorly on a CPU-GPU heterogeneous system at low complexity (as we show in this dissertation).

To minimize the inter-application interference in CPU-GPU heterogeneous systems, we introduce a new memory controller called the Staged Memory Scheduler (SMS) [33], which is both application-aware and GPU-aware. Specifically, SMS is designed to facilitate GPU applications’ high bandwidth demand, improving performance and fairness significantly. SMS introduces a fundamentally new approach that decouples the three primary tasks of the memory controller into three significantly simpler structures that together improve system performance and fairness. The three-stage memory controller first groups requests based on row-buffer locality in its first stage, called the Batch Formation stage. This grouping allows the second stage, called the Batch Scheduler stage, to focus mainly on inter-application scheduling decisions. These two stages collectively enforce high-level policies regarding performance and fairness, and therefore the last stage can get away with using simple per-bank FIFO queues (no further command reordering within each bank) and straight forward logic that deals only with the low-level DRAM commands and timing. This last stage is called the DRAM Command Scheduler stage.

Our evaluation shows that SMS is effective at reducing inter-application interference. SMS delivers superior performance and fairness compared to state-of-the-art memory schedulers [357, 293, 220, 221], while providing a design that is significantly simpler to implement and that has significantly lower power consumption.

2.3 Minimizing Inter-address-space Interference

Inter-address-space interference occurs when the GPU is shared among multiple GPGPU applications in large-scale computing environments [311, 310, 312, 315, 9, 31, 421, 191]. Much of the inter-address-space interference problem in a contemporary GPU lies within the memory system, where multi-application execution requires virtual memory support to manage the address spaces of each application and to provide memory protection. We observe that when multiple GPGPU applications spatially share the GPU, a significant amount of inter-core thrashing occurs on the shared TLB within the GPU. We observe that this contention at the shared TLB is high enough to prevent the GPU from successfully hiding memory latencies, which causes TLB contention to become a first-order performance concern.

Based on our analysis of the TLB contention in a modern GPU system executing multiple applications, we introduce two mechanisms. First, we design Multi Address Space Concurrent Kernels (MASK). The key idea of MASK is to 1) extend the GPU memory hierarchy to efficiently support address translation via the use of multi-level TLBs, and 2) use translation-aware memory and cache management techniques to maximize throughput in the presence of inter-address-space contention. MASK uses a novel token-based approach to reduce TLB miss overheads by limiting the number of thread that can use the shared TLB, and its L2 cache bypassing mechanisms and address-space-aware memory scheduling reduce the inter-address-space interference. We show that MASK restores much of the thread-level parallelism that was previously lost due to address translation.

Second, to further minimize the inter-address-space interference, we introduce Mosaic. Mosaic significantly decreases inter-address-space interference at the shared TLB by increasing TLB reach via support for multiple page sizes, including very large pages. To enable multi-page-size support, we provide two key observations. First, we observe that the vast majority of memory allocations and memory deallocations are performed en masse by GPGPU applications in phases, typically soon after kernel launch or before kernel exit. Second, long-lived memory objects that usually increase fragmentation and induce complexity in CPU memory management are largely absent in the GPU setting. These two observations make it relatively easy to predict memory access patterns of GPGPU applications and simplify the task of detecting when a memory region can benefit from using large pages.

Based on the prediction of the memory access patterns, Mosaic 1) modifies GPGPU applications’ memory layout in system software to preserve address space contiguity, which allows the GPU to splinter and coalesce pages very fast without moving data and 2) periodically performs memory compaction while still preserving address space contiguity to avoid memory bloat and data fragmentation. Our prototype shows that Mosaic is very effective at reducing inter-address-space interference at the shared TLB and limits the number of shared TLB miss rate to less than 1% on average (down from 25.4% in the baseline shared TLB).

In summary, MASK incorporates TLB-awareness throughout the memory hierarchy and introduces TLB- and cache-bypassing techniques to increase the effectiveness of a shared TLB. Mosaic provides a hardware-software cooperative technique that modifies the memory allocation policy in the system software and introduces a high-throughput method to support large pages in multi-GPU-application environments. The MASK-Mosaic combination provides a simple mechanism to eliminate the performance overhead of address translation in GPUs without requiring significant changes in GPU hardware. These techniques work together to significantly improve system performance, IPC throughput, and fairness over the state-of-the-art memory management technique [343].

3 Contributions

We make the following major contributions:

  • We provide an in-depth analyses of three different types of memory interference in systems with GPUs. Each of these three types of interference significantly degrades the performance and efficiency of the GPU-based systems. To minimize memory interference, we introduce mechanisms to dynamically analyze different applications’ characteristics and propose four major changes throughout the memory hierarchy of GPU-based systems.

  • We introduce Memory Divergence Correction (MeDiC). MeDiC is a mechanism that minimizes intra-application interference in systems with GPUs. MeDiC is the first work that observes that the different warps within a GPGPU application exhibit heterogeneity in their memory divergence behavior at the shared L2 cache, and that some warps do not benefit from the few cache hits that they have. We show that this memory divergence behavior tends to remain consistent throughout long periods of execution for a warp, allowing for fast, online warp divergence characterization and prediction. MeDiC takes advantage of this warp characterization via a combination of warp-aware cache bypassing, cache insertion and memory scheduling techniques. Chapter \thechapter provides the detailed design and evaluation of MeDiC.

  • We demonstrate how the GPU memory traffic in heterogeneous CPU-GPU systems can cause severe inter-application interference, leading to poor performance and fairness. We propose a new memory controller design, the Staged Memory Scheduler (SMS), which delivers superior performance and fairness compared to three state-of-the-art memory schedulers [357, 220, 221], while providing a design that is significantly simpler to implement. The key insight behind SMS’s scalability is that the primary functions of sophisticated memory controller algorithms can be decoupled into different stages in a multi-level scheduler. Chapter \thechapter provides the design and the evaluation of SMS in detail.

  • We perform a detailed analysis of the major problems in state-of-the-art GPU virtual memory management that hinders high-performance multi-application execution. We discover a new type of memory interference, which we call inter-address-space interference, that arises from a significant amount of inter-core thrashing on the shared TLB within the GPU. We also discover that the TLB contention is high enough to prevent the GPU from successfully hiding memory latencies, which causes TLB contention to become a first-order performance concern in GPU-based systems. Based on our analysis, we introduce Multi Address Space Concurrent Kernels (MASK). MASK extends the GPU memory hierarchy to efficiently support address translation through the use of multi-level TLBs, and uses translation-aware memory and cache management to maximize IPC (instruction per cycle) throughput in the presence of inter-application contention. MASK restores much of the thread-level parallelism that was previously lost due to address translation. Chapter \thechapter analyzes the effect of inter-address-space interference and provides the detailed design and evaluation of MASK.

  • To further minimize the inter-address-space interference, we introduce Mosaic. Mosaic further increases the effectiveness of TLB by providing a hardware-software cooperative technique that modifies the memory allocation policy in the system software. Mosaic introduces a low overhead method to support large pages in multi-GPU-application environments. The key idea of Mosaic is to ensure memory allocation preserve address space contiguity to allow pages to be coalesced without any data movements. Our prototype shows that Mosaic significantly increases the effectiveness of the shared TLB in a GPU and further reduces inter-address-space interference. Chapter \thechapter provides the detailed design and evaluation of Mosaic.

4 Dissertation Outline

This dissertation is organized into eight Chapters. Chapter \thechapter presents background on modern GPU-based systems. Chapter \thechapter discusses related prior works on resource management, where techniques can potentially be applied to reduce interference in GPU-based systems. Chapter \thechapter presents the design and evaluation of MeDiC. MeDiC is a mechanism that minimizes intra-application interference by redesigning the shared cache and the memory controller to be aware of different types of warps. Chapter \thechapter presents the design and evaluation of SMS. SMS is a GPU-aware and application-aware memory controller design that minimizes the inter-application interference. Chapter \thechapter presents a detailed analysis of the performance impact of inter-address-space interference. It then proposes MASK, a mechanism that minimizes inter-address-space interference by introducing TLB-awareness throughout the memory hierarchy. Chapter \thechapter presents the design for Mosaic. Mosaic provides a hardware-software cooperative technique that reduces inter-address-space interference by lowering contention at the shared TLB. Chapter \thechapter provides the summary of common principles and lessons learned. Chapter \thechapter provides the summary of this dissertation as well as future research directions that are enabled by this dissertation.

\rachata

Done with most of Onur’s comment up to here

Chapter \thechapter The Memory Interference Problem in Systems with GPUs

We first provide background on the architecture of a modern GPU, and then we discuss the bottlenecks that highly-multithreaded applications can face either when executed alone on a GPU or when executing with other CPU or GPU applications.

5 Modern Systems with GPUs

In this section, we provide a detailed explanation of the GPU architecture that is available on modern systems. Section 5 discusses a typical modern GPU architecture [310, 311, 312, 315, 7, 8, 427, 278, 344, 5, 432, 61, 179, 80, 181, 307, 308] as well as its memory hierarchy. Section 6 discusses the design of a modern CPU-GPU heterogeneous architecture [432, 61, 179, 181] and its memory hierarchy. Section 7 discusses the memory management unit and support for address translation.

5.1 GPU Core Organization

A typical GPU consists of several GPU cores called shader cores (sometimes called streaming multiprocessors, or SMs). As shown in Figure 1, a GPU core executes SIMD-like instructions [116]. Each SIMD instruction can potentially operate on multiple pieces of data in parallel. Each data piece operated on by a different thread of control. Hence, the name SIMT (Single Instruction Multiple Thread). Multiple threads that are the same are grouped into a warp. A warp is a collection of threads that are executing the same instruction (i.e., are at the same Program Counter). Multiple warps are grouped into a thread block. Every cycle, a GPU core fetches an available warp (a warp is available if none of its threads are stalled), and issues an instruction associated with those threads (in the example from Figure 1, this instruction is from Warp D and the address of this instruction is ). In this way, a GPU can potentially retire as many instructions as the number of cores multiplied by the number of threads per warp, enabling high instruction-per-cycle (IPC) throughput. More detail on GPU core organization can be found in [121, 129, 271, 120, 46, 436, 150, 385].

Figure 1: Organization of threads, warps, and thread blocks.

5.2 GPU Memory Hierarchy

When there is a load or store instruction that needs to access data from the main memory, the GPU core sends a memory request to the memory hierarchy, which is shown in Figure 2. This hierarchy typically contains a private data cache, and an interconnect (typically a crossbar) that connects all the cores with the shared data cache. If the target data is present neither in the private nor the shared data cache, a memory request is sent to the main memory in order to retrieve the data.

\paragraphbe

GPU Cache Organization and Key Assumptions. Each core has its own private L1 data, texture, and constant caches, as well as a software-managed scratchpad memory [311, 310, 251, 312, 315, 8, 11, 423]. In addition, the GPU also has several shared L2 cache slices and memory controllers. Because there are several methods to design the GPU memory hierarchy, we assume the baseline that decouples the memory channels into multiple memory partitions. A memory partition unit combines a single L2 cache slice (which is banked) with a designated memory controller that connects the GPU to off-chip main memory. Figure 2 shows a simplified view of how the cores (or SMs), caches, and memory partitions are organized in our baseline GPU.

Figure 2: Overview of a modern GPU architecture.
\paragraphbe

GPU Main Memory Organization. Similar to systems with CPUs, a GPU uses DRAM (organized as hierarchical two-dimensional arrays of bitcells) as main memory. Reading or writing data to DRAM requires that a row of bitcells from the array first be read into a row buffer. This is required because the act of reading the row destroys the row’s contents, and so a copy of the bit values must be kept (in the row buffer). Reads and writes operate directly on the row buffer. Eventually, the row is “closed” whereby the data in the row buffer is written back into the DRAM array. Accessing data already loaded in the row buffer, also called a row buffer hit, incurs a shorter latency than when the corresponding row must first be “opened” from the DRAM array. A modern memory controller (memory controller) must orchestrate the sequence of commands to open, read, write and close rows. Servicing requests in an order that increases row-buffer hit rate tends to improve overall throughput by reducing the average latency to service requests. The memory controller is also responsible for enforcing a wide variety of timing constraints imposed by modern DRAM standards (e.g., DDR3) such as limiting the rate of page-open operations (t) and ensuring a minimum amount of time between writes and reads (t). More detail on timing constraints and DRAM operation can be found in [222, 240, 241, 239, 254, 374, 71, 238, 70, 154, 155].

Each two-dimensional array of DRAM cells constitutes a bank, and a group of banks forms a rank. All banks within a rank share a common set of command and data buses, and the memory controller is responsible for scheduling commands such that each bus is used by only one bank at a time. Operations on multiple banks may occur in parallel (e.g., opening a row in one bank while reading data from another bank’s row buffer) so long as the buses are properly scheduled and any other DRAM timing constraints are honored. A memory controller can improve memory system throughput by scheduling requests such that bank-level parallelism or BLP (i.e., the number of banks simultaneously busy responding to commands) is higher [293, 237]. A memory system implementation may support multiple independent memory channels (each with its own ranks and banks) [287, 42] to further increase the number of memory requests that can be serviced at the same time. A key challenge in the implementation of modern, high-performance memory controllers is to effectively improve system performance by maximizing both row-buffer hits and BLP while simultaneously providing fairness among multiple CPUs and the GPU [33].

\paragraphbe

Key Assumptions. We assume the memory controller consists of a centralized memory request buffer. Additional details of the memory controller design can be found in Sections 1427 and 33.

5.3 Intra-application Interference within GPU Applications

While many GPGPU applications can tolerate a significant amount of memory latency due to their parallelism through the SIMT execution model, many previous works (e.g., [425, 193, 297, 192, 46, 436, 150, 120, 121, 74, 359, 360, 271]) observe that GPU cores often stall for a significant fraction of time. One significant source of these stalls is the contention at the shared GPU memory hierarchy [297, 271, 359, 425, 193, 192, 207, 74, 36]. The large amount of parallelism in GPU-based systems creates a significant amount of contention on the GPU’s memory hierarchy. Even through all threads in the GPU execute the codes from the same application, data accesses from one warp can interfere with data accesses from other warps. This interference comes in several forms such as additional cache thrashing and queuing delays at both the shared cache and shared main memory. These combine to lower the performance of GPU-based systems. We call this interference the intra-application interference.

Memory divergence, where the threads of a warp reach a memory instruction, and some of the threads’ memory requests take longer to service than the requests from other threads [297, 271, 74, 36], further exacerbates the effect of intra-application interference. Since all threads within a warp operate in lockstep due to the SIMD execution model, the warp cannot proceed to the next instruction until the slowest request within the warp completes, and all threads are ready to continue execution.

Chapter \thechapter provides detailed analyses on how to reduce intra-application interference at the shared cache and the shared main memory.

\rachata

Done with most of Onur’s comment up to here

6 GPUs in CPU-GPU Heterogeneous Architectures

Aside from using off-chip discrete GPUs, modern architectures integrate Graphics Processors integrate a GPU on the same chip as the CPU cores [181, 179, 176, 178, 62, 61, 432, 307, 80, 344, 278, 33, 209, 187]. Figure 3 shows the design of these recent heterogeneous CPU-GPU architecture. As shown in Figure 3, parts of the memory hierarchy are being shared across both CPU and GPU applications.

Figure 3: The memory hierarchy of a heterogeneous CPU-GPU architecture.
\paragraphbe

Key Assumptions. We make two key assumptions for the design of heterogeneous CPU-GPU systems. First, we assume that the GPUs and the CPUs do not share the last level caches. Second, we assume that the memory controller is the first point in the memory hierarchy that CPU applications and GPU applications share resources. We applied multiple memory scheduler designs as the baseline for our evaluations. Additional details of these baseline design can be found in Sections 19.5 and  20.

6.1 Inter-application Interference across CPU and GPU Applications

As illustrated in Figure 3, the main memory is a major shared resource among cores in modern chip multiprocessor (CMP) systems. Memory requests from multiple cores interfere with each other at the main memory and create inter-application interference, which is a significant impediment to individual application and system performance. Previous works on CPU-only application-aware memory scheduling [292, 293, 220, 221, 398, 103, 234] have addressed the problem by being aware of application characteristics at the memory controller and prioritizing memory requests to improve system performance and fairness. This approach of application-aware memory request scheduling has provided good system performance and fairness in multicore systems.

As opposed to CPU applications, GPU applications are not very latency sensitive as there are a large number of independent threads to cover long memory latencies. However, the GPU requires a significant amount of bandwidth far exceeding even the most memory-intensive CPU applications. As a result, a GPU memory scheduler [251, 311, 315] typically needs a large request buffer that is capable of request coalescing (i.e., combining multiple requests for the same block of memory into a single combined request [251]). Furthermore, since GPU applications are bandwidth intensive, often with streaming access patterns, a policy that maximizes the number of row-buffer hits is effective for GPUs to maximize overall throughput. Hence, a memory scheduler that can improve the effective DRAM bandwidth such as the FR-FCFS scheduler with a large request buffer [357, 454, 46, 445] tends to perform well for GPUs.

This conflicting preference between CPU applications and GPU applications (CPU applications benefit from lower memory request latency while GPU applications benefit from higher DRAM bandwidth) further complicates the design of memory request scheduler for CPU-GPU heterogeneous systems. A design that favors lowering the latency of CPU requests is undesirable for GPU applications while a design that favors providing high bandwidth is undesirable for CPU applications.

In this dissertation, Chapter \thechapter provides an in-depth analysis of this inter-application interference and provides a method to mitigate the interference in CPU-GPU heterogeneous architecture.

7 GPUs in Multi-GPU-application Environments

Recently, a newer set of analytic GPGPU applications, such as the Netflix movie recommendation systems [25], or a stock market analyzer [345], require a closely connected, highly virtualized, shared environment. These applications, which benefit from the amount of parallelism GPU provides, do not need to use all resources in the GPU to maximize their performance. Instead, these emerging applications benefit from concurrency - by running a few of these applications together, each sharing some resources on the GPU. NVIDIA GRID [311, 315] and AMD FirePro [9] are two examples of spatially share GPU resources across multiple applications.

Figure 4 shows the high-level design of how a GPU can be spatially shared across two GPGPU applications. In this example, the GPUs contain multiple shared page table walkers, which are responsible for translating a virtual address into a physical address. This design also contains two level of translation lookaside buffers (TLBs), which cache the virtual-to-physical translation. This design allows the GPU to co-schedule kernels, even applications, concurrently because address translation enables memory protection across multiple GPGPU applications.

Figure 4: A GPU design showing two concurrent GPGPU applications concurrently sharing the GPUs.
\paragraphbe

Key Assumptions. The page table walker can be placed at different locations in the GPU memory hierarchy. The GPU MMU design proposed by Power et al. places a parallel page table walkers between the private L1 and the shared L2 caches [343]. Other alternative designs place the page table walker at the Input-Output Memory Management Unit (IOMMU), which directly connects to the main memory [5, 432, 61, 179, 80, 181, 344, 278, 307, 308, 310, 311, 312, 315, 7, 8, 427, 344], and another GPU MMU design proposed by Cong et al. uses the CPU’s page table walker to perform GPU page walks [83]. We found that placing a parallel page table walkers at the shared L2 cache provides the best performance. Hence, we assume the baseline proposed by Power et al. that utilized the per-core private TLB and place the page table walker at the shared L2 cache [343].

7.1 Inter-address-space Interference on Multiple GPU Applications

While concurrently executing multiple GPGPU applications that have complementary resource demand can improve GPU utilization, these applications also share two critical resources: the shared address translation unit and the shared TLB. We find that when multiple applications spatially share the GPU, there is a significant amount of thrashing on the shared TLB within the GPU because multiple applications from different address spaces are contending at the shared TLB, the page table walker as well as the shared L2 data cache. We define this phenomenon as the inter-address-space interference.

The amount of parallelism on GPUs further exacerbate the performance impact of inter-address-space interference. We found that an address translation in response to a single TLB miss typically stalls tens of warps. As a result, a small number of outstanding TLB misses can result in a significant number of warps to become unschedulable, which in turn limits the GPU’s most essential latency-hiding capability. We observe that providing address translation in GPUs reduce the GPU performance to 47.3% of the ideal GPU with no address translation, which is a significant performance overhead. As a result, it is even more crucial to mitigate this inter-address-space interference throughout the GPU memory hierarchy in multi-GPU-application environments. Chapters \thechapter and \thechapter provide detailed design descriptions of the two mechanisms we propose that can be used to reduce this inter-address-space interference.

Chapter \thechapter Related Works on Resource Management in Systems with GPUs

Several previous works have been proposed to address the memory interference problem in systems with GPUs. These previous proposals address certain parts of the main memory hierarchy. In this chapter, we first provide the background on the GPU’s execution model. Then, we provide breakdowns of previous works on GPU resource management throughout the memory hierarchy as well as differences between these previous works and techniques presented in this dissertation.

8 Background on the Execution Model of GPUs

Modern day GPUs employ two main techniques to enable their parallel processing power: SIMD, which executes multiple data within a single instruction, and fine-grain multithreading, which prevents the GPU cores from stalling by issuing instructions from different threads every cycle. This section provides the background on previous machines and processors that apply similar techniques.

8.1 SIMD and Vector Processing

The SIMD execution model, which includes vector processing, been used by several machines in the past. Slotnik et al. in the Solomon Computer [388], Senzig and Smith [370], Crane and Guthens [87], Hellerman [158], CDC 7600 [84], CDC STAR-100 [369], Illiac IV [48] and Cray I [364] are examples of machines that employ a vector processor. In modern systems, Intel MMX [177, 336] and Intel SSE [179] also apply SIMD in order to improve performance. As an alternative of using one instruction to execute multiple data, VLIW [115] generate codes for a parallel machine that allows multiple instructions to operate on multiple data concurrently in a single cycle. Intel i860 [137] and Intel Itanium [268] are examples of processors with the VLIW technology.

8.2 Fine-grained Multithreading

Fine-grain multithreading, which is a technique that allows the processor to issue instructions from different threads every cycle, is the key component that enables latency hiding capability in modern day GPUs. CDC 6600 [410, 411], Denelcor HEP [389], MASA [148], APRIL [13] and Tera MTA [26] are examples of machines that utilize fine-grain multithreading.

9 Background on Techniques to Reduce Interference of Shared Resources

Several techniques to reduce interference at the shared cache, shared off-chip main memory as well as the shared interconnect have been proposed. In this section, we provide a brief discussion of these works.

9.1 Cache Bypassing Techniques

\paragraphbe

Hardware-based Cache Bypassing Techniques. Several hardware-based cache bypassing mechanisms have been proposed in both CPU and GPU setups. Li et al. propose PCAL, a bypassing mechanism that addresses the cache thrashing problem by throttling the number of threads that time-share the cache at any given time [247]. The key idea of PCAL is to limit the number of threads that get to access the cache. Li et al. [246] propose a cache bypassing mechanism that allows only threads with high reuse to utilize the cache. The key idea is to use locality filtering based on the reuse characteristics of GPGPU applications, with only high reuse threads having access to the cache. Xie et al. [439] propose a bypassing mechanism at the thread block level. In their mechanism, the compiler statically marks whether thread blocks prefer caching or bypassing. At runtime, the mechanism dynamically selects a subset of thread blocks to use the cache, to increase cache utilization. Chen et al. [78, 79] propose a combined warp throttling and bypassing mechanism for the L1 cache based on the cache-conscious warp scheduler [359]. The key idea is to bypass the cache when resource contention is detected. This is done by embedding history information into the L2 tag arrays. The L1 cache uses this information to perform bypassing decisions, and only warps with high reuse are allowed to access the L1 cache. Jia et al. propose an L1 bypassing mechanism [188], whose key idea is to bypass requests when there is an associativity stall. Dai et al. propose a mechanism to bypass cache based on a model of a cache miss rate [89].

There are also several other CPU-based cache bypassing techniques. These techniques include using additional buffers track cache statistics to predict cache blocks that have high utility based on reuse count  [195, 127, 446, 215, 106, 76, 435, 252], reuse distance [99, 146, 76, 114, 326, 443, 124, 434], behavior of the cache block [185] or miss rate [82, 414]

\rachata

I think that is all

\paragraphbe

Software-based Cache Bypassing Techniques. Because GPUs allow software to specify whether to utilize the cache or not [316, 317]. Software based cache bypassing techniques have also been proposed to improve system performance. Li et al. [245] propose a compiler-based technique that performs cache bypassing using a method similar to PCAL [247]. Xie et al. [438] propose a mechanism that allows the compiler to perform cache bypassing for global load instructions. Both of these mechanisms apply bypassing to all loads and stores that utilize the shared cache, without requiring additional characterization at the compiler level. Mekkat et al. [270] propose a bypassing mechanism for when a CPU and a GPU share the last level cache. Their key idea is to bypass GPU cache accesses when CPU applications are cache sensitive, which is not applicable to GPU-only execution.

9.2 Cache Insertion and Replacement Policies

Many works have proposed different insertion policies for CPU systems (e.g., [347, 379, 183, 184]). Dynamic Insertion Policy (DIP) [183] and Dynamic Re-Reference Interval Prediction (DRRIP) [184] are insertion policies that account for cache thrashing. The downside of these two policies is that they are unable to distinguish between high-reuse and low-reuse blocks in the same thread [379]. The Bi-modal Insertion Policy [347] dynamically characterizes the cache blocks being inserted. None of these works on cache insertion and replacement policies [347, 379, 183, 184] take warp type characteristics or memory divergence behavior into account.

9.3 Cache and Memory Partitioning Techniques

Instead of mitigating the interference problem between applications by scheduling requests at the memory controller, Awasthi et al. propose a mechanism that spreads data in the same working set across memory channels in order to increase memory level parallelism [42]. Muralidhara et al. propose memory channel partitioning (MCP) to map applications to different memory channels based on their memory-intensities and row-buffer locality to reduce inter-application interference [287]. Mao et al. propose to partition GPU channels and only allow a subset of threads to access each memory channel [266]. In addition to channel partitioning, several works also propose to partition DRAM banks [437, 171, 255] and the shared cache [401, 350] to improve performance. These partitioning techniques are orthogonal to our proposals and can be combined to improve the performance of GPU-based systems.

9.4 Memory Scheduling on CPUs

Memory scheduling algorithms improve system performance by reordering memory requests to deal with the different constraints and behaviors of DRAM. The first-ready-first-come-first-serve (FR-FCFS) [357] algorithm attempts to schedule requests that result in row-buffer hits (first-ready), and otherwise prioritizes older requests (FCFS). FR-FCFS increases DRAM throughput, but it can cause fairness problems by under-servicing applications with low row-buffer locality. Ebrahimi et al. [103] propose PAM, a memory scheduler that prioritizes critical threads in order to improve the performance of multithreaded applications. Ipek et al. propose a self-optimizing memory scheduling that improve system performance with reinforcement learning [405]. Mukundan and Martinez propose MORSE, a self-optimizing reconfigurable memory scheduler [285]. Lee et al. propose two prefetch aware memory scheduling designs [234, 237]. Stuecheli et al. [397] and Lee et al. [236] propose memory schedulers that are aware of writeback requests. Seshadri et al. [372] propose to simplify the implementation of row-locality-aware write back by exploiting the dirty-block index. Several application-aware memory scheduling algorithms [282, 220, 221, 292, 293, 398, 402] have been proposed to balance both performance and fairness. Parallelism-aware Batch Scheduling (PAR-BS) [293] batches requests based on their arrival times (older requests batched first). Within a batch, applications are ranked to preserve bank-level parallelism (BLP) within an application’s requests. Kim et al. propose ATLAS [220], which prioritizes applications that have received the least memory service. As a result, applications with low memory intensities, which typically attain low memory service, are prioritized. However, applications with high memory intensities are deprioritized and hence slowed down significantly, resulting in unfairness. Kim et al. further propose TCM [221], which addresses the unfairness problem in ATLAS. TCM first clusters applications into low and high memory-intensity clusters based on their memory intensities. TCM always prioritizes applications in the low memory-intensity cluster, however, among the high memory-intensity applications it shuffles request priorities to prevent unfairness. Ghose et al. propose a memory scheduler that takes into account of the criticality of each load and prioritizes loads that are more critical to CPU performance [131]. Subramanian et al. propose MISE [402], which is a memory scheduler that estimates slowdowns of applications and prioritizes applications that are likely to be slow down the most. Subramanian et al. also propose BLISS [398, 400], which is a mechanism that separates applications into a group that interferes with other applications and another group that does not, and prioritizes the latter group to increase performance and fairness. Xiong et al. propose DMPS, a ranking based on latency sensitivity [440]. Liu et al. propose LAMS, a memory scheduler that prioritizes requests based on the latency of servicing each memory request  [256].

9.5 Memory Scheduling on GPUs

Since GPU applications are bandwidth intensive, often with streaming access patterns, a policy that maximizes the number of row-buffer hits is effective for GPUs to maximize overall throughput. As a result, FR-FCFS with a large request buffer tends to perform well for GPUs [46]. In view of this, previous work [445] designed mechanisms to reduce the complexity of row-hit first based (FR-FCFS) scheduling. Jeong et al. propose a QoS-aware memory scheduler that guarantees the performance of GPU applications by prioritizing Graphics applications over CPU applications until the system can guarantee a frame can be rendered within its deadline, and prioritize CPU applications afterward [187]. Jog et al. [194] propose CLAM, a memory scheduler that identifies critical memory requests and prioritizes them in the main memory.

Aside from CPU-GPU heterogeneous systems, Usui et at. propose SQUASH [416] and DASH [417], which are accelerator-aware memory controller designs that improve the performance of systems with CPU and hardware accelerators. Zhao et al. propose FIRM, a memory controller design that improves the performance of systems with persistent memory [450].

9.6 DRAM Designs

Aside from memory scheduling and memory partitioning techniques, previous works propose new designs that are capable of reducing memory latency in conventional DRAM [69, 71, 70, 240, 241, 239, 238, 222, 242, 276, 367, 151, 160, 165, 210, 391, 262, 75, 338, 382, 67, 320, 452, 431, 22, 21, 72, 289, 295] as well as non-volatile memory [275, 227, 273, 442, 351, 348, 231, 233, 232, 274]. Previous works on bulk data transfer [143, 144, 198, 65, 448, 371, 172, 451, 189, 374, 260, 71] and in-memory computation [17, 20, 112, 125, 145, 329, 373, 265, 95, 377, 375, 164, 163, 60, 395, 119, 132, 200, 224, 322, 404, 447, 23, 218, 130, 59, 43, 126, 346, 330, 376] can be used improve DRAM bandwidth. Techniques to reduce the overhead of DRAM refresh [254, 419, 53, 250, 16, 296, 253, 211, 212, 213, 214, 327, 349, 217, 44, 15, 321, 349] can be applied to improve the performance of GPU-based systems. Data compression techniques can also be used on the main memory to increase the effective DRAM bandwidth [335, 334, 332, 333, 425]. These techniques can be used to mitigate the performance impact of memory interference and improve the performance of GPU-based systems. They are orthogonal and can be combined with techniques proposed in this dissertation.

Previous works on data prefetching can also be used to mitigate high DRAM latency [234, 380, 237, 299, 394, 229, 24, 45, 64, 88, 104, 197, 166, 196, 85, 105, 101, 291, 294, 290, 152, 235, 153]. However, these techniques generally increase DRAM bandwidth, which lead to lower GPU performance.

Upcoming works [422, 424] propose cross-layer abstractions to enable the programmer to better manage GPU memory system resources by expressing semantic information about high-level data structures.

9.7 Interconnect Contention Management

Aside from the shared cache and the shared off-chip main memory, on-chip interconnect is another shared resources on the GPU memory hierarchy. While this dissertation does not focus on the contention of shared on-chip interconnect, many previous works provide mechanisms to reduce contention of the shared on-chip interconnect. These include works on hierarchical on-chip network designs [34, 35, 353, 449, 149, 354, 138, 92, 147, 98], low cost router designs [219, 34, 35, 286, 2, 223, 139], bufferless interconnect designs [47, 133, 156, 225, 283, 109, 68, 110, 111, 318, 319, 389, 26, 161] and Quality-of-Service-aware interconnect designs [142, 141, 140, 93, 94, 113, 91, 279].

10 Background on Memory Management Unit and Address Translation Designs

\rachata

Why this is needed for GPU virtualization and why these matters.

\rachata

This section might more work compared to other.

Aside from the caches and the main memory, the memory management unit (MMU) is another important component in the memory hierarchy. The MMU provides address translation for applications running on the GPU. When multiple GPGPU applications are concurrently running, the MMU is also provides memory protection across different virtual address spaces that are concurrently using the GPU memory. This section first introduces previous works on concurrent GPGPU application. Then, we provide background on previous works on TLB designs that aids address translation.

10.1 Background on Concurrent Execution of GPGPU Applications

\paragraphbe

Concurrent Kernels and GPU Multiprogramming. The opportunity to improve utilization with concurrency is well-recognized but previous proposals [323, 430, 248], do not support memory protection. Adriaens et al. [4] observe the need for spatial sharing across protection domains but do not propose or evaluate a design. NVIDIA GRID [159] and AMD FirePro [9] support static partitioning of hardware to allow kernels from different VMs to run concurrently—partitions are determined at startup, causing fragmentation and under-utilization. The goal of our proposal, MASK, is a flexible dynamic partitioning of shared resources. NVIDIA’s Multi Process Service (MPS) [314] allows multiple processes to launch kernels on the GPU: the service provides no memory protection or error containment. Xu et al [441] propose Warped-Slicer, which is a mechanism for multiple applications to spatially share a GPU core. Similar to MPS, Warped-Slicer provides no memory protection, and is not suitable for supporting multi-application in a multi-tenant cloud setting.

\paragraphbe

Preemption and Context Switching. Preemptive context switching is an active research area [409, 129, 430]. Current architectural support [251, 315] will likely improve in future GPUs. Preemption and spatial multiplexing are complementary to the goal of this dissertation, and exploring techniques to combine them is future work.

\paragraphbe

GPU Virtualization. Most current hypervisor-based full virtualization techniques for GPGPUs [206, 406, 413] must support a virtual device abstraction without dedicated hardware support for VDI found in GRID [159] and FirePro [9] . Key components missing from these proposals includes support for dynamic partitioning of hardware resources and efficient techniques for handling over-subscription. Performance overheads incurred by some of these designs argue strongly for hardware assists such as those we propose. By contrast, API-remoting solutions such as vmCUDA [429] and rCUDA [97] provide near native performance but require modifications to guest software and sacrifice both isolation and compatibility.

\paragraphbe

Other Methods to Enable Virtual Memory. Vesely et al. analyze support for virtual memory in heterogeneous systems [420], finding that the cost of address translation in GPUs is an order of magnitude higher than in CPUs and that high latency address translations limit the GPU’s latency hiding capability and hurts performance (an observation in-line with our own findings. We show additionally that thrashing due to interference further slows applications sharing the GPU. Our proposal, MASK, is capable not only of reducing interference between multiple applications, but of reducing the TLB miss rate in single-application scenarios as well. We expect that our techniques are applicable to CPU-GPU heterogeneous system.

Direct segments [51] and redundant memory mappings [201] reduce address translation overheads by mapping large contiguous virtual memory to contiguous physical address space which reduces address translation overheads by increasing the reach of TLB entries. These techniques are complementary to those in MASK, and may eventually become relevant in GPU settings as well.

\paragraphbe

Demand Paging in GPUs. Demand paging is an important functionality for memory virtualization that is challenging for GPUs [420]. Recent works [453], AMD’s hUMA [12], as well as NVIDIA’s PASCAL architecture [453, 315] support for demand paging in GPUs. As identified in MOSAIC, these techniques can be costly in GPU environment.

10.2 TLB Designs

\paragraphbe

GPU TLB Designs. Previous works have explored the design space for TLBs in heterogeneous systems with GPUs [83, 343, 342, 420], and the adaptation of x86-like TLBs to a heterogeneous CPU-GPU setting [343]. Key elements in these designs include probing the TLB after L1 coalescing to reduce the number of parallel TLB requests, shared concurrent page table walks, and translation caches to reduce main memory accesses. Our proposal, MASK, owes much to these designs, but we show empirically that contention patterns at the shared L2 layer require additional support to accommodate cross-context contention. Cong et al. propose a TLB design similar to our baseline GPU-MMU design [83]. However, this design utilizes the host (CPU) MMU to perform page walks, which is inapplicable in the context of multi-application GPUs. Pichai et al. [342] explore TLB design for heterogeneous CPU-GPU systems, and add TLB awareness to the existing CCWS GPU warp scheduler [359], which enables parallel TLB access on the L1 cache level, similar in concept to the Powers design [343]. Warp scheduling is orthogonal to our work: incorporating a TLB-aware CCWS warp scheduler to MASK could further improve performance.

\paragraphbe

CPU TLB Designs. Bhattacharjee et al. examine shared last-level TLB designs [57] as well as page walk cache designs [54], proposing a mechanism that can accelerate multithreaded applications by sharing translations between cores. However, these proposals are likely to be less effective for multiple concurrent GPGPU applications because translations are not shared between virtual address spaces. Barr et al. propose SpecTLB [50], which speculatively predicts address translations to avoid the TLB miss latency. Speculatively predicting address translation can be complicated and costly in GPU because there can be multiple concurrent TLB misses to many different TLB entries in the GPU.

\paragraphbe

Mechanisms to Support Multiple Page Sizes. TLB miss overheads can be reduced by accelerating page table walks [49, 54] or reducing their frequency [122]; by reducing the number of TLB misses (e.g. through prefetching [56, 199, 368], prediction [325], or structural change to the TLB [408, 340, 339] or TLB hierarchy [55, 263, 393, 19, 18, 201, 51, 123]). Multipage mapping techniques [408, 340, 339] map multiple pages with a single TLB entry, improving TLB reach by a small factor (e.g., to 8 or 16); much greater improvements to TLB reach are needed to deal with modern memory sizes. Direct segments [51, 123] extend standard paging with a large segment to map the majority of an address space to a contiguous physical memory region, but require application modifications and are limited to workloads able to a single large segment. Redundant memory mappings (RMM) [201] extend TLB reach by mapping ranges of virtually and physically contiguous pages in a range TLB.

A number of related works propose hardware support to recover and expose address space contiguity. GLUE [341] groups contiguous, aligned small page translations under a single speculative large page translation in the TLB. Speculative translations (similar to SpecTLB [50]) can be verified by off-critical-path page table walks, reducing effective page-table walk latency. GTSM [96] provides hardware support to leverage the address space contiguity of physical memory even when pages have been retired due to bit errors. Were such features to become available, hardware mechanisms for preserving address space contiguity could reduce the overheads induced by proactive compaction, which is a feature we introduce in our proposal, Mosaic.

The policies and mechanisms used to implement transparent large page support in Mosaic are informed by a wealth of previous research on operating system support for large pages for CPUs. Navarro et al. [298] identify contiguity-awareness and fragmentation reduction as primary concerns for large page management, proposing reservation-based allocation and deferred promotion of base pages to large pages. These ideas are widely used in modern operating systems [412]. Ingens [228] eschews reservation-based allocation in favor of the utilization-based promotion based on a bit vector which tracks spatial and temporal utilization of base pages, implementing promotion and demotion asynchronously, rather than in a page fault handler. These basic ideas heavily inform Mosaic’s design, which attempts to emulate these same policies in hardware. In contrast to Ingens, Mosaic can rely on dedicated hardware to provide access frequency and distribution, and need not infer it by sampling access bits whose granularity may be a poor fit for the page size.

Gorman et al. [134] propose a placement policy for an operating system’s physical page allocator that mitigates fragmentation and promotes address space contiguity by grouping pages according to relocatability. Subsequent work [135] proposes a software-exposed interface for applications to explicitly request large pages like libhugetlbfs [249]. These ideas are complementary to ideas presented in this thesis. Mosaic can plausibly benefit from similar policies simplified to be hardware-implementable, and we leave that investigation as future work.

Chapter \thechapter Reducing Intra-application Interference with Memory Divergence Correction

Graphics Processing Units (GPUs) have enormous parallel processing power to leverage thread-level parallelism. GPU applications can be broken down into thousands of threads, allowing GPUs to use fine-grained multithreading [410, 390] to prevent GPU cores from stalling due to dependencies and long memory latencies. Ideally, there should always be available threads for GPU cores to continue execution, preventing stalls within the core. GPUs also take advantage of the SIMD (Single Instruction, Multiple Data) execution model [116]. The thousands of threads within a GPU application are clustered into work groups (or thread blocks), with each thread block consisting of multiple smaller bundles of threads that are run concurrently. Each such thread bundle is called a wavefront [11] or warp [251]. In each cycle, each GPU core executes a single warp. Each thread in a warp executes the same instruction (i.e., is at the same program counter). Combining SIMD execution with fine-grained multithreading allows a GPU to complete several hundreds of operations every cycle in the ideal case.

In the past, GPUs strictly executed graphics applications, which naturally exhibit large amounts of concurrency. In recent years, with tools such as CUDA [313] and OpenCL [216], programmers have been able to adapt non-graphics applications to GPUs, writing these applications to have thousands of threads that can be run on a SIMD computation engine. Such adapted non-graphics programs are known as general-purpose GPU (GPGPU) applications. Prior work has demonstrated that many scientific and data analysis applications can be executed significantly faster when programmed to run on GPUs [77, 396, 157, 63].

While many GPGPU applications can tolerate a significant amount of memory latency due to their parallelism and the use of fine-grained multithreading, many previous works (e.g., [425, 193, 297, 192]) observe that GPU cores still stall for a significant fraction of time when running many other GPGPU applications. One significant source of these stalls is memory divergence, where the threads of a warp reach a memory instruction, and some of the threads’ memory requests take longer to service than the requests from other threads [297, 271, 74]. Since all threads within a warp operate in lockstep due to the SIMD execution model, the warp cannot proceed to the next instruction until the slowest request within the warp completes, and all threads are ready to continue execution. Figures 5a and 5b show \redexamples of memory divergence within a warp, which we will explain in more detail soon.

Figure 5: Memory divergence within a warp. (a) and (b) show the heterogeneity between mostly-hit and mostly-miss warps, respectively. (c) and (d) show the change in stall time from converting mostly-hit warps into all-hit warps, and mostly-miss warps into all-miss warps, respectively.

In this work, we make three new key observations about the memory divergence behavior of GPGPU warps:

\paragraphbe

Observation 1: There is heterogeneity across warps in the degree of memory divergence experienced by each warp at the shared L2 cache (i.e., the percentage of threads within a warp that miss in the cache varies widely). Figure 5 shows examples of two different types of warps, with eight threads each, that exhibit different degrees of memory divergence:

  • Figure 5a shows a mostly-hit warp, where most of the warp’s memory accesses hit in the cache (

    1

    ). However, a single access misses in the cache and must go to main memory (

    2

    ). As a result, the entire warp is stalled until the much longer cache miss completes.

  • Figure 5b shows a mostly-miss warp, where most of the warp’s memory requests miss in the cache (

    3

    ), resulting in many accesses to main memory. Even though some requests are cache hits (

    4

    ), these do not benefit the execution time of the warp.

\paragraphbe

Observation 2: A warp tends to retain its memory divergence behavior (e.g., whether or not it is mostly-hit or mostly-miss) for long periods of execution, and is thus predictable. As we show in Section 13, this predictability enables us to perform history-based warp divergence characterization.

\paragraphbe

Observation 3: Due to the amount of thread parallelism within a GPU, a large number of memory requests can arrive at the L2 cache in a small window of execution time, leading to significant queuing delays. Prior work observes high access latencies for the shared L2 cache within a GPU [386, 385, 433], but does not identify why these latencies are so high. We show that when a large number of requests arrive at the L2 cache, both the limited number of read/write ports and backpressure from cache bank conflicts force many of these requests to queue up for long periods of time. We observe that this queuing latency can sometimes add hundreds of cycles to the cache access latency, and that non-uniform queuing across the different cache banks exacerbates memory divergence.

Based on these three observations, we aim to devise a mechanism that has two major goals: (1) convert mostly-hit warps into all-hit warps (warps where all requests hit in the cache, as shown in Figure 5c), and (2) convert mostly-miss warps into all-miss warps (warps where none of the requests hit in the cache, as shown in Figure 5d). As we can see in Figure 5a, the stall time due to memory divergence for the mostly-hit warp can be eliminated by converting only the single cache miss (

2

) into a hit. Doing so requires additional cache space. If we convert the two cache hits of the mostly-miss warp (Figure 5b,

4

) into cache misses, we can cede the cache space previously used by these hits to the mostly-hit warp, thus converting the mostly-hit warp into an all-hit warp. Though the mostly-miss warp is now an all-miss warp (Figure 5d), it incurs no extra stall penalty, as the warp was already waiting on the other six cache misses to complete. Additionally, now that it is an all-miss warp, we predict that its future memory requests will also not be in the L2 cache, so we can simply have these requests bypass the cache. In doing so, the requests from the all-miss warp can completely avoid unnecessary L2 access and queuing delays. This decreases the total number of requests going to the L2 cache, thus reducing the queuing latencies for requests from mostly-hit and all-hit warps, as there is less contention.

We introduce Memory Divergence Correction (MeDiC), a GPU-specific mechanism that exploits memory divergence heterogeneity across warps at the shared cache and at main memory to improve the overall performance of GPGPU applications. MeDiC consists of three different components, which work together to achieve our goals of converting mostly-hit warps into all-hit warps and mostly-miss warps into all-miss warps: (1) a warp-type-aware cache bypassing mechanism, which prevents requests from mostly-miss and all-miss warps from accessing the shared L2 cache (Section 13.2); (2) a warp-type-aware cache insertion policy, which prioritizes requests from mostly-hit and all-hit warps to ensure that they all become cache hits (Section 13.3); and (3) a warp-type-aware memory scheduling mechanism, which prioritizes requests from mostly-hit warps that were not successfully converted to all-hit warps, in order to minimize the stall time due to divergence (Section 13.4). These three components are all driven by an online mechanism that can identify the expected memory divergence behavior of each warp (Section 13.1).

This dissertation makes the following contributions:

  • \red

    We observe that the different warps within a GPGPU application exhibit heterogeneity in their memory divergence behavior at the shared L2 cache, and that some warps do not benefit from the few cache hits that they have. This memory divergence behavior tends to remain consistent throughout long periods of execution for a warp, allowing for fast, online warp divergence characterization and prediction.

  • We identify a new performance bottleneck in GPGPU application execution that can contribute significantly to memory divergence: due to the very large number of memory requests issued by warps in GPGPU applications that contend at the shared L2 cache, many of these requests experience high cache queuing latencies.

  • Based on our observations, we propose Memory Divergence Correction, \reda new mechanism that exploits the stable memory divergence behavior of warps to (1) improve the effectiveness of the cache by favoring warps that take the most advantage of the cache, (2) address the cache queuing problem, and (3) improve the effectiveness of the memory scheduler by favoring warps that benefit most from prioritization. We compare MeDiC to four different cache management mechanisms, and show that it improves performance by 21.8% and energy efficiency by 20.1% across a wide variety of GPGPU workloads compared to a a state-of-the-art GPU cache management mechanism [247].

11 Background

We first provide background on the architecture of a modern GPU, and then we discuss the bottlenecks that highly-multithreaded applications can face when executed on a GPU. These applications can be compiled using OpenCL [216] or CUDA [313], either of which converts a general purpose application into a GPGPU program that can execute on a GPU.

11.1 Baseline GPU Architecture

A typical GPU consists of several shader cores (sometimes called streaming multiprocessors, or SMs). In this work, we set the number of shader cores to 15, with 32 threads per warp in each core, corresponding to the NVIDIA GTX480 GPU based on the Fermi architecture [310]. The GPU we evaluate can issue up to 480 concurrent memory accesses per cycle [415]. Each core has its own private L1 data, texture, and constant caches, as well as a scratchpad memory [311, 310, 251]. In addition, the GPU also has several shared L2 cache slices and memory controllers. A memory partition unit combines a single L2 cache slice (which is banked) with a designated memory controller that connects to off-chip main memory. Figure 6 shows a simplified view of how the cores (or SMs), caches, and memory partitions are organized in our baseline GPU.

Figure 6: Overview of the baseline GPU architecture.

11.2 Bottlenecks in GPGPU Applications

Several previous works have analyzed the benefits and limitations of using a GPU for general purpose workloads (other than graphics purposes), including characterizing the impact of microarchitectural changes on applications [46] or developing performance models that break down performance bottlenecks in GPGPU applications [243, 383, 162, 257, 264, 136]. \redAll of these works show benefits from using a throughput-oriented GPU. However, a significant number of applications are unable to fully utilize all of the available parallelism within the GPU, leading to periods of execution where no warps are available for execution [425].

\red

When there are no available warps, the GPU cores stall, and the application stops making progress until a warp becomes available. Prior work has investigated two problems that can delay some warps from becoming available for execution: (1) branch divergence, which occurs when a branch in the same SIMD instruction resolves into multiple different paths [46, 436, 150, 120, 297], and (2) memory divergence, which occurs when the simultaneous memory requests from a single warp spend different amounts of time retrieving their associated data from memory [297, 271, 74]. In this work, we focus on the memory divergence problem; prior work on branch divergence is complementary to our work.

12 Motivation and Key Observations

We make three new key observations about memory divergence (at the shared L2 cache). First, we observe that the degree of memory divergence can differ across warps. This inter-warp heterogeneity affects how well each warp takes advantage of the shared cache. Second, we observe that a warp’s memory divergence behavior tends to remain stable for long periods of execution, making it predictable. Third, we observe that requests to the shared cache experience long queuing delays due to the large amount of parallelism in GPGPU programs, which exacerbates the memory divergence problem and slows down GPU execution. Next, we describe each of these observations in detail and motivate our solutions.

12.1 Exploiting Heterogeneity Across Warps

We observe that different warps have different amounts of sensitivity to memory latency and cache utilization. We study the cache utilization of a warp by determining its hit ratio, the percentage of memory requests that hit in the cache when the warp issues a single memory instruction. As Figure 7 shows, the warps from each of our three representative GPGPU applications are distributed across all possible ranges of hit ratio, exhibiting significant heterogeneity. To better characterize warp behavior, we break the warps down into the five types shown in Figure 8 based on their hit ratios: all-hit, mostly-hit, balanced, mostly-miss, and all-miss.

Figure 7: L2 cache hit ratio of different warps in three representative GPGPU applications (see Section 14 for methods).
Figure 8: Warp type categorization based on the shared cache hit ratios. Hit ratio values are empirically chosen.

This inter-warp heterogeneity in cache utilization provides new opportunities for performance improvement. We illustrate two such opportunities by walking through a simplified example, shown in Figure 9. Here, we have two warps, A and B, where A is a mostly-miss warp (with three of its four memory requests being L2 cache misses) and B is a mostly-hit warp with only a single L2 cache miss (request B0). Let us assume that warp A is scheduled first.

Figure 9: (a) Existing inter-warp heterogeneity, (b) exploiting the heterogeneity with MeDiC to improve performance.

As we can see in Figure 9a, the mostly-miss warp A does not benefit at all from the cache: even though one of its requests (A3) hits in the cache, warp A cannot continue executing until all of its memory requests are serviced. As the figure shows, using the cache to speed up only request A3 has no material impact on warp A’s stall time. In addition, while requests A1 and A2 do not hit in the cache, they still incur a queuing latency at the cache while they wait to be looked up in the cache tag array.

On the other hand, the mostly-hit warp B can be penalized significantly. First, since warp B is scheduled after the mostly-miss warp A, all four of warp B’s requests incur a large L2 queuing delay, even though the cache was not useful to speed up warp A. On top of this unproductive delay, since request B0 misses in the cache, it holds up execution of the entire warp while it gets serviced by main memory. The overall effect is that despite having many more cache hits (and thus much better cache utility) than warp A, warp B ends up stalling for as long as or even longer than the mostly-miss warp A stalled for.

To remedy this problem, we set two goals (Figure 9b):

1) Convert the mostly-hit warp B into an all-hit warp. By converting B0 into a hit, warp B no longer has to stall on any memory misses, which enables the warp to become ready to execute much earlier. This requires a little additional space in the cache to store the data for B0.

2) Convert the mostly-miss warp A into an all-miss warp. Since a single cache hit is of no effect to warp A’s execution, we convert A0 into a cache miss. This frees up the cache space A0 was using, and thus creates cache space for storing B0. In addition, warp A’s requests can now skip accessing the cache and go straight to main memory, which has two benefits: A0–A2 complete faster because they no longer experience the cache queuing delay that they incurred in Figure 9a, and B0–B3 also complete faster because they must queue behind a smaller number of cache requests. Thus, bypassing the cache for warp A’s request allows both warps to stall for less time, improving GPU core utilization.

\red

To realize these benefits, we propose to (1) develop a mechanism that can identify mostly-hit and mostly-miss warps; (2) design a mechanism that allows mostly-miss warps to yield their ineffective cache space to mostly-hit warps, similar to how the mostly-miss warp A in Figure 9a turns into an all-miss warp in Figure 9b, so that warps such as the mostly-hit warp B can become all-hit warps; (3) design a mechanism that bypasses the cache for requests from mostly-miss and all-miss warps such as warp A, to decrease warp stall time and reduce lengthy cache queuing latencies; and (4) prioritize requests from mostly-hit warps across the memory hierarchy, at both the shared L2 cache and at the memory controller, to minimize their stall time as much as possible, similar to how the mostly-hit warp B in Figure 9a turns into an all-hit warp in Figure 9b.

\green

A key challenge is how to group warps into different warp types. In this work, we observe that warps tend to exhibit stable cache hit behavior over long periods of execution. A warp consists of several threads that repeatedly loop over the same instruction sequences. This secs/medic/results in similar hit/miss behavior at the cache level across different instances of the same warp. As a result, a warp measured to have a particular hit ratio is likely to maintain a similar hit ratio throughout a lengthy phase of execution. We observe that most CUDA applications exhibit this trend.

Figure 10 shows the hit ratio over a duration of one million cycles, for six randomly selected warps from our CUDA applications. We also plot horizontal lines to illustrate the hit ratio cutoffs that we set in Figure 8 for our mostly-hit (70%) and mostly-miss (20%) warp types. Warps 1, 3, and 6 spend the majority of their time with high hit ratios, and are classified as mostly-hit warps. Warps 1 and 3 do, however, exhibit some long-term (i.e., 100k+ cycles) shifts to the balanced warp type. Warps 2 and 5 spend a long time as mostly-miss warps, though they both experience a single long-term shift into balanced warp behavior. As we can see, warps tend to remain in the same warp type at least for hundreds of thousands of cycles.

Figure 10: Hit ratio of randomly selected warps over time.

As a result of this relatively stable behavior, our mechanism, MeDiC (described in detail in Section 13), samples the hit ratio of each warp and uses this data for warp characterization. To account for the long-term hit ratio shifts, MeDiC resamples the hit ratio every 100k cycles.

12.2 Reducing the Effects of L2 Queuing Latency

Unlike CPU applications, GPGPU applications can issue as many as hundreds of memory instructions per cycle. All of these memory requests can arrive concurrently at the L2 cache, which is the first shared level of the memory hierarchy, creating a bottleneck. Previous works [46, 386, 433, 385] point out that the latency for accessing the L2 cache can take hundreds of cycles, even though the nominal cache lookup latency is significantly lower (only tens of cycles). While they identify this disparity, these earlier efforts do not identify or analyze the source of these long delays.

We make a new observation that identifies an important source of the long L2 cache access delays in GPGPU systems. L2 bank conflicts can cause queuing delay, which can differ from one bank to another and lead to the disparity of cache access latencies across different banks. As Figure 11a shows, even if every cache access within a warp hits in the L2 cache, each access can incur a different cache latency due to non-uniform queuing, and the warp has to stall until the slowest cache access retrieves its data (i.e., memory divergence can occur). For each set of simultaneous requests issued by an all-hit warp, we define its inter-bank divergence penalty to be the difference between the fastest cache hit and the slowest cache hit, as depicted in Figure 11a.

Figure 11: \redEffect of bank queuing latency divergence in the L2 cache: (a) example of the impact on stall time of skewed queuing latencies, (b) inter-bank divergence penalty due to skewed queuing for all-hit warps, in cycles.

In order to confirm this behavior, we modify GPGPU-Sim [46] to accurately model L2 bank conflicts and queuing delays (see Section 14 for details). We then measure the average and maximum inter-bank divergence penalty observed only for all-hit warps in our different CUDA applications, shown in Figure 11b. We find that on average, an all-hit warp has to stall for an additional 24.0 cycles because some of its requests go to cache banks with high access contention.

To quantify the magnitude of queue contention, we analyze the queuing delays for a two-bank L2 cache where the tag lookup latency is set to one cycle. We find that even with such a small cache lookup latency, a significant number of requests experience tens, if not hundreds, of cycles of queuing delay. Figure 12 shows the distribution of these delays for BFS [63], across all of its individual L2 cache requests. BFS contains one compute-intensive kernel and two memory-intensive kernels. We observe that requests generated by the compute-intensive kernel do not incur high queuing latencies, while requests from the memory-intensive kernels suffer from significant queuing delays. On average, across all three kernels, cache requests spend 34.8 cycles in the queue waiting to be serviced, which is quite high considering the idealized one-cycle cache lookup latency.

Figure 12: Distribution of per-request queuing latencies for L2 cache requests from BFS.

One naive solution to the L2 cache queuing problem is to increase the number of banks, without reducing the number of physical ports per bank and without increasing the size of the shared cache. However, as shown in Figure 13, \redthe average performance improvement from doubling the number of banks to 24 (i.e., 4 banks per memory partition) is less than 4%, while the improvement from quadrupling the banks is less than 6%. There are two key reasons for this minimal performance gain. First, while more cache banks can help to distribute the queued requests, these extra banks do not change the memory divergence behavior of the warp (i.e., the warp hit ratios remain unchanged). Second, non-uniform bank access patterns still remain, causing cache requests to queue up unevenly at a few banks.1

Figure 13: Performance of GPGPU applications with different number of banks and ports per bank, normalized to a 12-bank cache with 2 ports per bank.

12.3 Our Goal

Our goal of MeDiC is to improve cache utilization and reduce cache queuing latency by taking advantage of heterogeneity between different types of warps. To this end, we create a mechanism that (1) tries to eliminate mostly-hit and mostly-miss warps by converting as many of them as possible to all-hit and all-miss warps, respectively; (2) \redreduces the queuing delay at the L2 cache by bypassing requests from mostly-miss and all-miss warps, such that each L2 cache hit experiences a much lower overall L2 cache latency; and (3) prioritizes mostly-hit warps in the memory scheduler to minimize the amount of time they stall due to a cache miss.

13 MeDiC: Memory Divergence Correction

In this section, we introduce Memory Divergence Correction (MeDiC), a set of techniques that take advantage of the memory divergence heterogeneity across warps, as discussed in Section 12. These techniques work independently of each other, but act synergistically to provide a substantial performance improvement. In Section 13.1, we propose a mechanism that identifies and groups warps into different warp types based on their degree of memory divergence, as shown in Figure 8.

As depicted in Figure 14, MeDiC uses

1

warp type identification to drive three different components:

2

a warp-type-aware cache bypass mechanism (Section 13.2), which bypasses requests from all-miss and mostly-miss warps to reduce the L2 queuing delay;

3

a warp-type-aware cache insertion policy (Section 13.3), which works to keep cache lines from mostly-hit warps while demoting lines from mostly-miss warps; and

4

a warp-type-aware memory scheduler (Section 13.4), which prioritizes DRAM requests from mostly-hit warps as they are highly latency sensitive. We analyze the hardware cost of MeDiC in Section 15.5.

Figure 14: Overview of MeDiC:

1

warp type identification logic,

2

warp-type-aware cache bypassing,

3

warp-type-aware cache insertion policy,

4

warp-type-aware memory scheduler.

13.1 Warp Type Identification

In order to take advantage of the memory divergence heterogeneity across warps, we must first add hardware that can identify the divergence behavior of each warp. The key idea is to periodically sample the hit ratio of a warp, and to classify the warp’s divergence behavior as one of the five types in Figure 8 based on the observed hit ratio (see Section 12.1). This information can then be used to drive the warp-type-aware components of MeDiC. In general, warps tend to retain the same memory divergence behavior for long periods of execution. However, as we observed in Section 12.1, there can be some long-term shifts in warp divergence behavior, requiring periodic resampling of the hit ratio to potentially adjust the warp type.

Warp type identification through hit ratio sampling requires hardware within the cache to periodically count the number of hits and misses each warp incurs. We append two counters to the metadata stored for each warp, which represent the total number of cache hits and cache accesses for the warp. We reset these counters periodically, and set the bypass logic to operate in a profiling phase for each warp after this reset.2 During profiling, which lasts for the first 30 cache accesses of each warp, the bypass logic (which we explain in Section 13.2) does not make any cache bypassing decisions, to allow the counters to accurately characterize the current memory divergence behavior of the warp. At the end of profiling, the warp type is determined and stored in the metadata.

13.2 Warp-type-aware Shared Cache Bypassing

Once the warp type is known and a warp generates a request to the L2 cache, our mechanism first decides whether to bypass the cache based on the warp type. The key idea behind warp-type-aware cache bypassing, as discussed in Section 12.1, is to convert mostly-miss warps into all-miss warps, as they do not benefit greatly from the few cache hits that they get. By bypassing these requests, we achieve three benefits: (1) bypassed requests can avoid L2 queuing latencies entirely, (2) other requests that do hit in the L2 cache experience shorter queuing delays due to the reduced contention, and (3) space is created in the L2 cache for mostly-hit warps.

The cache bypassing logic must make a simple decision: if an incoming memory request was generated by a mostly-miss or all-miss warp, the request is bypassed directly to DRAM. This is determined by reading the warp type stored in the warp metadata from the warp type identification mechanism. A simple 2-bit demultiplexer can be used to determine whether a request is sent to the L2 bank arbiter, or directly to the DRAM request queue.

\paragraphbe

Dynamically Tuning the Cache Bypassing Rate. While cache bypassing alleviates queuing pressure at the L2 cache banks, it can have a negative impact on other portions of the memory partition. For example, bypassed requests that were originally cache hits now consume extra off-chip memory bandwidth, and can increase queuing delays at the DRAM queue. If we lower the number of bypassed requests (i.e., reduce the number of warps classified as mostly-miss), we can reduce DRAM utilization. After examining a random selection of kernels from three applications (BFS, BP, and CONS), we find that the ideal number of warps classified as mostly-miss differs for each kernel. Therefore, we add a mechanism that dynamically tunes the hit ratio boundary between mostly-miss warps and balanced warps (nominally set at 20%; see Figure 8). If the cache miss rate increases significantly, the hit ratio boundary is lowered.3

\paragraphbe

Cache Write Policy. Recent GPUs support multiple options for the L2 cache write policy [310]. In this work, we assume that the L2 cache is write-through [385], so our bypassing logic can always assume that DRAM contains an up-to-date copy of the data. For write-back caches, previously-proposed mechanisms [146, 384, 270] can be used in conjunction with our bypassing technique to ensure that bypassed requests get the correct data. For correctness, fences and atomic instructions from bypassed warps still access the L2 for cache lookup, but are not allowed to store data in the cache.

13.3 Warp-type-aware Cache Insertion Policy

Our cache bypassing mechanism frees up space within the L2 cache, which we want to use for the cache misses from mostly-hit warps (to convert these memory requests into cache hits). However, even with the new bypassing mechanism, other warps (e.g., balanced, mostly-miss) still insert some data into the cache. In order to aid the conversion of mostly-hit warps into all-hit warps, we develop a warp-type-aware cache insertion policy, whose key idea is to ensure that for a given cache set, data from mostly-miss warps are evicted first, while data from mostly-hit warps and all-hit warps are evicted last.

To ensure that a cache block from a mostly-hit warp stays in the cache for as long as possible, we insert the block closer to the MRU position. A cache block requested by a mostly-miss warp is inserted closer to the LRU position, making it more likely to be evicted. To track the status of these cache blocks, we add two bits of metadata to each cache block, indicating the warp type.4 These bits are then appended to the replacement policy bits. As a result, a cache block from a mostly-miss warp is more likely to get evicted than a block from a balanced warp. Similarly, a cache block from a balanced warp is more likely to be evicted than a block from a mostly-hit or all-hit warp.

13.4 Warp-type-aware Memory Scheduler

Our cache bypassing mechanism and cache insertion policy work to increase the likelihood that all requests from a mostly-hit warp become cache hits, converting the warp into an all-hit warp. However, due to cache conflicts, or due to poor locality, there may still be cases when a mostly-hit warp cannot be fully converted into an all-hit warp, and is therefore unable to avoid stalling due to memory divergence as at least one of its requests has to go to DRAM. In such a case, we want to minimize the amount of time that this warp stalls. To this end, we propose a warp-type-aware memory scheduler that prioritizes the occasional DRAM request from a mostly-hit warp.

The design of our memory scheduler is very simple. Each memory request is tagged with a single bit, which is set if the memory request comes from a mostly-hit warp (or an all-hit warp, in case the warp was mischaracterized). We modify the request queue at the memory controller to contain two different queues (

4

in Figure 14), where a high-priority queue contains all requests that have their mostly-hit bit set to one. The low-priority queue contains all other requests, whose mostly-hit bits are set to zero. Each queue uses FR-FCFS [357, 454] as the scheduling policy; however, the scheduler always selects requests from the high priority queue over requests in the low priority queue.5

14 Methodology

We model our mechanism using GPGPU-Sim 3.2.1 [46]. Table 10 shows the configuration of the GPU. We modified GPGPU-Sim to accurately model cache bank conflicts, and added the cache bypassing, cache insertion, and memory scheduling mechanisms needed to support MeDiC. We use GPUWattch [244] to evaluate power consumption.

\topruleSystem Overview 15 cores, 6 memory partitions
\cmidrule(rl)1-2 Shader Core Config. 1400 MHz, 9-stage pipeline, GTO scheduler [359]
\cmidrule(rl)1-2 Private L1 Cache 16KB, 4-way associative, LRU, L1 misses are coalesced before
ccessing L2, 1 cycle latency
\cmidrule(rl)1-2 Shared L2 Cache \red768KB total, 16-way associative, LRU, 2 cache banks
\red2 interconnect ports per memory partition, 10 cycle latency
\cmidrule(rl)1-2 DRAM GDDR5 1674 MHz, 6 channels (one per memory partition)
FR-FCFS scheduler [357, 454] 8 banks per rank, burst length 8
\bottomrule
Table 1: Configuration of the simulated system.
# Application AH MH BL MM AM
1 Nearest Neighbor (NN) [309] 19% 79% 1% 0.9% 0.1%
2 Convolution Separable (CONS) [309] 9% 1% 82% 1% 7%
3 Scalar Product (SCP) [309] 0.1% 0.1% 0.1% 0.7% 99%
4 Back Propagation (BP) [77] 10% 27% 48% 6% 9%
5 Hotspot (HS) [77] 1% 29% 69% 0.5% 0.5%
6 Streamcluster (SC) [77] 6% 0.2% 0.5% 0.3% 93%
7 Inverted Index (IIX) [157] 71% 5% 8% 1% 15%
8 Page View Count (PVC) [157] 4% 1% 42% 20% 33%
9 Page View Rank (PVR) [157] 18% 3% 28% 4% 47%
10 Similarity Score (SS) [157] 67% 1% 11% 1% 20%
11 Breadth-First Search (BFS) [63] 40% 1% 20% 13% 26%
12 Barnes-Hut N-body Simulation (BH) [63] 84% 0% 0% 1% 15%
13 Delaunay Mesh Refinement (DMR) [63] 81% 3% 3% 1% 12%
14 Minimum Spanning Tree (MST) [63] 53% 12% 18% 2% 15%
15 Survey Propagation (SP) [63] 41% 1% 20% 14% 24%
Table 2: Evaluated GPGPU applications and the characteristics of their warps.
\paragraphbe

Modeling L2 Bank Conflicts.\green In order to analyze the detailed caching behavior of applications in modern GPGPU architectures, we modified GPGPU-Sim to accurately model banked caches.6 Within each memory partition, we divide the shared L2 cache into two banks. When a memory request misses in the L1 cache, it is sent to the memory partition through the shared interconnect. However, it can only be sent if there is a free port available at the memory partition (we dual-port each memory partition). Once a request arrives at the port, a unified bank arbiter dispatches the request to the request queue for the appropriate cache bank (which is determined statically using some of the memory address bits). If the bank request queue is full, the request remains at the incoming port until the queue is freed up. Traveling through the port and arbiter consumes an extra cycle per request. In order to prevent a bias towards any one port or any one cache bank, the simulator rotates which port and which bank are first examined every cycle.

\red

When a request misses in the L2 cache, it is sent to the DRAM request queue, which is shared across all L2 banks as previously implemented in GPGPU-Sim. When a request returns from DRAM, it is inserted into one of the per-bank DRAM-to-L2 queues. Requests returning from the L2 cache to the L1 cache go through a unified memory-partition-to-interconnect queue (where round-robin priority is used to insert requests from different banks into the queue).

\paragraphbe

GPGPU Applications. We evaluate our system across multiple GPGPU applications from the CUDA SDK [309], Rodinia [77], MARS [157], and Lonestar [63] benchmark suites.7 These applications are listed in Table 2, along with the breakdown of warp characterization. The dominant warp type for each application is marked in bold (AH: all-hit, MH: mostly-hit, BL: balanced, MM: mostly-miss, AM: all-miss; see Figure 8). We simulate 500 million instructions for each kernel of our application, though some kernels complete before reaching this instruction count.

\paragraphbe

Comparisons. In addition to the baseline secs/medic/results, we compare each individual component of MeDiC with state-of-the-art policies. \greenWe compare our bypassing mechanism with three different cache management policies. First, we compare to PCAL [247], a token-based cache management mechanism. PCAL limits the number of threads that get to access the cache by using tokens. If a cache request is a miss, it causes a replacement only if the warp has a token. PCAL, as modeled in this work, first grants tokens to the warp that recently used the cache, then grants any remaining tokens to warps that access the cache in order of their arrival. Unlike the original proposal [247], which applies PCAL to the L1 caches, we apply PCAL to the shared L2 cache. We sweep the number of tokens per epoch and use the configuration that gives the best average performance. Second, we compare MeDiC against a random bypassing policy (Rand), where a percentage of randomly-chosen warps bypass the cache every 100k cycles. For every workload, we statically configure the percentage of warps that bypass the cache such that Rand yields the best performance. This comparison point is designed to show the value of warp type information in bypassing decisions. Third, we compare to a program counter (PC) based bypassing policy (PC-Byp). This mechanism bypasses requests from static instructions that mostly miss (as opposed to requests from mostly-miss warps). This comparison point is designed to distinguish the value of tracking hit ratios at the warp level instead of at the instruction level.

We compare our memory scheduling mechanism with the baseline first-ready, first-come first-serve (FR-FCFS) memory scheduler [357, 454], which is reported to provide good performance on GPU and GPGPU workloads [445, 74, 33]. We compare our cache insertion with the Evicted-Address Filter [379], a state-of-the-art CPU cache insertion policy.

\paragraphbe

Evaluation Metrics. \greenWe report performance secs/medic/results using the harmonic average of the IPC speedup (over the baseline GPU) of each kernel of each application.8 Harmonic speedup was shown to reflect the average normalized execution time in multiprogrammed workloads [107]. We calculate energy efficiency for each workload by dividing the IPC by the energy consumed.

15 Evaluation

15.1 Performance Improvement of MeDiC

Figure 15 shows the performance of MeDiC compared to the various state-of-the-art mechanisms (EAF [379], PCAL [247], Rand, PC-Byp) from Section 14,9 as well as the performance of each individual component in MeDiC.

Figure 15: \redPerformance of MeDiC.

Baseline shows the performance of the unmodified GPU using FR-FCFS as the memory scheduler [357, 454]. EAF shows the performance of the Evicted-Address Filter [379]. WIP shows the performance of our warp-type-aware insertion policy by itself. WMS shows the performance of our warp-type-aware memory scheduling policy by itself. \redPCAL shows the performance of the PCAL bypassing mechanism proposed by Li et al. [247]. Rand shows the performance of a cache bypassing mechanism that performs bypassing decisions randomly on a fixed percentage of warps. PC-Byp shows the performance of the bypassing mechanism that uses the PC as the criterion for bypassing instead of the warp-type. WByp shows the performance of our warp-type-aware bypassing policy by itself.

From these secs/medic/results, we draw the following conclusions:

  • Each component of MeDiC individually provides significant performance improvement: WIP (32.5%), WMS (30.2%), and WByp (33.6%). MeDiC, which combines all three mechanisms, provides a 41.5% performance improvement over Baseline, on average. MeDiC matches or outperforms its individual components for all benchmarks except BP, where MeDiC has a higher L2 miss rate and lower row buffer locality than WMS and WByp.

  • WIP outperforms EAF [379] by 12.2%. We observe that the key benefit of WIP is that cache blocks from mostly-miss warps are much more likely to be evicted. In addition, WIP reduces the cache miss rate of several applications (see Section 15.3).

  • WMS provides significant performance gains (30.2%) over Baseline, because the memory scheduler prioritizes requests from warps that have a high hit ratio, allowing these warps to become active much sooner than they do in Baseline.

  • WByp provides an average 33.6% performance improvement over Baseline, because it is effective at reducing the L2 queuing latency. We show the change in queuing latency and provide a more detailed analysis in Section 15.3.

  • \green

    Compared to PCAL [247], WByp provides 12.8% better performance, and full MeDiC provides 21.8% better performance. We observe that while PCAL reduces the amount of cache thrashing, the reduction in thrashing does not directly translate into better performance. We observe that warps in the mostly-miss category sometimes have high reuse, and acquire tokens to access the cache. This causes less cache space to become available for mostly-hit warps, limiting how many of these warps become all-hit. However, when high-reuse warps that possess tokens are mainly in the mostly-hit category (PVC, PVR, SS, and BH), we find that PCAL performs better than WByp.

  • Compared to Rand,10 MeDiC performs 6.8% better, because MeDiC is able to make bypassing decisions that do not increase the miss rate significantly. This leads to lower off-chip bandwidth usage under MeDiC than under Rand. Rand increases the cache miss rate by 10% for the kernels of several applications (BP, PVC, PVR, BFS, and MST). \redWe observe that in many cases, MeDiC improves the performance of applications that tend to generate a large number of memory requests, and thus experience substantial queuing latencies. We further analyze the effect of MeDiC on queuing delay in Section 15.3.

  • Compared to PC-Byp, MeDiC performs 12.4% better. We observe that the overhead of tracking the PC becomes significant, and that thrashing occurs as two PCs can hash to the same index, leading to inaccuracies in the bypassing decisions.

\green

We conclude that each component of MeDiC, and the full MeDiC framework, are effective. Note that each component of MeDiC addresses the same problem (i.e., memory divergence of threads within a warp) using different techniques on different parts of the memory hierarchy. For the majority of workloads, one optimization is enough. However, we see that for certain high-intensity workloads (BFS and SSSP), the congestion is so high that we need to attack divergence on multiple fronts. Thus, MeDiC provides better average performance than all of its individual components, especially for such memory-intensive workloads.

15.2 Energy Efficiency of MeDiC

MeDiC provides significant GPU energy efficiency improvements, as shown in Figure 16. All three components of MeDiC, as well as the full MeDiC framework, are more energy efficient than all of the other works we compare against. MeDiC is 53.5% more energy efficient than Baseline. WIP itself is 19.3% more energy efficient than EAF. WMS is 45.2% more energy efficient than Baseline, which uses an FR-FCFS memory scheduler [357, 454]. WByp and MeDiC are more energy efficient than all of the other evaluated bypassing mechanisms, with 8.3% and 20.1% more efficiency than PCAL [247], respectively.

Figure 16: \redEnergy efficiency of MeDiC.

For all of our applications, the energy efficiency of MeDiC is better than or equal to Baseline, because even though our bypassing logic sometimes increases energy consumption by sending more memory requests to DRAM, the resulting performance improvement outweighs this additional energy. We also observe that our insertion policy reduces the L2 cache miss rate, allowing MeDiC to be even more energy efficient by not wasting energy on cache lookups for requests of all-miss warps.

15.3 Analysis of Benefits

\paragraphbe

Impact of MeDiC on Cache Miss Rate. One possible downside of cache bypassing is that the bypassed requests can introduce extra cache misses. Figure 17 shows the cache miss rate for Baseline, Rand, WIP, and MeDiC.

Figure 17: L2 Cache miss rate of MeDiC.

Unlike Rand, MeDiC does not increase the cache miss rate over Baseline for most of our applications. The key factor behind this is WIP, the insertion policy in MeDiC. We observe that WIP on its own provides significant cache miss rate reductions for several workloads (SCP, PVC, PVR, SS, and DMR). For the two workloads (BP and BFS) where WIP increases the miss rate (5% for BP, and 2.5% for BFS), the bypassing mechanism in MeDiC is able to contain the negative effects of WIP by dynamically tuning how aggressively bypassing is performed based on the change in cache miss rate (see Section 13.2). We conclude that MeDiC does not hurt the overall L2 cache miss rate.

\paragraphbe

Impact of MeDiC on Queuing Latency. Figure 18 shows the average L2 cache queuing latency for WByp and MeDiC, compared to Baseline queuing latency. For most workloads, WByp reduces the queuing latency significantly (up to 8.7x in the case of PVR). This reduction secs/medic/results in significant performance gains for both WByp and MeDiC.

Figure 18: L2 queuing latency for warp-type-aware bypassing and MeDiC, compared to Baseline L2 queuing latency.

There are two applications where the queuing latency increases significantly: BFS and SSSP. We observe that when cache bypassing is applied, the GPU cores retire instructions at a much faster rate (2.33x for BFS, and 2.17x for SSSP). This increases the pressure at each shared resource, including a sharp increase in the rate of cache requests arriving at the L2 cache. This additional backpressure secs/medic/results in higher L2 cache queuing latencies for both applications.

When all three mechanisms in MeDiC (bypassing, cache insertion, and memory scheduling) are combined, we observe that the queuing latency reduces even further. This additional reduction occurs because the cache insertion mechanism in MeDiC reduces the cache miss rate. We conclude that in general, MeDiC significantly alleviates the L2 queuing bottleneck.

\paragraphbe

Impact of MeDiC on Row Buffer Locality. Another possible downside of cache bypassing is that it may increase the number of requests serviced by DRAM, which in turn can affect DRAM row buffer locality. Figure 19 shows the row buffer hit rate for WMS and MeDiC, compared to the Baseline hit rate.

Figure 19: Row buffer hit rate of warp-type-aware memory scheduling and MeDiC, compared to Baseline.

Compared to Baseline, WMS has a negative effect on the row buffer locality of six applications (NN, BP, PVR, SS, BFS, and SSSP), and a positive effect on seven applications (CONS, SCP, HS, PVC, BH, DMR, and MST). We observe that even though the row buffer locality of some applications decreases, the overall performance improves, as the memory scheduler prioritizes requests from warps that are more sensitive to long memory latencies. Additionally, prioritizing requests from warps that send a small number of memory requests (mostly-hit warps) over warps that send a large number of memory requests (mostly-miss warps) allows more time for mostly-miss warps to batch requests together, improving their row buffer locality. Prior work on GPU memory scheduling [33] has observed similar behavior, where batching requests together allows GPU requests to benefit more from row buffer locality.

15.4 Identifying Reuse in GPGPU Applications

While WByp bypasses warps that have low cache utility, it is possible that some cache blocks fetched by these bypassed warps get accessed frequently. Such a frequently-accessed cache block may be needed later by a mostly-hit warp, and thus leads to an extra cache miss (as the block bypasses the cache). To remedy this, we add a mechanism to MeDiC that ensures all high-reuse cache blocks still get to access the cache. The key idea, building upon the state-of-the-art mechanism for block-level reuse [379], is to use a Bloom filter to track the high-reuse cache blocks, and to use this filter to override bypassing decisions. We call this combined design MeDiC-reuse.

Figure 20 shows that MeDiC-reuse suffers 16.1% performance degradation over MeDiC. There are two reasons behind this degradation. First, we observe that MeDiC likely implicitly captures blocks with high reuse, as these blocks tend to belong to all-hit and mostly-hit warps. Second, we observe that several GPGPU applications contain access patterns that cause severe false positive aliasing within the Bloom filter used to implement EAF and MeDiC-reuse. This leads to some low reuse cache accesses from mostly-miss and all-miss warps taking up cache space unnecessarily, resulting in cache thrashing. We conclude that MeDiC likely implicitly captures the high reuse cache blocks that are relevant to improving memory divergence (and thus performance). However, there may still be room for other mechanisms that make the best of block-level cache reuse and warp-level heterogeneity in making caching decisions.

Figure 20: Performance of MeDiC with Bloom filter based reuse detection mechanism from the EAF cache [379].

15.5 Hardware Cost

MeDiC requires additional metadata storage in two locations. First, each warp needs to maintain its own hit ratio. This can be done by adding 22 bits to the metadata of each warp: two 10-bit counters to track the number of L2 cache hits and the number of L2 cache accesses, and 2 bits to store the warp type.11 To efficiently account for overflow, the two counters that track L2 hits and L2 accesses are shifted right when the most significant bit of the latter counter is set. Additionally, the metadata for each cache line contains two bits, in order to annotate the warp type for the cache insertion policy. The total storage needed in the cache is  NumCacheLines bits. In all, MeDiC comes at a cost of 5.1 kB, or less than 1% of the L2 cache size.

To evaluate the trade-off of storage overhead, we evaluate a GPU where this overhead is converted into additional L2 cache space for the baseline GPU. We conservatively increase the L2 capacity by 5%, and find that this additional cache capacity does not improve the performance of any of our workloads by more than 1%. As we discuss in the chapter, contention due to warp interference and divergence, and not due to cache capacity, is the root cause behind the performance bottlenecks that MeDiC alleviates. We conclude that MeDiC can deliver significant performance improvements with very low overhead.

16 MeDiC: Conclusion

Warps from GPGPU applications exhibit heterogeneity in their memory divergence behavior at the shared L2 cache within the GPU. We find that (1) some warps benefit significantly from the cache, while others make poor use of it; (2) such divergence behavior for a warp tends to remain stable for long periods of the warp’s execution; and (3) the impact of memory divergence can be amplified by the high queuing latencies at the L2 cache.

We propose Memory Divergence Correction (MeDiC), whose key idea is to identify memory divergence heterogeneity in hardware and use this information to drive cache management and memory scheduling, by prioritizing warps that take the greatest advantage of the shared cache. To achieve this, MeDiC consists of three warp-type-aware components for (1) cache bypassing, (2) cache insertion, and (3) memory scheduling. MeDiC delivers significant performance and energy improvements over multiple previously proposed policies, and over a state-of-the-art GPU cache management technique. We conclude that exploiting inter-warp heterogeneity is effective, and hope future works explore other ways of improving systems based on this key observation.

Chapter \thechapter Reducing Inter-application Interference with Staged Memory Scheduling

As the number of cores continues to increase in modern chip multiprocessor (CMP) systems, the DRAM memory system is becoming a critical shared resource. Memory requests from multiple cores interfere with each other, and this inter-application interference is a significant impediment to individual application and overall system performance. Previous work on application-aware memory scheduling [220, 221, 292, 293] has addressed the problem by making the memory controller aware of application characteristics and appropriately prioritizing memory requests to improve system performance and fairness.

Recent systems [62, 176, 307] present an additional challenge by introducing integrated graphics processing units (GPUs) on the same die with CPU cores. GPU applications typically demand significantly more memory bandwidth than CPU applications due to the GPU’s capability of executing a large number of parallel threads. GPUs use single-instruction multiple-data (SIMD) pipelines to concurrently execute multiple threads, where a batch of threads running the same instruction is called a wavefront or warp. When a wavefront stalls on a memory instruction, the GPU core hides this memory access latency by switching to another wavefront to avoid stalling the pipeline. Therefore, there can be thousands of outstanding memory requests from across all of the wavefronts. This is fundamentally more memory intensive than CPU memory traffic, where each CPU application has a much smaller number of outstanding requests due to the sequential execution model of CPUs.

Recent memory scheduling research has focused on memory interference between applications in CPU-only scenarios. These past proposals are built around a single centralized request buffer at each memory controller (MC). The scheduling algorithm implemented in the memory controller analyzes the stream of requests in the centralized request buffer to determine application memory characteristics, decides on a priority for each core, and then enforces these priorities. Observable memory characteristics may include the number of requests that result in row-buffer hits, the bank-level parallelism of each core, memory request rates, overall fairness metrics, and other information. Figure 21(a) shows the CPU-only scenario where the request buffer only holds requests from the CPUs. In this case, the memory controller sees a number of requests from the CPUs and has visibility into their memory behavior. On the other hand, when the request buffer is shared between the CPUs and the GPU, as shown in Figure 21(b), the large volume of requests from the GPU occupies a significant fraction of the memory controller’s request buffer, thereby limiting the memory controller’s visibility of the CPU applications’ memory behaviors.

Figure 21: Limited visibility example. (a) CPU-only information, (b) Memory controller’s visibility, (c) Improved visibility

One approach to increasing the memory controller’s visibility across a larger window of memory requests is to increase the size of its request buffer. This allows the memory controller to observe more requests from the CPUs to better characterize their memory behavior, as shown in Figure 21(c). For instance, with a large request buffer, the memory controller can identify and service multiple requests from one CPU core to the same row such that they become row-buffer hits, however, with a small request buffer as shown in Figure 21(b), the memory controller may not even see these requests at the same time because the GPU’s requests have occupied the majority of the entries.

Unfortunately, very large request buffers impose significant implementation challenges including the die area for the larger structures and the additional circuit complexity for analyzing so many requests, along with the logic needed for assignment and enforcement of priorities. Therefore, while building a very large, centralized memory controller request buffer could lead to good memory scheduling decisions, the approach is unattractive due to the resulting area, power, timing and complexity costs.

In this work, we propose the Staged Memory Scheduler (SMS), a decentralized architecture for memory scheduling in the context of integrated multi-core CPU-GPU systems. The key idea in SMS is to decouple the various functional requirements of memory controllers and partition these tasks across several simpler hardware structures which operate in a staged fashion. The three primary functions of the memory controller, which map to the three stages of our proposed memory controller architecture, are:

  1. Detection of basic within-application memory characteristics (e.g., row-buffer locality).

  2. Prioritization across applications (CPUs and GPU) and enforcement of policies to reflect the priorities.

  3. Low-level command scheduling (e.g., activate, precharge, read/write), enforcement of device timing constraints (e.g., t, t, etc.), and resolving resource conflicts (e.g., data bus arbitration).

Our specific SMS implementation makes widespread use of distributed FIFO structures to maintain a very simple implementation, but at the same time SMS can provide fast service to low memory-intensity (likely latency sensitive) applications and effectively exploit row-buffer locality and bank-level parallelism for high memory-intensity (bandwidth demanding) applications. While SMS provides a specific implementation, our staged approach for memory controller organization provides a general framework for exploring scalable memory scheduling algorithms capable of handling the diverse memory needs of integrated CPU-GPU systems of the future.

This work makes the following contributions:

  • We identify and present the challenges posed to existing memory scheduling algorithms due to the highly memory-bandwidth-intensive characteristics of GPU applications.

  • We propose a new decentralized, multi-stage approach to memory scheduling that effectively handles the interference caused by bandwidth-intensive applications, while simplifying the hardware implementation.

  • We evaluate our approach against four previous memory scheduling algorithms [357, 220, 293, 221] across a wide variety workloads and CPU-GPU systems and show that it provides better performance and fairness. As an example, our evaluations on a CPU-GPU system show that SMS improves system performance by 41.2% and fairness by 4.8 across 105 multi-programmed workloads on a 16-CPU/1-GPU, four memory controller system, compared to the best previous memory scheduler TCM [221].

17 Background

In this section, we re-iterate DRAM organization and discuss how past research attempted to deal with the challenges of providing performance and fairness for modern memory systems.

17.1 Main Memory Organization

DRAM is organized as two-dimensional arrays of bitcells. Reading or writing data to DRAM requires that a row of bitcells from the array first be read into a row buffer. This is required because the act of reading the row destroys the row’s contents, and so a copy of the bit values must be kept (in the row buffer). Reads and writes operate directly on the row buffer. Eventually the row is “closed” whereby the data in the row buffer are written back into the DRAM array. Accessing data already loaded in the row buffer, also called a row buffer hit, incurs a shorter latency than when the corresponding row must first be “opened” from the DRAM array. A modern memory controller (MC) must orchestrate the sequence of commands to open, read, write and close rows. Servicing requests in an order that increases row-buffer hits tends to improve overall throughput by reducing the average latency to service requests. The MC is also responsible for enforcing a wide variety of timing constraints imposed by modern DRAM standards (e.g., DDR3) such as limiting the rate of page-open operations (t) and ensuring a minimum amount of time between writes and reads (t).

Each two dimensional array of DRAM cells constitutes a bank, and a group of banks form a rank. All banks within a rank share a common set of command and data buses, and the memory controller is responsible for scheduling commands such that each bus is used by only one bank at a time. Operations on multiple banks may occur in parallel (e.g., opening a row in one bank while reading data from another bank’s row buffer) so long as the buses are properly scheduled and any other DRAM timing constraints are honored. A memory controller can improve memory system throughput by scheduling requests such that bank-level parallelism or BLP (i.e., the number of banks simultaneously busy responding to commands) is increased. A memory system implementation may support multiple independent memory channels (each with its own ranks and banks) to further increase the number of memory requests that can be serviced at the same time. A key challenge in the implementation of modern, high-performance memory controllers is to effectively improve system performance by maximizing both row-buffer hits and BLP while simultaneously providing fairness among multiple CPUs and the GPU.

17.2 Memory Scheduling

Accessing off-chip memory is one of the major bottlenecks in microprocessors. Requests that miss in the last level cache incur long latencies, and as multi-core processors increase the number of CPUs, the problem gets worse because all of the cores must share the limited off-chip memory bandwidth. The large number of requests greatly increases contention for the memory data and command buses. Since a bank can only process one command at a time, the large number of requests also increases bank contention where requests must wait for busy banks to finish servicing other requests. A request from one core can also cause a row buffer containing data for another core to be closed, thereby reducing the row-buffer hit rate of that other core (and vice-versa). All of these effects increase the latency of memory requests by both increasing queuing delays (time spent waiting for the memory controller to start servicing a request) and DRAM device access delays (due to decreased row-buffer hit rates and bus contention).

The memory controller is responsible for buffering and servicing memory requests from the different cores and the GPU. Typical implementations make use of a memory request buffer to hold and keep track of all in-flight requests. Scheduling logic then decides which requests should be serviced, and issues the corresponding commands to the DRAM devices. Different memory scheduling algorithms may attempt to service memory requests in an order different than the order in which the requests arrived at the memory controller, in order to increase row-buffer hit rates, bank level parallelism, fairness, or achieve other goals.

17.3 Memory Scheduling in CPU-only Systems

Memory scheduling algorithms improve system performance by reordering memory requests to deal with the different constraints and behaviors of DRAM. The first-ready-first-come-first-serve (FR-FCFS) [357] algorithm attempts to schedule requests that result in row-buffer hits (first-ready), and otherwise prioritizes older requests (FCFS). FR-FCFS increases DRAM throughput, but it can cause fairness problems by under-servicing applications with low row-buffer locality. Several application-aware memory scheduling algorithms [220, 221, 292, 293] have been proposed to balance both performance and fairness. Parallelism-aware Batch Scheduling (PAR-BS) [293] batches requests based on their arrival times (older requests batched first). Within a batch, applications are ranked to preserve bank-level parallelism (BLP) within an application’s requests. More recently, ATLAS [220] proposes prioritizing applications that have received the least memory service. As a result, applications with low memory intensities, which typically attain low memory service, are prioritized. However, applications with high memory intensities are deprioritized and hence slowed down significantly, resulting in unfairness. The most recent work on application-aware memory scheduling, Thread Cluster Memory scheduling (TCM) [221], addresses this unfairness problem. TCM first clusters applications into low and high memory-intensity clusters based on their memory intensities. TCM always prioritizes applications in the low memory-intensity cluster, however, among the high memory-intensity applications it shuffles request priorities to prevent unfairness.

17.4 Characteristics of Memory Accesses from GPUs

A typical CPU application only has a relatively small number of outstanding memory requests at any time. The size of a processor’s instruction window bounds the number of misses that can be simultaneously exposed to the memory system. Branch prediction accuracy limits how large the instruction window can be usefully increased. In contrast, GPU applications have very different access characteristics, generating many more memory requests than CPU applications. A GPU application can consist of many thousands of parallel threads, where memory stalls on one group of threads can be hidden by switching execution to one of the many other groups of threads.

Figure 22: GPU memory characteristic. (a) Memory-intensity, measured by memory requests per thousand cycles, (b) Row buffer locality, measured by the fraction of accesses that hit in the row buffer, and (c) Bank-level parallelism.

Figure 22 (a) shows the memory request rates for a representative subset of our GPU applications and the most memory-intensive SPEC2006 (CPU) applications, as measured by memory requests per thousand cycles (see Section 19.5 for simulation methodology descriptions) when each application runs alone on the system. The raw bandwidth demands of the GPU applications are often multiple times higher than the SPEC benchmarks. Figure 22 (b) shows the row-buffer hit rates (also called row-buffer locality or RBL). The GPU applications show consistently high levels of RBL, whereas the SPEC benchmarks exhibit more variability. The GPU programs have high levels of spatial locality, often due to access patterns related to large sequential memory accesses (e.g., frame buffer updates). Figure 22(c) shows the BLP for each application, with the GPU programs consistently making use of far banks at the same time.

In addition to the high-intensity memory traffic of GPU applications, there are other properties that distinguish GPU applications from CPU applications. The TCM [221] study observed that CPU applications with streaming access patterns typically exhibit high RBL but low BLP, while applications with less uniform access patterns typically have low RBL but high BLP. In contrast, GPU applications have both high RBL and high BLP. The combination of high memory intensity, high RBL and high BLP means that the GPU will cause significant interference to other applications across all banks, especially when using a memory scheduling algorithm that preferentially favors requests that result in row-buffer hits.

17.5 What Has Been Done in the GPU?

As opposed to CPU applications, GPU applications are not very latency sensitive as there are a large number of independent threads to cover long memory latencies. However, the GPU requires a significant amount of bandwidth far exceeding even the most memory-intensive CPU applications. As a result, a GPU memory scheduler [251] typically needs a large request buffer that is capable of request coalescing (i.e., combining multiple requests for the same block of memory into a single combined request [313]). Furthermore, since GPU applications are bandwidth intensive, often with streaming access patterns, a policy that maximizes the number of row-buffer hits is effective for GPUs to maximize overall throughput. As a result, FR-FCFS with a large request buffer tends to perform well for GPUs [46]. In view of this, previous work [445] designed mechanisms to reduce the complexity of row-hit first based (FR-FCFS) scheduling.

18 Challenges with Existing Memory Controllers

18.1 The Need for Request Buffer Capacity

The results from Figure 22 showed that GPU applications have very high memory intensities. As discussed in Section 6.1, the large number of GPU memory requests occupy many of the memory controller’s request buffer entries, thereby making it very difficult for the memory controller to properly determine the memory access characteristics of each of the CPU applications. Figure 23 shows the performance impact of increasing the memory controller’s request buffer size for a variety of memory scheduling algorithms (full methodology details can be found in Section 19.5) for a 16-CPU/1-GPU system. By increasing the size of the request buffer from 64 entries to 256 entries,12 previously proposed memory controller algorithms can gain up to 63.6% better performance due to this improved visibility.

Figure 23: Performance at different request buffer sizes

18.2 Implementation Challenges in Providing Request Buffer Capacity

The results above show that when the memory controller has enough visibility across the global memory request stream to properly characterize the behaviors of each core, a sophisticated algorithm like TCM can be effective at making good scheduling decisions. Unfortunately, implementing a sophisticated algorithm like TCM over such a large scheduler introduces very significant implementation challenges. For all algorithms that use a centralized request buffer and prioritize requests that result in row-buffer hits (FR-FCFS, PAR-BS, ATLAS, TCM), associative logic (CAMs) will be needed for each entry to compare its requested row against currently open rows in the DRAM banks. For all algorithms that prioritize requests based on rank/age (FR-FCFS, PAR-BS, ATLAS, TCM), a large comparison tree is needed to select the highest ranked/oldest request from all request buffer entries. The size of this comparison tree grows with request buffer size. Furthermore, in addition to this logic for reordering requests and enforcing ranking/age, TCM also requires additional logic to continually monitor each CPU’s last-level cache MPKI rate (note that a CPU’s instruction count is not typically available at the memory controller), each core’s RBL which requires additional shadow row buffer index tracking [100, 102], and each core’s BLP.

Apart from the logic required to implement the policies of the specific memory scheduling algorithms, all of these memory controller designs need additional logic to enforce DDR timing constraints. Note that different timing constraints will apply depending on the state of each memory request. For example, if a memory request’s target bank currently has a different row loaded in its row buffer, then the memory controller must ensure that a precharge (row close) command is allowed to issue to that bank (e.g., has t elapsed since the row was opened?), but if the row is already closed, then different timing constraints will apply. For each request buffer entry, the memory controller will determine whether or not the request can issue a command to the DRAM based on the current state of the request and the current state of the DRAM system. That is, every request buffer entry (i.e., all 256) needs an independent instantiation of the DDR compliance-checking logic (including data and command bus availability tracking). This type of monolithic memory controller effectively implements a large out-of-order scheduler; note that typical instruction schedulers in modern out-of-order processors only have about 32-64 entries [117]. Even after accounting for the clock speed differences between CPU core and DRAM command frequencies, it is very difficult to implement a fully-associative13, age-ordered/prioritized, out-of-order scheduler with 256-512 entries [324].

19 The Staged Memory Scheduler

The proposed Staged Memory Scheduler (SMS) is structured to reflect the primary functional tasks of the memory scheduler. Below, we first describe the overall SMS algorithm, explain additional implementation details, step through the rationale for the design, and then walk through the hardware implementation.

19.1 The SMS Algorithm

\paragraphbe

Batch Formation. The first stage of SMS consists of several simple FIFO structures, one per source (i.e., a CPU core or the GPU). Each request from a given source is initially inserted into its respective FIFO upon arrival at the memory controller. A batch is simply one or more memory requests from the same source that access the same DRAM row. That is, all requests within a batch, except perhaps for the first one, would be row-buffer hits if scheduled consecutively. A batch is complete or ready when an incoming request accesses a different row, when the oldest request in the batch has exceeded a threshold age, or when the FIFO is full. Ready batches may then be considered by the second stage of the SMS.

\paragraphbe

Batch Scheduler. The batch formation stage has combined individual memory requests into batches of row-buffer hitting requests. The next stage, the batch scheduler, deals directly with batches, and therefore need not worry about scheduling to optimize for row-buffer locality. Instead, the batch scheduler can focus on higher-level policies regarding inter-application interference and fairness. The goal of the batch scheduler is to prioritize batches from applications that are latency critical, while making sure that bandwidth-intensive applications (e.g., the GPU) still make reasonable progress.

The batch scheduler operates in two states: pick and drain. In the pick state, the batch scheduler considers each FIFO from the batch formation stage. For each FIFO that contains a ready batch, the batch scheduler picks one batch based on a balance of shortest-job first (SJF) and round-robin principles. For SJF, the batch scheduler chooses the core (or GPU) with the fewest total memory requests across all three stages of the SMS. SJF prioritization reduces average request service latency, and it tends to favor latency-sensitive applications, which tend to have fewer total requests. The other component of the batch scheduler is a round-robin policy that simply cycles through each of the per-source FIFOs ensuring that high memory-intensity applications receive adequate service. Overall, the batch scheduler chooses the SJF policy with a probability of , and the round-robin policy otherwise.

After picking a batch, the batch scheduler enters a drain state where it forwards the requests from the selected batch to the final stage of the SMS. The batch scheduler simply dequeues one request per cycle until all requests from the batch have been removed from the selected batch formation FIFO. At this point, the batch scheduler re-enters the pick state to select the next batch.

\paragraphbe

DRAM Command Scheduler. The last stage of the SMS is the DRAM command scheduler (DCS). The DCS consists of one FIFO queue per DRAM bank (e.g., eight banks/FIFOs for DDR3). The drain phase of the batch scheduler places the memory requests directly into these FIFOs. Note that because batches are moved into the DCS FIFOs one batch at a time, any row-buffer locality within a batch is preserved within a DCS FIFO. At this point, any higher-level policy decisions have already been made by the batch scheduler, therefore, the DCS can simply focus on issuing low-level DRAM commands and ensuring DDR protocol compliance.

On any given cycle, the DCS only considers the requests at the head of each of the per-bank FIFOs. For each request, the DCS determines whether that request can issue a command based on the request’s current row-buffer state (i.e., is the row buffer already open with the requested row, closed, or open with the wrong row?) and the current DRAM state (e.g., time elapsed since a row was opened in a bank, data bus availability). If more than one request is eligible to issue a command, the DCS simply arbitrates in a round-robin fashion.

19.2 Additional Algorithm Details

\paragraphbe

Batch Formation Thresholds. The batch formation stage holds requests in the per-source FIFOs until a complete batch is ready. This could unnecessarily delay requests as the batch will not be marked ready until a request to a different row arrives at the memory controller, or the FIFO size has been reached. This additional queuing delay can be particularly devastating for low-intensity, latency-sensitive applications.

SMS considers an application’s memory intensity in forming batches. For applications with low memory-intensity (1 MPKC), SMS completely bypasses the batch formation and batch scheduler, and forwards requests directly to the DCS per-bank FIFOs. For these highly sensitive applications, such a bypass policy minimizes the delay to service their requests. Note that this bypass operation will not interrupt an on-going drain from the batch scheduler, which ensures that any separately scheduled batches maintain their row-buffer locality.

For medium memory-intensity (1-10 MPKC) and high memory-intensity (10 MPKC) applications, the batch formation stage uses age thresholds of 50 and 200 cycles, respectively. That is, regardless of how many requests are in the current batch, when the oldest request’s age exceeds the threshold, the entire batch is marked ready (and consequently, any new requests that arrive, even if accessing the same row, will be grouped into a new batch). Note that while TCM uses the MPKI metric to classify memory intensity, SMS uses misses per thousand cycles (MPKC) since the per-application instruction counts are not typically available in the memory controller. While it would not be overly difficult to expose this information, this is just one less implementation overhead that SMS can avoid.

\paragraphbe

Global Bypass. As described above, low memory-intensity applications can bypass the entire batch formation and scheduling process and proceed directly to the DCS. Even for high memory-intensity applications, if the memory system is lightly loaded (e.g., if this is the only application running on the system right now), then the SMS will allow all requests to proceed directly to the DCS. This bypass is enabled whenever the total number of in-flight requests (across all sources) in the memory controller is less than sixteen requests.

\paragraphbe

Round-Robin Probability. As described above, the batch scheduler uses a probability of to schedule batches with the SJF policy and the round-robin policy otherwise. Scheduling batches in a round-robin order can ensure fair progress from high-memory intensity applications. Our experimental results show that setting to 90% (10% using the round-robin policy) provides a good performance-fairness trade-off for SMS.

19.3 SMS Rationale

\paragraphbe

In-Order Batch Formation. It is important to note that batch formation occurs in the order of request arrival. This potentially sacrifices some row-buffer locality as requests to the same row may be interleaved with requests to other rows. We considered many variations of batch formation that allowed out-of-order grouping of requests to maximize the length of a run of row-buffer hitting requests, but the overall performance benefit was not significant. First, constructing very large batches of row-buffer hitting requests can introduce significant unfairness as other requests may need to wait a long time for a bank to complete its processing of a long run of row-buffer hitting requests [205]. Second, row-buffer locality across batches may still be exploited by the DCS. For example, consider a core that has three batches accessing row X, row Y, and then row X again. If X and Y map to different DRAM banks, say banks A and B, then the batch scheduler will send the first and third batches (row X) to bank A, and the second batch (row Y) to bank B. Within the DCS’s FIFO for bank A, the requests for the first and third batches will all be one after the other, thereby exposing the row-buffer locality across batches despite the requests appearing “out-of-order” in the original batch formation FIFOs.

\paragraphbe

In-Order Batch Scheduling. Due to contention and back-pressure in the system, it is possible that a FIFO in the batch formation stage contains more than one valid batch. In such a case, it could be desirable for the batch scheduler to pick one of the batches not currently at the head of the FIFO. For example, the bank corresponding to the head batch may be busy while the bank for another batch is idle. Scheduling batches out of order could decrease the service latency for the later batches, but in practice it does not make a big difference and adds significant implementation complexity. It is important to note that even though batches are dequeued from the batch formation stage in arrival order per FIFO, the request order between the FIFOs may still slip relative to each other. For example, the batch scheduler may choose a recently arrived (and formed) batch from a high-priority (i.e., latency-sensitive) source even though an older, larger batch from a different source is ready.

\paragraphbe

In-Order DRAM Command Scheduling. For each of the per-bank FIFOs in the DCS, the requests are already grouped by row-buffer locality (because the batch scheduler drains an entire batch at a time), and globally ordered to reflect per-source priorities. Further reordering at the DCS would likely just undo the prioritization decisions made by the batch scheduler. Like the batch scheduler, the in-order nature of each of the DCS per-bank FIFOs does not prevent out-of-order scheduling at the global level. A CPU’s requests may be scheduled to the DCS in arrival order, but the requests may get scattered across different banks, and the issue order among banks may slip relative to each other.

19.4 Hardware Implementation

The staged architecture of SMS lends directly to a low-complexity hardware implementation. Figure 24 illustrates the overall hardware organization of SMS.

Figure 24: The design of SMS
\paragraphbe

Batch Formation. The batch formation stage consists of little more than one FIFO per source (CPU or GPU). Each FIFO maintains an extra register that records the row index of the last request, so that any incoming request’s row index can be compared to determine if the request can be added to the existing batch. Note that this requires only a single comparator (used only once at insertion) per FIFO. Contrast this to a conventional monolithic request buffer where comparisons on every request buffer entry (which is much larger than the number of FIFOs that SMS uses) must be made, potentially against all currently open rows across all banks.

\paragraphbe

Batch Scheduler. The batch scheduling stage consists primarily of combinatorial logic to implement the batch picking rules. When using the SJF policy, the batch scheduler only needs to pick the batch corresponding to the source with the fewest in-flight requests, which can be easily performed with a tree of MIN operators. Note that this tree is relatively shallow since it only grows as a function of the number of FIFOs. Contrast this to the monolithic scheduler where the various ranking trees grow as a function of the total number of entries.

\paragraphbe

DRAM Command Scheduler. The DCS stage consists of the per-bank FIFOs. The logic to track and enforce the various DDR timing and power constraints is identical to the case of the monolithic scheduler, but the scale is drastically different. The DCS’s DDR command-processing logic only considers the requests at the head of each of the per-bank FIFOs (eight total for DDR3), whereas the monolithic scheduler requires logic to consider every request buffer entry (hundreds).

\paragraphbe

Overall Configuration and Hardware Cost. The final configuration of SMS that we use in this dissertation consists of the following hardware structures. The batch formation stage uses ten-entry FIFOs for each of the CPU cores, and a twenty-entry FIFO for the GPU. The DCS uses a fifteen-entry FIFO for each of the eight DDR3 banks. For sixteen cores and a GPU, the aggregate capacity of all of these FIFOs is 300 requests, although at any point in time, the SMS logic can only consider or act on a small subset of the entries (i.e., the seventeen at the heads of the batch formation FIFOs and the eight at the heads of the DCS FIFOs). In addition to these primary structures, there are a small handful of bookkeeping counters. One counter per source is needed to track the number of in-flight requests; each counter is trivially managed as it only needs to be incremented when a request arrives at the memory controller, and then decremented when the request is complete. Counters are also needed to track per-source MPKC rates for memory-intensity classification, which are incremented when a request arrives, and then periodically reset. Table 3 summarizes the amount of hardware overhead required for each stage of SMS.

Storage Description Size
Storage Overhead of Stage 1: Batch formation stage
CPU FIFO queues A CPU core’s FIFO queue entries
GPU FIFO queues A GPU’s FIFO queue entries
MPKC counters Counts per-core MPKC bits
Last request’s row index Stores the row index of bits
the last request to the FIFO
Storage Overhead of Stage 2: Batch Scheduler
CPU memory request counters Counts the number of outstanding bits
memory requests of a CPU core
GPU memory request counter Counts the number of outstanding bits
memory requests of the GPU
Storage Overhead of Stage 3: DRAM Command Scheduler
Per-Bank FIFO queues Contains a FIFO queue per bank entries
Table 3: Hardware storage required for SMS

19.5 Experimental Methodology

We use an in-house cycle-accurate simulator to perform our evaluations. For our performance evaluations, we model a system with sixteen x86 CPU cores and a GPU. For the CPUs, we model three-wide out-of-order processors with a cache hierarchy including per-core L1 caches and a shared, distributed L2 cache. The GPU does not share the CPU caches. Table 4 shows the detailed system parameters for the CPU and GPU cores. The parameters for the main memory system are listed in Table 4. Unless stated otherwise, we use four memory controllers (one channel per memory controller) for all experiments. In order to prevent the GPU from taking the majority of request buffer entries, we reserve half of the request buffer entries for the CPUs. To model the memory bandwidth of the GPU accurately, we perform coalescing on GPU memory requests before they are sent to the memory controller [251].

Parameter Setting
CPU Clock Speed 3.2GHz
CPU ROB 128 entries
CPU L1 cache 32KB Private, 4-way
CPU L2 cache 8MB Shared, 16-way
CPU Cache Rep. Policy LRU
GPU SIMD Width 800
GPU Texture units 40
GPU Z units 64
GPU Color units 16
Memory Controller Entries 300
Channels/Ranks/Banks 4/1/8
DRAM Row buffer size 2KB
DRAM Bus 128 bits/channel
tRCD/tCAS/tRP 8/8/8 ns
tRAS/tRC/tRRD 20/27/4 ns
tWTR/tRTP/tWR 4/4/6 ns
Table 4: Simulation parameters.
\paragraphbe

Workloads. We evaluate our system with a set of 105 multiprogrammed workloads, each simulated for 500 million cycles. Each workload consists of sixteen SPEC CPU2006 benchmarks and one GPU application selected from a mix of video games and graphics performance benchmarks. For each CPU benchmark, we use PIN [355, 261] with PinPoints [328] to select the representative phase. For the GPU application, we use an industrial GPU simulator to collect memory requests with detailed timing information. These requests are collected after having first been filtered through the GPU’s internal cache hierarchy, therefore we do not further model any caches for the GPU in our final hybrid CPU-GPU simulation framework.

We classify CPU benchmarks into three categories (Low, Medium, and High) based on their memory intensities, measured as last-level cache misses per thousand instructions (MPKI). Table 5 shows the MPKI for each CPU benchmark. Benchmarks with less than 1 MPKI are low memory-intensive, between 1 and 25 MPKI are medium memory-intensive, and greater than 25 are high memory-intensive. Based on these three categories, we randomly choose a number of benchmarks from each category to form workloads consisting of seven intensity mixes: L (All low), ML (Low/Medium), M (All medium), HL (High/Low), HML (High/Medium/Low), HM (High/Medium) and H(All high). The GPU benchmark is randomly selected for each workload without any classification.

Name MPKI Name MPKI Name MPKI
tonto 0.01 sjeng 1.08 omnetpp 21.85
povray 0.01 gobmk 1.19 milc 21.93
calculix 0.06 gromacs 1.67 xalancbmk 22.32
perlbench 0.11 h264ref 1.86 libquantum 26.27
namd 0.11 bzip2 6.08 leslie3d 38.13
dealII 0.14 astar 7.6 soplex 52.45
wrf 0.21 hmmer 8.65 GemsFDTD 63.61
gcc 0.33 cactusADM 14.99 lbm 69.63
sphinx3 17.24 mcf 155.30
Table 5: L2 Cache Misses Per Kilo-Instruction (MPKI) of 26 SPEC 2006 benchmarks.
\paragraphbe

Performance Metrics. In an integrated CPUs and GPU system like the one we evalute, To measure system performance, we use CPU+GPU Weighted Speedup (Eqn. 1), which is a sum of the CPU weighted speedup [108, 107] and the GPU speedup multiply by the weight of the GPU. In addition, we measure Unfairness [93, 220, 221, 418] using maximum slowdown for all the CPU cores. We report the harmonic mean instead of arithmetic mean for Unfairness in our evaluations since slowdown is an inverse metric of speedup.

(1)
(2)

20 Qualitative Comparison with Previous Scheduling Algorithms

In this section, we compare SMS qualitatively to previously proposed scheduling policies and analyze the basic differences between SMS and these policies. The fundamental difference between SMS and previously proposed memory scheduling policies for CPU only scenarios is that the latter are designed around a single, centralized request buffer which has poor scalability and complex scheduling logic, while SMS is built around a decentralized, scalable framework.

20.1 First-Ready FCFS (FR-FCFS)

FR-FCFS [357] is a commonly used scheduling policy in commodity DRAM systems. A FR-FCFS scheduler prioritizes requests that result in row-buffer hits over row-buffer misses and otherwise prioritizes older requests. Since FR-FCFS unfairly prioritizes applications with high row-buffer locality to maximize DRAM throughput, prior work [220, 293, 221, 292, 281] have observed that it has low system performance and high unfairness.

20.2 Parallelism-aware Batch Scheduling (PAR-BS)

PAR-BS [293] aims to improve fairness and system performance. In order to prevent unfairness, it forms batches of outstanding memory requests and prioritizes the oldest batch, to avoid request starvation. To improve system throughput, it prioritizes applications with smaller number of outstanding memory requests within a batch. However, PAR-BS has two major shortcomings. First, batching could cause older GPU requests and requests of other memory-intensive CPU applications to be prioritized over latency-sensitive CPU applications. Second, as previous work [220] has also observed, PAR-BS does not take into account an application’s long term memory-intensity characteristics when it assigns application priorities within a batch. This could cause memory-intensive applications’ requests to be prioritized over latency-sensitive applications’ requests within a batch.

20.3 Adaptive per-Thread Least-Attained-Serviced Memory Scheduling (ATLAS)

ATLAS [220] aims to improve system performance by prioritizing requests of applications with lower attained memory service. This improves the performance of low memory-intensity applications as they tend to have low attained service. However, ATLAS has the disadvantage of not preserving fairness. Previous work [220, 221] have shown that simply prioritizing low memory intensity applications leads to significant slowdown of memory-intensive applications.

20.4 Thread Cluster Memory Scheduling (TCM)

TCM [221] is the best state-of-the-art application-aware memory scheduler providing both system throughput and fairness. It groups applications into either latency- or bandwidth-sensitive clusters based on their memory intensities. In order to achieve high system throughput and low unfairness, TCM employs different prioritization policy for each cluster. To improve system throughput, a fraction of total memory bandwidth is dedicated to latency-sensitive cluster and applications within the cluster are then ranked based on memory intensity with least memory-intensive application receiving the highest priority. On the other hand, TCM minimizes unfairness by periodically shuffling applications within a bandwidth-sensitive cluster to avoid starvation. This approach provides both high system performance and fairness in CPU-only systems. In an integrated CPU-GPU system, GPU generates a significantly larger amount of memory requests compared to CPUs and fills up the centralized request buffer. As a result, the memory controller lacks the visibility of CPU memory requests to accurately determine each application’s memory access behavior. Without the visibility, TCM makes incorrect and non-robust clustering decisions, which classify some applications with high memory intensity into the latency-sensitive cluster. These misclassified applications cause interference not only to low memory intensity applications, but also to each other. Therefore, TCM causes some degradation in both system performance and fairness in an integrated CPU-GPU system. As described in Section 18, increasing the request buffer size is a simple and straightforward way to gain more visibility into CPU applications’ memory access behaviors. However, this approach is not scalable as we show in our evaluations (Section 28). In contrast, SMS provides much better system performance and fairness than TCM with the same number of request buffer entries and lower hardware cost.

21 Experimental Evaluation of SMS

We present the performance of five memory scheduler configurations: FR-FCFS, ATLAS, PAR-BS, TCM, and SMS on the 16-CPU/1-GPU four-memory-controller system described in Section 19.5. All memory schedulers use 300 request buffer entries per memory controller; this size was chosen based on the results in Figure 23 which showed that performance does not appreciably increase for larger request buffer sizes. Results are presented in the workload categories as described in Section 19.5, with workload memory intensities increasing from left to right.

Figure 25 shows the system performance (measured as weighted speedup) and fairness of the previously proposed algorithms and SMS, averaged across 15 workloads for each of the seven categories (105 workloads in total). Compared to TCM, which is the best previous algorithm for both system performance and fairness, SMS provides 41.2% system performance improvement and 4.8 fairness improvement. Therefore, we conclude that SMS provides better system performance and fairness than all previously proposed scheduling policies, while incurring much lower hardware cost and simpler scheduling logic.

Figure 25: System performance, and fairness for 7 categories of workloads (total of 105 workloads)

Based on the results for each workload category, we make the following major observations: First, SMS consistently outperforms previously proposed algorithms (given the same number of request buffer entries), in terms of both system performance and fairness across most of the workload categories. Second, in the “H” category with only high memory-intensity workloads, SMS underperforms by 21.2%/20.7%/22.3% compared to ATLAS/PAR-BS/TCM, but SMS still provides 16.3% higher system performance compared to FR-FCFS. The main reason for this behavior is that ATLAS/PAR-BS/TCM improve performance by unfairly prioritizing certain applications over others, which is reflected by their poor fairness results. For instance, we observe that TCM misclassifies some of these high memory-intensity applications into the low memory-intensity cluster, which starves requests of applications in the high memory-intensity cluster. On the other hand, SMS preserves fairness in all workload categories by using its probabilistic round-robin policy as described in Section 19. As a result, SMS provides 7.6/7.5/5.2 better fairness relative to ATLAS/PAR-BS/TCM respectively, for the high memory-intensity category.

21.1 Analysis of CPU and GPU Performance

Figure 26: CPUs and GPU Speedup for 7 categories of workloads (total of 105 workloads)

In this section, we study the performance of the CPU system and the GPU system separately. Figure 26 shows CPU-only weighted speedup and GPU speedup. Two major observations are in order. First, SMS gains 1.76 improvement in CPU system performance over TCM. Second, SMS achieves this 1.76 CPU performance improvement while delivering similar GPU performance as the FR-FCFS baseline.14 The results show that TCM (and the other algorithms) end up allocating far more bandwidth to the GPU, at significant performance and fairness cost to the CPU applications. SMS appropriately deprioritizes the memory bandwidth intensive GPU application in order to enable higher CPU performance and overall system performance, while preserving fairness. Previously proposed scheduling algorithms, on the other hand, allow the GPU to hog memory bandwidth and significantly degrade system performance and fairness (Figure 25).

21.2 Scalability with Cores and Memory Controllers

Figure 27: SMS vs TCM on a 16 CPU/1 GPU, 4 memory controller system with varying the number of cores
Figure 28: SMS vs TCM on a 16 CPU/1 GPU system with varying the number of channels

Figure 27 compares the performance and fairness of SMS against TCM (averaged over 75 workloads15) with the same number of request buffers, as the number of cores is varied. We make the following observations: First, SMS continues to provide better system performance and fairness than TCM. Second, the system performance gains and fairness gains increase significantly as the number of cores and hence, memory pressure is increased. SMS’s performance and fairness benefits are likely to become more significant as core counts in future technology nodes increase.

Figure 28 shows the system performance and fairness of SMS compared against TCM as the number of memory channels is varied. For this, and all subsequent results, we perform our evaluations on 60 workloads from categories that contain high memory-intensity applications. We observe that SMS scales better as the number of memory channels increases. As the performance gain of TCM diminishes when the number of memory channels increases from 4 to 8 channels, SMS continues to provide performance improvement for both CPU and GPU.

21.3 Sensitivity to SMS Design Parameters

\paragraphbe

Effect of Batch Formation. Figure 30 shows the system performance and fairness of SMS as the maximum batch size varies. When the batch scheduler can forward individual requests to the DCS, the system performance and fairness drops significantly by 12.3% and 1.9 compared to when it uses a maximum batch size of ten. The reasons are twofold: First, intra-application row-buffer locality is not preserved without forming requests into batches and this causes performance degradation due to longer average service latencies. Second, GPU and high memory-intensity applications’ requests generate a lot of interference by destroying each other’s and most importantly latency-sensitive applications’ row-buffer locality. With a reasonable maximum batch size (starting from ten onwards), intra-application row-buffer locality is well-preserved with reduced interference to provide good system performance and fairness. We have also observed that most CPU applications rarely form batches that exceed ten requests. This is because the in-order request stream rarely has such a long sequence of requests all to the same row, and the timeout threshold also prevents the batches from becoming too large. As a result, increasing the batch size beyond ten requests does not provide any extra benefit, as shown in Figure 30.

Figure 29: SMS sensitivity to batch Size
Figure 30: SMS sensitivity to DCS FIFO Size
\paragraphbe

DCS FIFO Size. Figure 30 shows the sensitivity of SMS to the size of the per-bank FIFOs in the DRAM Command Scheduler (DCS). Fairness degrades as the size of the DCS FIFOs is increased. As the size of the per-bank DCS FIFOs increases, the batch scheduler tends to move more batches from the batch formation stage to the DCS FIFOs. Once batches are moved to the DCS FIFOs, they cannot be reordered anymore. So even if a higher-priority batch were to become ready, the batch scheduler cannot move it ahead of any batches already in the DCS. On the other hand, if these batches were left in the batch formation stage, the batch scheduler could still reorder them. Overall, it is better to employ smaller per-bank DCS FIFOs that leave more batches in the batch formation stage, enabling the batch scheduler to see more batches and make better batch scheduling decisions, thereby reducing starvation and improving fairness. The FIFOs only need to be large enough to keep the DRAM banks busy.

21.4 Case Studies

In this section, we study some additional workload setups and design choices. In view of simulation bandwidth and time constraints, we reduce the simulation time to 200M cycles for these studies.

\paragraphbe

Case study 1: CPU-only Results. In the previous sections, we showed that SMS effectively mitigates inter-application interference in a CPU-GPU integrated system. In this case study, we evaluate the performance of SMS in a CPU-only scenario. Figure 31 shows the system performance and fairness of SMS on a 16-CPU system with exactly the same system parameters as described in Section 19.5, except that the system does not have a GPU. We present results only for workload categories with at least some high memory-intensity applications, as the performance/fairness of the other workload categories are quite similar to TCM. We observe that SMS degrades performance by only 4% compared to TCM, while it improves fairness by 25.7% compared to TCM on average across workloads in the “H” category. SMS’s performance degradation mainly comes from the “H” workload category (only high memory-intensity applications); as discussed in our main evaluations in Section 28, TCM mis-classifies some high memory-intensity applications into the low memory-intensity cluster, starving requests of applications classified into the high memory-intensity cluster. Therefore, TCM gains performance at the cost of fairness. On the other hand, SMS prevents this starvation/unfairness with its probabilistic round-robin policy, while still maintaining good system performance.

Figure 31: System performance and fairness on a 16 CPU-only system.
\paragraphbe

Case study 2: Always Prioritizing CPU Requests over GPU Requests. Our results in the previous sections show that SMS achieves its system performance and fairness gains by appropriately managing the GPU request stream. In this case study, we consider modifying previously proposed policies by always deprioritizing the GPU. Specifically, we implement variants of the FR-FCFS and TCM scheduling policies, CFR-FCFS and CTCM, where the CPU applications’ requests are always selected over the GPU’s requests. Figure 32 shows the performance and fairness of FR-FCFS, CFR-FCFS, TCM, CTCM and SMS scheduling policies, averaged across workload categories containing high-intensity applications. Several conclusions are in order. First, by protecting the CPU applications’ requests from the GPU’s interference, CFR-FCFS improves system performance by 42.8% and fairness by 4.82x as compared to FR-FCFS. This is because the baseline FR-FCFS is completely application-unaware and it always prioritizes the row-buffer hitting requests of the GPU, starving other applications’ requests. Second, CTCM does not improve system performance and fairness much compared to TCM, because baseline TCM is already application-aware. Finally, SMS still provides much better system performance and fairness than CFR-FCFS and CTCM because it deprioritizes the GPU appropriately, but not completely, while preserving the row-buffer locality within the GPU’s request stream. Therefore, we conclude that SMS provides better system performance and fairness than merely prioritizing CPU requests over GPU requests.

Figure 32: Performance and Fairness when always prioritizing CPU requests over GPU requests

22 SMS: Conclusion

While many advancements in memory scheduling policies have been made to deal with multi-core processors, the integration of GPUs on the same chip as the CPUs has created new system design challenges. This work has demonstrated how the inclusion of GPU memory traffic can cause severe difficulties for existing memory controller designs in terms of performance and especially fairness. In this dissertation, we propose a new approach, Staged Memory Scheduler, that delivers superior performance and fairness compared to state-of-the-art memory schedulers, while providing a design that is significantly simpler to implement. The key insight behind SMS’s scalability is that the primary functions of sophisticated memory controller algorithms can be decoupled, leading to our multi-stage architecture. This research attacks a critical component of a fused CPU-GPU system’s memory hierarchy design, but there remain many other problems that warrant further research. For the memory controller specifically, additional explorations will be needed to consider interactions with GPGPU workloads. Co-design and concerted optimization of the cache hierarchy organization, cache partitioning, prefetching algorithms, memory channel partitioning, and the memory controller are likely needed to fully exploit future heterogeneous computing systems, but significant research effort will be needed to find effective, practical, and innovative solutions.

Chapter \thechapter Reducing Inter-address-space Interference with a TLB-aware Memory Hierarchy

Graphics Processing Units (GPUs) provide high throughput by exploiting a high degree of \changesIthread-level parallelism. A GPU executes hundreds of threads concurrently, where \changesIIthe threads are grouped into multiple warps. \changesIThe GPU executes each warp in lockstep (i.e., each thread in the warp executes the same instruction concurrently). When \changesIone or more \changesIIthreads of a warp stall, the GPU hides the latency of this stall by scheduling and executing another warp. \changesIThis high throughput provided by \changesIIIa GPU creates an opportunity to accelerate applications from \changesIIa wide range of domains \changesVIII(e.g., \changesIV[25, 77, 396, 157, 63, 3, 203, 90, 230, 284, 258, 267, 301]\changesVIII).

\changesIII

GPU compute density continues to increase \changesIIto support \changesIVdemanding applications. For example, emerging \changesIIIGPU architectures are expected to provide as many as 128 \changesIIstreaming multiprocessors (i.e., \changesIVGPU cores) per chip in the near future [31, 421]. \changesIWhile the increased compute density can help many \changesIIIindividual general-purpose GPU (GPGPU) applications, it exacerbates a growing need to share the GPU cores \changesIVacross multiple applications in order to fully utilize the large amount of GPU resources. This is especially true in large-scale computing environments, such as cloud servers, where \changesI\changesIIdiverse demands for compute and memory exist across different applications. To enable efficient GPU utilization in the \changesIIIpresence of application heterogeneity, these large-scale environments rely on the ability to \changesIIvirtualize the \changesIIIGPU compute resources and execute multiple applications \changesIVconcurrently \changesIIIon a single GPU [180, 10, 6, 174].

The adoption of GPUs in large-scale computing environments is hindered by the primitive virtualization support in contemporary GPUs [5, 432, 61, 179, 80, 181, 344, 278, 307, 308, 310, 311, 312, 315, 7, 8, 427]. While hardware virtualization support \changesIIhas improved for integrated GPUs [5, 432, 61, 179, 80, 181, 344, 278, 307, 308], where the GPU cores and CPU cores are \changesIVon the same chip and share the same off-chip memory, virtualization support for discrete \changesIIIGPUs [310, 311, 312, 315, 7, 8, 427, 278, 344], \changesIwhere the GPU is on a different chip than the CPU and has its own memory, is insufficient. \changesIIDespite poor existing support for virtualization, discrete GPUs are \changesIIIlikely to be more attractive than integrated GPUs for large-scale computing environments, as they provide the highest-available compute density and remain the platform of choice in many domains [3, 25, 157, 63, 77].

Two alternatives for virtualizating discrete GPUs are \changesIIItime multiplexing and spatial multiplexing. \changesIIIModern GPU architectures support time multiplexing using application preemption [251, 315, 409, 129, 311, 430], but this support currently does \changesIVnot scale well because each additional application increases contention for \changesIIthe limited GPU resources (Section 23.1). Spatial multiplexing allows us to share a GPU among \changesIIIconcurrently-executing applications much as we currently share \changesIVmulti-core CPUs, by providing support for multi-address-space concurrency (i.e., the concurrent execution of \changesIapplication kernels from different processes or guest VMs). By efficiently and dynamically managing \changesIapplication kernels that execute concurrently on the GPU, spatial multiplexing avoids the scaling issues of time multiplexing. To support spatial multiplexing, GPUs must provide architectural support for both memory virtualization and memory protection.

\changesIII

We find that existing techniques for spatial multiplexing in modern GPUs (e.g., [323, 311, 315, 306]) have two major shortcomings. \changesIVThey either (1) require significant programmer intervention to adapt existing programs for spatial multiplexing; or (2) sacrifice memory protection, which is a key requirement for virtualized systems. To overcome these shortcomings, GPUs must utilize \changesIVmemory virtualization [182], which \changesIVenables multiple applications to run concurrently while providing memory protection. While memory virtualization support in modern GPUs is also primitive, in large part due to the poor performance of address translation, several recent efforts have worked to improve address translation within GPUs [343, 342, 453, 420, 83]. \changesIIIThese efforts introduce translation lookaside buffer (TLB) designs that improve performance significantly when a single application executes on a GPU. \changesIIIUnfortunately, as we show in Section 24, even these improved address translation mechanisms suffer from high performance overheads during spatial multiplexing, as the limited capacities of the TLBs become a source of significant contention within the GPU.

\changesII

In this \changesIIIchapter, we perform a thorough \changesIIIexperimental analysis of concurrent multi-application execution when state-of-the-art address translation techniques are employed \changesIIin a discrete GPU (Section 25). We make \changesIIIthree key observations \changesIfrom our analysis. \changesIVFirst, a single TLB miss frequently stalls multiple warps at once, and incurs a very high latency, as each miss must walk through multiple levels of a page table to find the desired address translation. Second, due to high contention for shared address translation structures among the multiple applications, the TLB miss rate increases significantly. As a result, the GPU often does \changesVnot have enough warps that are ready to execute, leaving GPU cores idle and defeating the GPU’s latency hiding properties. \changesIIIThird, \changesIIIcontention between applications induces significant thrashing on the shared L2 TLB and significant interference between TLB misses and data requests throughout the \changesIVentire GPU memory system. With only a few simultaneous \changesIIITLB miss requests, it becomes difficult for the GPU to find a warp that can be scheduled for execution, \changesVIIwhich defeats the GPU’s basic \changesIIIfine-grained multithreading techniques [410, 411, 390, 389] \changesVIIthat are essential for hiding the latency of stalls.

Based on \changesIour extensive \changesIIIexperimental analysis, we conclude that address translation \changesIIIis a first-order performance concern in GPUs when multiple applications are executed concurrently. Our goal in this work is to develop new techniques that can alleviate the severe address translation bottleneck \changesIIIin state-of-the-art GPUs.

To this end, we propose Multi-Address Space Concurrent Kernels (MASK), a new \changesIIGPU framework that minimizes inter-application interference and \changesIIaddress translation overheads \changesIIduring concurrent application execution. The \changesII\changesIVoverarching idea of MASK is to make the entire memory hierarchy \changesIaware of TLB requests. MASK takes advantage of locality across \changesIIGPU cores to reduce TLB misses, and relies on three novel \changesIImechanisms to minimize \changesIIIaddress translation overheads. First, TLB-Fill Tokens provide \changesIa contention-aware mechanism to reduce thrashing in the shared L2 TLB, including a bypass cache to increase the TLB hit rate. Second, \changesIIIour TLB-request-aware L2 Bypass mechanism provides contention-aware cache bypassing to reduce interference at the L2 cache between address translation requests and data demand requests. Third, \changesIIIour Address-space-aware DRAM Scheduler provides a contention-aware memory controller policy that prioritizes address translation requests over data demand requests to mitigate high address translation overheads. \changesIIWorking together, these three mechanisms are highly effective at alleviating the address translation bottleneck, \changesIIIas our secs/mask-micro17/results show \changesIV(Section 26).

\changesIV

Our comprehensive experimental evaluation shows that, via the use of TLB-request-aware policies throughout the memory hierarchy, MASK \changesIIsignificantly reduces \changesIII(1) the number of TLB misses that occur during multi-application execution; and \changesIII(2) the overall latency of the remaining TLB misses, by ensuring that page table walks are serviced quickly. \changesIIAs a result, MASK greatly increases the average number of threads that can be scheduled during long-latency stalls, which in turn improves system throughput \changesIII(weighted speedup [107, 108]) by 57.8%, improves IPC throughput by 43.4%, and reduces unfairness by 22.4% over a state-of-the-art GPU memory management unit (MMU) design [343]. MASK provides performance within only 23.2% of \changesIIIan ideal TLB that always hits.

This chapter makes the following \changesIVmajor contributions:

    [topsep=0.3em, leftmargin=1em, labelwidth=*, align=left, itemsep=0.3em]
  • To our knowledge, this is the first work to \changesII(1) provide a thorough analysis of GPU memory virtualization under multi-address-space concurrency, \changesII(2) show the large impact of address translation on latency hiding within a GPU, and \changesII(3) demonstrate the need for new techniques to alleviate \changesIVcontention caused by address translation due to multi-application execution in a GPU.

  • We propose MASK [39, 41, 40], \changesIIIa new GPU framework that mitigates address translation overheads in the \changesVIIIpresence of multi-address-space concurrency. MASK consists of three novel techniques that \changesIwork together to increase TLB request awareness across the entire \changesIVGPU memory hierarchy. \changesVIIIMASK (1) significantly improves system performance, IPC throughput, and fairness over a state-of-the-art GPU address translation mechanism; \changesIVand (2) provides practical support for spatially partitioning a GPU across multiple address spaces.

23 Background

There \changesIIIis an increasingly pressing need to share the GPU hardware among multiple applications \changesIIIto improve GPU resource utilization. As a result, recent work\changesIV [37, 323, 311, 315, 306, 4, 251] enables support for GPU virtualization, where a single physical GPU can be shared transparently across multiple applications, with each application having its own address space.16 Much of this work \changesIVrelies on traditional time and spatial multiplexing techniques that \changesIVare employed by CPUs, and state-of-the-art \changesVIIIGPUs contain elements of both types of techniques [406, 413, 429]. Unfortunately, as we discuss in this section, existing GPU virtualization implementations are \changesItoo \changesIVcoarse-grained: they employ fixed hardware policies that leave system software without \changesIVmechanisms that can dynamically reallocate GPU resources to different applications, which are required for true application-transparent GPU virtualization.

23.1 Time Multiplexing

Most modern systems time-share \changesII(i.e., time multiplex) the GPU by running kernels from multiple applications back-to-back [311, 251]. These designs are optimized for the case where no concurrency exists between kernels from different address spaces. This simplifies memory protection and scheduling at the cost of two fundamental \changesIItrade-offs. \changesIFirst, kernels from a single address space \changesIVusually cannot fully utilize all of the GPU’s resources, leading to \changesIVsignificant resource \changesIIIunderutilization [207, 209, 323, 430, 191, 425]. Second, \changesIItime multiplexing limits the ability of a \changesIGPU kernel scheduler to provide forward-progress or QoS guarantees, which can lead to unfairness and starvation [362].

While \changesIVkernel preemption [409, 129, 430, 311, 315] could allow a time-sharing scheduler to avoid \changesIIIa case where one GPU kernel unfairly uses \changesVIIIup most of the execution time (e.g., by context switching at a fine granularity), \changesIVsuch preemption support remains an active research area \changesIVin GPUs [409, 129]. Software approaches [430] sacrifice memory protection. NVIDIA’s Kepler [311] and Pascal [315] architectures support preemption at \changesIIthe thread block and instruction granularity, respectively. We \changesIIempirically find that neither \changesIIgranularity is \changesV\changesIVeffective at minimizing inter-application interference.

\changesI

To illustrate the performance overhead of time multiplexing, \changesIVFigure 33 shows how the execution time increases when we use \changesVIItime multiplexing to switch between multiple concurrently-executing processes, as opposed to executing the processes back-to-back without any concurrent execution. We perform these experiments on real NVIDIA K40 [303, 311] and NVIDIA GTX 1080 [304] GPUs. Each process runs a GPU kernel that interleaves basic arithmetic operations with loads and stores into shared and global memory. \changesIVWe observe that as more processes execute concurrently, the overhead of time multiplexing grows significantly. For example, on the NVIDIA GTX 1080, \changesVIItime multiplexing between two processes increases the total execution time by 12%, as opposed to executing one process immediately after the other process finishes. When we increase the number of processes to 10, the overhead of \changesVIItime multiplexing increases to 91%. On top of this high performance overhead, we \changesIIfind that inter-application interference pathologies \changesI(e.g., the starvation of one or more concurrently-executing \changesIIapplication kernels) \changesIIare easy to induce: \changesIIan application kernel from one process consuming the majority of shared memory can easily cause \changesIIapplication kernels from other processes \changesVIIIto \changesIIInever get scheduled \changesVIIIfor execution \changesIVon the GPU. While we expect preemption support to improve in future hardware, we seek a \changesIIImulti-application concurrency solution that does \changesIIInot depend on it.

Figure 33: \changesIVIncrease in execution time when \changesVIItime multiplexing is used to execute processes concurrently on real GPUs.

23.2 Spatial Multiplexing

Resource utilization can be improved with spatial multiplexing [4], as the ability to execute multiple \changesIIapplication kernels concurrently \changesIII(1) enables the system to co-schedule kernels that have complementary resource demands, and \changesIII(2) can enable independent progress guarantees for different kernels. \changesIIIExamples of spatial multiplexing support in modern GPUs include (1) application-specific \changesIIsoftware scheduling of multiple kernels [323]; \changesIIand (2) \changesINVIDIA’s \changesIIICUDAstream support [311, 315, 306], which co-schedules kernels from independent “streams” \changesIIby \changesIIImerging them into a single address space. \changesIII\changesIVUnfortunately, these spatial multiplexing mechanisms have significant shortcomings. Software approaches (e.g., Elastic Kernels [323]) require programmers to manually time-slice kernels to enable \changesIVtheir mapping onto CUDA streams for concurrency. While \changesIIICUDAstream supports \changesIIthe flexible partitioning of resources at runtime, \changesIIImerging kernels into a single address space sacrifices memory protection, \changesIIIwhich is a key requirement in virtualized settings.

\changesII

True GPU support for multiple concurrent address spaces can address these shortcomings by enabling \changesIIIhardware virtualization. \changesIIIHardware virtualization allows the system to \changesIV(1) adapt to changes in \changesIVapplication resource utilization or (2) mitigate interference at runtime, by dynamically allocating hardware resources to multiple concurrently-executing applications. NVIDIA and AMD both offer products [159, 9] with \changesIIIpartial hardware virtualization support. \changesIIIHowever, these products simplify memory protection by statically partitioning the hardware resources prior to program execution. \changesIIIAs a result, these \changesVsystems \changesIVcannot adapt to changes in demand \changesIIIat runtime, \changesVIIIand, thus, can still leave GPU resources underutilized. \changesITo efficiently support \changesVIIIthe \changesIVdynamic sharing of GPU resources, GPUs must provide memory virtualization \changesIIand memory protection, both of which \changesIIrequire \changesIVefficient mechanisms for virtual-to-physical address translation.

24 Baseline Design

\changesIV

We \changesIIIdescribe \changesV(1) the state-of-the-art address translation \changesVmechanisms for GPUs, and \changesIII(2) the overhead of these translation mechanisms when multiple applications share the GPU [343]. We analyze the shortcomings of \changesIVstate-of-the-art address translation mechanisms for GPUs \changesIIIin the presence of multi-application concurrency in Section 25, which motivates the need for MASK.

\changesII

State-of-the-art GPUs extend the GPU memory hierarchy with translation lookaside buffers (TLBs) [343]. TLBs \changesIV(1) greatly reduce the overhead of address translation by caching recently-used virtual-to-physical address mappings \changesIVfrom a page table, and (2) help ensure that memory accesses from application kernels running in different address spaces \changesIIIare isolated \changesIIIfrom each other. Recent works [343, 342] propose optimized TLB designs that improve \changesIVaddress translation performance for \changesIIIGPUs.

We adopt \changesIIIa baseline based on these state-of-the-art TLB designs, whose memory hierarchy makes use of one of two variants for address translation: (1) \changesIVPWCache, a \changesIIpreviously-proposed design that \changesIIutilizes a shared page walk cache \changesIIIafter the L1 TLB [343] \changesIII(Figure 34a); and (2) \changesIVSharedTLB, \changesIIa design that \changesIIutilizes a shared \changesIIIL2 TLB \changesIIIafter the L1 TLB (Figure 34b). \changesIIThe TLB caches translations that are stored in a multi-level page table (we assume a four-level page table in this work). We extend \changesIIboth TLB designs to handle multi-address-space concurrency. \changesIIBoth variants \changesIVincorporate private per-core L1 TLBs, and all cores share a highly-threaded page table walker. For PWCache, on a miss in the L1 TLB (

1

in Figure 34a), the GPU initializes a page table walk (

2

), which probes \changesVIIIthe shared page walk cache (

3

). Any page walk requests that miss in the page walk cache go to the shared L2 cache and (if needed) main memory. For SharedTLB, on a miss in the L1 TLB (

4

in Figure 34b), the GPU checks whether the translation is available in \changesVIIIthe shared L2 TLB (

5

). If the translation misses in the shared L2 TLB, the GPU initiates a page table walk (

6

), whose requests go to the shared L2 cache and (if needed) main memory.17

Figure 34: \changesIITwo variants of baseline GPU design.
\changesIII

Figure 35 compares the performance of \changesIIboth \changesIIIbaseline variants \changesIV(\changesVIIIPWCache, depicted in Figure 34a, and \changesVIIISharedTLB, depicted in Figure 34b), running two separate applications concurrently, to an ideal scenario where every TLB access is a hit (see \changesVIITable 10 for our simulation configuration, and Section 27 for our methodology). We find that both \changesIIvariants incur a significant performance overhead \changesIII(45.0% and 40.6% on average) compared to the ideal case.18 \changesIIIIn order to retain the benefits of sharing a GPU across multiple applications, we first analyze the shortcomings of our baseline design, and then use this analysis to develop \changesIVour new mechanisms that improve TLB performance \changesIIIto make it \changesVIIIapproach the ideal performance.

Figure 35: Baseline designs vs. ideal performance.

25 Design Space Analysis

\changesI

To improve the performance of address translation in GPUs, we \changesIIfirst analyze and characterize \changesIIthe translation overhead in \changesIVa state-of-the-art baseline (see Section 24), taking into account especially the performance challenges induced by multi-address-space concurrency and contention. We first analyze how TLB misses can \changesIIlimit the GPU’s ability to hide long-latency stalls, which directly impacts performance (Section 25.1). \changesIVNext, we discuss two types of memory interference that impact GPU performance: (1) interference introduced by sharing GPU resources among multiple \changesIVconcurrent applications (Section 25.2), and (2) interference introduced by sharing \changesIIthe GPU memory hierarchy \changesIVbetween \changesIIIaddress translation requests and \changesIIIdata demand requests (Section 25.3).

25.1 Effect of TLB Misses on GPU Performance

GPU throughput relies on fine-grained multithreading\changesIV [410, 411, 390, 389] to hide memory latency.19 \changesIIIWe observe a fundamental tension between address translation and fine-grained multithreading. The need to cache address translations at a page granularity, combined with application-level spatial locality, \changesIIincrease the likelihood that \changesIIIaddress translations fetched in response to a TLB miss \changesIIare needed by \changesIVmore than one \changesIIIwarp (i.e., many threads). Even with the massive levels of parallelism supported by GPUs, we observe that a small number of outstanding TLB misses can result in the \changesIVwarp scheduler not having enough ready \changesIVwarps to schedule, which in turn limits the GPU’s \changesIIessential latency-hiding mechanism.

\changesIII

Figure 36 illustrates a scenario for an application with four warps, where all four warps execute on the same GPU core. Figure 36a shows how the GPU behaves when no virtual-to-physical address translation is required. When Warp A \changesIIIperforms a high-latency memory access (

1

in Figure 36), the GPU core does not stall \changesIVsince other warps have schedulable \changesIIIinstructions (Warps B–D). In this case, the GPU core selects \changesIVan active warp (Warp B) \changesIVin the next cycle \changesIV(

2

), and continues issuing instructions. \changesIIIEven though Warps B–D also perform memory accesses some time later, the accesses are independent of each other, and the GPU \changesIVavoids stalling by switching to a warp that is \changesVnot waiting for a memory access \changesV(

3

4

). Figure 36b \changesIVdepicts the same \changesIV4 warps when address translation is required. Warp A misses in the TLB (indicated in red), and stalls (\changesV

5

) until the \changesIvirtual-to-physical translation finishes. \changesIIIIn Figure 36b, due to spatial locality within the application, the other warps (Warps B–D) need the same address translation as Warp A. As a result, they too stall \changesV(

6

,

7

,

8

). \changesIVAt this point, the GPU no longer has any warps that it can schedule, and the GPU core stalls until the address translation request completes. Once the address translation request completes \changesV(

9

), the data demand requests of the warps are issued to memory. \changesIVDepending on the available memory bandwidth and the parallelism of these data demand requests, the data demand requests from Warps B–D \changesIVcan incur additional queuing latency \changesV(

10

11

12

). The GPU core can resume execution \changesVonly after the data demand request for Warp A is \changesIVcomplete \changesV(

13

).

\changesIII

Three phenomena harm performance in this scenario. First, warps stalled on TLB misses reduce the availability of schedulable warps, \changesIwhich lowers \changesIIGPU utilization. \changesIVIn Figure 36, no available warp exists while \changesVthe address translation request is pending, so the GPU utilization goes down to 0% \changesVfor a long time. Second, \changesIIIaddress translation requests, which are a series of dependent memory requests generated by a page walk, must complete before \changesIII\changesIVa pending data demand \changesIVrequest that requires the physical address can be issued, which reduces the \changesVGPU’s ability to hide latency by keeping \changesIImany memory requests in flight. \changesIIIThird, when the address translation data becomes available, all stalled warps \changesIVthat were waiting for the translation consecutively execute and send their data demand requests \changesIVto memory, resulting in additional queuing delay for data demand requests throughout the memory hierarchy.

Figure 36: Example bottlenecks created by TLB misses.
\changesI

To illustrate how TLB misses \changesIVsignificantly reduce the number of \changesIVready-to-schedule warps in GPU applications, \changesIIIFigure 37 shows the \changesIVaverage number of concurrent page table walks (sampled every 10K cycles) for a range of applications, \changesIIIand Figure 38 shows the average number of stalled warps per active \changesVITLB \changesVIImiss, in the SharedTLB \changesVIIbaseline design. \changesVError bars indicate the minimum and maximum values. \changesIIIWe observe from Figure 37 that \changesVIImore than 20 outstanding TLB misses \changesIIIcan \changesIIIperform page walks at the same time, all of which contend for access to address translation structures. From Figure 38, we observe that each TLB miss \changesIIIcan stall \changesVIImore than 30 warps \changesIVout of the 64 warps in the core. \changesIIIThe combined effect of these observations is that TLB misses in a GPU can quickly stall a large number of warps \changesIVwithin a \changesVIIGPU core. The GPU core must wait \changesIIfor the misses to be resolved before issuing \changesIIIdata demand requests \changesIIIand resuming execution. \changesIIHence, minimizing TLB misses and \changesIIthe page table walk latency is critical.

Figure 37: Average number of concurrent page walks.
Figure 38: Average number of warps stalled per TLB miss.
\para

Impact of Large Pages. \changesIA large page size can significantly improve the coverage of the TLB [37]. \changesIIIHowever, a TLB miss on a large page \changesIIstalls many more warps than a TLB miss on a small page. \changesIVWe find that with a 2MB page size, the average number of stalled warps increases \changesIVto close to 100% [37], even though the average number of concurrent page \changesVtable walks \changesIVnever exceeds 5 misses per GPU core. Regardless of the page size, there is a strong need for mechanisms that mitigate the high cost of TLB misses.

25.2 Interference at the Shared TLB

\changesI

When multiple applications are concurrently executed, the address translation overheads discussed in Section 25.1 are exacerbated \changesIVdue to inter-address-space interference. \changesIVTo study the impact of this interference, we measure how the TLB miss rates change once another application is introduced. Figure 39 compares the \changesIII\changesIV512-entry L2 TLB miss rate \changesIIof four representative workloads when each application in the workload runs in \changesIII\changesVIIIisolation to the miss rate when the \changesVtwo applications run concurrently \changesVIIIand share the L2 TLB. \changesIIWe observe from the figure that inter-address-space interference \changesIIincreases \changesIthe TLB miss rate \changesIIsignificantly for most applications. \changesIIIThis occurs because when the applications share the TLB, address translation requests often induce \changesIVTLB thrashing. The resulting thrashing \changesIV(1) hurts performance, and \changesIV(2) leads to unfairness and starvation when applications generate TLB misses at different rates \changesIVin the TLB (not shown).

Figure 39: \changesVIIEffect of interference \changesVIIon the shared L2 TLB miss rate. Each set of bars corresponds to a pair of \changesVIIco-running applications \changesVIII(e.g., “3DS_HISTO” denotes that the 3DS and HISTO benchmarks are run concurrently).

25.3 Interference Throughout the Memory Hierarchy

\para

Interference at the Shared Data Cache. Prior work [36] \changesIIdemonstrates that while cache hits in GPUs reduce the consumption of off-chip memory bandwidth, \changesVIIIthe \changesIVcache hits result in a lower load/store instruction latency only when every thread in the warp hits in the cache. In contrast, when a page table walk hits in the shared L2 cache, the cache hit has the potential to help reduce the latency of other warps that have threads which access the same page in memory. However, TLB-related data can interfere with \changesIVand displace cache entries housing regular application data, which can hurt the overall \changesIVGPU performance.

Hence, a trade-off exists between prioritizing \changesIIIaddress translation requests \changesIVvs. data \changesIdemand requests in the GPU memory hierarchy. \changesI\changesIVBased on \changesVIIIan empirical analysis of our workloads, we \changesIIfind that translation data from \changesIIpage table levels closer to the page table root are more likely to be \changesIVshared across warps, and \changesIItypically hit in the cache. \yellowWe \changesIIobserve that, for a 4-level page table, the data cache hit rates of \changesIIIaddress translation requests across all workloads are 99.8%, 98.8%, 68.7%, and \changesVIII1.0% for the root, first, second, and third levels \changesIIof the page table, respectively. \changesIThis means that \changesIIIaddress translation requests for the deepest \changesIVpage table levels often do \changesIVnot utilize the cache well. Allowing shared structures to cache \changesIVpage table entries from only the \changesVpage table levels closer to the root could alleviate the interference between low-hit-rate \changesIVaddress translation data and regular application data.

\para

Interference at Main Memory. Figure 40 characterizes the DRAM bandwidth \changesIIused by \changesIIIaddress translation and data demand requests, \changesInormalized to the maximum bandwidth available, \changesIIfor our workloads where two applications concurrently share the GPU. Figure 41 compares the average latency \changesIIof \changesIIIaddress translation \changesVIIIrequests and data demand requests. We see that even though \changesIIIaddress translation requests consume only 13.8% of the \changesItotal utilized DRAM bandwidth (2.4% of the maximum \changesIavailable bandwidth), their \changesVIIIaverage DRAM latency is \changesIVhigher than that of data \changesIIIdemand requests. \changesIVThis is undesirable because \changesI\changesIIIaddress translation requests \changesIVusually stall multiple warps, while data \changesIIIdemand requests \changesIVusually stall only one warp (not shown). The \changesIIhigher latency for \changesIIIaddress translation requests is caused by \changesIVthe \changesVIIIFR-FCFS memory \changesIIscheduling \changesIVpolicy [357, 454], which \changesIIprioritizes accesses that hit in the row buffer. Data \changesIIIdemand requests from GPGPU applications generally have very high row buffer locality [33, 433, 207, 445], so a scheduler that cannot distinguish \changesIIIaddress translation requests \changesIIIfrom data demand requests effectively de-prioritizes \changesIIIthe address translation requests, increasing their latency, \changesIVand thus exacerbating the effect on stalled warps.

Figure 40: \changesIVDRAM bandwidth utilization of address translation requests and data demand requests for \changesIItwo-application workloads.
Figure 41: Latency of \changesIVaddress translation requests and data demand requests for two-application workloads.

25.4 Summary and Our Goal

\changesIV

We make two important observations about address translation in GPUs. First, address translation \changesIVcan greatly hinder a GPU’s ability to hide latency \changesVby exploiting thread-level parallelism, \changesIVsince one single TLB miss can stall multiple warps. Second, \changesIVduring concurrent execution, multiple applications generate \changesIVinter-address-space interference throughout the GPU memory hierarchy, which \changesIVfurther increases the TLB miss latency and memory latency. In light of these observations, \changesIIour goal is to \changesValleviate the \changesIVaddress translation overhead \changesIVin GPUs \changesVin three ways: (1) increasing the TLB hit rate by \changesIVreducing TLB thrashing, (2) decreasing interference between \changesIIIaddress translation requests and data demand requests in the shared L2 cache, \changesVIIIand (3) decreasing \changesIIthe TLB miss latency by prioritizing \changesIIIaddress translation requests in DRAM \changesIVwithout sacrificing DRAM bandwidth utilization.

26 Design of MASK

\changesII

To improve support for multi-application concurrency in state-of-the-art GPUs, we introduce MASK. \changesIIMASK is a framework that provides memory protection support and employs three \changesIImechanisms in the memory hierarchy to reduce address translation overheads while requiring minimal \changesIhardware changes, as illustrated in Figure 42. First, we introduce TLB-Fill Tokens, \changesIwhich regulate the number of warps that can fill \changesII(i.e., insert entries) into the shared TLB in order to \changesIIreduce TLB thrashing, and utilize a \changesIIsmall \changesIVTLB bypass cache \changesIIto hold TLB entries from warps \changesIVthat are not allowed to fill the shared TLB \changesIVdue to not having enough tokens (

1

). Second, we design \changesVan TLB-request-aware L2 Bypass mechanism, \changesIwhich significantly increases the shared L2 data cache utilization \changesIVand hit rate by reducing interference from \changesIIthe TLB-related data that does not have high temporal locality (

2

). Third, we design an Address-space-aware DRAM Scheduler to further reduce interference between \changesIIaddress translation requests and data demand \changesVIrequests (

3

). \changesIIIn this section, we describe the detailed \changesIVdesign and implementation of MASK. We analyze the hardware cost of MASK in Section 28.4.

Figure 42: MASK design overview.

26.1 \changesIVEnforcing Memory Protection

\changesII\changesIV

Unlike previously-proposed GPU sharing techniques that do \changesIVnot enable memory protection [191, 430, 311, 251, 315, 409, 129], MASK provides memory protection by allowing different \changesIVGPU cores to be assigned to different address spaces. MASK uses per-core page table \changesIVroot registers (similar to \changesIthe \changesVIIICR3 register in x86 systems [173]) to set the current address space on each core. The \changesIIpage table root \changesIVregister value \changesIVfrom each GPU core is also stored in a page table root cache \changesIVfor use by the page table walker. \changesIVIf a GPU core’s page table root register value changes, the GPU core \changesVconservatively drains all in-flight memory requests in order to \changesVIIensure correctness. \changesIVWe extend each L2 TLB entry with an \changesVIIIaddress space identifier \changesVIII(ASID). TLB flush operations target a single \changesIVGPU core, flushing the core’s L1 TLB, and all entries in the L2 TLB \changesIVthat contain the matching \changesVIIIaddress space identifier.

26.2 Reducing L2 TLB Interference

Sections 25.1 and 25.2 demonstrate the need to minimize \changesIITLB misses, which induce long-latency stalls. MASK addresses this need with a new mechanism called TLB-Fill Tokens (

1

in Figure 42). To reduce \changesIVinter-address-space interference at the shared L2 TLB, we use an epoch- and token-based scheme to limit the number of warps from each \changesIVGPU core that can fill (and therefore contend for) the L2 TLB. While every warp can probe the shared L2 TLB, only warps with tokens can \changesVfill the shared L2 TLB. \changesII\changesVPage table entries (PTEs) requested by warps without tokens are only \changesIVbuffered in a small \changesIVTLB bypass cache. This token-based mechanism requires two components: \yellow(1) a component to determine the number of tokens allocated to each application, and (2) a component that implements a policy for assigning tokens to warps \changesIVwithin an application.

When a TLB \changesIVrequest arrives \changesIVat the L2 TLB controller, the GPU probes tags for both the shared L2 TLB and \changesIVthe TLB bypass cache in parallel. A hit \changesIin either the TLB or the \changesIVTLB bypass cache yields a TLB hit.

\para

Determining the Number of Tokens. \changesIIEvery epoch,20 MASK tracks \changesI(1) the \changesIVL2 TLB miss rate for each application and \changesI(2) the total number of \changesIVall warps in each core. After the first epoch,21 the initial number of tokens for each application is set to a \changesIIpredetermined fraction of the total number of warps per application.

At the end of any subsequent epoch, for each application, MASK compares \changesIIthe application’s shared L2 TLB miss rate \changesVIIIduring the current epoch to \changesIIits miss rate from the previous epoch. \changesIIIf the miss rate increases by more than 2%, this indicates that shared TLB contention is high at the current token count, so MASK decreases the number of tokens \changesIVallocated to the application. If the miss rate decreases by more than 2%, this indicates that shared TLB contention is low at the current token count, so MASK increases the number of tokens \changesIVallocated to the application. If the miss rate change is within 2%, the TLB contention has not changed significantly, and the token count remains unchanged.

\para

Assigning Tokens to Warps. Empirically, we observe that (1) \changesIIthe different warps of an application tend to have \changesIsimilar TLB miss rates; and (2) it is beneficial for warps that already have tokens to retain them, as it is likely that their TLB entries are already in the shared L2 TLB. We leverage these two observations to simplify the token assignment logic: \changesIVour mechanism assigns tokens to warps, one token per warp, \changesI\changesIVin an order based on the warp ID \changesII(i.e., if there are  tokens, the  warps with the lowest warp ID values receive tokens). \changesIIThis simple heuristic is effective at reducing \changesIVTLB thrashing, as contention at the shared L2 TLB is reduced based on the number of tokens, and highly-used TLB entries that \changesIVare requested by warps \changesVIIIwithout tokens can still fill the \changesIVTLB bypass cache \changesIVand thus \changesVIIIstill take advantage of locality.

\para\changesV

TLB Bypass Cache. While TLB-Fill Tokens can reduce thrashing in the shared L2 TLB, a handful of highly-reused \changesIV\changesVPTEs may be requested by warps with no tokens, \changesVIIIwhich cannot insert the PTEs into the shared L2 TLB. To address this, we add \changesIa \changesIVTLB bypass cache, which is a small 32-entry \changesIfully-associative cache. Only warps \changesIVwithout tokens can fill the \changesIVTLB bypass cache \changesIVin our evaluation. To preserve consistency and correctness, MASK flushes \changesVIIIall contents of the TLB and the \changesIVTLB bypass cache when \changesIIa \changesVPTE is modified. \changesIILike the L1 and L2 TLBs, the \changesIVTLB bypass cache uses \changesIVthe LRU replacement policy.

26.3 Minimizing Shared L2 \changesIVCache Interference

We find that a TLB miss generates shared \changesVL2 cache accesses with varying degrees of locality. Translating addresses through a multi-level page table (\changesIIe.g., the four-level table used in MASK) can generate dependent memory requests \changesIIat each level. This causes significant queuing latency at the shared L2 cache, corroborating observations from previous work [36]. Page table entries in levels closer to the root are more likely to be shared \changesIVand thus reused across threads than entries near the leaves.

\changesI

To address both interference and queuing delays \changesIVdue to address translation requests at the shared L2 cache, we introduce \changesII\changesVan TLB-request-aware L2 Bypass mechanism (

2

in Figure 42). \yellowTo determine which \changesIIaddress translation requests \changesIVshould bypass \changesV(i.e., skip \changesVIIprobing and filling the L2 cache), we leverage our insights from Section 25.3. \changesIIRecall that page table entries closer to the leaves have poor \changesIVcache hit rates \changesIV(i.e., the number of cache hits over all cache accesses). We make two observations from our \changesVdetailed study on the page table hit rates at each \changesIVpage table level \changesVII(see our technical report [40]). First, not all page table levels have the same hit rate across workloads (e.g., the level 3 hit rate for the \changesVIIIMM_CONS workload is only 58.3%, but is 94.5% for \changesVIIIRED_RAY). Second, the hit rate behavior can change over time. This means that \changesIVa scheme that statically \changesVbypasses address translation requests \changesVIIIfor a certain page table level is \changesIVnot effective, as \changesVsuch a scheme cannot adapt to \changesVdynamic hit rate behavior changes. Because of the sharp drop-off in \changesVIIIthe L2 cache hit rate \changesIVof address translation requests after the first few levels, we can simplify the \changesIVmechanism to determine when address translation requests should \changesVbypass the L2 cache by comparing the L2 cache hit rate of each page \changesItable level for \changesIIaddress translation requests to the L2 cache hit rate \changesIof \changesIIdata demand requests. We impose L2 cache bypassing \changesIfor \changesIIaddress translation requests from a particular page table level when the hit rate \changesIof \changesIIaddress translation requests \changesIto that page table level falls below the hit rate \changesIof \changesIIdata demand requests. \changesIVThe shared \changesVL2 TLB has counters to \changesVIItrack the cache hit rate of each page table level. \yellow\changesVEach memory request is tagged with a three-bit \changesVvalue that indicates \changesVits page walk depth, allowing MASK to differentiate between \changesVIIrequest types. These bits are set to zero for data \changesIIdemand requests, and to 7 for any depth higher than 6. \cjrOur current page table depth is 4 right? \cjrDon’t we need separate hit/miss counters for TLB vs non-TLB requests to make this possible? If so, should fix Figure 12.

26.4 Minimizing Interference at Main Memory

\changesIV

There are two types of interference that occur at main memory: (1) data demand requests can interfere with address translation requests, as we saw in Section 25.3; and (2) data demand requests from multiple applications can interfere with each other. MASK’s memory controller design mitigates both forms of interference using an Address-space-aware DRAM Scheduler \changesVIII(

3

in Figure 42).

The Address-space-aware DRAM Scheduler breaks the traditional DRAM request buffer into three separate queues. The first queue, called the Golden Queue, \changesIVis a small FIFO queue.22 \changesIIAddress translation requests always go to the Golden Queue, while \changesIIdata demand requests go \changesVto \changesIIone of the two other queues (\changesI\changesVthe size of each queue is similar to the size of a typical DRAM request \changesVbuffer). The second queue, called the Silver Queue, contains \changesIdata \changesIIdemand requests from one selected application. The last queue, called the Normal Queue, contains data \changesIIdemand requests from all \changesIVother applications. The Golden Queue is used to prioritize TLB misses over data \changesIIdemand requests. \changesIThe Silver Queue allows the GPU to (1) avoid starvation when one \changesIVor more applications hog memory bandwidth, and (2) improve fairness when multiple applications execute concurrently [281, 33]. When one application unfairly hogs DRAM bandwidth in the Normal Queue, the Silver Queue can process \changesIIdata demand requests from another application that would otherwise be starved or unfairly delayed.

Our Address-space-aware DRAM Scheduler always prioritizes requests in the Golden Queue over requests in the Silver Queue, which are \changesIalways prioritized over requests in the Normal Queue. \changesITo provide higher priority to applications that are likely to be stalled due to concurrent TLB misses, and to \changesIIminimize the time that bandwidth-heavy applications have access to the silver queue, each application takes turns being assigned to the Silver Queue based on two \changesIVper-application metrics: (1) the number of concurrent page walks, and (2) the number of warps stalled per active TLB miss. The number of \changesIIdata demand requests each application can add to the Silver Queue, \changesIVwhen the application gets its turn, is shown \changesVas in Equation 3. \changesIVAfter application  () \changesV\changesVIIreaches its quota, the next application () is \changesVthen allowed to send \changesVits requests to the Silver Queue, \changesIVand so on. Within \changesIIboth the Silver Queue and Normal Queue, FR-FCFS [357, 454] is used to schedule requests.

(3)
\vm

Is there any way to make this equation easier to read? Should it be numbered if it’s the only one?

To track the number of outstanding concurrent page walks \changesI( in Equation 3), we add a 6-bit counter per application to the shared L2 TLB.23 \changesIVThis counter tracks the number of concurrent TLB misses. To track the number of warps stalled per \changesIVactive TLB miss \changesI( in Equation 3), we add a 6-bit counter to \changesIeach TLB MSHR entry, \changesI\changesVIIIwhich tracks the maximum number of warps that hit in the entry. \changesVThe Address-space-aware DRAM Scheduler resets all of these counters every epoch \changesI(see Section 26.2).

We find that the number of concurrent \changesIIaddress translation requests that go to each memory channel is small, so our design has an additional benefit of lowering \changesIthe page table walk latency \changesIV(because it prioritizes address translation requests) while minimizing interference.

26.5 Page Faults and TLB Shootdowns

Address translation inevitably introduces page faults. Our design can be extended to use techniques from previous works, such as performing copy-on-write for \changesIVhandling page faults [343], and either exception support [272] or demand paging techniques [453, 315, 14] for major faults. \changesVIWe leave this as future work.

\yellow

Similarly, TLB shootdowns are required when \changesIVa GPU core changes its address space or when a page table entry is updated. Techniques to reduce TLB shootdown overhead [361, 58, 444] are well-explored and can be \changesVused with MASK.

27 Methodology

\changesII

To evaluate MASK, we model the NVIDIA Maxwell architecture [312], \changesIand the TLB-fill bypassing, \changesVIIIcache bypassing, and memory scheduling mechanisms in MASK, using the Mosaic simulator [37], which is based on GPGPU-Sim 3.2.2 [46]. We heavily modify the simulator to accurately model the behavior of CUDA Unified Virtual Addressing [312, 315] as described below. Table 10 provides \changesIVthe details of our baseline GPU configuration. \changesIOur baseline uses the FR-FCFS memory scheduling policy [357, 454], \yellowbased on findings from previous works [33, 445, 73] \changesIwhich show that \changesVFR-FCFS provides good performance for GPGPU applications compared to other, more sophisticated schedulers [220, 221]. \changesIWe have open-sourced our modified simulator online [366].

\toprule GPU Core \changesIXConfiguration
\midruleSystem Overview 30 cores, 64 execution \changesIVunits per core.
Shader Core 1020 MHz, 9-stage pipeline, 64 threads per warp,
\yellowGTO scheduler [359].
Page Table Walker \changesIVShared page table walker, traversing 4-level page tables.
\midrule Cache and Memory \changesIXConfiguration
\midrulePrivate L1 Cache 16KB, 4-way associative, LRU, L1 misses are
coalesced before accessing L2, 1-cycle latency.
Private L1 TLB 64 entries per core, fully associative, LRU, 1-cycle latency.
Shared L2 Cache 2MB total, 16-way associative, LRU, 16 cache banks,
2 ports per cache bank, 10-cycle latency
Shared L2 TLB 512 entries total, 16-way associative, LRU, 2 ports,
10-cycle latency
Page Walk Cache 16-way 8KB, \changesV10-cycle latency
DRAM \changesIVGDDR5 1674 MHz [170], 8 channels, 8 banks per rank, \changesV1 rank,
FR-FCFS scheduler [357, 454], burst length 8
\bottomrule
Table 6: Configuration of the simulated system.
\para

TLB and Page Table \changesVWalker Model. \changesIWe accurately model \changesIIboth TLB design variants discussed in Section 24. We employ the non-blocking TLB implementation used \changesIVby Pichai et al. [342]. Each core has a private L1 TLB. The page table walker is shared \changesV\changesVIIacross threads, and admits up to 64 concurrent threads for walks. On a TLB miss, a page table walker generates a series of dependent requests that probe the \changesVL2 cache and main memory as needed. We \changesIfaithfully model \changesIVthe multi-level page walks.

\para

Workloads. We randomly select 27 applications from the CUDA SDK [309], Rodinia [77], Parboil [396], LULESH [203, 204], and SHOC [90] suites. We classify these benchmarks based on their L1 and L2 TLB miss rates into one of four groups, \changesIas shown in Table 7. For our multi-application secs/mask-micro17/results, we randomly select 35 pairs of applications, avoiding \changesIpairs where both applications have \changesVIIIa \changesIVlow L1 TLB miss rate (i.e., 20%) and low L2 TLB miss rate (i.e., 20%), since these applications are relatively insensitive to \changesIVaddress translation overheads. The application that finishes first is relaunched to keep the GPU core \changesVbusy and \changesIV\changesVIIImaintain memory contention.

L1 TLB Miss Rate L2 TLB Miss Rate Benchmark Name
Low Low LUD, NN
Low High BFS2, FFT, HISTO, NW,
QTC, RAY, SAD, SCP
High Low BP, GUP, HS, LPS
High High 3DS, BLK, CFD, CONS,
FWT, LUH, MM, MUM, RED, SC,
SCAN, SRAD, TRD
Table 7: Categorization of \changesIVworkloads.

We divide \changesIV35 application-pairs into three workload categories based on the number of applications that have both high L1 and L2 TLB miss rates, \changesIIas high TLB miss rates at both levels indicate a high amount of pressure on the limited TLB resources. \changesIVn-HMR contains \changesVapplication-pairs where n applications in the \changesVworkload have \changesIVboth high L1 and L2 TLB miss rates.

\para

Evaluation Metrics. We report performance using \changesIVweighted speedup [107, 108], \changesIa commonly-used metric to evaluate the performance of a multi-application workload [33, 209, 221, 220, 293, 292, 287, 94, 93, 402, 401, 417]. \changesIWeighted speedup is defined as , \changesIwhere is the IPC of an application that runs on the same number of \changesIVGPU cores, but does \changesIVnot share GPU resources with any other application, and is the IPC of an application \changesIVwhen it runs concurrently with other applications. We report the unfairness of each design using \changesIVmaximum slowdown, defined as  [104, 33, 93, 220, 221, 418, 398, 400, 402, 401, 417].

\para

Scheduling and Partitioning of Cores. \changesIVWe assume an oracle GPU scheduler that finds the best partitioning of the GPU cores for each pair of applications. For each pair of applications that are concurrently executed, the scheduler partitions the cores according to the best weighted speedup for that pair \changesVfound by an exhaustive search over all possible static \changesIVcore partitionings. \changesIVNeither the L2 cache nor main memory are partitioned. All applications can use all of the shared L2 cache and the main memory.

\para

Design Parameters. MASK exposes two configurable parameters: for TLB-Fill Tokens, \changesVIand for the Address-space-aware DRAM Scheduler. A sweep over the range of possible values reveals less than 1% performance variance, as \changesVIIITLB-Fill Tokens are effective at reconfiguring the total number of tokens to a steady-state value (Section 26.2). In our evaluation, we set to 80%. We \changesVIset to empirically.

28 Evaluation

\cjr{compactitem}
  • Do GPU-MMU and Static enjoy the same oracular schedule? Would be more accurate at least for Static to partition based on the available schemes for GRID. I’ll bet that would leave some cores unused. \rachataYes. All of them enjoys the oracle. Note that we detach ourselves from GRID because of comments from MICRO reviewers. Essentially it is hard to claim what we model represents GRID due to limited informations. Might want to group plots by their category to save space. We need to find a way to make plots bigger… \rachataGrouping them by category might do the trick When using figure Xa, Xb, etc, we need to use the subcaption command instead of adding letters by hand–the colors aren’t coming out right. Of course, the real baseline that neither we nor our competitors include is a design with no memory protection, or one where address translation and TLBs are in the memory controller. Too late for ISCA probably, given that we have another paper to deal with as well. We compare the performance of MASK against \changesIVfour GPU designs. The first, called Static, uses a static spatial partitioning of resources, where an oracle is used to partition GPU cores, but the shared L2 cache and memory channels are partitioned equally \changesVacross applications. This design is intended to capture key design aspects of NVIDIA GRID [159] and AMD FirePro [9], \changesII\changesVIbased on publicly-available information. The second design, called PWCache, models the \changesIVpage walk cache baseline design we discuss in Section 24. The third design, called SharedTLB, models the \changesVshared L2 TLB baseline design we discuss in Section 24. The \changesIVfourth \changesVdesign, Ideal, represents a hypothetical GPU where every single TLB access is a TLB hit. In addition to these designs, we report \changesIthe performance of the individual components of MASK: TLB-Fill Tokens (MASK-TLB), TLB-request-aware L2 Bypass (MASK-Cache), and Address-space-aware DRAM Scheduler (MASK-DRAM).

  • 28.1 Multiprogrammed Performance

    \changesI

    Figure 43 compares \changesIIthe \changesVaverage performance by workload category of Static, \changesIVPWCache, SharedTLB, and Ideal to MASK and \changesIVthe three individual components of MASK. \changesIWe make \changesIVtwo observations from \changesIIFigure 43. First, compared to \changesIVSharedTLB, which is the best-performing baseline, MASK \changesIimproves the weighted speedup by 57.8% \changesIon average. Second, we find that MASK performs only 23.2% worse than Ideal \changesIV(where all accesses to the L1 TLB are hits). \changesIIThis demonstrates that MASK reduces a large portion of the TLB miss overhead.

    Figure 43: Multiprogrammed workload \changesIXperformance, \changesIIgrouped by workload category.
    \para

    Individual Workload Performance. \changesIVFigures 4646, and 46 compare the weighted speedup of \changesIeach individual multiprogrammed workload for MASK, \changesIand the individual performance of its three components \changesI(MASK-TLB, MASK-Cache, and MASK-DRAM), against Static, PWCache, and SharedTLB for the 0-HMR (Figure 46), 1-HMR (Figure 46), and 2-HMR (Figure 46) workload categories. Each group of bars in Figures 4646 represents a pair of co-scheduled benchmarks. \changesIVWe make two observations from the figures. \changesVFirst, compared to Static, where resources are statically partitioned, MASK provides better performance, because when an application stalls for concurrent TLB misses, \changesIVit no longer needs a large amount of other shared resources, such as DRAM bandwidth. During such stalls, other applications can utilize these resources. When multiple GPGPU applications run concurrently \changesIVusing MASK, TLB misses from two or more applications can be staggered, increasing the likelihood that there will be heterogeneous and complementary \changesIVresource demands. \changesVSecond, MASK provides significant performance improvements over \changesVboth PWCache and SharedTLB regardless of the workload type (i.e., 0-HMR to 2-HMR). \changesIVThis indicates that MASK is effective at reducing the address translation overhead both when TLB contention is high and when TLB contention is \changesIVrelatively low.

    Figure 44: \changesIV\changesVIIIPerformance of multiprogrammed workloads in the 0-HMR workload category.
    Figure 45: \changesIV\changesVIIIPerformance of multiprogrammed workloads in the 1-HMR workload category.
    Figure 46: \changesIV\changesVIIIPerformance of multiprogrammed workloads in the 2-HMR workload category.
    \changesIV

    Our technical report [40] provides additional analysis on the aggregate throughput (system-wide IPC). In the report, we show that MASK provides 43.4% better aggregate throughput compared to SharedTLB.

    Figure 47 compares \changesVIIthe unfairness \changesIVof MASK to \changesIVthat of \changesVStatic, PWCache, and SharedTLB. We make two observations. First, compared to statically partitioning resources (Static), MASK provides \changesVbetter fairness \changesVIIIby allowing both applications to access all shared resources. Second, compared to SharedTLB, which is the baseline that \changesVIIprovides the best fairness, MASK reduces unfairness by 22.4% on average. As the number of tokens for each application changes based on the \changesIVL2 TLB miss rate, applications that benefit more from the shared L2 TLB are more likely to get more tokens, causing applications that do not benefit from shared L2 TLB space to yield that shared L2 TLB space to other applications. Our application-aware token distribution mechanism and \changesITLB-fill bypassing mechanism work in tandem to reduce the amount of \changesIVshared L2 TLB thrashing observed in Section 25.2.

    Figure 47: Multiprogrammed workload unfairness.
    \para

    Individual Application Analysis. MASK provides better throughput for \changesIVall individual applications sharing the GPU due to reduced TLB miss rates for each application \changesII(shown in our technical report [40]). The per-application L2 TLB miss rates are reduced by over 50% on average, which is in line with the \changesIVreduction in system-wide \changesIVL2 TLB miss rates \changesI(see Section 28.2). Reducing the number of TLB misses \changesIVvia \changesIVour \changesITLB-fill bypassing policy (Section 26.2), and reducing the latency of TLB misses \changesIVvia our shared L2 bypassing (Section 26.3) and TLB- and application-aware DRAM scheduling (Section 26.4) \changesVIIIpolicies, enables significant performance improvement.

    In some cases, running two applications concurrently provides \changesIVbetter \changesIVperformance as well as lower unfairness than running \changesIeach application alone (e.g., \changesIVfor the \changesVIIIRED_BP and RED_RAY workloads in Figure 46, and the \changesVIIISC_FWT workload \changesIVin Figure 46). We attribute \changesIsuch cases to substantial improvements (more than 10%) of two factors: a lower L2 \changesVcache queuing latency for bypassed \changesIIaddress translation requests, and a higher L1 \changesVcache hit rate \changesVof data demand requests when applications share the L2 \changesVcache and main memory with other applications.

    \changesI

    We conclude that MASK is effective at \changesVreducing the address translation \changesIVoverheads in modern GPUs, and thus \changesVIIIat improving both performance and fairness, \changesIIby introducing \changesIIaddress translation request awareness throughout the \changesIVGPU memory hierarchy.

    28.2 Component-by-Component Analysis

    \changesI

    This section characterizes MASK’s underlying mechanisms (MASK-TLB, MASK-Cache, and MASK-DRAM). Figure 43 shows the average performance improvement of each individual component of MASK compared to Static, \changesIVPWCache, SharedTLB, and MASK. \changesIVWe summarize our key findings here, and provide a more detailed analysis \changesVIIIin our technical report [40].

    \para

    Effectiveness of TLB-Fill Tokens. MASK uses TLB-Fill Tokens to reduce thrashing. We compare TLB hit rates for Static, \changesIVSharedTLB, and MASK-TLB. \changesIThe hit rates for \changesIStatic and \changesIVSharedTLB are substantially similar. MASK-TLB increases \changesVshared L2 TLB hit rates by 49.9% on \changesIVaverage over SharedTLB [40], \changesVbecause \changesVIIthe TLB-Fill Tokens mechanism reduces the number of warps utilizing the shared L2 TLB entries, in turn reducing the miss rate. \changesIVThe TLB bypass cache stores frequently-used TLB entries that cannot be filled in the traditional TLB. Measurement of the \changesIVaverage TLB bypass cache hit rate \changesIV(66.5%) confirms this \changesIVconclusion [40].24

    \para

    Effectiveness of TLB-request-aware L2 Bypass. MASK uses TLB-request-aware L2 Bypass with the goal of prioritizing \changesIIaddress translation requests. \changesIWe measure the average \changesVIL2 cache hit rate for \changesIIaddress translation requests. \changesIWe find that for \changesIIaddress translation requests that fill into the shared L2 cache, TLB-request-aware L2 Bypass is \changesIVvery effective \changesVat selecting which blocks to cache, resulting in \changesIVan \changesIIaddress translation request hit rate that is higher than 99% for all of our workloads. At the same time, TLB-request-aware L2 Bypass minimizes the \changesIimpact of long L2 cache queuing latency [36], leading to \changesIVa 43.6% performance improvement compared to \changesIVSharedTLB (as shown in Figure 43).

    \para

    Effectiveness of Address-space-aware DRAM Scheduler. To characterize the performance impact of MASK’s DRAM scheduler, we compare \changesIVthe DRAM bandwidth utilization and average DRAM latency \changesIVof (1) address translation requests and (2) data demand requests for the baseline designs and MASK, \changesIVand \changesVIIImake two observations. \changesIVFirst, we find that MASK is effective at reducing the DRAM latency of address translation requests, which contributes to \changesVIIthe \changesV22.7% performance improvement \changesVIIof MASK-DRAM over SharedTLB, as shown in Figure 43. In cases where the DRAM latency is high, \changesVour DRAM \changesVIIscheduling policy reduces the latency of \changesIIaddress translation requests by up to 10.6% (\changesVIIISCAN_SAD), while increasing DRAM bandwidth utilization by up to 5.6% (\changesVIIISCAN_HISTO). \changesVIISecond, we find that when an application is suffering severely from interference due to another concurrently-executing application, the Silver Queue significantly reduces the latency of data demand requests from the suffering application. For example, when \changesIVthe Silver Queue is employed, SRAD from the \changesVIIISCAN_SRAD application-pair performs 18.7% better, while both SCAN and CONS from \changesVIIISCAN_CONS \changesVIIperform 8.9% and 30.2% \changesVIIbetter, respectively. \changesIVOur technical report [40] \changesVprovides a more detailed analysis of the impact of our Address-space-aware DRAM Scheduler.

    \changesI

    We conclude that each component of MASK provides complementary performance improvements by introducing \changesIIaddress-translation-aware policies at different \changesVmemory hierarchy levels.

    28.3 Scalability and Generality

    This section evaluates the scalability of MASK and provides evidence that the design generalizes well across different \changesIVarchitectures. \changesIVWe summarize our key findings here, \changesVIIand provide a more detailed analysis in our technical report [40].

    \para

    Scalability. We compare the performance of \changesIVSharedTLB, which is the best-performing state-of-the-art baseline design, and MASK, normalized to \changesIVIdeal performance, as the number of concurrently-running applications \changesVincreases from one to five. In general, as the application count increases, contention for shared resources (e.g., shared L2 TLB, \changesIVshared \changesVL2 cache) draws the performance for both \changesIVSharedTLB and MASK further from \changesIVthe performance of Ideal. \changesIHowever, MASK maintains a consistent performance advantage relative to \changesIVSharedTLB, as shown in Table 8. The performance gain \changesIVof MASK relative to SharedTLB is more pronounced at higher levels of \changesIImulti-application concurrency because (1) \changesIVthe shared L2 TLB becomes heavily contended as the number of concurrent applications increases, and (2) MASK is effective at reducing the amount of contention at the heavily-contended shared TLB.

    Number of Applications 1 2 3 4 5
    SharedTLB performance 47.1% 48.7% 38.8% 34.2% 33.1%
    normalized to Ideal
    MASK performance 68.5% 76.8% 62.3% 55.0% 52.9%
    normalized to Ideal
    Table 8: \changesIVNormalized performance of SharedTLB and MASK as the number of concurrently-executing applications increases.
    \para

    Generality. MASK is an architecture-independent design: our techniques \changesIVare applicable to any \changesIVSIMT machine [310, 311, 312, 315, 7, 8, 427, 278, 344]. \changesITo demonstrate this, we evaluate \changesIVour two baseline variants (PWCache and SharedTLB) and MASK on \changesIVtwo additional GPU architectures: the \changesIVGTX480 (Fermi architecture [310]), and an integrated GPU architecture [343, 5, 432, 61, 179, 80, 181, 344, 278, 307, 308], as shown in Table 9. \changesIVWe make three key conclusions. \changesIVFirst, address translation leads to significant performance overhead in both \changesVIIIPWCache and SharedTLB. Second, MASK provides \changesVIIIa \changesV46.9% \changesIVaverage performance improvement over \changesIVPWCache and \changesVIIIa \changesV29.1% average performance improvement over SharedTLB on the Fermi architecture, getting to within 22% of the \changesIVperformance of Ideal. \changesIVThird, on the integrated GPU configuration used in previous work [343], we find that MASK provides \changesVIIIa 23.8% performance improvement over PWCache and \changesVIIIa 68.8% performance improvement over SharedTLB, and gets within 35.5% of the performance of Ideal.

    Relative Performance Fermi Integrated GPU [343]
    PWCache 53.1% 52.1%
    SharedTLB 60.4% 38.2%
    MASK 78.0% 64.5%
    Table 9: \changesIVAverage performance of PWCache, SharedTLB, and MASK, normalized to Ideal.
    \changesIV

    We conclude that MASK is effective \changesVat \changesVII(1) reducing the \changesVperformance overhead of address translation, and \changesV\changesVII(2) significantly improving system performance over both \changesVIIIthe PWCache and SharedTLB \changesVIIIdesigns, regardless of the GPU architecture.

    \para

    Sensitivity to L1 and L2 TLB Sizes. \changesIWe evaluate the benefit of MASK over many different TLB sizes in our technical \changesIVreport [40]. \changesIVWe make two observations. First, MASK is effective at \changesVreducing (1) TLB thrashing at the shared L2 TLB, and \changesV(2) the latency of address translation requests regardless of TLB size. Second, as we increase the shared L2 TLB size from 64 to 8192 entries, MASK outperforms SharedTLB for all TLB sizes except \changesVthe 8192-entry shared L2 TLB. At 8192 entries, MASK and SharedTLB perform equally, because the working set fits completely within the 8192-entry shared L2 TLB.

    \para

    Sensitivity to Memory Policies. We study the sensitivity of MASK to (1) main memory row policy, and (2) memory scheduling policies. We find that for \changesIVall of our baselines and for MASK, \changesIVperformance with an open-row policy [220] is similar (within 0.8%) to the performance with a \changesVclosed-row policy, which is used in various \changesVCPUs [178, 175, 181]. Aside from the FR-FCFS scheduler [357, 454], we \changesIuse MASK \changesIin conjunction with another state-of-the-art GPU memory scheduler [191], and \changesIfind that \changesIIwith this scheduler, MASK \changesIimproves performance by 44.2% over \changesIVSharedTLB. We conclude that MASK is effective across different memory policies.

    \para

    Sensitivity to Different Page \changesVSizes. \yellowWe evaluate the performance of MASK with 2MB large pages assuming an ideal page fault latency [40, 32] \changesII(not shown). \changesIVWe provide two observations. First, even with the larger page size, SharedTLB continues to experience high contention during address translation, causing its average performance to fall 44.5% short of \changesIVIdeal. Second, we find that using MASK allows the GPU to perform within 1.8% of Ideal.

    28.4 Hardware Overheads

    To support memory protection, each L2 TLB \changesIVentry has an 9-bit address space identifier (ASID), \changesIVwhich translates to \changesVan overhead of 7% of the L2 TLB size in total.

    \changesII

    At each core, our TLB-Fill Tokens mechanism uses (1) two 16-bit counters \changesIIto track the \changesIVshared L2 TLB hit rate, with one counter tracking the number of \changesIVshared L2 TLB hits, and the other counter tracking the number of \changesIVshared L2 TLB misses; (2) a 256-bit vector addressable by warp ID \changesIVto track the number of active warps, where each bit is set when a warp uses the shader core for the first time, and is reset every epoch; and (3) an 8-bit incrementer that tracks the total number of unique warps executed by the core (i.e., its counter value is incremented each time a bit is set in the bit vector).

    We augment the shared cache with \changesIa 32-entry fully-associative content addressable memory (CAM) for the bypass cache, 30 15-bit token \changesIcounters, \changesIVand 30 1-bit direction registers to record whether the token count increased or decreased during the previous epoch. \changesIVThese structures allow the GPU to distribute tokens \changesIVamong up to 30 concurrent applications. In total, we add \changesII706 bytes \changesIof storage (\changesII13 bytes per core in the L1 TLB, and 316 bytes \changesIVtotal in the shared L2 TLB), which adds \changesII1.6% to the baseline L1 TLB \changesIIsize and 3.8% \changesIVto the \changesIIbaseline L2 TLB size (\changesVin \changesVIIaddition to the 7% overhead due to the ASID bits).

    TLB-request-aware L2 Bypass uses ten 8-byte counters per core to track \changesIVL2 cache hits and \changesIVL2 cache accesses per level. The resulting 80 bytes add less than 0.1% \changesIIto the baseline shared L2 cache size. Each \changesIVL2 cache and memory request requires an additional 3 bits \changesVto specify the page walk level, as we discuss in Section 26.3.

    \changesIV

    For each memory channel, \changesVIIIour Address-space-aware DRAM Scheduler contains a 16-entry FIFO queue for the Golden Queue, a 64-entry memory request buffer \changesVIfor the Silver Queue, and a 192-entry memory \changesVrequest buffer for the Normal Queue. This adds an extra 6% of storage overhead to the DRAM request queue per memory controller.

    \para

    Area and Power Consumption. We compare the area and power consumption of MASK to \changesIVPWCache and SharedTLB using CACTI [288]. \changesVIIPWCache and SharedTLB have near-identical area and power consumption, as we size the page walk cache and shared L2 TLB (see Section 24) such that they both use the same total area. We find that MASK introduces a negligible overhead to both baselines, consuming less than 0.1% additional area and 0.01% additional power in each baseline. We provide a detailed analysis of area and power consumption in our technical report [40].

    \cjr

    I guess we have a smaller L1 TLB, which somewhat compensates the additional L2? Might be worth saying that.

    29 MASK: Conclusion

    Spatial multiplexing support, which allows multiple applications to run concurrently, is needed to efficiently deploy GPUs in a large-scale computing environment. \changesII\changesIVUnfortunately, due to the primitive existing support for memory virtualization, many of the performance benefits of spatial multiplexing are lost in state-of-the-art GPUs. We perform a detailed analysis of state-of-the-art mechanisms for memory virtualization, and find that current address translation mechanisms (1) are highly susceptible to interference across the different address spaces of applications \changesIVin the shared TLB structures, which leads to a high number of page table walks; and (2) undermine the fundamental latency-hiding techniques of GPUs, by often stalling hundreds of threads at once. To alleviate these problems, we propose MASK, a new memory hierarchy designed \changesVcarefully to support multi-application concurrency at low overhead. MASK consists of three major components \changesIIin different parts of the memory \changesVIIIhierarchy, all of which incorporate \changesIIaddress translation request awareness. These three components work together to lower inter-application interference during address translation, and improve L2 cache utilization \changesIVand memory latency for \changesIIaddress translation requests. \changesVMASK improves performance by 57.8%, on average across a wide range of multiprogrammed workloads, over the state-of-the-art. \changesIVWe conclude that MASK provides a promising and effective substrate for multi-application execution on GPUs, and hope future work builds on the mechanism we provide and open source [366].

    Chapter \thechapter Reducing Inter-address-space Interference with Mosaic

    Graphics Processing Units (GPUs) are used for an ever-growing range of application domains due to steady increases in GPU compute density and continued improvements \changesIin programming tools [313, 216, 12]. The growing adoption of GPUs has in part been due to better high-level language support [313, 363, 403, 66], which has improved GPU programmability. \changesIRecent support within GPUs for memory virtualization features, such as a unified virtual address space [310, 12], demand paging [315], and preemption [315, 9], can provide fundamental improvements \changesIthat can ease programming. \changesIIIIIThese features allow developers to exploit \changesIkey benefits \changesIthat have long been taken for granted in CPUs (e.g., application portability, multi-application execution). Such familiar features can dramatically improve programmer productivity and further boost GPU adoption. However, \changesIa number of challenges have kept GPU memory virtualization from achieving performance similar to \changesIthat in CPUs [420, 269]. In this work, we focus on two fundamental challenges: (1) the address translation challenge, and (2) the demand paging challenge.

    \paragraphbe\changesI

    Address Translation Challenge. \changesIMemory virtualization relies on page tables to store virtual-to-physical address translations. \changesIConventionally, systems store one translation for every base page (e.g., a 4KB page). To translate \changesIa virtual address on demand, a series of serialized memory accesses are required to traverse \changesI(i.e., \changesIIIIIwalk) the page table [343, 342]. These serialized accesses clash with the \changesIsingle-instruction multiple-thread (SIMT) execution model\changesI [297, 251, 120] used by GPU-based systems, which relies on high degrees of concurrency through \changesIIIIIthread-level parallelism (TLP) to hide long memory latencies during GPU execution. \changesITranslation lookaside buffers (TLBs) can reduce the latency of address translation by caching recently-used \changesIaddress translation information. Unfortunately, as application working sets and DRAM capacity have increased in recent years, state-of-the-art GPU TLB designs [343, 342] suffer due to inter-application interference and stagnant TLB sizes. Consequently, GPUs have poor TLB reach, \changesIi.e., the TLB covers only a small fraction of the physical memory \changesIIIworking set of an application. Poor TLB reach is particularly detrimental \changesIwith the SIMT execution model, as a single TLB miss can stall \changesIhundreds of threads at once, undermining TLP within a GPU and significantly reducing performance [420, 269].

    \changesI

    Large pages (e.g., the 2MB or 1GB pages in modern CPUs [179, 181]) can significantly reduce the overhead of address translation. A major constraint for TLB reach is the small, fixed number of translations that a TLB can hold. If we store one translation for every large page instead of one translation for every base page, the TLB can cover a \changesIIImuch larger fraction of \changesIthe virtual address space using the same number of page translation entries. Large pages have been supported by CPUs for decades [381, 387], and large page support is emerging for GPUs [343, 453, 342]. However, large pages increase the risk of internal fragmentation, where a portion of the large page is unallocated (or unused). Internal fragmentation occurs because it is often difficult for an application to completely utilize large contiguous regions of memory. This fragmentation leads to \changesI(1) memory bloat, \changesIwhere a much greater amount of physical memory is allocated than the amount of memory that the application needs; and \changesI(2) longer memory access latencies, due to \changesIIIa lower effective TLB reach and more page faults [228].

    \paragraphbe\changesI

    Demand Paging Challenge. \changesIFor discrete GPUs (i.e., \changesIGPUs that are not in the same package/die as the CPU), demand paging can incur significant overhead. With demand paging, an application can request data that is not currently resident in GPU memory. This triggers a page fault, which requires a \changesIlong-latency data transfer for an entire page over the system I/O bus, \changesIwhich, in today’s systems, is also called the PCIe bus [331]. \changesIA single page fault can cause multiple threads to stall at once, \changesIas threads often access data in the same page due to data locality. As a result, the page fault can significantly reduce the amount of TLP that the GPU can exploit, and the long latency of a page fault harms performance [453].

    \changesI

    Unlike address translation, which benefits from \changesIlarger pages, demand paging benefits from smaller pages. Demand paging for large pages requires a greater amount of data to be transferred over the system I/O bus during a page fault than for conventional base pages. The larger data transfer size increases the transfer \changesItime significantly, due to the long latency and limited bandwidth of the system I/O bus. This, in turn, significantly increases the amount of time that GPU threads stall, and can further decrease the amount of TLP. To make matters worse, as the size of a page increases, there is a greater probability that an application does not need all of the data in the page. As a result, threads may stall for a longer time without gaining \changesIany further benefit in return.

    \paragraphbe\changesI

    Page Size Trade-Off. \changesIWe find that memory virtualization in state-of-the-art GPU systems has a fundamental trade-off due to the page size choice. \changesIA \changesIlarger page size reduces address translation stalls by increasing TLB reach and reducing the number of high-latency TLB misses. \changesIIn contrast, a smaller page size reduces demand paging stalls by decreasing the amount of unnecessary data transferred over the system I/O bus [343, 453]. \changesIWe can relax the page size trade-off by using multiple page sizes \changesItransparently to the application, and, thus, to the \changesIprogrammer. \changesIIn a system that supports multiple page sizes, several base pages that are contiguous in both virtual and physical memory can be coalesced (i.e., combined) into a single large page, and a large page can be splintered (i.e., split) into multiple base \changesIpages. \changesIIIWith multiple page sizes, and \changesIIIIIthe ability to change \changesIvirtual-to-physical mappings dynamically, the \changesIGPU system can support good TLB reach \changesIby using large pages for address translation, while providing better demand paging performance \changesIby using base pages for data transfer.

    \changesI

    Application-transparent \changesIsupport for multiple page sizes has \changesIproven challenging for CPUs [228, 298]. \changesIA key property of memory virtualization is to enforce memory protection, where a distinct virtual address space (i.e., a memory protection domain) is allocated to an individual application or \changesIa virtual machine, and memory is shared safely (i.e., only with explicit permissions \changesIfor accesses across \changesIdifferent address spaces). In order to \changesIensure that memory protection guarantees are not violated, coalescing operations can combine contiguous physical base pages into a single physical large page only if all base pages belong to the same virtual address space.

    \changesI

    Unfortunately, in both CPU and state-of-the-art GPU memory managers, existing memory access patterns and allocation mechanisms make it difficult to find regions of physical memory where base pages can be coalesced. We show an example of this in Figure a, which illustrates how a state-of-the-art GPU memory manager [343] allocates memory for two applications. Within a single large page frame (i.e., a contiguous piece of physical memory that is the size of a large page and whose starting address is page aligned), the GPU memory manager allocates base pages from both Applications 1 and 2 (

    1

    in the figure). As a result, the memory manager \changesIcannot coalesce the base pages into a large page (

    2

    ) without first migrating \changesIsome of the base pages, which would incur a high latency.

    (a) \changesIState-of-the-art GPU memory management [343].
    (b) \changesIMemory management with Mosaic.
    Figure 48: \changesIPage allocation and coalescing behavior \changesIof GPU memory managers: (a) state-of-the-art [343], (b) Mosaic.
    \changesI

    We make a key observation about the memory behavior of contemporary general-purpose GPU (GPGPU) applications. The vast majority of memory allocations in \changesIIIIIGPGPU applications are performed en masse (i.e., a large number of pages are allocated at the same time). The en masse memory allocation presents us with an opportunity: with so many pages being allocated at once, we can rearrange how we allocate the \changesIbase pages to ensure that (1) \changesIall of the base pages allocated within a large page frame belong to the \changesIsame virtual address space, and (2) base pages that are contiguous in virtual memory are allocated to a contiguous portion of physical memory and aligned within the large page frame. Our goal in this work is to develop an application-transparent memory manager that performs such memory allocation, and uses this allocation \changesIIIproperty to efficiently support multiple page sizes in order to improve TLB reach \changesI\changesIIIIIand efficiently support demand paging.

    \changesI

    To this end, we present Mosaic, a \changesInew GPU memory manager that uses our \changesIkey observation to provide application-transparent support for multiple page sizes in GPUs while avoiding high overhead for coalescing and splintering pages. \changesIThe key idea of Mosaic is to (1) transfer data to GPU memory at \changesIthe small base page (e.g., 4KB) granularity, (2) allocate physical base pages in a way that avoids the need to migrate data during coalescing, and (3) use \changesIIIIIa simple coalescing \changesIIIIImechanism to combine base pages into large pages \changesI(e.g., 2MB) and thus increase TLB reach. Figure b shows a high-level overview of \changesIhow Mosaic allocates and coalesces pages. Mosaic consists of three key design components: (1) Contiguity-conserving Allocation (CocoA), a memory allocator which provides a soft guarantee that \changesIall of the base pages within the same large page range belong to only a single application \changesI(

    3

    in the figure); (2) Lazy-Coalescer, \changesIa page size \changesIIIselection mechanism \changesIthat \changesIIImerges base pages into a large page immediately after allocation (

    4

    ), and thus does \changesIIInot need to monitor base pages to make coalescing decisions or migrate base pages; and (3) Contiguity-aware Compaction (CaC), a memory compaction mechanism that transparently \changesImigrates data to avoid \changesIinternal fragmentation within a large page frame, \changesIwhich frees up large page \changesIframes for CocoA.

    \paragraphbe\changesI

    Key Results. We evaluate Mosaic using 235 workloads. Each workload consists of multiple GPGPU applications from a wide range of benchmark suites. Our evaluations show that compared to a contemporary GPU that uses only 4KB base pages, a GPU with Mosaic reduces address translation overheads \changesIwhile efficiently achieving the benefits of demand paging, thanks to its use of multiple page sizes. When we compare to a \changesIIIGPU with a state-of-the-art memory manager \changesI(see Section 31.1), we find that \changesIIIa GPU with Mosaic provides an average speedup of \changesI55.5% and 29.7% for homogeneous and heterogeneous multi-application workloads, respectively, \changesIand comes within 6.8% and 15.4% of the performance \changesIof \changesIIIa GPU with an ideal TLB, where all TLB requests are hits. \changesIThus, by alleviating the page size trade-off between address translation and demand paging overhead, Mosaic improves the efficiency and practicality of multi-application execution on the GPU.

    This chapter makes the following contributions:

    • We analyze fundamental trade-offs \changesIon choosing the correct page size to optimize \changesIboth address translation (which benefits from larger pages) and demand paging (which benefits from smaller pages). \changesIBased on our analyses, we motivate the need for application-transparent support of \changesImultiple page sizes in a GPU.

    • We present Mosaic, \changesIa \changesInew GPU memory manager that \changesIefficiently supports multiple page sizes. Mosaic uses a novel \changesImechanism to allocate contiguous virtual pages to contiguous physical pages in the GPU memory, and exploits this property to \changesIcoalesce \changesIIIcontiguously-allocated base pages into a large page for address translation with low overhead and no \changesIIIdata migration, while still using base pages during demand paging.

    • We show that Mosaic’s application-transparent support for \changesImultiple page sizes effectively improves TLB reach \changesIwhile efficiently achieving the benefits of demand paging. \changesIOverall, Mosaic improves the average performance of homogeneous and heterogeneous multi-application workloads by 55.5% and 29.7%, respectively, over a state-of-the-art GPU memory manager.

    30 Background

    We first provide necessary background on contemporary GPU architectures. In Section 30.1, we discuss the GPU execution model. In Section 30.2, we discuss state-of-the-art support for GPU memory virtualization.

    30.1 GPU Execution Model

    GPU applications use fine-grained multithreading\changesI [410, 411, 390, 389]. \changesIA GPU application is made up of thousands of threads. These threads are clustered into thread blocks (also known as work groups), where each thread block consists of multiple smaller bundles of threads that execute concurrently. Each such thread bundle is known as a warp, or a wavefront. \changesIEach thread within the warp executes the same instruction at the same program counter value. The GPU avoids stalls due to dependencies and long memory latencies by taking advantage of thread-level parallelism (TLP), where the GPU swaps out warps that have dependencies or are waiting on memory with other warps that are ready to execute.

    A GPU consists of multiple streaming multiprocessors (SMs), also known as shader cores. Each SM executes one warp at a time using the single-instruction, multiple-thread (SIMT) execution model\changesI [297, 251, 120]. Under SIMT, all of the threads within a warp are executed in lockstep. Due to lockstep execution, a warp stalls when any one thread within the warp \changesIhas to stall. \changesIThis means that a warp is unable to proceed to the next instruction until the slowest thread in the warp completes the current instruction.

    The GPU memory hierarchy typically consists of multiple levels of memory. In contemporary GPU architectures, each SM has a private data cache, and has access to one or more shared \changesIIIIImemory partitions through an interconnect (typically a crossbar). A \changesIIIIImemory partition combines a single slice of the banked L2 cache with \changesIIIa memory controller that connects the GPU to off-chip main memory (DRAM). \changesIMore detailed information about the GPU memory hierarchy can be found in \changesIIIIII[192, 193, 425, 423, 36, 332, 209, 194, 32, 358, 190].

    30.2 Virtualization Support in GPUs

    Hardware-supported memory virtualization relies on address translation to map each virtual memory address to a physical address within the GPU memory. Address translation uses page-granularity virtual-to-physical mappings that are stored within a multi-level page table. To look up a mapping within the page table, the GPU performs a page table walk, where a page table walker traverses through each level of the page table in main memory until the walker locates the page table entry for the requested mapping in the last level of the table. GPUs with virtual memory support \changesIIIhave translation lookaside buffers (TLBs), which cache page table entries and avoid the need to perform a page table walk for the cached entries, thus reducing the address translation latency.

    The introduction of address translation hardware into the GPU memory hierarchy puts TLB misses on the critical path of application execution, as a TLB miss invokes a page table walk that can stall multiple threads and degrade performance significantly. (We study the impact of TLB misses and page table walks in Section 31.1.) A GPU uses multiple TLB levels \changesIto reduce the number of TLB misses, typically including private per-SM L1 TLBs and a shared L2 TLB [453, 342, 343]. Traditional address translation mechanisms perform memory mapping using a base page size of 4KB. Prior work for integrated GPUs (i.e., GPUs that are in the same package or die as the CPU) has found that using a larger page size can improve address translation performance by improving TLB reach (i.e., the maximum \changesIIIfraction of memory that can be accessed using the cached TLB entries) [453, 342, 343]. For a TLB that holds a fixed number of page table entries, using the large page (e.g., a page with a size of 2MB or greater) as the granularity for mapping greatly increases the TLB reach, and thus reduces the TLB miss rate, compared to using the base page granularity. While memory hierarchy designs for widely-used GPU architectures from NVIDIA, AMD, and Intel are not publicly available, it is widely accepted that contemporary GPUs support TLB-based address translation and, in some models, large page sizes [9, 310, 311, 312, 269]. To simplify translation hardware in a GPU that uses multiple page sizes (i.e., both base pages and large pages), \changesIwe assume that \changesIIIIIIeach TLB level contains two separate sets of entries\changesI [201, 122, 202, 340, 341, 325], where one \changesIIIIIIset of entries stores only base page translations, while the other \changesIIIIIIset of entries stores only large page translations.

    State-of-the-art GPU memory virtualization provides support for demand paging [343, 453, 14, 12, 315]. In demand paging, all of the memory used by a GPU application does not need to be transferred to \changesIthe GPU memory at the beginning of application execution. Instead, during application execution, when a GPU thread issues a memory request to a page that has not yet been allocated in \changesIthe GPU memory, the GPU issues a page fault, at which point the data for that page is transferred over the off-chip system I/O bus (e.g., the PCIe bus [331] in contemporary systems) from the CPU memory to the GPU memory. The transfer requires a long latency due to its use of an off-chip bus. Once the transfer completes, the GPU runtime allocates a physical GPU memory \changesIaddress to the page, and the thread can complete its memory request.

    31 A Case for Multiple Page Sizes

    Despite increases in DRAM capacity, TLB capacity (i.e., the number of cached page table entries) has not kept pace, and \changesIthus TLB reach has been declining. As a result, address translation overheads have started to significantly increase the execution time of many large-memory workloads [51, 123, 343,