SuperNeurons: Dynamic GPU Memory Management for Training Deep Neural Networks
Abstract.
Going deeper and wider in neural architectures improves the accuracy, while the limited GPU DRAM places an undesired restriction on the network design domain. Deep Learning (DL) practitioners either need change to less desired network architectures, or nontrivially dissect a network across multiGPUs. These distract DL practitioners from concentrating on their original machine learning tasks. We present SuperNeurons: a dynamic GPU memory scheduling runtime to enable the network training far beyond the GPU DRAM capacity. SuperNeurons features 3 memory optimizations, Liveness Analysis, Unified Tensor Pool, and CostAware Recomputation; all together they effectively reduce the networkwide peak memory usage down to the maximal memory usage among layers. We also address the performance issues in those memory saving techniques. Given the limited GPU DRAM, SuperNeurons not only provisions the necessary memory for the training, but also dynamically allocates the memory for convolution workspaces to achieve the high performance. Evaluations against Caffe, Torch, MXNet and TensorFlow have demonstrated that SuperNeurons trains at least 3.2432 deeper network than current ones with the leading performance. Particularly, SuperNeurons can train ResNet2500 that has basic network layers on a 12GB K40c.
1. Introduction
Deep Neural Network (DNN) is efficient at modeling complex nonlinearities thanks to the unparalleled representation power from millions of parameters. This implies scaling up neural networks is an effective approach to improve the generalization performance. The Deep Learning (DL) community now widely acknowledges either going deeper or going wider on the nonlinear architecture improves the quality of image recognition tasks. For example, 9layer AlexNet won the 2012 ILSVRC (ImageNet LargeScale Visual Recognition Challenge) with a top5 error of 17%. GoogLeNet (inception v1) refreshed the top5 error rate to 6.67% with 22 inception units in 2014 ILSVRC, and ResNet further reduced the error rate down to 3.57% in 2015 ILSVRC with 152 residual units.
While DL practitioners are enthusiastically seeking deeper and wider nonlinear networks, the limited size of GPU DRAM becomes a major restriction. Training a deep network is inherently a computingintensive task. Almost every AI lab today, either in academia or industry, is deploying the network training on GPUs for the demand of highperformance (bahrampour2016comparative, ). Data need to be residing on GPU DRAM for the GPU computing, but the largest commercial GPU DRAM so far is 24 GB. This is still far from sufficient to accommodate a deep neural network. For example, the latest Inception v4 has 515 basic layers consuming 44.3 GB memory in the training. The deeper or wider we go, the higher memory usages will be. Therefore, this deep trend subjects the rigid GPU DRAM to the severe insufficiency.
Major DL frameworks, such as Caffe or MXNet, have tried to alleviate the GPU memory shortage with several static memory reduction techniques. Those techniques, due to their static nature, are not well tuned to address the new data and dependency variations in nonlinear networks. For example, Caffe and Torch do not fully support the data flow analysis on nonlinear neural networks; the trading computation for memory strategy in MXNet is limited for ignoring the memory variations across network layers. These limitations have motivated us to propose a dynamic approach for the emerging deep nonlinear neural architectures.
In this paper, we present the first dynamic GPU memory scheduling runtime for training deep nonlinear neural networks. The runtime allows DL practitioners to explore a much deeper and wider model beyond the physical limitations of GPU memory. It utilizes tensors as the fundamental scheduling units to consist with the layerwise computations enforced in DL performance primitives cuDNN (chetlur2014cudnn, ). The runtime seamlessly orchestrates the tensor placement, movement, allocation and deallocation so that the underly memory operations are entirely transparent to users.
Our runtime guarantees the minimal peak memory usage, , at the layerwise granularity. We denote the memory usage of the layer as , and the superscript, e.g. or , as the forward/backward. The peak memory usage during the forward and backward computations is denoted as . First, Liveness Analysis recycles no longer needed tensors to reduce from baseline to (defined in Sec.3). Secondly, Unified Tensor Pool (UTP) offloads tensors in computeintensive layers, referred to as checkpoints, to the external physical memory. This further reduces from to . Finally, CostAware Recomputation drops the forward results of cheaptocompute or nonecheckpoints layers and reconstructs them to reduce from to . The final indicates the largest computable network is bounded by the maximal memory usage among layers.
Our runtime also features three performance optimizations to improve the efficiency of Liveness Analysis and UTP. First, GPUs require memory allocations to create tensors and deallocations to free tensors. Thus, the highfrequent large tensor allocations/deallocations incur the nonnegligible overhead in Liveness Analysis (wang2016blasx, ). The runtime successfully amortizes the cost by directly reusing memory segments from a huge preallocated memory pool, managed by a heap based GPU memory management utility. Secondly, UTP swaps tensors among different physical memory spaces, while modern GPUs equip with independent Direct Memory Access (DMA) engine exposing opportunities to hide communications under computations. The runtime also meticulously overlap communications with computations. However, the overlapping opportunity is limited given the fixed amount of computations. We propose a LRU based Tensor Cache built on GPU DRAM to minimize total communications by tensor reusing.
This paper claims the following contributions:

We demonstrate the new memory scheduling challenges in nonlinear neural networks, and discuss the key limitations of existing approaches.

We design and implement SuperNeurons to enable DL practitioners to explore deep neural networks; and the largest computable network of SuperNeurons is only bounded by the max memory usage among layers.

By dynamically allocating memory for convolution workspaces, SuperNeurons deliveries the leading performance among stateofart DL systems on the GPU.
2. Background and Motivation
2.1. Challenges for Processing Super Deep Neural Networks
Traditional Convolutional Neural Networks (CNN) (lecun1998gradient, ; krizhevsky2012imagenet, ; simonyan2014very, ) are typically composed of several basic building layers, including Convolution (CONV), Pooling (POOL), Activation (ACT), Softmax, Fully Connected (FC), Local Response Normalization (LRN), Batch Normalization (BN), and Dropout. For linear CNNs, these layers are independent and interconnected to their neighbors in a sequential manner: . Recently, several deep nonlinear neural architectures have been proposed to further improve the stateoftheart accuracy on the 1K ImageNet recognition challenge, e.g., Inception v4(szegedy2017inception, ), ResNet(he2016deep, ), and DenseNet(huang2016densely, ). These prominent network designs (especially the one that solves classic gradient vanishing (bengio1994learning, ) problem) pave the algorithmic foundation for DL practitioners to harness the unparalleled representation power brought forth by the super deep nonlinear neural architectures. For example, the latest inception v4 delivers 95% top5 accuracy with 515 basic building layers while ResNet151^{1}^{1}1151 represents the number of convolutional units. achieves 94.3% top5 accuracy with 567 layers. In Figure 1, we illustrate two classic types of nonlinear connections: fan and join. Compared with the linear connection pattern, the sparse fanout connection (Figure (a)a) avoids one huge computinginefficient dense layer (szegedy2015going, ) while the join connection prevents gradients from quickly vanishing in the backpropagation (he2016deep, ).
Training these super deep and complex nonlinear neural architectures is a computationintensive task. Due to its DLdriven novel architecture designs and massive parallelism, GPUs have been widely adopted in today’s industry and academia for the efficient neural network training. However, there are critical issues for efficiently training in these newly developed super deep nonlinear neural architectures: limited GPU resident memory and a high degree of variation in computational dependencies.
Challenge I: Limited GPU Resident Memory. The prominent deep neural architectures share a common feature: high memory demand and computation intensity. Figure 2 illustrates the network wide memory usages of several recent DNNs in the training with and without convolution workspaces or buffer. Among them, AlexNet and VGG are linear networks while the others are nonlinear. We can observe that the nonlinear networks demand a significant amount of GPU memory, e.g., ResNet152 and Inception v4 require up to 18.5GB and 44.3 GB at only the batch size of 32, respectively. However, these sizes are either similar to or surpass the resident memory sizes of commercial GPUs on the market today. For instance, the newest generations of NVIDIA Pascal and Volta GPUs only have 16GB with HBM2 enabled (e.g., P100 and V100) while the one with the most memory available in the recent generations is Maxwell P40 with 24GB GDDR5. This limitation poses a major bottleneck for deep learning practitioners for exploring deep and wide neural architectures (szegedy2015going, ; pleiss2017memory, ; szegedy2017inception, ). The most straight forward solution is to split the network across GPUs, i.e. Model Parallelism. However, splitting either the computations of a network or a layer incurs excessive intranetwork and intralayer communications that drastically deteriorate the performance. For example, recent work has suggested the deficiency of applying model parallelism for deep neural networks: it compromises at least 40 speed when training a network with 1.3 billion parameters from 36 GPUs to 64 GPUs (coates2013deep, ). To address the performance issues from Model Parallelism, Data Parallelism has been widely adopted in today’s mainstream deep learning frameworks such as Caffe(jia2014caffe, ), TensorFlow(abadi2016tensorflow, ), Torch(collobert2002torch, ), and MXNet(chen2015MXNet, ). In this model, each GPU holds a network replica; and one GPU computes one subgradient with a subbatch. Subsequently all subgradients are aggregated as one global gradient to update the network parameters (wang2016efficient, ). Although this process does not incur intranetwork or intralayer communications besides necessary gradient exchanges, it requires the network training to fit in the limited GPU DRAM. In this paper, we focus on addressing the GPU memory shortage issue for training deep neural networks under data parallelism model while taking the training performance into design considerations.
Challenge II: Variations in Computational Dependencies for Nonlinear Networks. Nonlinear networks exhibit a high degree of dependency variations while linear networks follow a fixed sequential execution pattern with predictable data dependencies (rhu2016vdnn, ). Fig.3 illustrates the data dependency graph for linear (a) and nonlinear (b and c) neural architectures. One typical training iteration consists of two phases: forward and backward propagation. For Linear networks, data is sequentially propagated in the forward pass; and a layer’s backward computation is simply contingent upon the previous layer as illustrated in Figure (a)a. Thus their computation and dependency patterns are static regardless of the total layers involved.
However, for nonlinear networks, a high degree of variations in computational dependencies appear. Fig.(b)b and (c)c show two simple examples of join and fan nonlinear connections. For join connections, it forwards a layer’s output tensor to another layer, creating a dependency between two layers. For example, the join in Fig.(b)b forwards from DATA layer to FC layer in the forward pass. The dependency of joinbased nonlinear networks is nondeterministic as any two layers can be connected with a join, e.g., in DenseNet. For fan connections, it creates multiple branches in the execution flow: DATA layer forks two branches and joins them before FC layer. Separate branches, each with a different number of layers, have to finish before joining them back to the original branch, making this execution sequence nonlinear. Although the two basic nonlinear scenarios shown here are intuitive, a typical deep nonlinear network today has hundreds of joins and fans convoluted together, resulting in a complex network architecture. These significantly complicate runtime resourcemanagement compared to the static computational pattern in linear ones. Therefore, the memory scheduling of deep nonlinear neural networks demands a dynamic solution to effectively address these variations in both the execution flow and computation dependencies.
2.2. Limitations of GPU Memory Management in Mainstream Deep Learning Frameworks
Several static memory reduction techniques have been implemented in today’s deep learning frameworks to address the GPU memory shortage at data parallelism level. For example, Caffe and Torch directly reuse the forward data tensors for the backward data propagation, which saves up to of memory on a linear network (MXNet_graph, ). Although this technique works well on linear networks, it requires extra tensors to hold the future dependencies for training nonlinear networks, thereby limiting the effectiveness and efficiency. Also, these frameworks still have to fit the entire network into GPU DRAM without leveraging NUMA architectures; and this level of reuse is arguably not adequate for contemporary deep nonlinear neural networks. MXNet and TensorFlow are built with a Directed Acyclic Graph (DAG) execution engine (wu2015hierarchical, ). Users explicitly define the computation flow and tensor dependencies, which provide necessary information for the DAG engine to analyze the life span of tensors. Both systems then free tensors that are no longer needed in order to save memory. MXNet also implements a perlayerbased recomputation strategy that is similar to Resilient Distributed Datasets (RDD) in Spark. Basically it frees the tensors produced by computingcheap layers in the forward pass, and recomputes the freed dependencies for the backward pass by doing another forward. However, this method neglects nonuniform memory distribution of network layers, consequentially demanding large unnecessary memory usages. TensorFlow swaps longlived data tensors from GPU DRAM to CPU DRAM, but it fails to optimize data communications between the two (e.g., utilizing pinned data transfer) which compromises at least of communication speed.
More importantly, none of aforementioned DL frameworks utilize a dynamic scheduling policy that provisions necessary memory space for deep nonlinear network training while at the same time optimizing the training speed given the existing GPU DRAM resource. In other words, these static memorysaving techniques aggressively reduce the GPU memory usage at the expense of speed. Users either painstakingly tune the performance or suffer from the insufficient memory during the execution. Additionally, these frameworks either have no optimization strategy or adopt a naive method on allocating the convolution workspace (see Section 3.5), which is a decisive factor determining CNN training speed on the GPU. In summary, these challenges motivate us to design a dynamic scheduling runtime to provision necessary memory for the training while maximizing the memory for convolution workspaces to optimize the training speed.
3. Design Methodologies
This section elaborates on three memory optimization techniques and their related performance issues in SuperNeurons. From a high level perspective, SuperNeurons provisions necessary memory spaces for the training while maximizing the speed by seeking convolution workspaces within the constraint of native GPU memory size.
Notations and Baseline Definition: To facilitate the analysis of proposed techniques, we denote the forward memory usage of the layer as , the backward as . We denote the peak memory usage as . We use the naive networkwide tensor allocation strategy as the baseline, which allocates an independent tensor for each memory requests. Thus, the of baseline is . We also denote the maximal memory usage among layers as , where , and represents the network length. represents the tensor.
First, Liveness Analysis reduces the baseline to by recycling free tensors amid backpropagation, demonstrating up to of the memory saving. This technique is guaranteed to work on various nonlinear architectures, and it is constructed in . Liveness Analysis involves highfrequent memory operations on the large chunk memory, while native memory utilities, e.g. cudaMalloc and cudaFree, incur the nontrivial overhead. We address this issue with a preallocated heap managed by the runtime.
Secondly, Unified Tensor Pool(UTP) further reduces to , where checkpoints represent the computeintensive layers such as FC and CONV. UTP provides a consolidated memory abstraction to external memory pools to supply for the training. Instead of using naive ondemand data transfers, it hides communications under computations. While the overlapping opportunity is limited given the fixed amount of computations, UTP further introduces a Tensor Cache built on GPU to reduce communications.
Finally, CostAware Recomputation reduces to , the minimum at the layerwise granularity. The method keeps track of memory distributions among checkpoints to minimize the extra computations while ensuring .
3.1. Prerequisites
A typical DNN network layer computes on a 4dimension tensor indexed by batches (N), image channels (C), height (H) and width (W) (Fig.4). Since cuDNN operates at the layer granularity, we use tensors as the basic memory scheduling unit.
Alg.1 describes how SuperNeurons constructs execution steps for nonlinear neural architectures. The input is the first network layer; then Alg.1 recursively explores the subsequent layers in DepthFirst Searching (DFS), except that it reaches a join where all prior layers must finish before proceeding. The behavior is achieved by the counter in each layer that tracks the input dependencies (line 5 6 in Alg.1).
Fig.6 demonstrates an example execution route for a nonlinear network constructed by Alg.1. Each box represents a network layer indexed from to . Note that this network has two fan structures (layer and layer ) nested together. Alg.1 successfully identifies layers and as the prerequisites for executing .
3.2. Liveness Analysis and Its Related Issues
Liveness analysis enables different tensors to reuse the same physical memory at different time partitions. Our runtime implements a simple yet effective variant of the traditional data flow analysis constructed in for various nonlinear neural networks. The general procedures are as follows:

We construct an and set for every layers to track the live tensors before and after the layer, which cost , where is the network length.

The runtime populates a layer’s and sets by checking the dependencies of subsequent layers. It eliminates tensors in from if no subsequent layers need them. The cost is as each check costs , , , , , respectively.
Fig.5 demonstrates the detailed procedures of Liveness Analysis on the network shown in Fig.(c)c. It explicitly lists the content of and sets at each steps. For instance, for FC7, . It needs to create tensor to finalize the current computation. Since and are no longer needed after FC7, runtime eliminates them from FC7’s set (step:7).
Liveness Analysis reduces the baseline to . In order to simplify the analysis, let’s assume identical memory usages on every layers, i.e. where . In the network training, the results of forward pass are needed by the backward propagation^{2}^{2}2Not all layers require the previous forward output for the backpropagation, again we simplify the case for the analysis. (wang2017accelerating, ; chetlur2014cudnn, ). Therefore, the forward total memory usages at step is , where . During the backpropagation, Liveness Analysis frees and where at the backward step since no future dependencies on them as demonstrated in Fig.5. Therefore, the backward total memory usages at step is and . Since , the is . Therefore, Liveness Analysis saves up to 50% memory from the baseline.
3.2.1. Toward a High Performance Liveness Analysis
Both the empty initial set at step 0 and the empty final set at step 11 in Fig.5 demonstrates Liveness Analysis frequently stashes and frees tensors on the fly in a training iteration, while a typical training phase consists of millions of iterations and such intense memory operations incur nontrivial overhead if using the native cudaMalloc and cudaFree (wang2016blasx, ). According to the experiment, ResNet50 wastes of the training time on memory allocations/deallocations with cudaMalloc and cudaFree. To alleviate this performance issue, we implement a fast heapbased GPU memory pool utility. The core concept is to remove the allocation/deallocation overhead by preallocating a big chunk of GPU memory as a shared memory pool. Then we divide the entire GPU memory pool into 1KB blocks as the basic storage unit. The memory pool contains a list of allocated and empty memory nodes. Each node in the two lists contains memory address, occupied blocks and node ID. For an allocation request, the memory pool finds the first node with enough free memory from the empty list. After that, it updates the empty list and creates a new node in the allocated list to track the current allocation. For a deallocation request, the memory pool locates the node in the allocated list with the IDtonode hashtable, then the pool places the node back to the empty list.
3.3. Unified Tensor Pool(UTP) and Its Related Issues
If the depth of a neural network goes to , the ImageNet training still consumes at least GB memory. Therefore, Liveness Analysis alone is inadequate for the emerging deep nonlinear neural architectures. We provide Unified Tensor Pool (UTP) to further alleviate the GPU DRAM shortage by asynchronously transferring tensors in/out the external memory. UTP is a consolidated memory pool abstraction for tensor allocations/deallocations, using various external physical memory such as CPU DRAM, DRAM of other GPUs, or remote CPU/GPU DRAM. In this paper, we focus on the scenario of using local CPU DRAM as an external pool for the fast and efficient interconnect, but the abstraction also applies to other cases shown in Fig.7. UTP intelligently manages the tensor placement, movement, allocation and deallocation, so that the underlying memory management is entirely transparent to DL practitioners.
3.3.1. Basic UTP Memory Management: Memory Offloading and Prefetching
Not all the layers are suitable for Offloading and Prefetching. We define transferring tensors from GPU to external physical pools as Offloading, and the reversed operation as Prefetching. Fig.(a)a and Fig.(b)b demonstrate that POOL, ACT, BN and LRN all together occupy over of the total memory, while their computations only account for an average of of the entire workload. Thus, offloading these layers incurs a great overhead due to the insufficient overlapping of communications and computations. It is also not fruitful to offload on Dropout, Softmax and FC layers since they only use less than 1% of the total memory. Therefore, we only offload the tensors from CONV layers.
Offloading:the runtime asynchronously transfers the forward outputs of CONV layers to the preallocated pinned CPU memory. It records an event for this data transfer and frees the tensor’s GPU memory once the event is completed. The runtime has an independent thread running in the background to check events of memory copies; and this enables GPUtoCPU data transfers to overlap with the forward computations starting from the current CONV layer to the next one.
Prefetching:the runtime asynchronously brings the offloaded and soon to be reused tensors back to the GPU DRAM. At any CONV layers in the backward, the runtime asynchronously fetches the required tensors for the previous CONV layer. This enables the CPUtoGPU data transfer to overlap with the backward computation starting from the current CONV layer to the previous one.
Offloading and Prefetching reduce after Liveness Analysis to , where . Since layers in are offloaded, the total memory consumption at each backward steps is , where . The memory usage of each layers is nonnegative, thus is .
3.3.2. Caching Tensors on GPU DRAM
While the overlapping opportunity is limited given the fixed amount of computations in an iteration, the aforementioned ondemand Prefetching/Offloading protocol can quickly exhaust the chance. Nowadays CPUtoGPU data movements over PCIE, GPUtoGPU data movements over the same PCIE switch, and GPUtoremote GPU over GPUDirect RDMA deliver a practical speed of 8 GB/s, 10 GB/s, and 6 GB/s, but transferring Gigabytes data in each training iterations incurs the nontrivial overhead. Therefore, this ondemand tensor transfer protocol must be optimized. SuperNeurons proposes a Tensor Cache to exploit the temporal localities of tensors. It caches tensors on GPU DRAM to maximize their reuses and to minimize the global communications. With Prefetching and Offloading, the runtime only triggers data transfers when GPU DRAM is insufficient.
We adopt Least Recent Used (LRU) tensor replacement policy to build Tensor Cache. Since the backpropagation demonstrates the headtotail and tailtohead computation pattern, it subjects the most recent used tensors to the earliest reusing as suggested in Fig.5. This motivates us to design Tensor Cache with a simple variant of LRU. While there are other sophisticated cache replacement policies might be better fit the scenario, thorough discussions of them fall out the scope of this paper.
Alg.2 demonstrates the three key operations of proposed LRU. 1) LRU.in function intends to place a tensor into LRU. Each tensor has a lock, and a tensor cannot be removed from LRU if locked. A layer will lock dependent tensors at calculations. LRU is implemented by a list with the front as Most Frequently Used (MFU) and the tail otherwise. 2) LRU.out function intends to remove enough bytes for a new tensor. It offloads the unlocked Least Recent Used tensors to CPU RAM till having enough free memory for the new one. 3) Check function decides what to operate on the tensor. It takes in a tensor to check if the tensor is in based on the object address (line 2). If found, we place the tensor to the MFU position, i.e. the list front (line 9), and return the tensor’s GPU address. This is the hit scenario. If not found, we call LRU.out to free enough memory for the new tensor before inserting it into LRU, and this is the miss scenario.
3.4. CostAware Recomputation
POOL, ACT, LRN and BN all together use an average of memory, while their forward computations only account for less than of the total time. This exposes additional memory savings with a fraction of performance loss by recomputing the forward dependencies in the backpropagation. Basically, the runtime frees the tensors in cheaptocompute layers such as POOL for reconstructions. In general, there are memorycentric and speedcentric strategies for the recomputation for memory.
The speedcentric strategy keeps the recomputed tensors so that other backward layers can directly reuse them. Fig.(a)a denotes the procedures in red. At the backward step on , it performs a forward pass from to to get dependencies for . It keeps so that they can be reused for the backward computation on and . MXNet (chen2016training, ) adopts this strategy. It incurs the least additional computations, but is . will exceed if is within the segment.
The memorycentric strategy always recomputes the dependencies for each backward layer. In contrast to the speedcentric one, it fully exploits the memorysaving opportunity by freeing the recomputed intermediate results. For example, it recomputes for , while it recomputes again for as demonstrated by the blue lines in Fig.(b)b. The stays at guaranteed to be , but the strategy incurs additional computations.
We present a new CostAware Recomputation that leverages the advantages of both methods. It is motivated by the observation that the memory costs of most recomputation segments are , i.e. . That implies we can leverage the least recomputations in the speedcentric strategy while still guarantees the memory usage to be as in the memorycentric strategy. The general procedures of CostAware Recomputation are as follows:

the runtime iterates over all the layers to find as the threshold.
Table.1 summarizes the extra recomputation for two basic strategies and CostAware Recomputation. Our costaware method ensures to be consistent with the memorycentric strategy, while the extra recomputations are comparable to the speedcentric strategy.
CostAware Recomputation finally reduces to . Previously, Liveness Analysis and Offloading jointly reduce the to . Since noncheckpoints layers will be freed for recomputations, only the nearest checkpoint layer exists in the GPU memory. Thus, . During the recomputations, can be either or depending what recomputation strategies to use. Whereas, CostAware Recomputation guarantees (see analyses above). Thus, the final network wide , which is the minimal achievable at the layerwise granularity.
speedcentric  memorycentric  costaware  

extra  extra  extra  
AlexNet  14  993.018  23  886.23  17  886.23 
ResNet50  84  455.125  118  401  85  401 
ResNet101  169  455.125  237  401  170  401 
3.5. Finding the Best Convolution Algorithm under the Memory Constraint
The speed of CONV layers significantly impacts the training as it accounts for over of total computing time (Fig.8). cuDNN provides several convolution algorithms, e.g. using FFT, Winograd and GEMM, for different contexts. Some of them, FFT in particular, require temporary convolution workspaces to delivery the maximal speed as demonstrated in Fig.2. Therefore, the memory is also a critical factor to the highperformance training.
We implement a dynamic strategy for allocating convolution workspaces. It is dynamic because the memory left for convolution workspaces constantly changes in every steps according to Liveness Analysis, UTP and CostAware Recomputation. Since convolution workspaces do not affect the functionality, the allocations of functional tensors such as data and parameters are prioritized. Then the runtime steps into each layer to profile free bytes left in GPU DRAM after those memory techniques being applied. With free bytes information at individual layers, the runtime benchmarks all the memoryfeasible convolution algorithms to pick up the fastest one. Please note the runtime skips convolution algorithms that require more memory than it can provide. Each layer selects the fastest algorithm under the remaining GPU DRAM, and therefore maximize the performance of CONV layers and the entire training.
4. Evaluations
In this section, we present the results of our experimental studies that evaluate each memory and performance techniques in SuperNeurons. We also did endtoend evaluations against TensorFlow, MXNet, Caffe and Torch on various neural networks to justify the design.
4.1. Components Evaluations
4.1.1. Memory Optimizations
We use the naive networkwide tensor allocation strategy as the baseline. Thus, the of baseline is , where is the network length. (defined in Sec.3). Since cuDNN operates at the layerwise granularity, is bounded by the maximal memory usage among layers, i.e. .
Liveness Analysis reduces the baseline’s to .
Fig.(a)a demonstrates how Liveness Analysis affect memory usages
and live tensor counts at each forward/backward steps on AlexNet.
^{3}^{3}3the structure of AlexNet is
CONV1RELU1LRN1POOL1
CONV2RELU2
LRN2POOL2CONV3RELU3
CONV4RELU4CONV5RELU5POOL5FC1
RELU6Dropout1FC2RELU7Dropout2
FC3Softmax
Since AlexNet has 23 layers, there are 23 forward steps and 23 backward steps.
The central vertical line separates forward and backward while each of them contains 23 computational
steps. The baseline allocates 36 data tensors consuming 2189.437MB,
while Liveness Analysis uses up to 17 tensors with a peak memory usage of 1489.355MB.
This demonstrates 31.9% improvement over the baseline in terms of .
It is also observable that the location of is not necessarily consistent with
the peak tensor count. This confirms our claim that the memory are unevenly distributed across
network layers.
To verify the cost model, i.e. , we delve into the memory usages of peak layer. Fig.(a)a suggests the 32th step reaches . This corresponds to the backward POOL5 in AlexNet, and because of 46  32. The forward layers that are before and include POOL5 stash 5 tensors, consuming 1409.277MB (), while the backward POOL5 stashes 3 tensors, consuming 80.078MB (). Therefore, 1409.277 + 80.078 = 1489.355MB, which is consistent with the measured .
Prefetching and Offloading reduces the after Liveness Analysis to . Fig.(b)b demonstrates the updated memory usages and live tensor counts after Prefetching/Offloading being applied on the top of Liveness Analysis. We set CONV layers as checkpoints for offloading. The new is 1132.155 MB at the 39th step or POOL2 backward. It further reduces 357.2MB on the previous or total 48.29% improvement over the baseline’s . The new shifts from POOL5 to POOL2 because of the number of CONV layers ahead of them. CONV1, CONV2, CONV3, and CONV4 are located before POOL5; and they consume 221.56MB, 142.38MB, 49.51MB and 49.51MB, respectively, The runtime offloads CONV to CPU RAM and prefetches CONV5. This leads the new memory usage of POOL5 to be 1489.355  221.56  142.38  49.51 = 1075.9MB, which is less than the measured new 1132.155 MB at POOL2.
To verify the updated cost model, i.e. , we compare the calculated live tensor count from the model with the actual measurement. There are 2 checkpoints, CONV1 and CONV2, before POOL2; and the runtime prefetches CONV2 in the backward. As a result, the calculated live tensor count at POOL2 is 10 (measured live tensors before POOL2)  1 (CONV1) = 9. This is same to our actual measurement of 9 tensors at POOL2. Therefore, the updated cost model after Prefetching/Offloading is still valid.
Finally, CostAware Recomputation reduces to . In theory, is the minimal at the layerwise granularity as cuDNN needs at least stash the tensors in a layer to compute. Fig.(c)c demonstrates stepwise memory usages and live tensor counts with all three techniques. We profile that at the backward LRN1 by iterating through every layer. Fig.(c)c demonstrates a of 886 MB at the 44th step, which is the backward of LRN1. Therefore, three proposed memory saving techniques successfully reduce the from to .
4.1.2. Speed Optimizations
The runtime equips with a GPU Memory Pool and a Tensor Cache to improve the performance of memory techniques and a dynamic strategy for allocating convolution workspaces to accelerate the training speed. More specifically, GPU Memory Pool amortizes the nontrivial overhead of highfrequent memory allocations/deallocations in Liveness Analysis; and Tensor Cache enables tensor reusing to minimize data transfers in Prefetching/Offloading. Fig.(c)c demonstrates the GPU free space dynamically changes at each forward and backward step due to 3 memory techniques. The runtime allocates convolution workspaces within the free memory at a step. As a result, the performance is optimized at individual layers under different stepwise memory constraints.
img/s  AlexNet  VGG16  InceptionV4  ResNet50  ResNet101  ResNet152 

CUDA  359.4  12.1  6.77  21.5  11.3  7.46 
Ours  401.6  14.4  10.0  32.9  18.95  13.2 
speedup  1.12x  1.19x  1.48x  1.53x  1.68x  1.77x 
Communications in GB  256  384  512  640  896  1024 

Without Tensor Cache  2.56  3.72  4.88  6.03  8.35  9.50 
Tensor Cache  0  0  0  0  0  0.88 
GPU Memory Pool amortizes the nontrivial overhead of intensive memory operations in Liveness Analysis by preallocating a big chunk of GPU memory. Table 2 illustrates the performance improvement of using GPU Memory Pool over cudaMalloc and cudaFree. Linear networks such as AlexNet and VGG involve much fewer memory operations than nonlinear ones such as InceptionV4 and ResNet due to the limited depth. Therefore, the speedups on nonlinear networks (ResNet 50152 and InceptionV4) are more significant than linear networks (AlexNet, VGG).
Tensor Cache intends to reduce unnecessary data transfers in Prefetching/Offloading. Specifically, the offloading is unnecessary if a network can fit into the GPU DRAM. In Table 3, we can see Tensor Cache successfully avoids communications at batch sizes of 256 , while the communications, in the scenario without Tensor Cache, linearly increase along batch sizes. The training performance will deteriorate if communications outweigh computations. Fig.11 demonstrates up to 33.33% performance loss without using Tensor Cache. It is also noticeable that the speedup on linear networks (AlexNet, VGG16) is less significant than nonlinear ones (ResNet50152, Inception). In general, the computation intensity of a linear network layer is far more than the nonlinear one. Because their communications can overlap with computations in Prefetching/Offloading, Tensor Cache does not provide the comparable speed up for AlexNet and VCG16.
Dynamic Convolution Workspace Allocation intends to optimize each layers’ training speed in together with 3 memory techniques. Convolution workspaces are critical to the high performance, while the free memory for convolution workspaces constantly changes at different computing steps as demonstrated in Fig.(c)c. The runtime picks the fastest memoryfeasible convolution algorithm at a particular step.
Fig.(a)a and Fig.(b)b demonstrate that the runtime automatically reduces CONV workspaces to accommodate functional tensors with the increasing batch size. Specifically, the runtime prioritizes the functional tensor allocations at batch 300 under 3 GB memory pool (Fig.(b)b), while it provisions the most workspace for the maximal speed at batch 100 (Fig.(a)a). In general, a higher speed is observable with more convolution workspaces. Fig.(c)c and Fig.(d)d demonstrate the training speed (images per second) increases from 203 img/s to 240 img/s with additional CONV workspaces.
4.2. Going Deeper and Wider
Our primary goal is to enable ML practitioners exploring deeper and wider neural architectures within the limited GPU DRAM. In this section, we conduct endtoend comparisions to TensorFlow, MXNet, Caffe and Torch with several mainstream linear networks (AlexNet, VGG16) and nonlinear ones (ResNet50 150, Inception V4) under the same experiment setup.
Depth  Caffe  MXNet  Torch  TensorFlow  SuperNeurons 

ResNet  148  480  152  592  1920 
peak batch  Caffe  MXNet  Torch  TensorFlow  SuperNeurons 

AlexNet  768  768  1024  1408  1792 
VGG16  48  64  48  80  224 
InceptionV4  16  N/A  N/A  64  240 
ResNet50  24  80  32  128  384 
ResNet101  16  48  16  80  256 
ResNet152  16  32  16  48  176 
We increase the batch size to go wider. Table. 5 presents the largest batch reachable by different frameworks before the GPU outofmemory error. SuperNeurons consistently outperforms the mainstream frameworks on both linear and nonlinear networks. On average, it handles 1.8947x larger batches than the second best. SuperNeurons can train ResNet101 at the batch of 256, which is 3x larger than the second best TensorFlow.
Fig.13 demonstrates the corresponding memory requirement to peak batches in Table.5. The translation is nonlinear because of the convolution workspace. We calculate the memory requirement with , and is the sum of the memory usages of all tensors in the layer. It is observable that SuperNeurons handles up to 19.8x larger model than Caffe.
We add layers to go deeper. Table.4 demonstrates SuperNeurons trains 12.9730x, 12.6316x, 4.0000x, and 3.2432x deeper ResNet than Caffe, Torch, MXNet, and TensorFlow, respectively. Particularly SuperNeurons can train a ResNet up to 2500 residual units having approximately basic layers at the batch size of 1 on a 12GB GPU.
The training speed is measured by the processed images per second. Fig.14 presents an endtoend training speed comparison of SuperNeurons to mainstream DL systems. SuperNeurons consistently demonstrates the leading speed on various linear networks (AlexNet, VGG16) and nonlinear ones (ResNet, Inception V4). The performance largely results from the abundant supply of convolution workspaces saved by the dynamic GPU memory scheduler. We can also observe that the speed has slowly deteriorated along the increasing batch size. This is because the growing communications in more frequent tensor swapping between CPU and GPU DRAM. The performance will be the worst when GPU memory can only accommodate one network layer. Then, the runtime has to constantly offload the current layer before proceeding to the next one.
5. Related Work
Several solutions have been proposed to address the GPU DRAM shortage for training largescale neural networks. Model Parallelism provides a straightforward solution to the large network training. DistBelief (dean2012large, ) partitions a network across multiple machines so that each machine holds a segment of the original network. Coates et al (coates2013deep, ) discuss another partition scheme on multiGPUs. Model Parallelism demands huge intranetwork communications for synchronizations. Therefore, most DL systems parallelize the training with Data Parallelism for the highperformance (jia2014caffe, ; abadi2016tensorflow, ; chen2015MXNet, ; collobert2002torch, ). In this paper, we focus on the GPU DRAM shortage issue for Data Parallelism.
Under Data Parallelism, vDNN (rhu2016vdnn, ) proposes a prefetching and offloading technique to utilize the CPU DRAM as an external buffer for the GPU. It tries to overlap communications with computations by asynchronously swapping the data between CPU and GPU amid the backpropagation. The performance of this method largely depends on the communication/computation ratio. Some layers such as POOL are very cheap to compute, while the GPU processing speed is several orders of faster than PCIE 16x bus. In nonlinear networks, the performance will quickly deteriorate once computations are inadequate to overlap with communications. Chen et al (chen2016training, ) also introduce a recomputation strategy to trade computations for memory. However, their method fails to fully exploit the memory saving opportunities and computation efficiency for ignoring the memory variations among layers.
Removing the parameter redundancy also reduces the memory usage. For example, the network pruning (han2016eie, ; hassibi1993second, ) removes near zero parameters; and quantization (vanhoucke2011improving, ) or precision reduction (judd2016proteus, ) utilize low precision floats to save the memory. Although the parameter reduction has immense benefits in deploying neural networks on embedded systems, parameters only account for a negligible portion of memory usage in the training. Therefore, these approaches are quite limited to the training.
6. Conclusion
In this paper, we focus on the GPU memory scheduling problem for training deep neural networks; and we propose a novel dynamic scheduling runtime to tackle the issue. The runtime features three memory techniques to reduce to , which is the minimal at the layerwise granularity. We also propose several performance optimizations to guarantee the high performance. Evaluations against stateoftheart DL frameworks have demonstrated the effectiveness and efficiency of proposed dynamic scheduling runtime. It creates new opportunities for DL practitioners to explore deeper and wider neural architectures; and the new accuracy record is awaiting to be refreshed with even deeper and wider designs.
7. Acknowledgements
This research is funded in part by the DARPA Award 1643D3MFP040 and gifts from Google, VMware, Mellanox, and Oracle. Zenglin Xu and Jianmian Ye were supported by a grant from the Natural Science Foundation of China (No. 61572111), and a Fundamental Research Fund for the Central Universities of China (No. ZYGX2016Z003).
References
 (1) Mxnet’s graph representation of neural networks. http://mxnet.io/architecture/note_memory.html.
 (2) Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al. Tensorflow: A system for largescale machine learning. In OSDI (2016), vol. 16, pp. 265–283.
 (3) Bahrampour, S., Ramakrishnan, N., Schott, L., and Shah, M. Comparative study of caffe, neon, theano, and torch for deep learning.
 (4) Bengio, Y., Simard, P., and Frasconi, P. Learning longterm dependencies with gradient descent is difficult. IEEE transactions on neural networks 5, 2 (1994), 157–166.
 (5) Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao, T., Xu, B., Zhang, C., and Zhang, Z. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 (2015).
 (6) Chen, T., Xu, B., Zhang, C., and Guestrin, C. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174 (2016).
 (7) Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J., Tran, J., Catanzaro, B., and Shelhamer, E. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 (2014).
 (8) Coates, A., Huval, B., Wang, T., Wu, D., Catanzaro, B., and Andrew, N. Deep learning with cots hpc systems. In International Conference on Machine Learning (2013), pp. 1337–1345.
 (9) Collobert, R., Bengio, S., and Mariéthoz, J. Torch: a modular machine learning software library. Tech. rep., Idiap, 2002.
 (10) Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Senior, A., Tucker, P., Yang, K., Le, Q. V., et al. Large scale distributed deep networks. In Advances in neural information processing systems (2012), pp. 1223–1231.
 (11) Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., Horowitz, M. A., and Dally, W. J. Eie: efficient inference engine on compressed deep neural network. In Proceedings of the 43rd International Symposium on Computer Architecture (2016), IEEE Press, pp. 243–254.
 (12) Hassibi, B., and Stork, D. G. Second order derivatives for network pruning: Optimal brain surgeon. In Advances in neural information processing systems (1993), pp. 164–171.
 (13) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (2016), pp. 770–778.
 (14) Huang, G., Liu, Z., Weinberger, K. Q., and van der Maaten, L. Densely connected convolutional networks. arXiv preprint arXiv:1608.06993 (2016).
 (15) Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., and Darrell, T. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia (2014), ACM, pp. 675–678.
 (16) Judd, P., Albericio, J., Hetherington, T., Aamodt, T. M., Jerger, N. E., and Moshovos, A. Proteus: Exploiting numerical precision variability in deep neural networks. In Proceedings of the 2016 International Conference on Supercomputing (2016), ACM, p. 23.
 (17) Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (2012), pp. 1097–1105.
 (18) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE 86, 11 (1998), 2278–2324.
 (19) Pleiss, G., Chen, D., Huang, G., Li, T., van der Maaten, L., and Weinberger, K. Q. Memoryefficient implementation of densenets. arXiv preprint arXiv:1707.06990 (2017).
 (20) Rhu, M., Gimelshein, N., Clemons, J., Zulfiqar, A., and Keckler, S. W. vdnn: Virtualized deep neural networks for scalable, memoryefficient neural network design. In Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on (2016), IEEE, pp. 1–13.
 (21) Simonyan, K., and Zisserman, A. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556 (2014).
 (22) Szegedy, C., Ioffe, S., Vanhoucke, V., and Alemi, A. A. Inceptionv4, inceptionresnet and the impact of residual connections on learning. In AAAI (2017), pp. 4278–4284.
 (23) Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (2015), pp. 1–9.
 (24) Vanhoucke, V., Senior, A., and Mao, M. Z. Improving the speed of neural networks on cpus. In Proc. Deep Learning and Unsupervised Feature Learning NIPS Workshop (2011), vol. 1, p. 4.
 (25) Wang, L., Wu, W., Bosilca, G., Vuduc, R., and Xu, Z. Efficient communications in training large scale neural networks. arXiv preprint arXiv:1611.04255 (2016).
 (26) Wang, L., Wu, W., Xu, Z., Xiao, J., and Yang, Y. Blasx: A high performance level3 blas library for heterogeneous multigpu computing. In Proceedings of the 2016 International Conference on Supercomputing (2016), ACM, p. 20.
 (27) Wang, L., Yang, Y., Min, R., and Chakradhar, S. Accelerating deep neural network training with inconsistent stochastic gradient descent. Neural Networks (2017).
 (28) Wu, W., Bouteiller, A., Bosilca, G., Faverge, M., and Dongarra, J. Hierarchical dag scheduling for hybrid distributed systems. In Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International (2015), IEEE, pp. 156–165.
 (29) Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., and Stoica, I. Spark: Cluster computing with working sets. HotCloud 10, 1010 (2010), 95.