cuDNN: Accelerating Deep Learning Frameworks with MicroBatching
Abstract
NVIDIA cuDNN is a lowlevel library that provides GPU kernels frequently used in deep learning. Specifically, cuDNN implements several equivalent convolution algorithms, whose performance and memory footprint may vary considerably, depending on the layer dimensions. When an algorithm is automatically selected by cuDNN, the decision is performed on a perlayer basis, and thus it often resorts to slower algorithms that fit the workspace size constraints. We present cuDNN, a transparent wrapper library for cuDNN, which divides layers’ minibatch computation into several microbatches. Based on Dynamic Programming and Integer Linear Programming, cuDNN enables faster algorithms by decreasing the workspace requirements. At the same time, cuDNN keeps the computational semantics unchanged, so that it decouples statistical efficiency from the hardware efficiency safely. We demonstrate the effectiveness of cuDNN over two frameworks, Caffe and TensorFlow, achieving speedups of 1.63x for AlexNet and 1.21x for ResNet18 on P100SXM2 GPU. These results indicate that using microbatches can seamlessly increase the performance of deep learning, while maintaining the same memory footprint.
I Introduction
Prevalent Deep Neural Networks (DNNs) are becoming increasingly deeper and are trained with large batch sizes. Specifically, stateoftheart DNNs contain hundreds of layers [1, 2], and utilize batch sizes in the order of thousands [3, 4, 5].
Large batches are also favored by distributed dataparallel deep learning frameworks, because they improve utilization of accelerators, as well as hiding the communication of parameter gradients in the computation efficiently. Consequently, the batch size per accelerator (e.g., GPU) should be large to achieve better scaling. Since the memory usage of a DNN is nearly proportional to the layer size and the batch size, the accelerator memory tends to be used at full capacity in most realworld cases.
This “limited memory scenario” is also exhibited in cuDNN [6], a deep learning kernel library for NVIDIA GPUs. cuDNN provides a variety of computational primitives for deep neural networks, and is widely used in deep learning frameworks, such as Caffe [7] and others [8, 9, 10]. cuDNN provides up to eight different algorithms to perform convolutions, each of which requires different temporary storage (workspace) schemes. To guide users to determine the best algorithm for a given maximum workspace size, cuDNN provides a function cudnnGetConvolution*Algorithm (* is one of convolution types, Forward, BackwardData and BackwardFilter), that benchmarks all the algorithms and chooses the best algorithm, either with respect to computation time or memory usage. However, if the workspace size requested by a fast algorithm is one byte larger than provided, cuDNN will resort to a slower algorithm that requires less workspace. In fact, the performance impact can be 4.51x in the 2nd convolutional layer of AlexNet, as shown in Fig. 1.
In this paper, we propose cuDNN, a transparent wrapper for cuDNN that attempts to mitigate the aforementioned inefficiency. In order to utilize fast convolution algorithms with limited size of workspace, cuDNN automatically divides layer minibatch computation into several microbatches and perform multiple convolutions sequentially. cuDNN decouples the statistical efficiency (speed of accuracy/loss improvement with fixed amount of parameter updates) from the hardware efficiency (speed of computations with fixed amount of parameter updates), improving only the latter. Using microbatches, cuDNN improves the utilization of the accelerators without incurring any reduction in training accuracy.
The contributions of this paper are as follows:

We present a method to automatically divide minibatch training into several “microbatches”, so that faster algorithms are utilized with tight workspace constraints.

We propose two different workspace allocation policies, which enable optimization of multiple convolutional layers with interdependencies.

We evaluate cuDNN over two different deep learning frameworks, Caffe and TensorFlow, showing that it can mitigate the inefficiency of cuDNN even with stateoftheart Convolutional Neural Networks (CNNs), such as AlexNet and ResNet.
Ii The Anatomy of Convolutional Neural Networks
Convolution operations in Convolutional Neural Networks (CNNs) apply multiple filters to a batch of channels of twodimensional data (Algorithm 1, Fig. 2). In particular, input and output tensors are represented as fourdimensional tensors with dimensions (), where is the minibatch size, is the number of channels, and and represent image height and width, respectively. Similarly, the filter tensor is represented as fourdimensional () tensor, where is the number of output channels and represent kernel height and width.
The twodimensional convolution is composed of sevennested loops (Algorithm 1). The innermost three loops compute the actual convolution, where one element of the input tensor is multiplied and accumulated to one element of the output tensor . The remaining loops iterate over all elements of . The key observation is that in order to solve the problem described in Section I, there is no dependency inside the minibatch loop between different iterations. This is intuitive because in training or inference we compute parameter gradients or outputs with respect to different data samples, so this is equivalent to computing different CNNs concurrently. This observation motivates us to apply loop tiling to the minibatch loop, so that we can reduce the resident workspace size.
The only exception to the intersample independency is the computation of parameter gradients;
where and is the loss function with respect to a minibatch and a sample respectively, and is the convolution operation [12]. The semantics of this computation is, however, not violated by the loop splitting, only if each of the iterations is performed sequentially.
In cuDNN, there are three operations related to the twodimensional convolution; Forward for forward computation (Fig. 2), BackwardData for computing neuron errors in backpropagation, BackwardFilter for computing parameter gradients in backpropagation.
Although Forward and BackwardData can directly be divided into several microbatches, BackwardFilter cannot, since there are output dependencies on the accumulated parameter gradients tensor . However, we can still divide the loops by running BackwardFilter multiple times while accumulating the results, i.e., output scale in cuDNN. Therefore, loop splitting can be achieved by repeating cuDNN kernels one or more times for any convolutionrelated operation, regardless of the underlying method.
Iii cuDNN
cuDNN is a transparent C++ wrapper library for cuDNN, which can easily be integrated into most deep learning frameworks [7, 13, 8, 10]. The key concept of cuDNN is that it automatically divides a minibatch to several batches (referred to as “microbatches” in this paper) and optimizes their sizes, to utilize faster convolution algorithms (Fig. 3).
Iiia cuDNN Methodology
cuDNN library employs one of two workspace utilization policies to optimize microbatches for convolution kernels (Fig. 4):

Workspace Reuse (WR): WR allocates one workspace per layer, sharing the space between the internal microbatches. In this scheme, each layer is assumed to use the workspace exclusively, hence the total size of the workspaces is in proportion to the number of convolutional layers.

Workspace Division (WD): WD allocates one workspace per network, and assigns different segments to each convolutional layer. WD enables small groups of convolution operations, as in the Inception module [14], to run concurrently with larger workspaces. In WD, the actual workspace is managed by cuDNN rather than the deep learning framework. This is because conventional frameworks allocate each workspace separately, lacking a global view of the entire network’s workspace requirements.
WR and WD both rely on the parameters of one or more convolution kernel(s), the minibatch size, and the maximum workspace size. The output of cuDNN is a division of the minibatch, and “microconfigurations”; a pair of a convolution algorithm and microbatch size for each convolution microbatch. In this paper, we define “configuration” of a segmented convolution kernel as “a list of microconfigurations”. For example, if a kernel with a minibatch size of 256 is equally divided into four microbatches and each of them uses algorithm , the configuration is represented as . Also we define concatenation of two lists as , such as and .
IiiB WR Algorithm
The goal of the WR policy is to optimize , the total execution time with minibatch size of using Dynamic Programming (DP), given by:
where is the fastest execution time of one convolution kernel with a microbatch size of , within the workspace constraint. If the first row of the definition of is smaller than the second row, cuDNN does not have to divide the batch. Otherwise, it is beneficial to divide the batch into two or more parts, applying the process recursively (Fig. 5).
The key point of WR is that the optimal microconfiguration size is deterministic and independent from other kernels. This is because in this case, we assume that multiple kernels do not run simultaneously.
The algorithm of WR is threefold, where the minibatch size is , and usergiven maximum workspace size is :

For , WR benchmarks all available convolution algorithms of microbatch size of with maximum workspace size of , using cuDNN. We define the fastest microconfiguration as (where is the fastest algorithm) and its execution time as .

For , WR computes , the fastest execution time for microbatch size of , and , the corresponding configuration, as follows (where ). and are memorized and reused for further iterations.

Outputs the optimal configuration .
IiiC WD Algorithm
In the WD scheme, configurations for multiple convolution kernels are optimized, while at the same time the total workspace size should be less than the total workspace limit that users specify. Therefore, WD is a more complex problem than WR, since the configuration of each convolution kernel is no longer independent from others, due to the total workspace size constraint.
To solve this problem, we formulate a 01 Integer Linear Programming (ILP)based optimization algorithm (Fig. 6). Given the set of kernels and sets of available configurations of kernel , WD is solved by minimizing Equation 1:
(1)  
(4)  
where and are the workspace size and execution time of kernel with configuration , respectively. Equation 4 limits the total workspace size to the userspecified size . cuDNN uses configuration on kernel if and only if , and exactly one of them is selected for each kernel , according to the constraint in Equation 4.
IiiC1 Desirable Configuration Selection
The challenging problem of the above ILPbased algorithm is that if all possible configurations are evaluated (i.e., all combinations of the number of microbatch and algorithms), the searchspace is in the order of (where is set of algorithms and is the minibatch size) configurations in total, which makes the problem impractically large.
Here we compute a Pareto front to remove undesirable configurations from all possible configurations, without returning any suboptimal solutions. The resulting Pareto front is then input to the ILP (Equation 14) to solve the entire problem.
First, we modify the DP algorithm from WR (Section IIIB) to output a set of configurations, rather than the fastest configuration, as follows:
where is a set of available microconfigurations of microbatch size of , and is a pruning function described below. Note that this outputs of the WR algorithm as one of its elements; and for any .
Second, we define the “desirable configuration set” as a Pareto front in the twodimensional (execution time workspace size) space (Fig. 7):
where and is execution time and required workspace size of a configuration set . This definition implies that any is the fastest configuration among any of the elements of using a workspace size of or less. Conversely, if an element is not in , there is an element that is faster than and requires less workspace, hence there is no reason to choose , namely “undesirable”.
The pruning drastically reduces the number of variables of Equation 1, and enables solving the ILP for stateoftheart deep CNNs in practical time. For instance, the maximum number of desirable configurations of AlexNet’s layers we examined in Section IVD was 68, which is much smaller than the exponential order. Fig. 8 illustrates a Pareto front of one convolutional layer of AlexNet.
The validity of the pruning algorithm that the optimal solution of the ILP does not include any undesirable configurations is proved as follows:
Proof.
Suppose that an optimal solution of the ILP , where is the set of variable symbols of the ILP, contains an undesirable configuration of a kernel (i.e. ). According to the definition of desirable sets, there is a configuration of such that and . According to Equation 4, for all .
IiiD cuDNN Implementation
To enable cuDNN, the only modification that needs to be performed to the code is to replace the cuDNN handle type cudnnHandle_t with UcudnnHandle_t. The cuDNN handle object is an opaque type that wraps the original type, such that users can call any cuDNN function. When a convolution operation or benchmarking function is called with the cuDNN handle object, the cuDNN library internally computes the optimal configurations, and returns a virtual algorithm ID and zero required workspace size. This mechanism enables users to call cuDNN with minimal modification to the original code. For example, the number of lines to be modified to introduce cuDNN to Caffe (v1.0) is approximately three.
The implementation of cuDNN is based on overloading a subset of cuDNN functions, where the memory of the cuDNN handle type is structured to behave to act as the cuDNN internal handle for the other calls. We define a cast operator from the cuDNN handle to cuDNN handle so that the framework automatically adopts this method. Using this technique, cuDNN delegates most of the functions to cuDNN, but overrides functions related to the convolutional layers.
The optimization algorithm in cuDNN is based on the methodology described in Section IIIA. In practice, cuDNN provides a “batch size policy”, which determines what microbatch sizes are benchmarked at the step 1 of the WR algorithm, as follows:

all uses all batch sizes . Although this always finds the optimal solution, it takes time for the benchmark.

powerOfTwo uses only poweroftwo batch sizes . This saves a considerable amount of time since it only costs time for the benchmark.

undivided uses only the original minibatch size . In WR, this option always selects the same configuration as cuDNN, hence this option is only useful to evaluate the overhead of cuDNN.
These policies can be specified via an environment variable or through a special library function in cuDNN. Furthermore, cuDNN supports parallel microconfiguration evaluation via an environment variable, in which the aforementioned microbatches are distributed to different GPUs on the same computing node and tested concurrently. This function assumes that the node contains multiple homogeneous GPUs.
cuDNN caches the optimized configurations and the benchmark results into memory and optional filebased database respectively, to skip unnecessary recomputations. This is especially beneficial for networks that replicate convolutional layers of the same size, such as ResNet [2]. In addition, the filebased caching enable offline benchmarking, as well as sharing the results among a homogeneous GPU cluster via network file system.
IiiE Implementation of WD Optimization
To perform WD optimization, cuDNN must know the number of convolutional layers and corresponding layer parameters in advance, i.e., before running any kernel. In the current cuDNN API, however, the parameters are passed one layer at a time, and thus there is no way to obtain all the parameters collectively from deep learning frameworks.
To overcome this issue, we assume that the deep learning framework calls cudnnGetConvolution*Algorithm one time for each layer prior to the computation of the entire network (e.g., training, inference). This is the most straightforward use of the cuDNN interface, as memory (including workspace) is usually allocated before initiating computations. Due to the specific implementation of Caffe, we add a cuDNN library call after network initialization, which ignores subsequent cudnnGetConvolution*Algorithm calls.
When cudnnGetConvolution*Algorithm is called, cuDNN pushes the kernel parameters to an internal list, and returns a dummy result. Note that the returned results satisfy the semantics given by the cuDNN interface, so the framework will not raise errors and will not allocate its own workspaces. When cudnnConvolution* is called for the first time, cuDNN executes the optimization algorithm (namely, WD). We use the GNU Linear Programming Kit (GLPK) [15] as the ILP solver.



TSUBAMEKFC/DL  TSUBAME 3  DGX1  


CPU (Intel Xeon)  E52620 2  E52680 v4 2  E52698 v4 2 
GPU (NVIDIA Tesla)  K80 4  P100SXM2 4  V100SXM2 8 
 8.73 SP TFlop/s   10.6 SP TFlop/s   15.7 SP TFlop/s  
 24 GiB GDDR5   16 GiB HBM2   16 GiB HBM2  
(480 GiB/s BW)  (732 GiB/s BW)  (900 GiB/s BW)  
SUSE Linux  
OS  CentOS 7.3.1611  Enterprise Server  Ubuntu 16.04.3 
12 SP2  
CUDA  8.0.61  8.0.44  9.0 
cuDNN  6.0  6.0  7.0.5 
GLPK  4.63  4.63  N/A 
Caffe  1.0  1.0  NVCaffe 
v0.16.5 [16]  
TensorFlow  N/A  1.4.1  N/A 

Iv Performance Evaluation
We evaluate the performance of cuDNN for three different GPU architectures, NVIDIA Tesla K80 [17], P100SXM2 [18] and V100SXM2 [19] on the TSUBAMEKFC/DL, TSUBAME 3, and DGX1 supercomputers, respectively. The specifications of these supercomputers are listed in Table I.
Throughout the evaluation, we use singleprecision floating point format and store tensors in the storage order. We use three different deep learning frameworks for evaluations: Caffe [7], its NVIDIA branch (NVCaffe) [16], and TensorFlow [8]. Both support recent versions of cuDNN (6 or 7). We use a builtin benchmarking command (Caffe’s “time” command) or an official benchmarking script (from TensorFlow models repository [20]) to measure the execution time of forward and backward passes, and show the sum of forward and backward passes together. In the following sections, unless explicitly mentioned, each forwardbackward passes are measured 50 times on Caffe and 100 times on TensorFlow.
For neural networks, we use AlexNet[1], ResNet[2], and DenseNet[21]. For evaluations on Caffe, we use the AlexNet model defined in Caffe, ResNet18, and ResNet50 from NVCaffe. We modify data prefetching size from 4 to 16 for AlexNet and ResNet18 for TSUBAME 3. For evaluations on TensorFlow, we use the definitions in an official benchmarking repository [22].
As for workspace limit, unless explicitly mentioned, we use 8 MiB and 64 MiB for each layer, which are the default workspace size limits of Caffe and Caffe2 [13] respectively. In addition, we use 512 MiB of workspace per layer to investigate the case where sufficiently large workspace is provided. To shorten the benchmarking time, we use several GPUs on the same node with the parallel evaluation function of cuDNN, mentioned in Section IIID.
Iva Convolution Kernel Optimization Using WR
Fig. 9 shows the execution time of forward convolution (cudnnConvolutionForward) of the “conv2” layer in AlexNet on P100SXM2. With workspace size of 64 MiB, the GEMM (GEneral MatrixMatrix multiply)based algorithm is the one chosen by cuDNN, requiring only 4.3 KiB for workspace if the minibatch is not divided. On the other hand, FFTbased convolution [12] is more efficient, although it requires excessive amount of workspace (213 MiB) to store the images and filters in the frequency domain. cuDNN with powerOfTwo option successfully enables the use of FFT within the workspace size constraints, using 48.9 MiB over microbatches of size 32.
The all option also enables cuDNN to use Winograd convolution [23], an algorithm that is especially efficient for small convolution kernels, achieving 2.33x speedup over undivided in total.
IvB CNN Optimization Using WR
We evaluate WRbased optimization on two different deep learning frameworks: Caffe and TensorFlow.
IvB1 Caffe
Fig. 10 shows timing breakdowns of Caffe on AlexNet with three different GPUs. Additionally, we only highlight convolutional layers since the others (e.g., pooling) are out of the scope of this paper.
One important observation from Fig. 10 is that the performance improvement of cuDNN over cuDNN (which is equivalent to undivided) is significant when the moderate amount of workspace is set by users. For instance, if the workspace size per kernel is 64 MiB, cuDNN with the all option achieves 1.81x speedup with respect to the entire iteration, and 2.10x with respect to convolutions alone, than undivided on K80. This is because cuDNN successfully enables cuDNN to use faster algorithms, as in the example from Section IVA. In addition, a similar speedup is achieved on P100SXM2 (1.40x for the entire iteration, and 1.63x for convolutions alone), and on V100SXM2 (1.47x for the entire iteration, and 1.63x for convolutions alone).
In the case where workspace size is limited to 8 MiB, cuDNN cannot attain any performance improvement, because even if the minibatch is finely divided, the specified workspace is too small to utilize. Indeed, on P100SXM2, only one kernel of all option seems to increase the utilization of the workspace over undivided.
On the other hand, when the workspace size limit is too large (512 MiB) on K80 and P100SXM2 GPUs, performance difference between cuDNN and cuDNN is negligible. This is because there is no benefit from dividing the minibatch, as all algorithms fit into the workspace constraints. However, this workspace limit consumes a considerable amount of workspace memory: While the undivided option consumes 2.87 GiB in total, all with 64 MiB limit only consumes 0.70 GiB, although with 4% overhead caused by the choice of microbatch algorithms.
From the viewpoint of the time to optimization, including kernel benchmarking and solving DP, powerOfTwo considerably outperforms all. In particular, with 64 MiB workspace on P100SXM2, all takes 34.16 s, whereas powerOfTwo takes 3.82 s. This result and Fig. 10 imply that powerOfTwo is a reasonable choice to test the computation efficiency of new CNNs quickly. Generally, the overhead of cuDNN is negligible with respect to the entire training time, in which the forward and backward passes are repeated hundreds of thousands of times.
IvB2 TensorFlow
Fig. 11 presents timing breakdowns of AlexNet and ResNet50, DenseNet40 on P100SXM2.
We set the (input width, output width) as for AlexNet and ResNet50, or for DenseNet40, which are used for training ILSVRC2012 classification dataset [24] or the CIFAR dataset [25], respectively. We also set of DenseNet40, the number of feature maps of each convolutional layer, to 40 to obtain better computational efficiency.
Since TensorFlow 1.4.1 does not provide any workspace limits to cuDNN via cuDNN’s benchmarking functions before actual convolutions, we manually provide workspace limits of 8, 64, and 512 MiB to cuDNN. cuDNN with a workspace limit of 64 MiB achieves 1.24x speedup for AlexNet, and 1.06x for ResNet50. These results prove that cuDNN has good performance portability between different deep learning frameworks that depend on cuDNN.
IvC Memory Consumption Using WR
Fig. 12 shows the perlayer memory usage of AlexNet and ResNet18 on P100SXM2. In Fig. 12, we set a perlayer workspace limit of 512 MiB for cuDNN, and 64 MiB for cuDNN, where the slowdown due to the decrease of memory limit is negligible (1.17x). These figures clearly show that cuDNN can cut down perlayer memory consumption by up to 3.43x and 2.73x on AlexNet and ResNet18 respectively.
IvD CNN Optimization Using WD
Fig. 13 shows the benchmark results of using the WD algorithm. The adjoined bars have the same workspace limit in total: For example, since AlexNet has five convolutional layers and each layer has three kernels (Forward, BackwardData, BackwardFilter), we place the result with 120 MiB WD workspace next to that of 8 MiB WR workspaces.
In Fig. 13, we can see that the training time decreases as the workspace constraints increase in both WR and WD. At the same time, WD successfully manages the global memory requirements better, attaining higher performance with the same overall memory footprint (see Fig. 14 for breakdown). Specifically, when 120 MiB workspace in total is provided for AlexNet, the entire execution time with WD optimization and all option is 1.24x faster than the WR with undivided option for the entire iteration (or 1.38x for convolution). WD also outperforms the baseline with 960 MiB workspace in total, which can use 8 times more memory for workspace, by 1.24x in total execution time.
Furthermore, even for ResNet50, which has 10 times more convolutional layers than AlexNet, WD achieves 1.05x speedup for the entire iteration (or 1.14x for convolution) with 2,544 MiB of total workspace, outperforming the original version (which consumes 5,088 MiB) in terms of memory footprint as well. In addition, the ILP for ResNet50 is still small enough to solve in practical time. When the workspace limit is set to 5,088 MiB, the number of 01 variables is 562, and the GLPK solver takes 5.46 ms to solve it.
The main reason that WD outperforms WR is that in WR, if cuDNN fails to find better algorithms and microbatch sizes to fully utilize the assigned workspace, cuDNN must abandon that workspace slot and cannot allocate it to other kernels. On the other hand, in WD, characteristics of different desirable workspace sizes of different kernels (Fig. 8) are implicitly considered in the ILPbased optimization framework. Therefore, cuDNN can assign larger proportional workspaces to timeconsuming layers, if it is expected that the kernels will be considerably faster with a larger workspace.
In Fig. 14, cuDNN with the WD policy spares most of the workspace for “conv2” and “conv3” (93.7%), which are the most timeconsuming layers in the baseline (WR, undivided). In contrast, cuDNN doesn’t allocate workspace of over 3 MiB for “conv4” and “conv5”, although cuDNN lists some faster and desirable configurations than the baseline. For instance, the fastest configuration of conv5 (forward), which uses FFTbased convolution with two microbatches, is 1.29x faster than baseline, although this configuration uses 109 MiB of workspace. This observation implies that the WD does not unnecessarily allocate workspace for a specific layer but chooses the best combination, as defined by the ILP.
V Related Work
Li et. al [26] propose a heuristic to automatically tune each tensor memory layout to utilize either GEMMbased or FFTbased convolution efficiently. The proposed heuristic is, however, based on the authors’ performance observation using conventional convolutional layers and specific GPU architecture, and thus there is no guarantee that the algorithm always provides the best memory alignment for any deep neural network and GPU architecture. On the other hand, since cuDNN uses the techniques of dynamic programming and integer linear programming, it is mathematically guaranteed that cuDNN provides the best performance that the library can produce, provided that each convolution is independent from the others.
Rhu et al. [27] propose a memory management technique that offloads neuron activations, parameters, and errors from the GPU memory to the CPU memory during forward/backwardpropagation, so that larger models can be trained with the same memory constraint. However, as Fig. 12 shows, even in such memoryefficient implementation or similar memory management techniques [28] cuDNN is expected to save the peak memory usage of each layer.
Zlateski et al. [29] propose ZNNi, an FFTbased convolution algorithm, and mention microbatching technique to reduce the temporal memory usage by FFT. cuDNN, however, generalizes the schema so that microbatching can be applied to any convolution algorithm, obtaining the best computational performance for the given layer configurations, as well as maintains high portability between different existing deep learning frameworks.
Vi Conclusion
In this paper, we proposed cuDNN, a wrapper library for cuDNN, which divides the minibatch to utilize highperformance convolution algorithms with limited amount of memory for workspace. We have shown that cuDNN works well even with recent CNNs, which are composed of many convolutional layers, and can easily be integrated into existing deep learning frameworks.
The performance of cuDNN demonstrated in our work suggests that other layer types can be optimized as well, if they can be decomposed and computed by different algorithms. This is because cuDNN does not use any special properties of the convolution operator, apart from gradient accumulation.
In addition, the result of WD optimization (Fig. 14) provides us with the insight that allocating the same workspace memory for each convolutional layer is not necessarily effective, and dynamic, adaptive assignment performs better. This observation should be beneficial for advanced deep learning frameworks that dynamically manage GPU memory to store tensors such as neuron data, weights and their gradients, for further memory optimization.
Acknowledgment
This research was supported by the ETH Postdoctoral Fellowship (for T. B. N.), Student Summer Research Fellowship (for Y. O.), and JST CREST Grant Number JPMJCR1303, JPMJCR1687, Japan. Part of this work is conducted as research activities of AIST  TokyoTech Real World BigData Computation Open Innovation Laboratory (RWBCOIL).
References
 [1] A. Krizhevsky, I. Sutskever, and H. Geoffrey E., “ImageNet Classification with Deep Convolutional Neural Networks,” Advances in Neural Information Processing Systems 25 (NIPS 2012), Dec 2012.
 [2] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2016), Jun 2016, pp. 770–778.
 [3] P. Goyal, P. Dollár, R. B. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, K. He, and P. Dollar, “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour,” CoRR, vol. abs/1706.0, Jun 2017, http://arxiv.org/abs/1706.02677.
 [4] T. Akiba, S. Suzuki, and K. Fukuda, “Extremely Large Minibatch SGD: Training ResNet50 on ImageNet in 15 Minutes,” CoRR, vol. abs/1711.04325, Nov 2017, https://arxiv.org/abs/1711.04325.
 [5] S. L. Smith, P.J. Kindermans, and Q. V. Le, “Don’t Decay the Learning Rate, Increase the Batch Size,” CoRR, vol. abs/1711.00489, Nov 2017, https://arxiv.org/abs/1711.00489.
 [6] NVIDIA. NVIDIA cuDNN. https://developer.nvidia.com/cudnn. Accessed on 20171123.
 [7] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional Architecture for Fast Feature Embedding,” arXiv preprint arXiv:1408.5093, 2014.
 [8] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Largescale machine learning on heterogeneous systems,” https://www.tensorflow.org/, Nov 2015.
 [9] Theano Development Team, “Theano: A Python framework for fast computation of mathematical expressions,” arXiv eprints, vol. abs/1605.02688, May 2016, http://arxiv.org/abs/1605.02688.
 [10] S. Tokui, K. Oono, S. Hido, and J. Clayton, “Chainer: a NextGeneration Open Source Framework for Deep Learning,” in Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Twentyninth Annual Conference on Neural Information Processing Systems (NIPS 2015), Dec 2015.
 [11] A. Krizhevsky, “One weird trick for parallelizing convolutional neural networks,” arXiv preprint, vol. abs/1404.5, Apr 2014, http://arxiv.org/abs/1404.5997.
 [12] M. Mathieu, M. Henaff, and Y. Lecun, “Fast training of convolutional networks through FFTs,” in International Conference on Learning Representations (ICLR 2014), Apr 2014.
 [13] Facebook. Caffe2. https://caffe2.ai/. Accessed on 20171123.
 [14] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2015), vol. 0712June, Jun 2015.
 [15] A. Makhorin. GLPK (GNU Linear Programming Kit). https://www.gnu.org/software/glpk/. Accessed on 20171123.
 [16] NVIDIA. NVIDIA Caffe. https://github.com/NVIDIA/caffe. Accessed on 20171123.
 [17] ——. Tesla K80 HPC and Machine Learning Accelerator. http://www.nvidia.com/object/teslak80.html. NVIDIA. Accessed on 20171123.
 [18] ——. Tesla P100 Most Advanced Data Center Accelerator. http://www.nvidia.com/object/teslap100.html. Accessed on 20171123.
 [19] ——. NVIDIA Tesla V100. https://www.nvidia.com/enus/datacenter/teslav100/. Accessed on 201831.
 [20] The TensorFlow Authors. tensorflow/models. https://github.com/tensorflow/models. Accessed on 201831.
 [21] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2017), Jul 2017.
 [22] The TensorFlow Authors. tensorflow/benchmarks. https://github.com/tensorflow/benchmarks. Accessed on 201831.
 [23] A. Lavin and S. Gray, “Fast Algorithms for Convolutional Neural Networks,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2016), Jun 2016, pp. 4013–4021.
 [24] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. FeiFei, “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision, no. 3, pp. 211–252, 2015.
 [25] A. Krizhevsky, “Learning Multiple Layers of Features from Tiny Images,” https://www.cs.toronto.edu/~kriz/learningfeatures2009TR.pdf, Tech. Rep., Apr 2009.
 [26] C. Li, Y. Yang, M. Feng, S. Chakradhar, and H. Zhou, “Optimizing Memory Efficiency for Deep Convolutional Neural Networks on GPUs,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’16). Piscataway, NJ, USA: IEEE Press, Nov 2016, pp. 54:1–54:12.
 [27] M. Rhu, N. Gimelshein, J. Clemons, A. Zulfiqar, and S. W. Keckler, “vDNN: Virtualized deep neural networks for scalable, memoryefficient neural network design,” in 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2016), Oct 2016.
 [28] K. Shirahata, Y. Tomita, and A. Ike, “Memory reduction method for deep neural network training,” in 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP 2016), Sep 2016.
 [29] A. Zlateski, K. Lee, and H. S. Seung, “ZNNi: Maximizing the Inference Throughput of 3D Convolutional Networks on CPUs and GPUs,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’16), Nov 2016, pp. 854–865.