\mu-cuDNN: Accelerating Deep Learning Frameworks with Micro-Batching

-cuDNN: Accelerating Deep Learning Frameworks with Micro-Batching

Yosuke Oyama1, Tal Ben-Nun2, Torsten Hoefler2, Satoshi Matsuoka3 1 1Department of Mathematical and Computing Science, Tokyo Institute of Technology, Tokyo, Japan
2Department of Computer Science, ETH Zurich, Zurich, Switzerland
3RIKEN Center for Computational Science, Hyogo, Japan
oyama.y.aa@m.titech.ac.jp, {talbn,htor}@inf.ethz.ch, matsu@acm.org
Abstract

NVIDIA cuDNN is a low-level library that provides GPU kernels frequently used in deep learning. Specifically, cuDNN implements several equivalent convolution algorithms, whose performance and memory footprint may vary considerably, depending on the layer dimensions. When an algorithm is automatically selected by cuDNN, the decision is performed on a per-layer basis, and thus it often resorts to slower algorithms that fit the workspace size constraints. We present -cuDNN, a transparent wrapper library for cuDNN, which divides layers’ mini-batch computation into several micro-batches. Based on Dynamic Programming and Integer Linear Programming, -cuDNN enables faster algorithms by decreasing the workspace requirements. At the same time, -cuDNN keeps the computational semantics unchanged, so that it decouples statistical efficiency from the hardware efficiency safely. We demonstrate the effectiveness of -cuDNN over two frameworks, Caffe and TensorFlow, achieving speedups of 1.63x for AlexNet and 1.21x for ResNet-18 on P100-SXM2 GPU. These results indicate that using micro-batches can seamlessly increase the performance of deep learning, while maintaining the same memory footprint.

I Introduction

Prevalent Deep Neural Networks (DNNs) are becoming increasingly deeper and are trained with large batch sizes. Specifically, state-of-the-art DNNs contain hundreds of layers [1, 2], and utilize batch sizes in the order of thousands [3, 4, 5].

Large batches are also favored by distributed data-parallel deep learning frameworks, because they improve utilization of accelerators, as well as hiding the communication of parameter gradients in the computation efficiently. Consequently, the batch size per accelerator (e.g., GPU) should be large to achieve better scaling. Since the memory usage of a DNN is nearly proportional to the layer size and the batch size, the accelerator memory tends to be used at full capacity in most real-world cases.

This “limited memory scenario” is also exhibited in cuDNN [6], a deep learning kernel library for NVIDIA GPUs. cuDNN provides a variety of computational primitives for deep neural networks, and is widely used in deep learning frameworks, such as Caffe [7] and others [8, 9, 10]. cuDNN provides up to eight different algorithms to perform convolutions, each of which requires different temporary storage (workspace) schemes. To guide users to determine the best algorithm for a given maximum workspace size, cuDNN provides a function cudnnGetConvolution*Algorithm (* is one of convolution types, Forward, BackwardData and BackwardFilter), that benchmarks all the algorithms and chooses the best algorithm, either with respect to computation time or memory usage. However, if the workspace size requested by a fast algorithm is one byte larger than provided, cuDNN will resort to a slower algorithm that requires less workspace. In fact, the performance impact can be 4.51x in the 2nd convolutional layer of AlexNet, as shown in Fig. 1.

(a) Execution time of all layers.
(b) Execution time vs. execution time of conv2. and represent the “Best” and the “-1 byte” respectively.
Fig. 1: Execution time of cuDNN 7.0.1 forward convolution of single-column AlexNet [11] with different workspace sizes. The “Best” case always chooses the fastest algorithm regardless of workspace size, while in the “-1 byte” case the maximum workspace size is limited to 1 byte less than the best algorithm.

In this paper, we propose -cuDNN, a transparent wrapper for cuDNN that attempts to mitigate the aforementioned inefficiency. In order to utilize fast convolution algorithms with limited size of workspace, -cuDNN automatically divides layer mini-batch computation into several micro-batches and perform multiple convolutions sequentially. -cuDNN decouples the statistical efficiency (speed of accuracy/loss improvement with fixed amount of parameter updates) from the hardware efficiency (speed of computations with fixed amount of parameter updates), improving only the latter. Using micro-batches, -cuDNN improves the utilization of the accelerators without incurring any reduction in training accuracy.

The contributions of this paper are as follows:

  • We present a method to automatically divide mini-batch training into several “micro-batches”, so that faster algorithms are utilized with tight workspace constraints.

  • We propose two different workspace allocation policies, which enable optimization of multiple convolutional layers with inter-dependencies.

  • We evaluate -cuDNN over two different deep learning frameworks, Caffe and TensorFlow, showing that it can mitigate the inefficiency of cuDNN  even with state-of-the-art Convolutional Neural Networks (CNNs), such as AlexNet and ResNet.

Ii The Anatomy of Convolutional Neural Networks

Convolution operations in Convolutional Neural Networks (CNNs) apply multiple filters to a batch of channels of two-dimensional data (Algorithm 1, Fig. 2). In particular, input and output tensors are represented as four-dimensional tensors with dimensions (), where is the mini-batch size, is the number of channels, and and represent image height and width, respectively. Similarly, the filter tensor is represented as four-dimensional () tensor, where is the number of output channels and represent kernel height and width.

1:  for(; ; ++) // Mini-batch loop
2:   for(; ; ++) // Output channel loop
3:    for(; ; ++) // Height loop
4:     for(; ; ++) // Width loop
5:      for(; ; ++) // Input channel loop
6:       for(; ; ++) // Kernel width loop
7:        for(; ; ++) // Kernel height loop
8:          += ;
Algorithm 1 Pseudo-code of two-dimensional convolution.
Fig. 2: Two-dimensional convolution. Each element of is set to be a sum of element-wise products between partial area of and one filter from .

The two-dimensional convolution is composed of seven-nested loops (Algorithm 1). The innermost three loops compute the actual convolution, where one element of the input tensor is multiplied and accumulated to one element of the output tensor . The remaining loops iterate over all elements of . The key observation is that in order to solve the problem described in Section I, there is no dependency inside the mini-batch loop between different iterations. This is intuitive because in training or inference we compute parameter gradients or outputs with respect to different data samples, so this is equivalent to computing different CNNs concurrently. This observation motivates us to apply loop tiling to the mini-batch loop, so that we can reduce the resident workspace size.

The only exception to the inter-sample independency is the computation of parameter gradients;

where and is the loss function with respect to a mini-batch and a sample respectively, and is the convolution operation [12]. The semantics of this computation is, however, not violated by the loop splitting, only if each of the iterations is performed sequentially.

In cuDNN, there are three operations related to the two-dimensional convolution; Forward for forward computation (Fig. 2), BackwardData for computing neuron errors in back-propagation, BackwardFilter for computing parameter gradients in back-propagation.

Although Forward and BackwardData can directly be divided into several micro-batches, BackwardFilter cannot, since there are output dependencies on the accumulated parameter gradients tensor . However, we can still divide the loops by running BackwardFilter multiple times while accumulating the results, i.e., output scale in cuDNN. Therefore, loop splitting can be achieved by repeating cuDNN kernels one or more times for any convolution-related operation, regardless of the underlying method.

Iii -cuDNN

-cuDNN is a transparent C++ wrapper library for cuDNN, which can easily be integrated into most deep learning frameworks [7, 13, 8, 10]. The key concept of -cuDNN is that it automatically divides a mini-batch to several batches (referred to as “micro-batches” in this paper) and optimizes their sizes, to utilize faster convolution algorithms (Fig. 3).

Fig. 3: The conceptual timeline of -cuDNN. “@256” means that each computation is executed with batch-size of 256. -cuDNN splits one convolution operation into one or more disjoint subsets of the mini-batch.

Iii-a -cuDNN Methodology

-cuDNN library employs one of two workspace utilization policies to optimize micro-batches for convolution kernels (Fig. 4):

  • Workspace Reuse (WR): WR allocates one workspace per layer, sharing the space between the internal micro-batches. In this scheme, each layer is assumed to use the workspace exclusively, hence the total size of the workspaces is in proportion to the number of convolutional layers.

  • Workspace Division (WD): WD allocates one workspace per network, and assigns different segments to each convolutional layer. WD enables small groups of convolution operations, as in the Inception module [14], to run concurrently with larger workspaces. In WD, the actual workspace is managed by -cuDNN rather than the deep learning framework. This is because conventional frameworks allocate each workspace separately, lacking a global view of the entire network’s workspace requirements.

WR and WD both rely on the parameters of one or more convolution kernel(s), the mini-batch size, and the maximum workspace size. The output of -cuDNN is a division of the mini-batch, and “micro-configurations”; a pair of a convolution algorithm and micro-batch size for each convolution micro-batch. In this paper, we define “configuration” of a segmented convolution kernel as “a list of micro-configurations”. For example, if a kernel with a mini-batch size of 256 is equally divided into four micro-batches and each of them uses algorithm , the configuration is represented as . Also we define concatenation of two lists as , such as and .

Fig. 4: Overview of WR and WD. -cuDNN optimizes micro-batch sizes and internally calls cuDNN functions, via the cuDNN interfaces.

Iii-B WR Algorithm

The goal of the WR policy is to optimize , the total execution time with mini-batch size of using Dynamic Programming (DP), given by:

where is the fastest execution time of one convolution kernel with a micro-batch size of , within the workspace constraint. If the first row of the definition of is smaller than the second row, -cuDNN does not have to divide the batch. Otherwise, it is beneficial to divide the batch into two or more parts, applying the process recursively (Fig. 5).

The key point of WR is that the optimal micro-configuration size is deterministic and independent from other kernels. This is because in this case, we assume that multiple kernels do not run simultaneously.

The algorithm of WR is three-fold, where the mini-batch size is , and user-given maximum workspace size is :

  1. For , WR benchmarks all available convolution algorithms of micro-batch size of with maximum workspace size of , using cuDNN. We define the fastest micro-configuration as (where is the fastest algorithm) and its execution time as .

  2. For , WR computes , the fastest execution time for micro-batch size of , and , the corresponding configuration, as follows (where ). and are memorized and reused for further iterations.

  3. Outputs the optimal configuration .

Fig. 5: DP-based optimization of WR. Here we assume that convolution algorithm 4 with micro-batch size of 60 () achieves better computation efficiency, hence it is repeatedly used.

Iii-C WD Algorithm

In the WD scheme, configurations for multiple convolution kernels are optimized, while at the same time the total workspace size should be less than the total workspace limit that users specify. Therefore, WD is a more complex problem than WR, since the configuration of each convolution kernel is no longer independent from others, due to the total workspace size constraint.

To solve this problem, we formulate a 0-1 Integer Linear Programming (ILP)-based optimization algorithm (Fig. 6). Given the set of kernels and sets of available configurations of kernel , WD is solved by minimizing Equation 1:

(1)
(4)

where and are the workspace size and execution time of kernel with configuration , respectively. Equation 4 limits the total workspace size to the user-specified size . -cuDNN uses configuration on kernel if and only if , and exactly one of them is selected for each kernel , according to the constraint in Equation 4.

Iii-C1 Desirable Configuration Selection

The challenging problem of the above ILP-based algorithm is that if all possible configurations are evaluated (i.e., all combinations of the number of micro-batch and algorithms), the search-space is in the order of (where is set of algorithms and is the mini-batch size) configurations in total, which makes the problem impractically large.

Fig. 6: ILP-based optimization of WD. The problem is stacking “time memory” rectangles of configurations diagonally, and obtaining the minimum total width , provided that the total height is lower than . Each configuration is composed of one or more micro-configurations such as .

Here we compute a Pareto front to remove undesirable configurations from all possible configurations, without returning any sub-optimal solutions. The resulting Pareto front is then input to the ILP (Equation 1-4) to solve the entire problem.

First, we modify the DP algorithm from WR (Section III-B) to output a set of configurations, rather than the fastest configuration, as follows:

where is a set of available micro-configurations of micro-batch size of , and is a pruning function described below. Note that this outputs of the WR algorithm as one of its elements; and for any .

Second, we define the “desirable configuration set” as a Pareto front in the two-dimensional (execution time workspace size) space (Fig. 7):

where and is execution time and required workspace size of a configuration set . This definition implies that any is the fastest configuration among any of the elements of using a workspace size of or less. Conversely, if an element is not in , there is an element that is faster than and requires less workspace, hence there is no reason to choose , namely “undesirable”.

Fig. 7: The concept of desirable set. Here cannot be in because a exists for which the condition is not satisfied.

The pruning drastically reduces the number of variables of Equation 1, and enables solving the ILP for state-of-the-art deep CNNs in practical time. For instance, the maximum number of desirable configurations of AlexNet’s layers we examined in Section IV-D was 68, which is much smaller than the exponential order. Fig. 8 illustrates a Pareto front of one convolutional layer of AlexNet.

Fig. 8: Desirable configurations (i.e. a Pareto front) of AlexNet’s “conv2” layer (Forward) on P100-SXM2 with a maximum workspace size of 120 MiB, and a mini-batch size of 256. Colored bars corresponding to data points represent the division of the mini-batch and the chosen micro-batch algorithms. For example, the top-left point divides the mini-batch into two micro-batches of 128 and utilizes the FFT_TILING algorithm.

The validity of the pruning algorithm that the optimal solution of the ILP does not include any undesirable configurations is proved as follows:

Proof.

Suppose that an optimal solution of the ILP , where is the set of variable symbols of the ILP, contains an undesirable configuration of a kernel (i.e. ). According to the definition of desirable sets, there is a configuration of such that and . According to Equation 4, for all .

Let be defined as

satisfies Equation 4 for as

and Equation 4 as

Similarly, by replacing as in the inequality above, the objective value of is proved to be lower than , hence is a better solution of the ILP. Therefore it contradicts the supposition that is the optimal solution. ∎

Iii-D -cuDNN Implementation

To enable -cuDNN, the only modification that needs to be performed to the code is to replace the cuDNN handle type cudnnHandle_t with UcudnnHandle_t. The -cuDNN handle object is an opaque type that wraps the original type, such that users can call any cuDNN function. When a convolution operation or benchmarking function is called with the -cuDNN handle object, the -cuDNN library internally computes the optimal configurations, and returns a virtual algorithm ID and zero required workspace size. This mechanism enables users to call -cuDNN with minimal modification to the original code. For example, the number of lines to be modified to introduce -cuDNN to Caffe (v1.0) is approximately three.

The implementation of -cuDNN is based on overloading a subset of cuDNN functions, where the memory of the -cuDNN handle type is structured to behave to act as the cuDNN internal handle for the other calls. We define a cast operator from the -cuDNN handle to cuDNN handle so that the framework automatically adopts this method. Using this technique, -cuDNN delegates most of the functions to cuDNN, but overrides functions related to the convolutional layers.

The optimization algorithm in -cuDNN is based on the methodology described in Section III-A. In practice, -cuDNN provides a “batch size policy”, which determines what micro-batch sizes are benchmarked at the step 1 of the WR algorithm, as follows:

  • all uses all batch sizes . Although this always finds the optimal solution, it takes time for the benchmark.

  • powerOfTwo uses only power-of-two batch sizes . This saves a considerable amount of time since it only costs time for the benchmark.

  • undivided uses only the original mini-batch size . In WR, this option always selects the same configuration as cuDNN, hence this option is only useful to evaluate the overhead of -cuDNN.

These policies can be specified via an environment variable or through a special library function in -cuDNN. Furthermore, -cuDNN supports parallel micro-configuration evaluation via an environment variable, in which the aforementioned micro-batches are distributed to different GPUs on the same computing node and tested concurrently. This function assumes that the node contains multiple homogeneous GPUs.

-cuDNN caches the optimized configurations and the benchmark results into memory and optional file-based database respectively, to skip unnecessary recomputations. This is especially beneficial for networks that replicate convolutional layers of the same size, such as ResNet [2]. In addition, the file-based caching enable offline benchmarking, as well as sharing the results among a homogeneous GPU cluster via network file system.

Iii-E Implementation of WD Optimization

To perform WD optimization, -cuDNN must know the number of convolutional layers and corresponding layer parameters in advance, i.e., before running any kernel. In the current cuDNN API, however, the parameters are passed one layer at a time, and thus there is no way to obtain all the parameters collectively from deep learning frameworks.

To overcome this issue, we assume that the deep learning framework calls cudnnGetConvolution*Algorithm one time for each layer prior to the computation of the entire network (e.g., training, inference). This is the most straightforward use of the cuDNN interface, as memory (including workspace) is usually allocated before initiating computations. Due to the specific implementation of Caffe, we add a -cuDNN library call after network initialization, which ignores subsequent cudnnGetConvolution*Algorithm calls.

When cudnnGetConvolution*Algorithm is called, -cuDNN pushes the kernel parameters to an internal list, and returns a dummy result. Note that the returned results satisfy the semantics given by the cuDNN interface, so the framework will not raise errors and will not allocate its own workspaces. When cudnnConvolution* is called for the first time, -cuDNN executes the optimization algorithm (namely, WD). We use the GNU Linear Programming Kit (GLPK) [15] as the ILP solver.

 

TSUBAME-KFC/DL TSUBAME 3 DGX-1

 

CPU (Intel Xeon) E5-2620 2 E5-2680 v4 2 E5-2698 v4 2
GPU (NVIDIA Tesla) K80  4 P100-SXM2  4 V100-SXM2  8
- 8.73 SP TFlop/s - 10.6 SP TFlop/s - 15.7 SP TFlop/s
- 24 GiB GDDR5 - 16 GiB HBM2 - 16 GiB HBM2
(480 GiB/s BW) (732 GiB/s BW) (900 GiB/s BW)
SUSE Linux
OS CentOS 7.3.1611 Enterprise Server Ubuntu 16.04.3
12 SP2
CUDA 8.0.61 8.0.44 9.0
cuDNN 6.0 6.0 7.0.5
GLPK 4.63 4.63 N/A
Caffe 1.0 1.0 NVCaffe
v0.16.5 [16]
TensorFlow N/A 1.4.1 N/A

 

Table I: Evaluation Environment Specification.

Iv Performance Evaluation

We evaluate the performance of -cuDNN for three different GPU architectures, NVIDIA Tesla K80 [17], P100-SXM2 [18] and V100-SXM2 [19] on the TSUBAME-KFC/DL, TSUBAME 3, and DGX-1 supercomputers, respectively. The specifications of these supercomputers are listed in Table I.

Throughout the evaluation, we use single-precision floating point format and store tensors in the storage order. We use three different deep learning frameworks for evaluations: Caffe [7], its NVIDIA branch (NVCaffe) [16], and TensorFlow [8]. Both support recent versions of cuDNN (6 or 7). We use a built-in benchmarking command (Caffe’s “time” command) or an official benchmarking script (from TensorFlow models repository [20]) to measure the execution time of forward and backward passes, and show the sum of forward and backward passes together. In the following sections, unless explicitly mentioned, each forward-backward passes are measured 50 times on Caffe and 100 times on TensorFlow.

For neural networks, we use AlexNet[1], ResNet[2], and DenseNet[21]. For evaluations on Caffe, we use the AlexNet model defined in Caffe, ResNet-18, and ResNet-50 from NVCaffe. We modify data prefetching size from 4 to 16 for AlexNet and ResNet-18 for TSUBAME 3. For evaluations on TensorFlow, we use the definitions in an official benchmarking repository [22].

As for workspace limit, unless explicitly mentioned, we use 8 MiB and 64 MiB for each layer, which are the default workspace size limits of Caffe and Caffe2 [13]  respectively. In addition, we use 512 MiB of workspace per layer to investigate the case where sufficiently large workspace is provided. To shorten the benchmarking time, we use several GPUs on the same node with the parallel evaluation function of -cuDNN, mentioned in Section III-D.

Fig. 9: Benchmark results of forward convolution of AlexNet’s “conv2” layer on P100-SXM2. We use 64 MiB workspace size and a mini-batch size of 256. Numbers on each rectangle represent micro-batch sizes.
(a) K80
(b) P100-SXM2
(c) V100-SXM2
Fig. 10: Benchmark results of AlexNet on three different GPUs with different workspace sizes (8, 64, 512 MiB). The labels “u”, “p” and “a” represent undivided, powerOfTwo, and all, respectively. We use a mini-batch size of 256 on K80 and P100-SXM2, and 1024 on V100-SXM2.

Iv-a Convolution Kernel Optimization Using WR

Fig. 9 shows the execution time of forward convolution (cudnnConvolutionForward) of the “conv2” layer in AlexNet on P100-SXM2. With workspace size of 64 MiB, the GEMM (GEneral Matrix-Matrix multiply)-based algorithm is the one chosen by cuDNN, requiring only 4.3 KiB for workspace if the mini-batch is not divided. On the other hand, FFT-based convolution [12]  is more efficient, although it requires excessive amount of workspace (213 MiB) to store the images and filters in the frequency domain. -cuDNN with powerOfTwo option successfully enables the use of FFT within the workspace size constraints, using 48.9 MiB over micro-batches of size 32.

The all option also enables -cuDNN to use Winograd convolution [23], an algorithm that is especially efficient for small convolution kernels, achieving 2.33x speedup over undivided in total.

Iv-B CNN Optimization Using WR

We evaluate WR-based optimization on two different deep learning frameworks: Caffe and TensorFlow.

Iv-B1 Caffe

Fig. 10 shows timing breakdowns of Caffe on AlexNet with three different GPUs. Additionally, we only highlight convolutional layers since the others (e.g., pooling) are out of the scope of this paper.

One important observation from Fig. 10 is that the performance improvement of -cuDNN over cuDNN (which is equivalent to undivided) is significant when the moderate amount of workspace is set by users. For instance, if the workspace size per kernel is 64 MiB, -cuDNN with the all option achieves 1.81x speedup with respect to the entire iteration, and 2.10x with respect to convolutions alone, than undivided on K80. This is because -cuDNN successfully enables cuDNN to use faster algorithms, as in the example from Section IV-A. In addition, a similar speedup is achieved on P100-SXM2 (1.40x for the entire iteration, and 1.63x for convolutions alone), and on V100-SXM2 (1.47x for the entire iteration, and 1.63x for convolutions alone).

In the case where workspace size is limited to 8 MiB, -cuDNN cannot attain any performance improvement, because even if the mini-batch is finely divided, the specified workspace is too small to utilize. Indeed, on P100-SXM2, only one kernel of all option seems to increase the utilization of the workspace over undivided.

On the other hand, when the workspace size limit is too large (512 MiB) on K80 and P100-SXM2 GPUs, performance difference between cuDNN and -cuDNN is negligible. This is because there is no benefit from dividing the mini-batch, as all algorithms fit into the workspace constraints. However, this workspace limit consumes a considerable amount of workspace memory: While the undivided option consumes 2.87 GiB in total, all with 64 MiB limit only consumes 0.70 GiB, although with 4% overhead caused by the choice of micro-batch algorithms.

From the viewpoint of the time to optimization, including kernel benchmarking and solving DP, powerOfTwo considerably outperforms all. In particular, with 64 MiB workspace on P100-SXM2, all takes 34.16 s, whereas powerOfTwo takes 3.82 s. This result and Fig. 10 imply that powerOfTwo is a reasonable choice to test the computation efficiency of new CNNs quickly. Generally, the overhead of -cuDNN is negligible with respect to the entire training time, in which the forward and backward passes are repeated hundreds of thousands of times.

(a) AlexNet
(b) ResNet-50
(c) DenseNet-40 ()
Fig. 11: Benchmark results of different CNNs on P100-SXM2 with different workspace sizes (8, 64, 512 MiB), using TensorFlow framework. We use a mini-batch size of 256 for AlexNet and DenseNet, and 64 for ResNet-50.

Iv-B2 TensorFlow

Fig. 11 presents timing breakdowns of AlexNet and ResNet-50, DenseNet-40 on P100-SXM2.

We set the (input width, output width) as for AlexNet and ResNet-50, or for DenseNet-40, which are used for training ILSVRC2012 classification dataset [24] or the CIFAR dataset [25], respectively. We also set of DenseNet-40, the number of feature maps of each convolutional layer, to 40 to obtain better computational efficiency.

Since TensorFlow 1.4.1 does not provide any workspace limits to -cuDNN via cuDNN’s benchmarking functions before actual convolutions, we manually provide workspace limits of 8, 64, and 512 MiB to -cuDNN. -cuDNN with a workspace limit of 64 MiB achieves 1.24x speedup for AlexNet, and 1.06x for ResNet-50. These results prove that -cuDNN has good performance portability between different deep learning frameworks that depend on cuDNN.

Iv-C Memory Consumption Using WR

Fig. 12 shows the per-layer memory usage of AlexNet and ResNet-18 on P100-SXM2. In Fig. 12, we set a per-layer workspace limit of 512 MiB for cuDNN, and 64 MiB for -cuDNN, where the slowdown due to the decrease of memory limit is negligible (1.17x). These figures clearly show that -cuDNN can cut down per-layer memory consumption by up to 3.43x and 2.73x on AlexNet and ResNet-18 respectively.

(a) AlexNet (cuDNN)
(b) AlexNet (-cuDNN)
(c) ResNet-18  (cuDNN)
(d) ResNet-18  (-cuDNN)
Fig. 12: Per-layer breakdowns of memory consumption of AlexNet and ResNet-18 on P100-SXM2. For simplicity, we only show the memory usage of unique convolutional layers (CONV_) and fully-connected layers (fc or fc) in one forward propagation. We use a mini-batch of 256 for AlexNet and 128 for ResNet-18 respectively. We set a per-layer workspace limit of 512 MiB for cuDNN, and 64 MiB for -cuDNN. Each bar segment of “WS (-cuDNN)” represents the maximum workspace size of the layer.

Iv-D CNN Optimization Using WD

Fig. 13 shows the benchmark results of using the WD algorithm. The adjoined bars have the same workspace limit in total: For example, since AlexNet has five convolutional layers and each layer has three kernels (Forward, BackwardData, BackwardFilter), we place the result with 120 MiB WD workspace next to that of 8 MiB WR workspaces.

(a) AlexNet
(b) ResNet-50
Fig. 13: Benchmark results of AlexNet and ResNet-50 on P100-SXM2 with different workspace sizes and policies (WR and WD). We use a mini-batch size of 256 for AlexNet and 32 for ResNet-50. Note that the adjoined bars have the same workspace limit in total.

In Fig. 13, we can see that the training time decreases as the workspace constraints increase in both WR and WD. At the same time, WD successfully manages the global memory requirements better, attaining higher performance with the same overall memory footprint (see Fig. 14 for breakdown). Specifically, when 120 MiB workspace in total is provided for AlexNet, the entire execution time with WD optimization and all option is 1.24x faster than the WR with undivided option for the entire iteration (or 1.38x for convolution). WD also outperforms the baseline with 960 MiB workspace in total, which can use 8 times more memory for workspace, by 1.24x in total execution time.

Fig. 14: Assigned workspace division of AlexNet on P100-SXM2. “F”, “BF”, “BD” represent kernel types (Forward, BackwardFilter, BackwardData respectively). We use a mini-batch size of 256 for AlexNet. We set a workspace limit of 8 MiB for WR, and a total workspace limit of 120 MiB for WD.

Furthermore, even for ResNet-50, which has 10 times more convolutional layers than AlexNet, WD achieves 1.05x speedup for the entire iteration (or 1.14x for convolution) with 2,544 MiB of total workspace, outperforming the original version (which consumes 5,088 MiB) in terms of memory footprint as well. In addition, the ILP for ResNet-50 is still small enough to solve in practical time. When the workspace limit is set to 5,088 MiB, the number of 0-1 variables is 562, and the GLPK solver takes 5.46 ms to solve it.

The main reason that WD outperforms WR is that in WR, if -cuDNN fails to find better algorithms and micro-batch sizes to fully utilize the assigned workspace, -cuDNN must abandon that workspace slot and cannot allocate it to other kernels. On the other hand, in WD, characteristics of different desirable workspace sizes of different kernels (Fig. 8) are implicitly considered in the ILP-based optimization framework. Therefore, -cuDNN can assign larger proportional workspaces to time-consuming layers, if it is expected that the kernels will be considerably faster with a larger workspace.

In Fig. 14, -cuDNN with the WD policy spares most of the workspace for “conv2” and “conv3” (93.7%), which are the most time-consuming layers in the baseline (WR, undivided). In contrast, -cuDNN doesn’t allocate workspace of over 3 MiB for “conv4” and “conv5”, although -cuDNN lists some faster and desirable configurations than the baseline. For instance, the fastest configuration of conv5 (forward), which uses FFT-based convolution with two micro-batches, is 1.29x faster than baseline, although this configuration uses 109 MiB of workspace. This observation implies that the WD does not unnecessarily allocate workspace for a specific layer but chooses the best combination, as defined by the ILP.

V Related Work

Li et. al [26] propose a heuristic to automatically tune each tensor memory layout to utilize either GEMM-based or FFT-based convolution efficiently. The proposed heuristic is, however, based on the authors’ performance observation using conventional convolutional layers and specific GPU architecture, and thus there is no guarantee that the algorithm always provides the best memory alignment for any deep neural network and GPU architecture. On the other hand, since -cuDNN uses the techniques of dynamic programming and integer linear programming, it is mathematically guaranteed that -cuDNN provides the best performance that the library can produce, provided that each convolution is independent from the others.

Rhu et al. [27] propose a memory management technique that offloads neuron activations, parameters, and errors from the GPU memory to the CPU memory during forward-/backward-propagation, so that larger models can be trained with the same memory constraint. However, as Fig. 12 shows, even in such memory-efficient implementation or similar memory management techniques [28] -cuDNN is expected to save the peak memory usage of each layer.

Zlateski et al. [29] propose ZNNi, an FFT-based convolution algorithm, and mention micro-batching technique to reduce the temporal memory usage by FFT. -cuDNN, however, generalizes the schema so that micro-batching can be applied to any convolution algorithm, obtaining the best computational performance for the given layer configurations, as well as maintains high portability between different existing deep learning frameworks.

Vi Conclusion

In this paper, we proposed -cuDNN, a wrapper library for cuDNN, which divides the mini-batch to utilize high-performance convolution algorithms with limited amount of memory for workspace. We have shown that -cuDNN works well even with recent CNNs, which are composed of many convolutional layers, and can easily be integrated into existing deep learning frameworks.

The performance of -cuDNN demonstrated in our work suggests that other layer types can be optimized as well, if they can be decomposed and computed by different algorithms. This is because -cuDNN does not use any special properties of the convolution operator, apart from gradient accumulation.

In addition, the result of WD optimization (Fig. 14) provides us with the insight that allocating the same workspace memory for each convolutional layer is not necessarily effective, and dynamic, adaptive assignment performs better. This observation should be beneficial for advanced deep learning frameworks that dynamically manage GPU memory to store tensors such as neuron data, weights and their gradients, for further memory optimization.

Acknowledgment

This research was supported by the ETH Postdoctoral Fellowship (for T. B. N.), Student Summer Research Fellowship (for Y. O.), and JST CREST Grant Number JPMJCR1303, JPMJCR1687, Japan. Part of this work is conducted as research activities of AIST - TokyoTech Real World Big-Data Computation Open Innovation Laboratory (RWBC-OIL).

References

  • [1] A. Krizhevsky, I. Sutskever, and H. Geoffrey E., “ImageNet Classification with Deep Convolutional Neural Networks,” Advances in Neural Information Processing Systems 25 (NIPS 2012), Dec 2012.
  • [2] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2016), Jun 2016, pp. 770–778.
  • [3] P. Goyal, P. Dollár, R. B. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, K. He, and P. Dollar, “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour,” CoRR, vol. abs/1706.0, Jun 2017, http://arxiv.org/abs/1706.02677.
  • [4] T. Akiba, S. Suzuki, and K. Fukuda, “Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes,” CoRR, vol. abs/1711.04325, Nov 2017, https://arxiv.org/abs/1711.04325.
  • [5] S. L. Smith, P.-J. Kindermans, and Q. V. Le, “Don’t Decay the Learning Rate, Increase the Batch Size,” CoRR, vol. abs/1711.00489, Nov 2017, https://arxiv.org/abs/1711.00489.
  • [6] NVIDIA. NVIDIA cuDNN. https://developer.nvidia.com/cudnn. Accessed on 2017-11-23.
  • [7] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional Architecture for Fast Feature Embedding,” arXiv preprint arXiv:1408.5093, 2014.
  • [8] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning on heterogeneous systems,” https://www.tensorflow.org/, Nov 2015.
  • [9] Theano Development Team, “Theano: A Python framework for fast computation of mathematical expressions,” arXiv e-prints, vol. abs/1605.02688, May 2016, http://arxiv.org/abs/1605.02688.
  • [10] S. Tokui, K. Oono, S. Hido, and J. Clayton, “Chainer: a Next-Generation Open Source Framework for Deep Learning,” in Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Twenty-ninth Annual Conference on Neural Information Processing Systems (NIPS 2015), Dec 2015.
  • [11] A. Krizhevsky, “One weird trick for parallelizing convolutional neural networks,” arXiv preprint, vol. abs/1404.5, Apr 2014, http://arxiv.org/abs/1404.5997.
  • [12] M. Mathieu, M. Henaff, and Y. Lecun, “Fast training of convolutional networks through FFTs,” in International Conference on Learning Representations (ICLR 2014), Apr 2014.
  • [13] Facebook. Caffe2. https://caffe2.ai/. Accessed on 2017-11-23.
  • [14] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2015), vol. 07-12-June, Jun 2015.
  • [15] A. Makhorin. GLPK (GNU Linear Programming Kit). https://www.gnu.org/software/glpk/. Accessed on 2017-11-23.
  • [16] NVIDIA. NVIDIA Caffe. https://github.com/NVIDIA/caffe. Accessed on 2017-11-23.
  • [17] ——. Tesla K80 HPC and Machine Learning Accelerator. http://www.nvidia.com/object/tesla-k80.html. NVIDIA. Accessed on 2017-11-23.
  • [18] ——. Tesla P100 Most Advanced Data Center Accelerator. http://www.nvidia.com/object/tesla-p100.html. Accessed on 2017-11-23.
  • [19] ——. NVIDIA Tesla V100. https://www.nvidia.com/en-us/data-center/tesla-v100/. Accessed on 2018-3-1.
  • [20] The TensorFlow Authors. tensorflow/models. https://github.com/tensorflow/models. Accessed on 2018-3-1.
  • [21] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2017), Jul 2017.
  • [22] The TensorFlow Authors. tensorflow/benchmarks. https://github.com/tensorflow/benchmarks. Accessed on 2018-3-1.
  • [23] A. Lavin and S. Gray, “Fast Algorithms for Convolutional Neural Networks,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2016), Jun 2016, pp. 4013–4021.
  • [24] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision, no. 3, pp. 211–252, 2015.
  • [25] A. Krizhevsky, “Learning Multiple Layers of Features from Tiny Images,” https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf, Tech. Rep., Apr 2009.
  • [26] C. Li, Y. Yang, M. Feng, S. Chakradhar, and H. Zhou, “Optimizing Memory Efficiency for Deep Convolutional Neural Networks on GPUs,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’16).   Piscataway, NJ, USA: IEEE Press, Nov 2016, pp. 54:1–54:12.
  • [27] M. Rhu, N. Gimelshein, J. Clemons, A. Zulfiqar, and S. W. Keckler, “vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design,” in 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2016), Oct 2016.
  • [28] K. Shirahata, Y. Tomita, and A. Ike, “Memory reduction method for deep neural network training,” in 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP 2016), Sep 2016.
  • [29] A. Zlateski, K. Lee, and H. S. Seung, “ZNNi: Maximizing the Inference Throughput of 3D Convolutional Networks on CPUs and GPUs,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’16), Nov 2016, pp. 854–865.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
230644
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description