A Comparative Measurement Study of
Deep Learning as a Service Framework
Big data powered Deep Learning (DL) and its applications have blossomed in recent years, fueled by three technological trends: a large amount of digitized data openly accessible, a growing number of DL software frameworks in open source and commercial markets, and a selection of affordable parallel computing hardware devices. However, no single DL framework, to date, dominates in terms of performance and accuracy even for baseline classification tasks on standard datasets, making the selection of a DL framework an overwhelming task. This paper takes a holistic approach to conduct empirical comparison and analysis of four representative DL frameworks with three unique contributions. First, given a selection of CPU-GPU configurations, we show that for a specific DL framework, different configurations of its hyper-parameters may have significant impact on both performance and accuracy of DL applications. Second, the optimal configuration of hyper-parameters for one DL framework (e.g., TensorFlow) often does not work well for another DL framework (e.g., Caffe or Torch) under the same CPU-GPU runtime environment. Third, we also conduct a comparative measurement study on the resource consumption patterns of four DL frameworks and their performance and accuracy implications, including CPU and memory usage, and their correlations to varying settings of hyper-parameters under different configuration combinations of hardware, parallel computing libraries. We argue that this measurement study provides in-depth empirical comparison and analysis of four representative DL frameworks, and offers practical guidance for service providers to deploying and delivering DL as a Service (DLaaS) and for application developers and DLaaS consumers to select the right DL frameworks for the right DL workloads.
Big data powered Deep Learning (DL) systems and applications are gaining huge popularity recently, in numerous fields, such as image classification, object detection, machine translation and NLP. We witness two emerging trends: (1) a growing number of DL software frameworks in both open source and commercial markets, represented by Tensor-Flow , Caffe , Torch , Theano , CNTK , Keras  and PyTorch  and (2) an increasing volume and variety of DL applications with diverse datasets and domain-specific DL problems. It is widely recognized that choosing the right DL framework for the right applications becomes a daunting task for many researchers, developers and domain scientists. Although there are some existing DL benchmarking efforts, most of them have centered on studying different CPU-GPU configurations and their impact on different DL frameworks with standard datasets [8, 9, 10, 11]. Even under the same CPU-GPU configuration, no single DL framework dominates the performance and accuracy for standard datasets, such as MNIST , CIFAR , ImageNet . Little efforts have been engaged to systematically study the impacts and correlations of various hardware configurations, parallel computing libraries, and DL hyper-parameters on both the performance and the accuracy of DL frameworks for different datasets and DL applications, and how system resources, e.g., CPU and memory, are consumed and contribute to the performance and accuracy of DL frameworks and applications.
Bearing these problems in mind, in this paper we take a holistic approach to design and conduct a comparative measurement study of four representative DL frameworks, focusing on how they optimize their performance and accuracy using popular DL datasets and workloads. This paper makes three original contributions. First, although adding GPU devices can significantly shorten the model training time for all DL frameworks, our empirical study shows that careful selection of hardware configurations, parallel computing libraries, and hyper-parameters can have a significant impact on both the performance and the accuracy of DL frameworks for any given DL workloads under a fixed CPU-GPU hardware configuration. Second, the optimal configuration of hyper-parameters is highly dependent on a number of parameters, such as the choice of specific DL framework, the specifics of the datasets and the learning tasks, the structural complexity of deep neural networks, and their specific parallel computation implementation libraries. Thus, the optimal settings of hyper-parameters for one DL framework (e.g., TensorFlow) often do not work well for another DL framework (e.g., Caffe or Torch), or for different datasets or different learning tasks under the same DL framework with the dual goals of high performance and high accuracy. This makes the tuning of hyper-parameters substantially more challenging. Third, we analyze the resource consumption patterns, such as CPU and memory usage, under different configuration combinations of hardware, parallel computing libraries, and hyper-parameters. We show the resource consumption implications for different hardware configurations, different parallel computing libraries, and different configurations of hyper-parameters, including the default configurations used by existing open source DL frameworks. For example, we show that the mini-batch (batch) size and the number of iterations (#Iterations or #Epochs) are the two hyper-parameters that have significant impact on both performance and accuracy of the four DL frameworks in our study. Furthermore, learning rate can have significant impact on the accuracy of the DL frameworks for many datasets and learning tasks. Although larger batch sizes and a larger number of iterations directly correlate to the training time for all DL frameworks, their impact on accuracy varies for different DL frameworks and often exhibits non-linear relationship. Furthermore, larger number of GPUs and higher capacity of memory and CPU may not result in shorter training time and better accuracy. The comparative analysis and insights obtained from our in-depth empirical measurement study provides three unique benefits to the big-data services and DL as a service (DLaaS) communities: (1) It provides evidences and recommendations for DL as a Service developers to further enhance the parallel computation libraries and the programmable capabilities of DL frameworks to better leverage the advancement of GPU hardware capabilities and capacities; (2) It offers practical guidelines for service providers to effectively deploying and delivering DLaaS to their diversed applications and consumers; and (3) It steers DLaaS users and application developers to select the right DL frameworks for the right DL workloads.
2 Reference Model for DL Frameworks
All DL frameworks encapsulate a chosen deep neural network (DNN) to learn and generate a DNN model over the training dataset . Thus, a DL framework can be abstracted as a function , where denotes the model parameters, represents an -dimension input and denotes the output, an -dimension vector. The DNN typically contains a large number of model parameters (), such as the neuron weights and hyper-parameters. Hyper-parameters primarily include the batch size, the number of training iterations and the learning rate (LR). For example, Inception-v3 contains almost 25 million model parameters . Thus, in practice, the is first initialized with a set of values (random or fixed), then tuned by the training phase.
Mathematically, for a training dataset , the objective of training is to minimize the average loss over all  in each iteration:
where represents the neuron weights, indicates the average loss over all , and is the loss on a data sample (computed by the feed-forward process). Since can be very large, in practice, a stochastic approximation of this objective is to use mini-batch (batch). By using samples, the average loss can be estimated as:
where denotes the batch size for this iteration. In particular, a typical way to select these samples is to put continuous samples from the shuffled into this batch. After iterations, the is traversed, forming an epoch. For each epoch, no overlapping between batches exists and needs to be reshuffled.
where represents the learning rate, controlling the extent of each update.
Most DL frameworks, such as TensorFlow, Caffe, Torch, Theano, CNTK, Keras and PyTorch, adopt a similar layered software architecture and provide APIs to enable users to configure the DNN model and the training methods (optimizers). Figure 1 shows an overview of the reference architecture used by most of DL frameworks. Most of the existing DL frameworks are implemented on top of popular parallel computing libraries: BLAS (Basic Linear Algebra Subprograms) libraries, such as OpenBLAS , MKL , and cuBLAS , NCCL (NVIDIA Collective Communications Library) , OpenMP/MPI [21, 22] and Eigen . Network and storage tier is on top of the bare metal hardware and connecting to the parallel computation libraries. LeNet, AlexNet, VGG, Resnet are some popular neural network (NN) models offered as user-configurable NN options by most DL frameworks. For example, Caffe, Torch, Theano all provide options of AlexNet, LeNet VGG, and ResNet. The Google inception network  is an extension of LeNet.
3 Methodology and Baselines
High-performance DL systems and applications demand both low latency and high accuracy simultaneously. With a large number of parameters in DL, it is very difficult to set these parameters, particularly the hyper-parameters, to achieve this goal. We take a holistic approach to conduct a systematic empirical study on four representative DL frameworks: TensorFlow by Google , Caffe by BVLC , Torch  by Lua community , and Theano (a Python library) . As shown in Figure 1, the implementation and the runtime execution of different DL frameworks may differ in a number of ways, depending on their concrete implementation and configuration choices, including (1) the static components chosen as the parts of their runtime execution systems at the coding and implementation phase, such as the specific ML libraries, the specific parallel computing libraries, (2) the flexible components of their implementation systems, which are configurable prior to runtime, such as the concrete DNN structures, the hyper-parameters, such as the mini-batch size, the number of iterations, and the learning rate; and (3) the hardware platform compositions, such as with or without GPU devices, the number of GPU cards used, the type of network and storage runtime, such as using InfiniBand or Ethernet to connect computing nodes.
Several existing research efforts have shown the impact of different hardware platforms on the performance of DL frameworks [10, 25, 11, 8], and compared the performance of different DL frameworks with respect to their DNN structures and their default configuration settings [9, 26]. Thus, in this paper, we engage our empirical measurement study and comparison on characterization and analysis of DL frameworks in terms of how they respond to different configurations of their hyper-parameters, different types of datasets and different choices of parallel computing libraries. Both runtime performance in terms of training time and testing time and accuracy of DNN models produced by DL frameworks for prediction are measured in each set of experiments and reported with respect to their baseline configurations, which are the default settings of DNN structures and hyper-parameters recommended by the developers of each DL framework. We indicate those configuration settings of parallel computing libraries and hyper-parameters that can provide improved performance and/or improved accuracy over the recommended baseline default.
The profiling tools used to measure the performance of the DL frameworks, in this study include the sysstat  and Nvidia SMI . The sysstat is a collection of utilities to monitor system performance and usage, including the usage statistics for CPU and memory . The Nvidia SMI is a command line tool on top of the NVIDIA Management Library (NVML) for management and monitoring of NVIDIA GPU devices .
3.1 Hardware Platforms
All experiments are primarily conducted on an Intel Xeon E5-1620 server (Server-1) with the 3.6Ghz 4-core CPU, DDR3 1600 MHz 8GB 4 (32GB) memory, 256GB SSD, Nvidia GeForce GTX 1080 Ti with 3584 CUDA cores and 11 GB GDDR5X onboard memory, connected to the host via the PCIe 2.0 port, installed with Ubuntu 16.04 LTS, CUDA 8.0 and cuDNN 6.0. The same set of experiments were also repeated on a more powerful Intel Xeon server (Server-2), which has two Intel Xeon E5-2650 v4 2.2GHz CPUs, each with 12 physical cores, composing a NUMA architecture with DDR4 2400 MHz 16 GB12 (192GB) memory, 3TB SSD, 2 Nvidia Tesla M60, each has 2 GPUs, each GPU has 2048 CUDA cores and 8GB GDDR5 memory, installed with Ubuntu 16.04 LTS, CUDA 9.0 and cuDNN 7.0. Two M60 GPU cards are lined to the host via PCIe 3.0 ports. For both platforms, TensorFlow by default uses Eigen and the other DL frameworks are compiled with OpenBLAS. The default platform is the Intel Xeon E5-1620 server for all experiments reported in this paper unless stated otherwise.
We choose three most popular and classic datasets: MNIST , CIFAR-10 , and ImageNet (ILSVRC2012). The first two datasets are the representative datasets used by almost all mainstream DL frameworks to tune and publish their default configurations. MNIST consists of 70,000 gray-scale images of ten handwritten digits, each image is in size. CIFAR-10 consists of 60,000 colorful images of 10 classes, each is in size. Table I shows the raw dataset size and its in-memory size for MNIST and CIFAR-10. The in-memory size is calculated with the assumption that the samples are stored in float32 matrices while labels are stored in int. For ImageNet, ImageNet has 1,000 categories, the vast majority of its categories has about 1,300 samples for training, and each category owns 50 testing samples. The raw training dataset size is 140 GB while it is 6.3GB for its testing data.
|Dataset||Category||Raw Size (MB)||In-memory (MB)|
|Base Learning Rate||0.0001||0.01||0.05||0.1|
|Base Learning Rate||0.1||0.001||0.01|
|1st Layer (conv)|
|ReLU, MaxPooling()||MaxPooling()||Tanh, MaxPooling()||MaxPooling()|
|2nd Layer (conv)|
|ReLU, MaxPooling()||MaxPooling()||Tanh, MaxPooling()||MaxPooling()|
|3rd Layer (fc)|
3.3 Baseline Comparison
Table II and Table III show the primary hyper-parameter settings in the default configuration of TensorFlow, Caffe, Torch and Theano for MNIST and CIFAR-10 respectively. Given that the Theano release has only the default setting for MNIST , we use a popular third-party Theano implementation on CIFAR-10 . To control the iterative training process, some DL frameworks use the notion of #Epochs, such as Torch, Theano, whereas others use the notion of #Iterations instead, such as TensorFlow, Caffe. We list both #Epochs and #Iterations for ease of comparison. For example, for MNIST, TensorFlow sets its max iterations to 20,000 and Caffe sets it to 10,000 while Torch and Theano set their #Epochs to 12 or 200 respectively. TensorFlow and Theano reserve 5,000 and 10,000 samples respectively from the original training dataset as the validation samples. Thus, the corresponding #Epochs can be calculated as follows:
Thus, we obtain . Torch did not specify a default setting on #Epochs for both datasets. We conduct experiments on Torch using Server-1 by varying #Epochs on both datasets, and found that 12 epochs for MNIST and 45 epochs for CIFAR-10 are optimal #Epochs and thus used as its baseline.
From Table II and Table III, the four DL frameworks adopt different hyper-parameters for MNIST and CIFAR-10 datasets. For example, the SGD algorithm is used by all four frameworks for CIFAR-10 but TensorFlow sets Adam as its default optimizer for MNIST whereas Caffe, Torch and Theano still use SGD for MNIST. All four frameworks adopt different base learning rates (LR) and batch sizes for each of the two datasets. For TensorFlow, LR, batch size and #Iterations increase significantly for CIFAR-10 compared to MNIST, with LR and #Iterations 100 and 50 larger respectively. Similarly, Caffe increases its batch size to 100 and uses a two-phase training for CIFAR-10 to gain high accuracy, while Torch and Theano both decrease the LR and batch size in order to learn more elaborately.
In fact, the four DL frameworks also set their default DNN structures differently for MNIST and CIFAR-10, as shown in Table IV and Table V respectively. conv denotes the convolution layer and fc the fully-connected layer. The four frameworks adopt a similar DNN structure with 2 convolution layers for MNIST but use significantly different settings for CIFAR-10. These baseline comparisons imply that the performance and accuracy of DL frameworks can be sensitive to the type of datasets and the specific configurations of their hyper-parameters. The sensitivity to datasets leads to different optimal settings of the hyper-parameters for different datasets under the same DL framework (data dependency). The sensitivity to the specific implementation choices (recall Figure 1) leads to the divergence of different DL frameworks on the optimal settings of the hyper-parameters for the same dataset (framework dependency).
3.4 Empirical Study Objectives
The above baseline comparison and analysis motivate us to conduct empirical measurement and comparative study on the four DL frameworks with three complementary objectives. First, we will compare the effectiveness and efficiency of the four DL frameworks using their default baseline configurations for MNIST and CIFAR-10 on different configurations of hardware and parallel computing libraries. Second, we will characterize and analyze the optimal settings of those hyper-parameters that are critical to both performance and accuracy, including both the impact of individual hyper-parameters and the impact of the combo-settings of two or more hyper-parameters. Third, we will measure and characterize CPU and memory resource consumptions by different DL frameworks for different hardware configurations, parallel computing libraries and hyper-parameter settings. We believe that this comprehensive empirical characterization and analysis of the DL frameworks can serve as a practical guidance for researchers, developers, domain scientists and big data professionals in both choosing a DL framework most effective for their big datasets and learning tasks, and tuning the DL frameworks for high performance and high accuracy.
|Framework||Dataset||#GPUs||#Iterations||Batch Size||Training Time (s)||Testing Time (s)||Accuracy (%)|
4 Experimental Comparison and Analysis
4.1 Impact of Hardware Configurations
4.1.1 Baseline Experiments
We first compare the baseline performance of all four DL frameworks using their default settings on MNIST and CIFAR-10 on both CPU and GPU, and the results are shown in Table VI on Server-1 and Table VII on Server-2. We make four interesting observations.
First, regarding accuracy, under both CPU and GPU on Server-1 and Server-2, TensorFlow (TF) and Torch achieved higher accuracy for MNIST, even though Torch has smaller #Epochs, kernel size and batch size than TensorFlow. But for CIFAR-10, TensorFlow has the highest accuracy, followed by Caffe, then Torch with Theano the worst, though Theano spent the least training and testing time. Also, for CIFAR-10, Torch-GPU has a slightly lower accuracy (65.96%) than Torch-CPU (66.20%), one reason might be that Torch uses SpatialConvolutionMap on CPU and uses SpatialConvolutionMM on GPU due to the lack of the corresponding implementation on GPU . Overall, all frameworks achieve much lower accuracy on the content-rich CIFAR-10 dataset compared to the simple gray-scale MNIST dataset, due to the low entropy of MNIST, thus easier DNN feature extraction. Specifically, for MNIST, all frameworks obtain an accuracy of , with Caffe having the lowest accuracy on CPU, though shorter training or testing time. While, for CIFAR-10, all frameworks achieve an accuracy of .
Second, regarding training and testing time, the GPU significantly shortens the training and testing time for all frameworks on both datasets and two servers, except that Theano is slightly lower, but offers higher accuracy in Caffe for both datasets that CPU. For CIFAR-10, the highest accuracy of TensorFlow is at the cost of taking significantly longer training time on both CPU and GPU.
Third, for CIFAR-10, Torch-CPU and Theano-CPU achieved higher accuracy of 66.20% and 56.04% respectively than Torch-GPU (65.96%) and Theano-GPU (54.49%) on Server-1. This observation indicates that even GPU accelerates the training and testing, it may not lead to high accuracy.
Fourth, comparing the memory and CPU of the two servers, Server-2 has much larger memory (192GB) and higher performance CPU (E5-2650 v4) than Server-1 (32GB, E5-1620). However, Caffe achieved higher runtime performance on Server-1 than Server-2 for both MNIST and CIFAR-10, i.e., 512.18s (Server-1) and 785.22s (Server-2) for training on MNIST. Theano also has obtained shorter training and testing time and achieved higher accuracy on Server-1 than Server-2 for MNIST. These observations indicate that higher capacity of memory and CPU may not result in shorter training/testing time and better training accuracy and we conjecture that scalable advancement in DL software frameworks that can leverage more powerful hardware capabilities is an important and critical challenge for scaling DLaaS for big data powered DL workloads.
4.1.2 Impact of Varying #GPUs
DL frameworks continue to improve over time for better support of multiple GPUs. TensorFlow comes with a default implementation for ImageNet with multi-GPU support. Caffe inherently supports multiple GPUs (enabled for this set of experiments). However, TensorFlow, Torch and Theano do not implement their support for multiple GPUs for MNIST and CIFAR-10 in their default configurations. Thus we conduct a set of experiments to show the impact of #GPUs by varying from 1, 2 to 4 on the runtime performance and accuracy of TensorFlow and Caffe. For TensorFlow, the neural network structure is the Inception-v3 , and it costs 5 billion multiply-adds per inference and uses almost 25 million parameters. We set the batch size for each GPU to be 32 as recommended in  and other parameters set as default. For Caffe, the parameters are set as default with a fixed batch size per GPU, 64 for MNIST and 100 for CIFAR-10.
Table VIII shows the experimental results. We observe that the training time increases as the batch size grows for both TensorFlow and Caffe. This is primarily due to the synchronization overhead of multiple GPUs. The testing time remains similar because the testing is performed only on one GPU. Furthermore, the accuracy of TensorFlow on ImageNet and the accuracy of Caffe on CIFAR-10 both increase as the total batch size increases given the fixed #Iterations. However, the accuracy of Caffe on MNIST fluctuates when we vary the number of GPUs, and also the accuracy of Caffe on MNIST with 2 or 4 GPUs are lower than its accuracy with 1 GPU. This shows that more GPUs for MNIST may not result in shorter training time and higher accuracy. This set of experiments also shows that for content-rich and color-rich datasets like CIFAR-10 and ImageNet, more GPUs can improve the accuracy of trained DNN models, indicating that larger batch sizes bundled with multi-GPUs hold the potential to improve the accuracy with little cost of training time, particularly for content-rich datasets. Thus, developing dataset feature aware configuration of GPUs and deep neural network structures can be a complimentary dimension for benchmarking and performance tuning.
|Framework||Setting||Training Time (s)||Testing Time (s)||Accuracy (%)|
4.1.3 CPU v.s. CPU with GPU
Our baseline experiments have shown that the hardware configuration of CPU with GPUs typically outperforms the CPU server without GPU by an order of magnitude, denoted by CPU GPU for presentation brevity. However, our empirical measurements have shown that in some scenarios, CPU without GPUs, with proper optimizations, may outperform the CPU with GPUs in terms of training and testing time as well as accuracy, denoted by CPU GPU. Table IX shows the results of one such set of measurements. GPU-1 and GPU-2 denote the default GPU configurations on Server-1 and Server-2 while on Server-2, we replace OpenBLAS, the default for Caffe and Theano on CPU with MKL as the parallel computing library, denoted as MKL (CPU-2). We observe that the performance of Caffe with MKL (CPU-2) shows significant improvement over GPU-1 and GPU-2. Also the training time Caffe with MLK on Server-2 improves over that with GPU-1 by 1.64 and GPU-2 by 2.30. Furthermore, Caffe achieved higher accuracy of 99.18%, slightly higher than GPU-1 (99.13%) and GPU-2 (99.15%). Similar observations are made for Theano. With MKL, Theano MKL (CPU-2) also achieved higher runtime performance than on GPU-2 with the training time and testing time reduced from 1,597.86s and 0.52s (GPU-2) to 691.75s and 0.28s respectively. These observations show the opportunities and the potentials of parallel computation optimizations for CPU without GPUs. It also shows the potential of better DL software frameworks for a given hardware configuration platform, be it CPU or CPU with GPUs.
4.2 Impact of Parallel Computing Libraries
The bagging schemes enable the partitioning of large datasets into many small mini-batches such that massive parallel processing is possible for deep neural networks (DNNs) with huge feature maps. As a result, all DL frameworks are compute-intensive. The runtime performance of deep learning frameworks highly relies on the performance of the underlying parallel computing libraries, as shown in the previous set of experiments. In this section, we primarily study the impact of different parallel computing libraries and the configurations on both the runtime performance and the accuracy of different DL frameworks. For the two hardware platforms (Server-1 and Server-2), the DL frameworks are compiled natively with Eigen or OpenBLAS or MKL. Eigen is only used as the default by TensorFlow while Caffe, Torch and Theano use OpenBLAS as their default parallel computation library. Eigen is tightly integrated into TensorFlow , while OpenBLAS is a popular open-source, optimized BLAS (Basic Linear Algebra Subprograms) library. MKL is developed by Intel to better utilize the computation power of Intel CPUs for optimal computing performance. This set of experiments compares the use of MKL as the parallel computation library with the use of the default parallel computing libraries in all four DL frameworks on both Server-1 and Server-2. We also vary the environmental variable OPENBLAS_NUM_THREADS in OpenBLAS (abbreviated as #THREADS) to set the number of threads involved in parallel computation (default: unset) on Caffe, Torch and Theano for Serve-2. Two key system performance indicators, CPU usage and memory utilization, are also measured in this set of experiments.
4.2.1 Impact of Parallel Computing Libraries
First, since MKL is well-optimized by Intel for Intel CPUs used by both Server-1 and Server-2. We observe indeed the better performance for MKL compared to the scenarios where OpenBLAS is used as the default parallel computing library, such as Caffe, Torch, and Theano. Although MKL achieved much higher runtime performance than OpenBLAS in Caffe, Torch and Theano, TensorFlow with Eigen as its default parallel computation library significantly outperforms MKL for training on Server-1 and Server-2 by 5.47 and 5.08 respectively. We conjecture that the high performance of Eigen in TensorFlow is due to their tight integration, even though Eigen may not be properly optimized for Intel CPUs as MKL does, replacing Eigen with MKL in TensorFlow leads to significant performance reduction for TensorFlow. However, the choice of Eigen or MKL has low impact on the accuracy of TensorFlow trained DNN models. Also the accuracy for an individual DL framework with different parallel computing libraries varies within 0.10% on Server-1 and 0.18% on Server-2. This observation manifests the feasibility of changing the parallel computing libraries to attain better runtime performance with negligible impact on accuracy.
Second, We measure the total CPU usage as the overall utilization of CPU. High CPU utilization typically implies better runtime performance. However, the highest total CPU usage of DL frameworks does not correspond to the highest runtime performance because the performance of the parallel computation library for an DL framework is a dominating performance factor. For TensorFlow, its total CPU usage with MKL is 99.12% on Server-1 and 99.47% on Server-2, which is much higher than Eigen with 80.15% on Server-1 and 50.03% on Server-2. However, TensorFlow with MKL has much worse training performance compared to TensorFlow with Eigen. Moreover, Caffe, Torch and Theano with OpenBLAS (their default option) achieved the highest total CPU usage on both Server-1 and Server-2 respectively, but suffer from the worst training time. Meanwhile, it is also observed that the shorter training often corresponds to lower memory usage, possibly corresponding to smaller memory footprints.
|CPU Usage (% AVG)||Memory (MB)|
4.2.2 Impact of #THREADS in Parallel Computing
Server-2 has 24 physical cores, that is 48 logical cores with HT (Hyper-threading) enabled. Table XII shows the experimental results on Sever-2 by varying the #THREADS from 4 to 48 of OpenBLAS for Caffe, Torch and Theano. The experimental results further confirm our previous observations on trivial impact of specific parallel computing libraries on the accuracy. We observe small accuracy difference by varying #THREADS for each of these three DL frameworks, i.e., the accuracy difference is within 0.14% for Caffe, 0.06% for Torch and no difference for Theano, demonstrating the tiny impact of different thread count configurations in the OpenBLAS parallel computing library. For CPU usage, as the #THREADS increases, the total CPU usage and the percentage of CPU usage for systems also increase. Hence, the performance degradation with larger #THREADS is likely caused by the increased overhead of thread synchronization. In addition, the optimal setting of the #THREADS is 24 for Caffe, 4 for Torch, and 8 for Theano on the same hardware platform, showing that the implementations of DL software frameworks may have larger impact on the parallel computing performance. In particular, Torch with #TREAHDS=4 in OpenBLAS achieved shorter training time (1,224.94s) than using Intel optimized MKL (1,555.17s), demonstrating that OpenBLAS with a proper configuration could outperform MKL. In summary, tight integration of parallel computing library and its configurations with DL frameworks, such as TensorFlow with Eigen, Torch with OpenBLAS optimal #TREAHDS configuration, are highly recommended for performance tuning of DL frameworks.
4.3 Impact of Hyper-parameter: #Epochs (#Iterations)
Recall the comparative analysis of the baseline configurations of DL frameworks presented earlier, we compared their training time performance and accuracy based on their default settings for hyper-parameters and DNN structures. In this section, we run a set of experiments and tune the settings of individual hyper-parameters to identify those optimal settings for each of the four DL frameworks. We first study the impact of #Epochs (or #Iterations) and keep the framework dependent and dataset dependent default settings for other parameters, as those shown in Table II V.
From the baseline experiments on CIFAR-10 on both CPU and GPU, TensorFlow shows the highest accuracy at the cost of significantly longer training time (up to 4 on CPU to 7 on GPU) compared to other frameworks, and the experiment takes about 3 days to complete. By Table III, the default #Epochs of TensorFlow on CIFAR-10 is 2560 with batch size of 128, which is significantly larger, and equivalent to #Iterations according to Formula (4). We want to reduce the training time by cutting down the #Epochs TensorFlow has set for CIFAR-10 while preserving its high accuracy. Interestingly, we found that reducing the #Epochs by 10 times to 256 epochs, we can maintain very similar accuracy for CIFAR-10 on both CPU and GPU, as shown in Table XIII. Our empirical results are consistent to those reported by TensorFlow for 256 epochs  ( 86% accuracy). It is interesting to note that TensorFlow recommended default setting for #Epochs is 10 times of 256, but its accuracy increase is only 0.30% on CPU and 0.60% on GPU. This translates to significantly longer training time, about 197,495.33s (54.86 hours) more on CPU and 11,117.83s (3.09 hours) more on GPU. Specifically, the training time of 2,560 epochs is 10.11 the training time of 256 epochs on CPU and 9.18 on GPU. On one hand, long training time spent once to train a model with higher prediction accuracy is well worth of the cost if real application deployment is highly sensitive to accuracy. On the other hand, some mission-critical applications may have training timing requirements, training for 256 epochs might be more attractive for two reasons: (1) the testing times of 2,560 epochs and 256 epochs are similar for both CPU and GPU respectively, and (2) the accuracy difference is only in the range of 0.30% to 0.60% for both CPU and GPU.
Caffe has shorter training time but the prediction accuracy of its trained DNN model is lower than TensorFlow and Torch on MNIST and lower than TensorFlow on CIFAR-10 for both CPU and GPU. From Table II and III, it is observed that Caffe has the smallest default #Iterations (#Epochs) on both datasets. We next investigate the impact of larger #Iterations on the accuracy of Caffe. Table XIV reports the measurement results. We make two observations. First, for MNIST, training for 20,000 iterations (21.33 epochs) shows the higher accuracy of 99.22%, which is 0.09% higher than the Caffe’s default, at an acceptable cost of additional 97.18s for training. It is also interesting to note that the accuracy of 15,000 iterations (16 epochs) and of 240,000 iterations (256 epochs) are lower than default by 0.09% and 0.08% respectively, even by adding longer training time of 48.2s and 2,236.91s respectively. One reason of the lower accuracy for the 15,000 iterations is probably that the training is trapped into a local optimum instead of the global one, and the lower accuracy of the 240,000 iterations is likely due to over-fitting, i.e., the trained model is over-fitted to the training dataset, losing its generalization on the testing dataset. Second, for CIFAR-10, we use the default setting of #Epochs in TensorFlow to cap the maximum #Iterations to 1,280,000 (i.e., 2,560 epochs), which increases the accuracy of Caffe by 2.41% at the cost of 255.53 more training time than the default. Also, the 500,000 iterations help increase the prediction accuracy by 1.38% at the cost of 99.72 training time, considerably lower than that of 1,280,000 iterations. However, careful examination of Table XIV, it also indicates that the correlation between the training time and the prediction accuracy is non-linear and more complex. In summary, this set of experiments indicates that Caffe can benefit from larger #Epochs (i.e., longer training time) in some cases, though the larger #Epochs does not necessarily guarantee a higher accuracy. This is because local optimum and over-fitting may lower accuracy with more iterations. For content-rich dataset like CIFAR-10, larger #Epochs helps improve the accuracy of Caffe.
The next set of experiments is designed to study the impact of hyper-parameter #Epochs (#Iterations) on the performance of Torch in terms of accuracy and training time. We keep the default for the DNN structure and other hyper-parameters and vary the #Epochs on MNIST and CIFAR-10 for both CPU and GPU platforms.
Figure 2 (MNIST) and 3 (CIFAR-10) show the results and we use the left y-axis for accuracy and the right y-axis for training time to facilitate the analysis of the relationship between training time and accuracy. Overall, the training time increases as the #Epochs increases in all cases. For accuracy, Figure (a)a, (a)a show that the accuracy of Torch on CPU first increases rapidly to reach the optimal value and then stays stable or slightly drops probably due to over-fitting. For MNIST, the peak accuracy of 99.24% is first achieved at the 12th epoch while the CPU training time is 9,647.34s. For CIFAR-10, the peak accuracy of 66.20% is first obtained at the 45th epoch, with the CPU training time of 54,830.26s. However, Torch-CPU reaches the accuracy of 66.16% at the 20th epoch, and as the #Epoch increases from 20 to 45, the accuracy for Torch-CPU only increases slightly by 0.04% from 66.16% to 66.20%. Compared to the peak accuracy at the 45th epoch, with only 0.04% less than the peak accuracy, it can save the training time by 16,561.59s, approximately 30% of the training time for the 45 epochs.
Figure (b)b, (b)b show that Torch experiences more accuracy fluctuations on GPU for both MNIST and CIFAR-10, with the peak GPU accuracy of 99.22% first at the 12th epoch for MNIST with the training time of 338.46s, and the peak GPU accuracy of 65.96% first at the 45th epoch for CIFAR-10 with the training time of 1,906.56s. Comparing Torch on CPU and GPU, the training time of GPU is about 28 faster than CPU with a small loss of accuracy. These experiments indicate that the accuracy and training time curve on CPU is smoother than that on GPU and they also illustrate why Torch has the default #Epochs=12 for MNIST and #Epochs=45 for CIFAR-10 (recall Table II and III). For CIFAR-10, Torch also achieves 0.24% higher accuracy on CPU than GPU, though the time spent for CPU training is 27.76 longer than its GPU counterpart. Thus, we show the impact of #Epochs for Torch on accuracy and training time, and find its optimal #Epochs on MNIST and CIFAR-10.
We report the experiments on the impact of varying #Epochs on the performance of Theano for CPU in Figure 4. Both the training time and the accuracy increase as the #Epochs increases, though the accuracy has a much slower response to the growth of #Epochs. Theano first achieves the peak accuracy of 99.11% at the 600th epoch, showing that Theano may benefit from larger #Epochs to improve the accuracy.
In summary, all experiments conducted for studying the impact of #Epochs consistently confirm two observations for all four frameworks: (1) The training time is proportional to #Epochs independently of dataset or framework choices, and (2) a larger #Epochs does not guarantee to increase the model accuracy but a peak accuracy threshold can be found.
4.4 Impact of Hyper-parameter: Batch Size
The batch size is another important hyper-parameter for all DL frameworks. To understand the role of batch size on the performance of DL frameworks, we only vary the batch size in this set of experiments while keeping the DNN structure and the other hyper-parameters by their default.
Table XV shows the training time (per epoch and total), testing time and the accuracy as the batch size increases. First, as the batch size starts to increase exponentially, the training time and the accuracy both start to drop. Since all other hyper-parameters remain their default configuration, one possible interpretation is that the larger batch size adds more complexity to the learning process and fewer features may be found per iteration. Moreover, a larger batch size implies the drop of the total #Iterations (#Iters) for a given default #Epochs based on Formula (4). However, when the batch size increases to , the training time becomes higher than that of the default , which is likely due to the use of virtual memory due to the large batch (see Section 4.7 for more discussions).
|Batch Size||#Iters||#Epochs||Training Time (s)||Testing Time (s)||Accuracy (%)|
We observe very similar behavior from Caffe. By Table XVI, as the batch size increases, both the accuracy and the training time decrease, though the testing time remains almost unchanged. Caffe has the lowest accuracy (60.30%) when the batch size increases to , indicating that the whole training process may not converge well.
|Batch Size||#Iters||#Epochs||Training Time (s)||Testing Time (s)||Accuracy (%)|
Recall Table II and Table III, Torch has the smallest default batch sizes among all four DL frameworks. Figure (c)c and (c)c show the results of varying the batch size for Torch for MNIST and CIFAR-10 respectively. For MNIST, we vary the batch size from 1 to 50. As the batch size increases from 1 to 5, the accuracy starts to increase quickly and reaches the peak accuracy of 99.22% at the batch size of 10 and then starts to drop very slowly from 10 to 50, while the training time decreases at first quickly and then the trend becomes slow as the batch size increases beyond 10. The training time is 1,765.57s for , which is 3.47 that of (509.07s). For CIFAR-10, Figure (c)c shows that the batch size of 1 is optimal and as soon as the batch size starts to increase, the prediction accuracy starts to drop while the training time drops sharply when the batch size increases from 1 to 10 and then drops at much slower pace until the batch size reaches 100. This set of experiments confirms that Torch achieves the highest accuracy at for MNIST and for CIFAR-10. We conclude with three interesting observations: (1) For CIFAR-10, the accuracy of is 65.50%, only 0.19% lower than the optimal but Torch at enjoys a training time reduction of 179.92s, about 11% of the training time for . (2) Torch shows the worst accuracy with on MNIST while it achieves the best accuracy on CIFAR-10, showing that its default configuration is highly dataset dependent. This is because when the batch size is 1, the training is pure stochastic, implying the features extracted from each batch is partial. Moreover, the LR on MNIST is 0.1 much larger than that for CIFAR-10 (0.01). The higher LR may lead to over-fitting more easily on partial features due to the low entropy of a dataset, e.g., MNIST. (3) Table XVII shows that the experimental results for much larger batch sizes. When the batch size increases from 10 to 1000, the accuracy drops slightly and the training time drops to 19.21%. However, when , the training time increases compared to that for . One reason could be the higher memory overhead. When we further increase the batch size to , Torch crashed on both platforms.
|Batch Size||#Iters||#Epochs||Training Time (s)||Testing Time (s)||Accuracy (%)|
Table XVIII shows the measurement of Theano when batch size is varied from 50, 500 to 5000. Theano adopts the early stopping technique to address over-fitting . Thus, when , it stopped earlier at the 178th epoch with the highest accuracy. The training time per epoch drops when the batch size increases from 500 to 5000 but increases when the batch size increases from 50 to 500. As the batch size increases, the accuracy declines. Also, Theano produces the same accuracy on both single GPU and multi-GPU platforms for the same batch size settings, demonstrating its good stability of accuracy.
|Batch Size||#Iters||#Epochs||Training Time (s)||Testing Time (s)||Accuracy (%)|
4.5 Impact of Tuning Multiple Hyper-parameters
We have studied the impact of single hyper-parameter, such as #Epochs and the batch size on the performance of DL frameworks, by varying one hyper-parameter while keeping the default setting for the rest of the parameters (recall Section 4.1.1). Though larger #Epochs may improve the accuracy at the cost of training time, and a larger batch size may decrease the accuracy, the correlation of #Epochs and the batch size is much more complex. We dedicate the next set of experiments to study the impact of tuning multiple hyper-parameters and to understand whether such tuning may improve the accuracy for DL frameworks.
Based on the empirical analysis for the single hyper-parameter #Iterations (#Epochs) and batch size, we choose the and and the #Iterations of 200, 2000 and 20000 to study the impact of tuning the multiple hyper-parameters. Table XIX shows that with 20000 iterations achieved much better accuracy, the same as the default (99.28%), than with 2000 iterations (98.65%). Also, by tuning the two hyper-parameters together, when and , TensorFlow achieves the highest accuracy of 99.34%, better than the accuracy (99.28%) of its default configuration for MNIST, though at a high cost due to about 84.31 longer training time. Overall, this set of experiments indicates a larger batch size needs much longer training to converge and achieve higher accuracy, and thus, a fixed #Epochs is not suitable for larger batch sizes.
Similarly, we choose and for Caffe. Table XX shows that using and with 10000 iterations, Caffe can improve the accuracy (99.03%) of its default configuration and achieve higher accuracy of 99.09% and 99.05% respectively, at the cost of much longer training time (e.g., 4,199.61s for compared to 512.18s for the default batch size of 64). This also indicates that a larger batch size may help increase the resistance to over-fitting.
In this set of experiments, similar to Caffe and TensorFlow, is chosen. Torch uses #Epochs to control the training process. The default is 12 and the default is 10 in Torch for MNIST. Figure (c)c shows the accuracy measurements by varying the setting of #Epochs with . With the larger batch size, as the #Epochs increases, Torch achieves much better accuracy. Particularly, when , combined with or , Torch achieved an higher accuracy of 99.31% or 99.32% respectively, compared with 99.22% using its default for MNIST (recall Table II and Table VI). Overall, this set of experiments indicates that with larger #Epochs and a larger batch size, Torch can improve the accuracy. Also a larger batch size tends to be more resilient to over-fitting.
As for Theano, the default and the default (recall Table II). In this set of experiments, we set and vary the #Epochs up to 1000. The experimental results are shown in Figure (d)d. From 200 to 1000 epochs, the accuracy for increases continuously. From 800 epochs to 1000 epochs, Theano achieves the highest accuracy of 99.13%, compared to the accuracy of 99.05% when using its default setting of 200 epochs (recall Table (a)a Server-1: Theano-GPU for MNIST).
|Framework||Batch Size||#Iterations||#Epochs||Learning Rate||Training Time (s)||Testing Time (s)||Accuracy (%)|
4.6 The Impact of Learning Rate
Above experiments shows that a larger batch size combined with a larger #Iterations (#Epochs) may help achieve better accuracy than the default configurations. It also indicates that even with much longer training time, a larger batch size is more resistant to over-fitting. Furthermore, sufficient #Iterations or #Epochs are necessary for achieving desirable accuracy, while too few iterations (or epochs) with a larger batch size could lead to under-fitting, hurting the accuracy. We conjecture that seeking a balance between accuracy and training time is more desirable for many DL applications. These observations motivate us to study the impact of the learning rate (LR).
Through experiments, we found the series of hyper-parameters that outperform the default w.r.t. accuracy and even runtime performance. Specifically, the batch size, #Iterations, LR and accuracy found for TensorFlwo, Caffe, Torch and Theano are shown in bold on Table XXI, which shows the measurement comparison of three hyper-parameters: the batch size, #Iterations, LR on accuracy of four DL frameworks. Recall Figure 5, it shows the accuracy and training time for the four DL frameworks by varying #Iterations (#Epochs) under the batch size and LR set as ones in bold respectively on Table XXI of the corresponding DL framework. We highlight two observations. First, the training time is proportional to the #Iterations (#Epochs). Second, the accuracy tends to increase until it reaches the plateau as the #Epochs increases.
From Table XXI, we also observe that by varying the batch size and the LR setting, we can obtain improved accuracy over the corresponding default configuration for each DL framework. Concretely, for TensorFlow, the configuration of , and achieved the highest accuracy of 99.45%, compared with 99.22% of the default setting for TensorFlow with , , and the . This is obtained through a progressive study through measurements: We first changed the LR from 0.0001 (default) to 0.001 to study the impact of LR. Then, we changed the #Iterations from its default (20000) to 60000. For Caffe, Torch and Theano, we conducted similar experiments. Theano employed an early-stopping mechanism to combat over-fitting . During training, Theano will monitor the model performance on a validation dataset (neither the training nor the testing dataset). If the model performance fails to improve sufficiently, or even degrades with further training, the early-stopping mechanism will be triggered to stop the training. Therefore, the values within the parentheses represent the set values that are used by Theano compared to the actual values in the #Epochs column. For example, 230 (800) represents the setting of #Epochs = 800 and Theano stopped training at the 230th epoch by its early-stopping mechanism. From the experimental results in Table XXI, we highlighted several interesting observations.
(1) Accuracy measures the utility of a trained DNN model. Tuning hyper-parameters can lead to accuracy improvement. TensorFlow, Caffe, Torch and Theano obtained accuracy improvements by 0.23%, 0.05%, 0.10% and 0.07% respectively. It is widely acknowledged in machine learning (ML) community that even small improvement on accuracy can have significant impact on the utility of the trained DNN model, as demonstrated in recent ML literature for newly proposed DL algorithms [29, 30, 31]. For instance, the accuracy over the conventional algorithms is improved by 0.02% on CVPR 2012  and on ICML 2013 , and such small percentage is non-trivial when it is over a large dataset of 100,000 in size.
(2) Accuracy is sensitive to the setting of LR while different settings of LR show little impact on the training time and testing time. For all four DL frameworks, slight changes in LR led to significant accuracy variance. For TensorFlow, when the LR is changed from 0.001 to 0.01, the accuracy of its trained DNN dropped from 99.45% to 98.59%. For Caffe, when the LR is changed from 0.1 to 0.01, the accuracy of its trained DNN dropped by 0.18%, from 99.18% to 99.00%. It is also worth to note that for the LR=0.1 with batch_size=64 and LR=0.5 with batch_size=6400, Caffe trained DNN failed to converge, denoted by NaN for accuracy due to the improper setting . Table XXI also shows that the training time and testing time are kept almost the same for each of the four DL frameworks when we only vary the settings of LR. For example, the training time of TensorFlow with batch_size=5000 and #Iterations=60000 for LR=0.0001, 0.001 and 0.01 are 8484.47s, 8379.51s and 8331.56s respectively, showing a small variance. This observation also indicates that tuning LR may lead to higher accuracy with negligible runtime performance degradation.
(3) The training time per epoch is highly dependent on the batch size. Also the total training time is proportional to the #Iterations under a fixed batch size. All four frameworks manifest similar training time per epoch for a specific batch size. For example, the training time per epoch for Caffe with batch_size=64 is 9.10s9.16s while it is 7.66s 7.72s with batch_size=6400. Similar observations are found in other frameworks, indicating that the training time is somewhat more predictable for DL frameworks, namely, under a fixed batch size, the total training time is proportional to the #Iterations (#Epochs).
(4) The impact of combined hyper-parameters is independent of the optimal settings of individual hyper-parameters. For example, the default configuration of TensorFlow can achieve higher accuracy with either a larger #Iterations (60000, 99.38%) or a larger LR (0.001, 99.26%). However, when we combine the larger #Iterations (60000) and LR (0.001) and compare the combination to the default setting, the accuracy dropped from 99.22% to 99.11%. This also indicates that the complexity of finding the optimal settings for multiple hyper-parameters, which is another reason that makes the tuning and benchmarking of DL framework a more challenging compared to conventional big data processing systems.
(5) In addition to improving the accuracy of default hyper-parameters, we observe that two sets of hyper-parameters in Torch and Theano outperform the default ones on both accuracy and training time performance. Specifically, for Torch, batch_size=100, #Epochs=12 and LR=0.4 surpassed its default configuration by a shorter training time of 209.64 seconds compared to 338.46 seconds, and a higher accuracy of 99.24% compared to 99.22% for the Torch default setting. For Theano, the combination of batch_size=500, #Epochs=200 and LR=0.25 outperforms the default configuration by 301.19s over 560.49s for training time and 99.09% over 99.06% for accuracy. Notably, these two setting combinations also reduced the training time to approximately 60% of the training time when using their default settings.
4.7 Impact of CPU and Memory Resource
We have shown the feasibility of tuning individual hyper-parameters and tuning multiple hyper-parameters for improving the accuracy of the DL frameworks over their default configurations. In this section, we examine how different DL frameworks respond to different batch sizes with respect to their CPU and memory resource usage patterns, given that larger batch sizes may demand more CPU processing and consume more memory resource. Table XXII shows the CPU and memory usage measurement results with varying batch sizes on MNIST for all four frameworks.
|Framework||Batch Size||CPU Usage (% AVG)||Memory (MB)|
We make two interesting observations on the CPU usage of TF. First, as the batch size increases from 50 to 500 and 5000, the CPU usage of TF (total) increases accordingly with almost no %iowait, because the corresponding maximum memory usage for all three batch size settings is within the total of the 32GB memory of Server 1. However, when the batch size is increased to 50000, the percentage of the CPU usage for user mode (%user) drops significantly, the %iowait increases, and the maximum memory consumed is slightly over 31GB, very close to the physical memory capacity. The increased %iowait shows a heavy disk read/write during the execution, indicating that memory swapping occurs, which degrades the overall system performance with much longer training time. This also implies that the adequate batch size with respect to the physical memory capacity is critical for ensuring high performance of DL frameworks.
From Table XXII, we measure the performance of Caffe by varying its batch size from the default value of 64 to 640, 6400 and 60000, we observe that the total CPU usage is decreasing as the batch size increases. However, the average memory consumption and the maximum memory consumption show different responses as the batch size increases to 640, 6400 and 60000. For example, the maximum memory for Caffe is 11009.01MB, about 11GB, when the batch size is increased to 60000. Compared with TensorFlow, Caffe uses much less memory to keep the whole batch of the training dataset in the memory, while TensorFlow runs out of the memory for , possibly due to the data-flow centric processing method of TensorFlow, which introduces more intermediate data structures. In comparison, for the default batch size of 64, the maximum memory usage is about 484.38 MB. This is sufficient to accommodate the in-memory data size of 209.62 MB (179.67 MB+29.95 MB) for MNIST . When the batch size is increased to 60000, the in-framework data expands to 50.21 ((11009.01-484.38)/209.62), compounded with the fact that the feature map of the training dataset may occupy a huge amount of memory , making the execution of Caffe memory-intensive. One take-away from this empirical analysis is the potential of improving the CPU usage for large batch sizes, as the higher CPU usage accounts for faster training with shorter time to training completion.
4.7.3 Torch & Theano
Similarly, the CPU usage drops as the batch size increases for Torch and Theano, even though the memory usage increases significantly for Torch when batch size changes from 10 to 10000 and for Theano when the changes from 50 to 500.
Intuitively, the larger batch size will increase the workload for parallel computations, which should consume more CPU, and thus the CPU usage is expected to increase. However, we observe from Table XXII that the CPU usage drops as the batch size increases for all four DL frameworks. This further indicates that the CPU usage is not efficient for larger batch sizes, and optimizations that can further improve CPU usage may further speed up the training process. Moreover, our experiments also show that a large batch size increases the memory usage and reduces the training time, demonstrating the feasibility of space and time tradeoff. These observations further indicate that improving the CPU and memory usage hold the potential to further optimize the performance.
5 Related Work and Conclusion
We have presented a systematic approach to empirical analysis and characterization of four popular DL frameworks, TensorFlow, Caffe, Torch and Theano, on three representative datasets, MNIST, CIFAR-10 and ImageNet. This paper makes three unique contributions. First, some existing benchmarking efforts for DL frameworks [35, 36] suffer from two inherent problems: (1) they measure the average time for forward and backward passes, matrix multiplication, or layer-wise performance, and lack of overall performance characterization; and (2) they do not include accuracy comparison. Although some recent efforts [9, 34] provide end-to-end DL benchmarking by only focusing on the training phase or specific DL tasks, none to date has taken a holistic approach to study the impact of hardware configurations, parallel computing libraries, and hyper-parameters on the performance of DL frameworks with respect to both accuracy and training time. Second, to the best of our knowledge, this study is the first to identify the opportunities for configuring parallel computing libraries and tuning individual and multiple hyper-parameters for improving the training time performance and the accuracy of DL frameworks. Third but not the least, to gain a deeper understanding of the impact of hyper-parameters and the choice of parallel computing libraries on the accuracy and training time of DL frameworks, we provide a systematic analysis of the CPU and memory usage patterns for different parallel computing libraries and different batch sizes and their impact on accuracy and training efficiency. We conjecture that the comparative measurement study and analysis presented in this paper will provide empirical guidance for developers, service providers to provide high performance DLaaS with better performance and accuracy tuning tools and at the same time it also helps application developers and end-users to select the right DL frameworks for the right DL workloads.
This research is partially sponsored by National Science Foundation under CISE SAVI/RCN (1402266, 1550379), CNS (1421561), CRISP (1541074), SaTC (1564097) programs, an REU supplement (1545173), an IBM Faculty Award, and gifts, grants, or contracts from Fujitsu, HP, Intel, and Georgia Tech Foundation through the John P. Imlay, Jr. Chair endowment. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation or other funding agencies and companies mentioned above.
-  M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. J. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Józefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. G. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. A. Tucker, V. Vanhoucke, V. Vasudevan, F. B. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” CoRR, vol. abs/1603.04467, 2016. [Online]. Available: http://arxiv.org/abs/1603.04467
-  Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in Proceedings of the 22Nd ACM International Conference on Multimedia, ser. MM ’14. New York, NY, USA: ACM, 2014, pp. 675–678. [Online]. Available: http://doi.acm.org/10.1145/2647868.2654889
-  R. Collobert and K. Kavukcuoglu, “Torch7: A matlab-like environment for machine learning,” in NIPS Workshop, 2011.
-  F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. J. Goodfellow, A. Bergeron, N. Bouchard, D. Warde-Farley, and Y. Bengio, “Theano: new features and speed improvements,” CoRR, vol. abs/1211.5590, 2012. [Online]. Available: http://arxiv.org/abs/1211.5590
-  F. Seide and A. Agarwal, “Cntk: Microsoft’s open-source deep-learning toolkit,” in Proceedings of the 22Nd ACM SIGKDD, ser. KDD ’16. New York, NY, USA: ACM, 2016, pp. 2135–2135. [Online]. Available: http://doi.acm.org/10.1145/2939672.2945397
-  Keras Developers, “Keras: The python deep learning library,” https://keras.io/, 2017, [Online; accessed 04-Dec-2017].
-  PyTorch Developers, “Tensors and dynamic neural networks in python with strong gpu acceleration.” https://pytorch.org/, 2017, [Online; accessed 04-Dec-2017].
-  A. A. Awan, H. Subramoni, and D. K. Panda, “An in-depth performance characterization of cpu- and gpu-based dnn training on modern architectures,” in Proceedings of the Machine Learning on HPC Environments, ser. MLHPC’17. New York, NY, USA: ACM, 2017, pp. 8:1–8:8. [Online]. Available: http://doi.acm.org/10.1145/3146347.3146356
-  C. Coleman, D. Narayanan, D. Kang, T. Zhao, J. Zhang, L. Nardi, P. Bailis, K. Olukotun, C. Ré, and M. Zaharia, “Dawnbench: An end-to-end deep learning benchmark and competition,” in NIPS ML Systems Workshop, 2017.
-  S. Shi, Q. Wang, P. Xu, and X. Chu, “Benchmarking state-of-the-art deep learning software tools,” in 2016 7th International Conference on Cloud Computing and Big Data (CCBD), Nov 2016, pp. 99–104.
-  S. Shams, R. Platania, K. Lee, and S. J. Park, “Evaluation of deep learning frameworks over different hpc architectures,” in 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS), June 2017, pp. 1389–1396.
-  Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, Nov 1998.
-  A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” 2009.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
-  C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” CoRR, vol. abs/1512.00567, 2015. [Online]. Available: http://arxiv.org/abs/1512.00567
-  L. Bottou, “Stochastic gradient descent tricks,” in Neural networks: Tricks of the trade. Springer, 2012, pp. 421–436.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014. [Online]. Available: http://arxiv.org/abs/1412.6980
-  X. Zhang, Q. Wang, and W. Saar, “Openblas: An optimized blas library,” http://www.openblas.net/, 2017, [Online; accessed 04-Dec-2017].
-  Intel Corporation, “The fastest and most-used math library for intel-based systems,” https://software.intel.com/en-us/mkl, 2018, [Online; accessed 04-July-2018].
-  Nvidia Corporation, “Ai computing leadership from nvidia,” https://www.nvidia.com/en-us/, 2018, [Online; accessed 17-August-2018].
-  L. Dagum and R. Menon, “Openmp: an industry standard api for shared-memory programming,” IEEE Computational Science and Engineering, vol. 5, no. 1, pp. 46–55, Jan 1998.
-  M. P. Forum, “Mpi: A message-passing interface standard,” Knoxville, TN, USA, Tech. Rep., 1994.
-  G. Guennebaud, B. Jacob et al., “Eigen v3,” http://eigen.tuxfamily.org, 2010.
-  Lua Community, “The programming language lua,” https://www.lua.org/home.html, 2018, [Online; accessed 03-Apr-2018].
-  H. Kim, H. Nam, W. Jung, and J. Lee, “Performance analysis of cnn frameworks for gpus,” in 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), April 2017, pp. 55–64.
-  L. Liu, Y. Wu, W. Wei, W. Cao, S. Sahin, and Q. Zhang, “Benchmarking deep learning frameworks: Design considerations, metrics and beyond,” in 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS), July 2018, pp. 1258–1269.
-  G. Sebastien, “sysstat - system performance tools for the linux operating system,” https://github.com/sysstat/sysstat, 2017, [Online; accessed 04-Dec-2017].
-  S. Dieleman, “Reslab theano tutorial (10 february 2015),” https://github.com/benanne/theano-tutorial, 2017, [Online; accessed 02-Apr-2018].
-  Z. Luo, L. Liu, J. Yin, Y. Li, and Z. Wu, “Deep learning of graphs with ngram convolutional neural networks,” IEEE Transactions on Knowledge and Data Engineering, vol. 29, no. 10, pp. 2125–2139, Oct 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification,” ArXiv e-prints, Feb. 2015.
-  D. Mishkin and J. Matas, “All you need is a good init,” arXiv preprint arXiv:1511.06422, 2015.
-  D. Cireşan, U. Meier, and J. Schmidhuber, “Multi-column deep neural networks for image classification,” arXiv preprint arXiv:1202.2745, 2012.
-  L. Wan, M. Zeiler, S. Zhang, Y. Le Cun, and R. Fergus, “Regularization of neural networks using dropconnect,” in International Conference on Machine Learning, 2013, pp. 1058–1066.
-  H. Zhu, M. Akrout, B. Zheng, A. Pelegris, A. Phanishayee, B. Schroeder, and G. Pekhimenko, “TBD: Benchmarking and Analyzing Deep Neural Network Training,” ArXiv e-prints, Mar. 2018.
-  S. Chintala, “Easy benchmarking of all publicly accessible implementations of convnets,” https://github.com/soumith/convnet-benchmarks, 2017, [Online; accessed 28-Jan-2018].
-  Baidu Research, “Benchmarking deep learning operations on different hardware,” https://github.com/baidu-research/DeepBench, 2017, [Online; accessed 28-Jan-2018].