Benchmarking the Performance and Power of AI Accelerators for AI Training
Deep learning has become widely used in complex AI applications. Yet, training a deep neural network (DNNs) model requires a huge amount of calculations, taking a long running time and consuming a lot of energy. Nowadays, many-core AI accelerators (e.g., GPUs and TPUs) are designed to improve the AI training performance. However, different processors from different vendors perform very differently in terms of performance and power consumption. To investigate the differences among several popular off-the-shelf processors (i.e., Intel CPU, NVIDIA GPU, AMD GPU and Google TPU) in training DNNs, we carry out a detailed benchmark study on the performance and power (when possible) of these processors when training a representative set of DNNs, including three classical convolutional neural networks (CNNs), a recurrent neural network (LSTM), Deep Speech 2, and Transformer. We try to understand the impact of hardware, vendor’s software library, and deep learning framework on the final training performance. Our evaluation results make two valuable directions for end-users and vendors. For the end-users, the evaluation results provide a guide for selecting a proper accelerator for training DNN models. For the vendors, some advantages and disadvantages revealed in our evaluation results could be useful for future architecture design and software library optimization.
AI Accelerator, Deep Learning, GPU, TPU, Convolution Neural Networks, Recurrent Neural Networks, Transformer, Deep Speech2
Recent years have witnessed the fast development of deep neural networks (DNNs) [lecun2015deep], which have been widely used in many AI applications, such as image recognition [krizhevsky2012imagenet][he2015deep], object detection [girshick2015fast][redmon2016you], speech to text tasks [hinton2012deep], etc. However, training these DNN models requires a considerable amount of computational resources [lecun2015deep][dean2012large].
Graphics Processing Units (GPUs) [luebke2006gpgpu] serve as one of the most popular hardware to accelerate the training speed of DNNs. Different from the conventional CPU, a typical GPU is generally equipped with thousands of cores and large Gigabytes of memory bandwidth [luebke2006gpgpu], which significantly accelerates the training and reasoning speed of DNNs compared to the traditional CPU. Since 2016, the new generation of computing device - the TPU [jouppi2017datacenter], has been launched by Google, followed by Cloud TPU v2 and Cloud TPU v3 [ying2018image]. The difference between different generations of TPUs is mainly on performance and memory capacity. Benefited from its extraordinary parallel computing capability, the cloud service of TPU greatly fastened the steps of artificial intelligence and its relating applications. They have achieved better performance than many other AI accelerators [jouppi2017datacenter].
Meanwhile, the development of optimized software keeps in pace with the hardware. On the CPU processors, there exist highly optimized high-performance libraries like MKL and MKLDNN [wang2014intel][cyphers2018intel]. On the NVIDIA GPUs, researchers and industry make cuDNN [chetlur2014cudnn], cuBLAS and other CUDA based libraries be able to achieve nearly peak performance of GPU cards. On the AMD GPUs, ROCm111https://rocm.github.io/dl.html is also actively developed for supporting high performance deep learning. Also, for TPUs, TensorFlow [abadi2015tensorflow] is highly optimized under a large development community.
However, different AI accelerators of various generations and designed by vendors has a large diversity in terms of performance, power and energy consumption. For example, the time performance could be different even on similar capacity GPUs from NVIDIA and AMD. In terms of performance, there exist some benchmarks including software comparison [bahrampour2016comparative][shi2018performance], hardware comparison [wei2019benchmarking] and the combination of software and hardware comparison [shi2016benchmarking] in training DNNs. In addition, different vendors provide their own benchmarks to demonstrate the performance with their own highly optimized libraries or configurations, while these results could be unfairly compared.
For server deployment, power and energy consumption are vital for DNN training since that the long-term electric bill can directly benefit from lower energy consumption. By combining the performance and energy, one can scale the hardware configuration with dynamic voltage and frequency scaling (DVFS) techniques [mei2017survey][wang2018gpgpu] to save energy. Tang et al. [tang2019impact] have evaluated energy with DVFS on GPUs in training DNNs. Wang et al. [eppminer] propose a benchmark suite for both performance and energy, while they only focus on traditional algorithms but not on deep learning. Furthermore, performance and energy data together are critical to job scheduling algorithms [chau2017energy] in saving energy while preserving the computing efficiency of tasks.
In summary, existing benchmarks consider either only performance, or energy for particular accelerators and algorithms. Furthermore, there is little study on AMD GPUs while AMD researchers have also developed a deep learning ecosystem ROCm for users. To this end, in this paper, we make wide benchmarks on many popular AI accelerators including Intel CPU, NVIDIA GPUs, AMD GPU and Google TPUs in terms of multiple metrics including performance, power and energy. On one hand, our evaluation results give a guide for end-users on how to choose proper AI accelerators to train their own DNN models for different considerations. For example, end-users can compare the budgets of cloud-based GPUs and TPUs for a specific model, and choose a cheaper one to train the model. On the other hand, the problems revealed by the evaluation results could be helpful for hardware or software design for further optimization. For example, GPU library engineers can have an insight into the performance why some operations dose not well utilize the computing resources. The experimental numbers with performance, power and energy can be used by the job scheduling algorithms for energy conservation [chau2017energy][mei2017energy], in which one should consider the task should be finished in expected time (related to performance) while it should not consume too much power (related to energy).
To make the evaluation thorough, we first evaluate the performance on low-level operations on the above aforementioned accelerators, and then we evaluate the performance, power and energy on end-to-end training of currently popular DNNs from different AI areas including CNNs [krizhevsky2012imagenet][he2015deep], LSTM [gers1999learning], Deep Speech 2 [amodei2016deep] and Transformers[vaswani2017attention]. Our major findings are shown in Table 1.
|Section||Key Factor||Metric||Main Findings|
|4.1.2||Tensor size||PERF||Larger tensor sizes have higher workloads for accelerators|
|and they generally achieve higher throughput.|
|Software for TPU||With different input tensors, TPU nearly fully utilizes the|
|computing resource under TensorFlow framework.|
|Software for NVIDIA GPUs||CUDA has approximately the same utilization with Tensor-|
|Flow on NVIDIA GPUs. Both software are highly optimized.|
|Software for GPUs||Matrix multiplication has been well optimized on GPUs|
|while convolution still has space for further optimization.|
|4.2.1||Multi-threading on CPU||In order to fully utilize the benefit of multi-threading,|
|some CPU cores should be allocated to data pre-|
|processing during training.|
|4.2.2||Mini-batch size||Mini-batch size should be large enough to fully utilize|
|the computational resources of accelerators.|
|Low-bit precision||FP16 generally achieves higher throughput than FP32,|
|especially on CNNs with Tesla V100 using Tensor Cores.|
|However, on NLP models, it has no obvious improvement|
|compared to FP32.|
|GPU vendor||NVIDIA GPUs achieve higher throughput and have wider|
|supported software than the AMD GPU.|
|4.2.3||Latest TPU||TPU V3-8 has about 1.5 higher throughput than TPU|
|4.2.4||TPU vs GPU||TPU V3-8 achieves more than 3 higher throughput than|
|Tesla V100 on CNNs, while it has only about 1.5 on|
|4.3.1||NVIDIA GPU Model||POW||Tesla P100 has the lowest power consumption on CNNs,|
|while Titan X(Pascal) has the highest power consumption|
|on all models among NVIDIA GPUs.|
|GPU vendor||AMD GPU has the lowest power consumption on LSTM,|
|Inception v3, while it consumes much higher power on|
|ResNet50 and VGG16 than NVIDIA GPUs.|
|4.3.2||Mini-batch size||NRG||Larger mini-batch size consumes more energy on CNNs|
|GPU Model||NVIDIA Tesla V100 has the lowest energy consumption|
|among evaluated GPUs in Mixed Precision training.|
The rest of this paper is organized as follows. Section 2 introduces some background knowledge related to DNNs, AI accelerators and training algorithms. Section 3 describes our experimental designs and setups, including hardware configurations and DNNs. Section 4 demonstrates our experimental results and analysis of AI accelerators with different training tasks. Some related work is introduced in Section 5. We finally conclude the paper in Section 6.
2.1 Deep Models
In different areas of AI applications, there exist various types of deep architectures achieving state-of-the-art results. In image classification and object detection tasks, convolutional neural networks (CNNs) are the main architectures to extract the image features automatically, among which VGG [simonyan2014very], ResNet [he2015deep] and Inception [szegedy2016rethinking] architectures are widely used. These CNNs also achieved very good results in the popular ImageNet challenge [deng2009imagenet] on the tasks of image classification and object detection. In the area of natural language processing, recurrent neural network (RNN) was one of the successful models with a long developing history, especially LSTM [press2016using]. Recent years, Deep Speech 2 [amodei2016deep] was proposed with state-of-the-art results on speech recognition tasks, and attention-based models including transformer [vaswani2017attention] and BERT [devlin2018bert] have achieved very good scores in many machine translation tasks.
2.2 AI Accelerators
There are many newly developed AI accelerators. In this paper, we mainly focus on the widely available processors such as CPUs, GPUs and TPUs. We will investigate FPGAs in our future work.
CPUs are traditional processors that used in computers, while it was not good at doing highly parallel and computing-intensive tasks. In the era of deep learning, the main CPU vendor designs its many-core CPUs for these kinds of tasks. For example, the Intel Xeon processor [regnier2004eta] is a powerful CPU with high computing FLOPS among Intel CPUs. The scalable processor was reported that it outperforms NVIDIA GPU in deep learning inference on the ResNet-50 model222https://intel.ly/2k4Bxh2.
NVIDIA and AMD GPUs.
GPUs are designed for highly parallel applications in terms of the number of computing cores and the memory access speed, and the peak FLOPS has increased rapidly in the last ten years. In Table 2, we listed the parameter details of four recent GPUs. The listed GPUs contain three (Tesla V100, P100, and Titan X(Pascal)) from NVIDIA and one (Radeon VII) from AMD. It can be seen that the peak FP32 computing FLOPS is more than 10 TFLOPs, which is around 5 higher than CPUs. Especially, the well-known Tesla V100 GPU is built on GPU chip GV100 with Volta architecture, which enable the GPU to be qualified for both GPU universal calculations and specific neural networks calculations.
|Product Name||Tesla V100||Tesla P100||Titan X(Pascal)||Radeon VII|
|Core Clock||1245 MHz||1190 MHz||1417MHz||1400 MHz|
|Boost Clock||1380MHz||1329 MHz||1531 MHz||1750 MHz|
|Memory Clock||877 MHz||715 MHz||1251MHz||1000 MHz|
|Memory Bus Width||4096 bit||4096 bit||384 bit||4096 bit|
|Memory Bandwidth||897.0 GB/s||732.2 GB/s||480.4 GB/s||1 TB/s|
|FP16 Computing||28.26 TFLOPS||19.05 TFLOPS||-||26.88 TFLOPS|
|FP32 Computing||14.13 TFLOPS||9.526 TFLOPS||10.97 TFLOPS||13.44 TFLOPS|
Tensor Processing Units (TPUs) are Google’s custom-designed machine learning application-specific integrated circuits (ASICs). Each TPU device has 4 chips and each consists of 2 cores, so a TPU device contains 8 cores. Each core has scalar, vector and matrix units (MXU) and is connected with the on-chip high bandwidth memory (HBM). There are two types of TPUs: TPU v2 and TPU v3. For TPU v2, the amount of HBM of each core is 8 GB. Especially, one MXU is allocated to a core. While for TPU v3, each core has two MXUs and is connected with 16 GB of HBM. TPUs support the bfloat16 format which has a wider range of values than float16 with the same 16-bit storage. TPU v2 with 8 cores (TPU v2-8) and TPU v3 with 8 cores (TPU v3-8) have peak bfloat16 computing capacity of 180 Tera bfloat16 per second and 420 Tera bfloat16 per second respectively. Additionally, TPU v2 Pod is assembled by 64 TPU v2 devices, containing 512 TPU v2 cores. TPU v3 Pod provides a maximum of 256 TPU v3 devices and consists of a total 2048 TPU v3 cores.
bfloat16 and float16
The MXU in each TPU core is used to execute 16K multiply-accumulate operations in each cycle. Besides, MXU supports mixed precision training, i.e. its input and output are 32-bit floating point values and it can use bfloat16 for activation and gradients. Compared with IEEE half-precision floating point (fp16), bfloat16 has a wider range of values because it has one sign bit, eight exponent bits, and seven mantissa bits plus one implicit mantissa bit, as shown in Fig. 3. Using bfloat16 can help reduce the size of data in memory, making larger models available for the same size of memory, while ensuring no degradation of converged accuracy.
2.3 Training Algorithms.
The Stochastic Gradient Descent (SGD) [chen2016revisiting] algorithm and its variants are widely used in training deep models. Mini-batch SGD [li2014efficient] is a derivative algorithm of SGD, which divides the entire data set into multiple subsets, and iteratively update the model parameters according to the first-order gradients at current mini-batch of data. The training process during a single iteration can be divided into the following steps. As shown in Fig. 4, a single iteration starts with the process of reading data from the computer’s disk to the CPU’s memory, and it ends with updates of parameters. The training process is to repeat the iteration until some terminating criteria. We generally use the average iteration time to measure the performance of training on particular software and hardware.
Mixed Precision Training.
The mixed precision [micikevicius2017mixed][jia2018highly]333The mixed precision mainly exploits FP16 as computation during the forward and backward passes. Its performance presents the performance of Tensor Core on accelerators. training technique is a very successful training algorithm that uses only 16-bit floating points to do the computation of forward and backward during training such that the hardware resource can be better utilized. Typically, in mixed precision, FP32 master weights and loss scaling are adopted to avoid the instability that FP16 precision might trigger. The training process is also shown in Fig. 4.
In this section, we introduce the methodology of our evaluation for demonstrating comparison on performance, power and energy among multiple accelerators. We first present the selected hardware settings and DNNs from different AI areas. Then we illustrate our evaluation methods.
3.1 Hardware Setup
As we would like to evaluate the most commonly used accelerators for training DNNs, we select many-core processors from four vendors including Intel, NVIDIA, AMD and Google. For each vendor, we select one to three processors for evaluation. The details of the selected accelerators are listed in Table 3, which presents the key parameters that are related to the performance of accelerators.
|Vendor||Accelerator Model||Memory||Theoretical FLOPS||Memory Bdw||Memory Type||CPU|
|Intel||Xeon Platinum 8163||48GB||3.84 T(FP32)||119.21 GB/s||DDR4||-|
|NVIDIA||Titan X(Pascal)||12GB||11 T(FP32)||480.4 GB/s||GDDR5X||i7-7820X|
|Tesla P100||16GB||9.5 T(FP32)||732.2 GB/s||HBM2||i7-6800K|
|Tesla V100||16GB||112 T(Tensor Core)||897.0 GB/s||HBM2||i7-6800K|
|AMD||Radeon VII||16GB||13.44 T(FP32)||1 TB/s||HBM2||i7-4790|
|TPU v2-8||64GB||180 T (bfloat16)||600 GB/s||HBM||-|
|TPU v3-8||128GB||420 T (bfloat16)||900 GB/s||HBM||-|
3.2 Evaluated Operations and DNNs
DNN models are mainly stacked by many layers which generally invoke two main resource-consuming operators (ops) including the matrix multiplication (Matmul) and convolution in 2d dimension (Conv2d). DNNs also contain some activation layers that are element-wise operators, but these operators are much faster than Matmul and Conv2d, so we mainly evaluate the performance of Matmul and Conv2d on the selected accelerators. To evaluate the performance of ops, the input data are synthetic tensors of different FLOPs444In this paper, FLOPS is a performance metric indicated by FLOating Point Operations per Second, while FLOPs is a workload metric that represent the total number of FLOating Point Operations. For the Matmul operator, tensor dimensions range from 256256 to 81928192; for the Conv2d operator, inputs and filters are selected based on the real-world models under different batch sizes, from 32 to 256. To ensure the utilization of accelerators, we select to show results of small, medium and large sizes of input tensors, which are listed in Table 4.
|Matmul Shape||Matmul I||Matmul II||Matmul III|
|N(2048, 2048)||N(4096, 4096)||N(8192, 8192)|
|Conv2d Shape||Conv2d I||Conv2d II||Conv2d III|
|F(256, 224, 224, 3)||F(128, 112, 112, 64)||F(256,56,56,128)|
|K(7, 7, 3, 64) S(2,2)||K(3, 3, 64, 128) S(1,1)||K(3, 3, 128, 256) S(1,1)|
To cover comprehensive AI applications, we choose DNN models from image classification with CNNs, language models with LSTM, speech recognition with Deep Speech 2 and the recent state-of-the-art language model Transformer. For CNNs, we choose ResNet-50 [he2015deep], Inception V3 [szegedy2016rethinking], and VGG16[simonyan2014very] on the ImageNet [deng2009imagenet] dataset; For LSTM, we selected the typical 2-Layer LSTM on the PTB [marcus1994penn] dataset; For the Deep Speech 2 architecture, we train the model on the AN4 dataset. For Transformer, we train the model on the WMT14 EN-DE dataset. The details of DNN configurations 555The FLOPs for Deep Speech2 is measured for a single time step. are shown in Table 5.
|Networks||# Params (million)||Theoretical MACs (GFLOPs)||Datasets||# Samples||Input Size|
|2-Layer LSTM||66||0.102||PTB||42K||seq length: 20|
|Deep Speech 2||27||0.122/utterance||AN4||948||audio length: 100-400|
|Transformer||65||46.08||WMT14 EN-DE||36M||seq length: 512|
3.3 Evaluation Methods
In order to present the readers a comprehensive scope of different tasks and AI accelerators, we use performance, power and energy as the evaluation metrics. For the performance measurements, we evaluate the average iteration time in 1000 iterations with an input mini-batch size, and then we calculate the accelerators’ performance in training a particular DNN as the throughput in terms of samples per second (Samples/s). Note that the units in pictures of samples are images for CNNs, sentence numbers divided by 10 for LSTM, utterances for Deep Speech 2, and token numbers divided by 100 for Transformer in all pictures for exhibition, respectively. For the power measurement, we sample the system power (in Watt) in every 50ms during the training process using the built-in interfaces provided by NVIDIA Management Library [NVML] on NVIDIA GPUs and ROCm System Management Library [rocm-power] on AMD GPU. We use the average Watt as the metric for the power measurement. For energy measurement, it is directly derived using the evaluated performance and power. The metric details are defined in Table 6.
|Performance||Throughput in processing samples during the training process||Samples per second|
|Power||The electrical energy cost at a certain mini-batch per second||Watt|
|Energy||The electrical energy cost of the computing device to process a sample||J per sample|
Measurement Software Tools for ops.
|Ops or DNNs||Accelerator||OS||Software||Libraries|
|Ops, Transformer||Intel CPU||Ubuntu||TensorFlow 1.14.0/ CUDA C++||MKL-2019.4-243|
|NVIDIA GPUs||CUDA-10.0, cuDNN-v7.4|
|CNNs, LSTM, Deep Speech 2||Intel CPU||Ubuntu||PyTorch 1.1||MKL-2019.4-243|
|NVIDIA GPUs||CUDA-10.0, cuDNN-v7.6|
|Google TPUs||-||TensorFlow 1.14.0||-|
Regarding the performance of operations, we use TensorFlow 1.14.0 on TPUs, Intel CPU, and Tesla V100; CUDA on NVIDIA GPUs for op benchmark especially. The workloads of the two ops are shown in Table 4.
CUDA C++ Settings.
Developed by NVIDIA, CUDA is a software architecture that combines the advantages of CPU in serial computing and GPU in parallel computing, to solve complex computing problems, especially deep learning. DeepBench 666https://github.com/baidu-research/DeepBench.git suite is a functional platform developed by Baidu, which provides performance benchmarks in CUDA C++ language on different platforms. We adopted the CUDA C++ codes in DeepBench and made two major revisions to it. Firstly, we erase the unnecessary kernels launched during the 300 training iterations to ensure the best performance of cuBLAS. Each op is looped for 300 iterations and averaged with the middle 100 iterations. To make it fair, the sample strategy is the same for all operation evaluation experiments. Secondly, we have the forward algorithm fixed at IMPLICIT_PRECOMP_GEMM while input Tensor vary, which eliminate the differences in real forwarding FLOPs for ops in the experiment.
TensorFlow has provided users with necessary tools for profiling. By printing the timeline of training processes, users can easily get the exact time that TensorFlow has called GPU kernel to calculate Conv2D and Matmul ops. In this experiment, we only use TensorFlow results as a control group for comparison with the performance of TPU, on which only TensorFlow programming is highly optimized. The major meaning of testing ops on the GPU with TensorFlow is to ensure the switch between CUDA and TensorFlow on different AI accelerators is a irrelevant variable in the experiment. In other words, TensorFlow is efficient enough compared with CUDA with respect to calling and using NVIDIA GPU kernels.
Measurement Software Tools for DNNs.
For DNNs measurements, we evaluate CPU and GPUs (including NVIDIA and AMD) with PyTorch 1.1 in 1000 rounds, and the CPU and GPU related libraries are shown in Table 7. As TPU mainly supports TensorFlow, we measure the TPU training performance with TensorFlow for the best performance.
4 Experimental Results
In this section, we present the experimental results and discussions, including the performance of low-level mathematical operators (Subsection 4.1), performance of end-to-end training (Subsection 4.2), and power and energy consumption of end-to-end training (Subsection 4.3).
4.1 Low-level Operators Performance
We evaluate the AI Accelerators on the two major mathematical operators (i.e., matrix multiplication and 2D convolution) that are widely used in DNN training. We test the operators with small, medium, and large FLOPs, which represent computation under different workloads for accelerators as shown in Table 4.
4.1.1 CPU Results
Multi-threading and AVX are two main techniques to exploit the many-core and SIMD capability of modern CPUs. To evaluate the performance of operators using an Intel Optimization version of TensorFlow with AVX512 enabled, we enumerate the number of threads from 1 to 12. The results of Intel Xeon 8163777The CPU instance used in our experiments has 24 vCPUs on a rented Alibaba Cloud ECS Compute Type c5, which is in fact a package of 12 physical cores, serving only half of peak theoretical FLOPS. are shown in Fig. 5. For the matrix multiplication, the computing performance is almost linear to the number of threads. However, the number of threads should not be larger than the number of physical cores of the CPU. Otherwise, it may sacrifice the performance. Regarding the Conv2d operator, however, we observe that the highest achieved FLOPS is under 1.25 TFLOPS, which is 20% lower than the highest achieved FLOPS of Matmul. The diverse results on Malmul and Conv2d indicate that the software optimization on Malmul is better than Conv2d and there may exist further opportunities to improve the efficiency of Conv2d with multi-core and AVX techniques.
We also plot the CPU utilization of the performance on these two operators as shown in Fig. 6. The maximum utilization is up to 83% and 68% on Matmul and Conv2d respectively.
4.1.2 GPU Results
On NVIDIA Tesla V100 and P100 GPUs, the performance that has been transformed into FLOPS of Matmul and Conv2d ops with CUDA C++ implementation is shown in Fig. 7, as well as the utilization. V100 has achieved up 97 utilization for Matmul ops and 92 for Conv2d ops with FP32 Precision. The utilization of V100 with Tensor Core is relatively low, which indicates that Tensor Core is capable of dealing with more complex calculation, and has room to improve calculating capability with less complexity.
On GPUs and TPUs, the performance of Malmul and Conv2d is shown in Fig. 7’s right part. CUDA C++ and TensorFlow’s performances on Tesla V100 with Tensor Core are approximately the same for Matmul ops, whereas a bit smaller for TensorFlow’s implementation of Conv2d ops- The gap decays while the workload adds up. For all accelerators, higher workloads can better utilize the computing resource to achieve higher throughput. Among GPUs, NVIDIA Tesla V100 has the highest performance on both operators. The evaluated Tesla V100 that supports FP16 indicates that the performance of 16-bit is better than the 32-bit counterpart. In particular, the performance of 16-bit in Tesla V100 is nearly 2 higher than its 32-bit version. The performance utilization of different accelerators is also shown in Fig. 8. TPU V2 achieves nearly optimal throughput on both the operators. On the Malmul operator, Tesla V100 with FP16 also achieves nearly 97% perk performance.It has only about 99% on the Conv2d operator- much better than Tesla V100 with Tensor Core. Among GPUs, NVIDIA Volta GPU generally has higher utilization than all other GPUs for both operators in FP32 precision. According to the analysis above, we may safely draw the conclusion that there remain optimization space for Tesla V100 with Tensor Core enabled.
4.2 End-to-end Training Performance
4.2.1 CPU Results
The evaluated CPU instance contains 12 physical cores and supports up to 8-way multiprocessing. Multi-threading is a key technique to utilize multiple cores to do calculations. We first evaluate how the number of threads affects training performance, which is shown in Fig. 9. It can be seen that on the 12-core CPU, the 2-way multiprocessing (24 threads) generally achieve the best performance on CNNs and LSTM. However, the best number of threads on Deep Speech 2 of Fig. 9 is 8, which indicates that it should not occupy all the computing resources for some operations. Note that in the training process, besides the forward and backward computations, data loading and pre-processing could be also computing-intensive. If all CPU cores are occupied by the forward and backward computations, the data pre-processing thread could lack CPU resource to do computations such that the overall performance would be even worse.
Note that with doubled number of threads (not larger than the number of physical cores), it generally achieves improvement around 10%-95%, which is much smaller than the expected doubled improvement. The main reason is that the parallelism of end-to-end training on the CPU is mainly on the operator level, which means the multi-threading is effected on the parallel operators (e.g., Matmul and Conv2d) as well as other serial operators. As we analyzed in Section 4.1, the performance improvement of the operator is around 90%-100% with the doubled number of threads, which means that other serial operators could have held-up the parallelism of DNNs.
4.2.2 GPU Results
Performance vs Mini-batch Size.
As introduced in the GPU architecture, there exist thousands of cores in current GPUs. When training DNNs, the GPU workload is linearly increased with respect to the mini-batch size. Therefore, we first select a representative GPU (i.e., Tesla V100) to demonstrate the performance with respect to the mini-batch size, which is displayed in Fig. 10. From the results, we can conclude that small mini-batch sizes may not occupy all computing resource of accelerators so that the accelerators’ computational power is not fully utilized. One should set a proper mini-batch size for particular DNNs and accelerators to achieve maximum performance. For example, on one hand, a mini-batch size of 4096 can achieve about 30% higher throughput than that of 2048 with Transformer. On the other hand, a mini-batch size of 32 has very close performance with that of 16 with Deep Speech 2. Also, it is obvious that Mixed precision gets more benefits when batch size increases for CNNs, however no apparent improvement for NLP models. In the latter discussion, we always use a proper mini-batch size that maximizes the performance on a particular accelerator.
The performance of end-to-end training with different DNNs on different GPUs (including NVIDIA and AMD) is shown in Fig. 11. We compare their performance in two applications (i.e., CNNs and NLP models).
The results of training CNNs on GPUs are shown in Fig. 12. It can be seen that Tesla V100 has the best performance with both FP32 and FP16 training among the tested GPUs. With the same numerical precision, NVIDIA V100 generally achieves 1.5-2 higher performance than the AMD Radeon VII GPU. Among NVIDIA GPUs, Tesla P100 and Titan X(Pascal) have very close performance, which is reasonable as these two GPUs have similar peak FLOPS as shown in Table 3. Comparing two desktop-level GPUs between NVIDIA Titan X(Pascal) and AMD Radeon VII, we notice that Titan X(Pascal) achieves slightly higher performance and Radeon VII, while the peak FLOPS of Radeon VII is about 22% higher than that of Titan X(Pascal). The phenomenon indicates the importance of software optimizations for particular hardware, and highly optimized software could achieve nearly optimal FLOPS. Compared to highly optimized cuDNN and CUDA libraries on NVIDIA GPUs, AMD software ecosystem recently develops ROCm.
Note that Deep Speech 2 cannot be successfully run on the Radeon VII GPU as some operators are not supported, so we exclude Radeon VII from the comparison in Deep Speech 2. Similar to the performance of CNNs, NVIDIA GPUs achieve higher performance than the AMD GPU. In particular, Titan X(Pascal) is nearly 1.8 faster than Radeon VII on 2-layer LSTM. Among NVIDIA GPUs, Tesla V100 always has the best performance. However, different from the results on CNNs, Tesla V100 with mixed precision has only slight improvement compared to the FP32 counterpart on the NLP models. The margin improvement of Tesla V100 with mixed precision indicates that the software library should be further optimized for NLP models.
4.2.3 TPU Results.
We first discuss the performance of two versions of TPUs. The performance is shown in Fig. 14. On three evaluated DNNs, TPU v3-8 is about 1.5-1.7 faster than TPU v2-8. However, the peak FLOPS of TPU v3-8 is around 2.3 than TPU v2-8 as shown in Table 3, which indicates that utilization on TPU v3-8 is much lower than TPU v2-8. The experimental results conclude that there still exist software optimization room for performance improvement of TPU v3-8.
4.2.4 Comparison Between GPU and TPU
As we have seen that among all evaluated GPUs, NVIDIA Tesla V100 outperforms any other GPUs including the AMD GPU, while TPUs also achieve very high throughput in training DNNs in the previous subsection, we here would like to compare the performance between NVIDIA Tesla V100 and TPUs in training different models. The performance comparison is shown in Fig. 14. It can be seen that TPUs outperform Tesla V100 GPU in the three evaluated models. On CNNs (ResNet-50 and Inception v3), TPU V2-8 and TPU V3-8 achieves more than 1.5 and 2 higher throughput than Tesla V100 with Tensor Cores respectively. However, on the Transformer architecture, TPU V2-8 is very close to Tesla V100 GPU, and TPU V3-8 achieves around 1.5 faster than Tesla V100.
4.2.5 Analysis of NVIDIA Tensor Core
NVIDIA Tesla V100 known as the best GPU for training Deep Learning models in the world, training DNNs with it in mixed precision is the fastest GPU example in our experiments. The high Performance of V100 attributes to the unique design of Tensor Core, together with highly optimized CUDA and cuDNN. The key factor of the Volta architecture is that it transforms the convolutional parts in a neural network into matrix calculations before Tensor Cores conduct highly-optimized parallel calculations. With Tensor Core open, V100 outperforms itself in FP32 training over two times, P100 over three times- though the theoretical FLOPS indicates that the number could have been higher. Parts of the Mixed Precision training that requires more accuracy as well as the lower utilization of Tensor Core compared with FP32/FP16 has led to this mismatching. Nevertheless, Tensor Core has achieved only about 2/3 performance of TPU V2-8, which can reach 180 TFLOPS compared with V100(PCIe)’s 112 TFLOPS.
4.3 End-to-end Training Power and Energy
Due to the limitation of power measurements on CPU and TPUs, we only discuss the power and energy consumption for training DNNs on GPUs (NVIDIA GPUs and the AMD GPU). We will discuss them separately in the following two subsections.
4.3.1 Power Consumption
Generally, desktop-level GPUs support an aggressive range of core clock scaling which allows users to achieve boosting performance, while server-level GPUs only provide conservative options which takes stability and energy efficiency as first priority. Two desktop-level GPUs, Titan X and Radeon VII, feature over 1400 MHz core clock, higher than two server-level GPUs, Tesla V100 and P100. Besides, two server-level GPUs are equipped with HBM of low memory clock, which can further improve power efficiency. The measured powers on different GPUs are shown in Fig. 15. Most measurements of three NVIDIA GPUs (i.e., V100, P100 and Titan X(Pascal)) are around 200-300 watts except Deep Speech 2, among which Titan x(Pascal) achieves the highest power over 250 watts. However, AMD GPU (Radeon VII) demonstrates diverse power consumption among different deep learning models (i.e., over 250 watts on Resnet50 and VGG16, below 200 watts on Inception V3, LSTM and Transformer).
4.3.2 Energy Efficiency
Energy efficiency of an accelerator is affected by two factors, power consumption and processing throughput. First, as discussed in 4.3.1, the power consumption of those tested GPUs is usually stable in different model training tasks, as the dominating GPU kernel functions in deep model training are similar. Second, increasing the processed batch size generally increases the resource utilization and training throughput, which consequently improve energy efficiency. We then compare the energy consumption of different GPUs among different DNN models.
The energy consumption on CNNs with different GPUs is shown in Fig. 17(a). It is reasonable that two server-level GPUs, Tesla V100 and P100, are mostly the top-two in energy efficiency. Especially V100 has much lower energy consumption than other GPUs since it performs a remarkably higher training throughput with negligible power sacrifice. Another interesting finding is that the larger batch sizes of those CNNs do not help conserve energy consumption and even lead to marginal increments. The reason is that the GPUs achieve similar training throughput of three CNN models under two batch sizes, as shown in Fig. 11.
The energy comparison of LSTM is shown in Fig. 17(b). Three NVIDIA GPUs achieve similar energy efficiency under different batch sizes, while the AMD GPU benefits from increasing batch sizes. Since those two GPU vendors apply different deep learning framework, CUDA and ROCm, the software implementation also results in different resource utilization when changing the batch sizes.
Deep Speech 2.
As the connectionist temporal classification (CTC) loss function of the Deep Speech 2 is not supported on the AMD GPU, we exclude the AMD result for this case. Among the three NVIDIA GPUs, Tesla V100 is again the winner of energy efficiency. However, Tesla V100 does not achieve any energy efficiency improvement with increasing batch sizes, while P100 and Titan X (Pascal) achieves better energy efficiency with larger batch sizes.
Fig. 17(d) shows the energy consumption of Transformer on different GPUs. Since Transformer includes many GPU kernels that consume less power, the GPU utilization of Transformer is generally lower than other models. Larger batch sizes generally help improve the training throughput, which results in better energy efficiency for all the GPUs.
For ease of reference to the experimental results, all raw numbers of end-to-end training are shown in Table 8.
|Model||DNN||Resnet50||Inception V3||VGG16||2-Layer LSTM||Deep Speech 2||Transformer|
|Note: ’-’ means the item is currently unsupported.|
5 Related Work
Benchmarks are key methods to make the hardware and software move forward to better targets (e.g., performance and/or energy). In the era of deep learning, training tasks are computationally intensive and resource-consuming. The running time performance and energy consumption of AI accelerators are the two major concerns for practitioners. Started from 2016, deep learning frameworks are rapidly developed to support many types of processors and accelerators, like CPUs, GPUs, and TPUs.
Researchers [harmonia2015, shi2016benchmarking] started to evaluate the performance among different deep learning frameworks and different GPUs. However, these works mainly focused on software-level evaluation in terms of performance. Later, Stanford DAWN deep learning benchmark [coleman2017dawnbench] and MLPerf [mlperf2019] were developed for the evaluation of training and inference performance under different software and hardware platforms. These two open benchmark platforms have attracted many submissions from vendors, but they mainly focused on the end-to-end training performance without reasoning on the results. Shams et al. [icdcs2017] evaluated deep learning software tools over several hardware platforms including the distributed environment. Recently Wang et al. [wei2019benchmarking] proposed ParaDnn to measure the performance of various hardware including Intel CPU, NVIDIA GPU and Google TPU, which was the closest to our work. The energy consumption is of great importance to the servers that run resource-intensive tasks, but there is little study measuring the power and energy consumption of DNN training tasks. One related work is [tang2019impact] which studied the impact of GPU dynamic voltage and frequency scaling (DVFS) on the training performance and energy.
In this paper, we made a comprehensive evaluation of the training performance, power and energy consumption on various modern AI accelerators including AMD GPU, NVIDIA GPUs, and Google TPUs, covering a representative set of deep neural networks (convolutional neural networks, recurrent neural network, deep speech 2, and transformer). Our evaluation results provide several levels of comparison, including hardware performance, software utilization, diversity on deep models, power, and energy consumption, for end-users and hardware/software designers. In the future, we will extend our evaluation to more AI accelerators such as FPGAs and more training categories such as AutoML [he2019automl]. Another direction is to benchmark the performance and power of deep learning inference tasks on both server devices and edge/mobile devices.
The research was supported by Hong Kong RGC GRF grant HKBU 12200418. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan X (Pascal) used for this research. We also gratefully acknowledge Google for providing TPUs to support our research in the TensorFlow Research Cloud (TFRC) program.