Benanza: Automatic \muBenchmark Generation to Compute “Lower-bound” Latency and Inform Optimizations of Deep Learning Models on GPUs

Benanza: Automatic Benchmark Generation to Compute “Lower-bound” Latency and Inform Optimizations of Deep Learning Models on GPUs

Cheng Li*, Abdul Dakkak* University of Illinois Urbana-Champaign
Urbana, USA
{cli99,dakkak}@illinois.edu
   Jinjun Xiong IBM T. J. Watson Research Center
Yorktown Heights, USA
jinjun@us.ibm.com
   Wen-mei Hwu University of Illinois Urbana-Champaign
Urbana, USA
w-hwu@illinois.edu
Abstract

As Deep Learning (DL) models have been increasingly used in latency-sensitive applications, there has been a growing interest in improving their response time. An important venue for such improvement is to profile the execution of these models and characterize their performance to identify possible optimization opportunities. However, the current profiling tools lack the highly desired abilities to characterize ideal performance, identify sources of inefficiency, and quantify the benefits of potential optimizations. Such deficiencies have led to slow characterization/optimization cycles that cannot keep up with the fast pace at which new DL models are introduced.

We propose Benanza, a sustainable and extensible benchmarking and analysis design that speeds up the characterization/optimization cycle of DL models on GPUs. Benanza consists of four major components: a model processor that parses models into an internal representation, a configurable benchmark generator that automatically generates micro-benchmarks given a set of models, a database of benchmark results, and an analyzer that computes the “lower-bound” latency of DL models using the benchmark data and informs optimizations of model execution. The “lower-bound” latency metric estimates the ideal model execution on a GPU system and serves as the basis for identifying optimization opportunities in frameworks or system libraries. We used Benanza to evaluate ONNX models in MXNet, ONNX Runtime, and PyTorch on GPUs ranging from Kepler to the latest Turing, and identified optimizations in parallel layer execution, cuDNN convolution algorithm selection, framework inefficiency, layer fusion, and using Tensor Cores.

\tcbset

enhanced \newtcolorboxobservationbox[1][] breakable, left=1pt, right=1pt, top=1pt, bottom=1pt, colback=gray!20, colframe=black, width=enlarge left by=0mm, boxsep=5pt, arc=0pt,outer arc=0pt, #1 \NewEnvironscaletikzpicturetowidth[1]\BODY \newtcbox\questionboxhbox, on line, colback=black, enhanced, frame hidden, boxrule=0pt, top=-2pt, bottom=-2pt, right=-2pt, left=-2pt, rounded corners, arc=2pt

11footnotetext: The two authors contributed equally to this paper.

I Introduction

The past few years have seen a spur of deep learning (DL) innovations. These innovations span from DL models to software stack optimizations (e.g. frameworks such as MXNet or PyTorch, libraries such as cuDNN or MKL-DNN) and hardware stack improvements (e.g. CPU, GPU, FPGA). Among all the innovations, however, DL models are the most rapidly evolving and prolific. This is true in both academia [dean2018new] and industry [hazelwood2018applied], where models are tweaked and introduced on a weekly, daily, or even hourly basis.

Both industry and academia have invested heavily in developing benchmarks to characterize DL models and systems [mlperf, aimatrix, deepbench, dawnbench, deep500]. Characterization is followed by optimizations to improve the model performance. However, there is currently a gap between the benchmarking results and possible optimizations to perform. Researchers use profilers, such as nvprof [nvprof], Nsight [nsight], and VTune [vtune], to profile and get low-level GPU and CPU information. With ample knowledge of how models execute and utilize system resources, researchers manually identify bottlenecks and inefficiencies within model execution using the profilers. Researchers then make hypotheses of solutions, and try out different ideas to optimize the model execution — which may or may not pan out. This manual and ad-hoc process requires a lot of effort and expertise and slows down the turnaround time for model optimization and system tuning.

Figure 1: The GPU kernel time breakdown for all models (listed in Table I) on Tesla_V100 (Table III) using batch size 1. Both cuDNN and cuBLAS invoke child GPU kernel(s) asynchronously in the model executions, we therefore measure the time of kernels launched by cuDNN and cuBLAS APIs rather than the time of the API itself.

Thus there is a need for a systematic DL benchmarking and subsequent analysis design that can guide researchers to potential optimization opportunities and assess hypothetical execution scenarios. Since for GPUs model execution latency is determined by the hardware, framework, and system libraries (primarily cuDNN [cudnn] and cuBLAS [cublas] for DL), answers to the following questions are highly desired by researchers:

1
what is the potential latency speedup if optimizations are performed?

2
Are independent layers executed in parallel?

3
Are convolution layers using the optimal convolution algorithms?

4
Are there any inefficiencies or unexpected behavior in a framework? Does the execution

5
fuse layers or

6
leverage Tensor Cores, and what are the benefits? We motivate our design by answering these questions, while ensuring the sustainability and extensibility of the design.

To answer these questions, we first propose a new benchmarking metric: “lower-bound” latency. The “lower-bound” latency estimates the ideal latency of a DL model given a software and hardware stack, and is based on the following observations: (1) DL models are executed as layers in frameworks and thus layers form the performance building blocks of DL models. (2) Frameworks delegate execution of common layers to either cuDNN or cuBLAS (shown in Figure 1). The “lower-bound” latency is defined in terms of the latencies of the cuDNN and cuBLAS API functions invoked by model layers (framework overhead and memory transfers are ignored). We refine the “lower-bound” latency and define it under sequential execution mode (all layers are executed sequentially) and parallel execution mode (data-independent layers are executed asynchronously).

This paper presents Benanza (pronounced bonanza) — an sustainable and extensible benchmarking and analysis design. Benanza consists of a set of modular components: (1) a model processor to process input ONNX models into a set of unique layers (layers are considered the same if they have the same layer type, shape, and parameters), (2) a benchmark generator to automatically generate parameterized cuDNN and cuBLAS micro-benchmarks from the unique layers, (3) a performance database to store historical benchmark results, and (4) an analyzer to compute the “lower-bound” latency of DL models and inform potential optimizations (

1-6
 ).

Benanza is architected to be sustainable. The benchmarking workflow of Benanza is highly automated and minimizes the benchmark development and maintenance effort. Benanza uses the observation that DL models have repeated layers (i.e. non-unique) within and across models to decrease the time to benchmark. When a new model is introduced, only the newly un-benchmarked layers that do (not in the performance database) need to be benchmarked. Although the focus of the paper is on NVIDIA GPUs using cuDNN and cuBLAS, the design proposed is extensible and users can incorporate other benchmark runtimes that target other software libraries or hardware such as: frameworks’ API or MKL-DNN for CPUs.

In summary, this paper makes the following contributions:

  • [nosep,leftmargin=0.5em,labelwidth=*,align=left]

  • We propose a “lower-bound” latency metric for DL models based on the observation that the latency of a DL model is bounded by the latencies of the cuDNN and cuBLAS API calls invoked by the model layers. This metric estimates the ideal latency of a model given a specific GPU hardware and software stack.

  • We present Benanza, a benchmarking and analyzing design that automatically generates micro-benchmarks given a set of models, computes their “lower-bound” latencies using the benchmark data, and informs optimizations of their executions on GPUs. The sustainable and extensible design of Benanza makes it cope with the fast evolution of DL innovations.

  • Using Benanza, we characterized the “lower-bound” latencies of ONNX models (shown in Table I) in MXNet, ONNX Runtime, and PyTorch on systems (shown in Table III). We performed a comprehensive “lower-bound” latency analysis as we vary the model, execution mode, batch size, and system. E.g., when using parallel execution mode, up to (with a geometric mean of across models) latency speedup could be made to MXNet using batch size on the Tesla_V100 system.

  • We further identified optimization opportunities through Benanza in cuDNN convolution algorithm selection (up to geometric mean speedup across models), inefficiencies within MXNet (up to speedup across models) and PyTorch (up to speedup using batch size ), layer fusion, and Tensor Cores (up to and speedup for ResNet50-v1 respectively). We evaluated the above optimizations jointly and get up to speedup for ResNet50-v1 across systems and batch sizes.

  • The usages and evaluation of Benanza are posted to benanza.mlmodelscope.org for public inspection.

Figure 2: The Benanza design and workflow.

Ii Background and Motivation

Ii-a DL Model Execution and ONNX Format

A DL model is an execution graph where each vertex is a layer operator (e.g. convolution, activation, normalization, pooling, or softmax). These layer operators (or layers for short) are functions defined by a DL framework. A framework executes a model by traversing the model graph in topological order and enqueuing the layers into an execution queue. Although sequential evaluation is always valid, frameworks strive to execute data-independent layers within the queue in parallel. Through careful execution scheduling, a framework can overlap communication with computation, increase utilization, etc. Regardless of the execution strategy, however, layer execution latency is the limiting factor for model execution. As thus, layers are not only the building blocks by which developer define models, but are also the atomic components that define a model’s performance characteristics.

Each framework provides its own API, layer definition semantics, model storage format, and model executing strategy. To increase interoperability between frameworks, there have been concerted efforts [onnx, nnef] to standardize layer definitions and model exchange format. A leading effort is the Open Neural Network Exchange Format (ONNX), which has wide industry and framework backing. Frameworks such as Caffe2, CNTK, MXNet, Paddle, PyTorch, and TensorRT readily support ONNX, and converters exist for other frameworks such as TensorFlow and Caffe. To perform a fair comparison between frameworks (by evaluating them using the same ONNX model), and more importantly, to make Benanza framework-agnostic, we choose ONNX as the model input format for Benanza. ONNX hosts all their models publicly [onnxzoo] and, we select vision models out of the models available at the time of writing for evaluation (the models not selected are non-vision models). The selected models cover an array of tasks and are listed in Table I. We refer to these models by their IDs throughout the paper.

ID Name Task MACs # Layers Year
1 Arcface [DBLP:journals/corr/abs-1801-07698] FR 12.08G 412 2018
2 BVLC-Alexnet [alexnet] IC 656M 24 2012
3 BVLC-Caffenet [alexnet] IC 721M 24 2012
4 BVLC-Googlenet [googlenet] IC 1.59G 143 2014
5 BVLC-RCNN-ILSVRC13 [DBLP:journals/corr/GirshickDDM13] IC 718M 23 2013
6 Densenet-121 [DBLP:journals/corr/HuangLW16a] IC 2.87G 910 2016
7 DUC [DBLP:journals/corr/WangCYLHHC17] SS 34.94G 355 2017
8 Emotion Ferplus [DBLP:journals/corr/BarsoumZCZ16] ER 877M 52 2016
9 Inception-v1 [DBLP:journals/corr/IoffeS15] IC 1.44G 144 2015
10 Inception-v2 [DBLP:journals/corr/SzegedyVISW15] IC 2.03G 509 2015
11 LeNet [lecun1998gradient] HR 796K 12 2010
12 MobileNet-v2 [DBLP:journals/corr/HowardZCKWWAA17] IC 437M 155 2017
13 Resnet18-v1 [DBLP:journals/corr/HeZRS15] IC 1.82G 69 2015
14 Resnet18-v2 [DBLP:journals/corr/HeZR016] IC 1.82G 69 2016
15 Resnet34-v1 [DBLP:journals/corr/HeZRS15] IC 3.67G 125 2015
16 Resnet34-v2 [DBLP:journals/corr/HeZR016] IC 3.67G 125 2016
17 Resnet50-v1 [DBLP:journals/corr/HeZRS15] IC 3.87G 175 2015
18 Resnet50-v2 [DBLP:journals/corr/HeZR016] IC 4.10G 174 2016
19 Resnet101-v1 [DBLP:journals/corr/HeZRS15] IC 7.58G 345 2015
20 Resnet101-v2 [DBLP:journals/corr/HeZR016] IC 7.81G 344 2016
21 Resnet152-v1 [DBLP:journals/corr/HeZRS15] IC 11.30G 515 2015
22 Resnet152-v2 [DBLP:journals/corr/HeZR016] IC 11.53G 514 2016
23 Shufflenet [DBLP:journals/corr/ZhangZLS17] IC 127M 203 2015
24 Squeezenet-v1.1 [DBLP:journals/corr/IandolaMAHDK16] IC 352M 66 2016
25 Tiny Yolo-v2 [DBLP:journals/corr/RedmonF16] OD 3.13G 32 2016
26 Vgg16-BN [DBLP:journals/corr/SimonyanZ14a] IC 15.38G 54 2014
27 Vgg16 [DBLP:journals/corr/SimonyanZ14a] IC 15.38G 41 2014
28 Vgg19-bn [DBLP:journals/corr/SimonyanZ14a] IC 19.55G 63 2014
29 Vgg19 [DBLP:journals/corr/SimonyanZ14a] IC 19.55G 47 2014
30 Zfnet512 [DBLP:journals/corr/ZeilerF13] IC 1.48G 22 2013
Table I: The 30 ONNX models used are vision models which encompass image classification (IC), object detection (OD), face recognition (FR), emotion recognition (ER), semantic segmentation (SS), or hand digit recognition (HR) tasks.
Layer Type cuDNN / cuBLAS API Tensor Core Support
Convolution cudnnConvolutionForward
Activation cudnnActivationForward
BatchNorm cudnnBatchNormalizationForwardInference
Conv+Bias+Activation cudnnConvolutionBiasActivationForward
RNN cudnnRNNForwardInference
Dropout cudnnDropoutForward
Pooling cudnnPoolingForward
Softmax cudnnSoftmaxForward
Add cudnnAddTensor
Element-wise cudnnOpTensor
Rescale cudnnScaleTensor
\hdashlineGEMM cublas*Gemm / cublasGemmEx
GEMV cublasSgemv
Table II: Eleven layer types are supported by cuDNN and two layer types are supported by cuBLAS. Each API may have auxiliary functions to setup its arguments (e.g. cudnnSetTensor4dDescriptor to specify a tensor’s dimensions and cudnnSetConvolution2dDescriptor to configure the convolution API). The convolution, RNN, and GEMM APIs have Tensor Core support.

Ii-B cuDNN and cuBLAS

{scaletikzpicturetowidth}

0.48

\sansmath

\sansmath

\sansmath

\sansmath

\sansmath

\sansmath Supported

Figure 3: The percentage of layers supported by cuDNN and cuBLAS (also covered by Benanza) for each model in Table I.

Much like BLAS or LAPACK are the backbone of HPC computing, cuDNN and cuBLAS are the backbones of the GPU software stacks for DL. cuDNN is a GPU-accelerated library and provides highly tuned implementations of DL layers such as convolution, pooling, normalization, activation. cuBLAS is a GPU-accelerated BLAS library and provides fast implementations of GEMM and GEMV. The DL layers supported by each API are listed in Table II. And, while there is a wide array of DL frameworks, common between them is the reliance on these primitives defined by cuDNN and cuBLAS. In fact, all major DL frameworks, such as MXNet, PyTorch, ONNX Runtime, and TensorFlow, rely on cuDNN/cuBLAS API functions for the implementation of common layers.

Figure 3 shows the percentage of layers supported by cuDNN and cuBLAS for each model in Table I. Most layers within DL models are covered by the cuDNN and cuBLAS API. The layers that are not supported are non-compute operators (such as concatenate, which joins two tensors across a specified axis) or datatype manipulations (such as reshape, which changes the dimensions of a tensor). For example, the cuDNN and cuBLAS functions support of the layers within Inception-v2 (ID = ). This is because Inception-v2 makes heavy use of unsqueeze — a tensor reshape layer; of the layers in Inception-v2 are unsqueeze layers.

Given a specific DL software stack (e.g. framework, cuDNN, cuBLAS, and other CUDA libraries) and GPU hardware, the cuDNN and cuBLAS functions invoked by a model are fixed. Most common layers are supported by cuDNN and cuBLAS and the latency attributed to cuDNN and cuBLAS functions is significant with respect to the model’s end-to-end latency. Figure 1 shows that for the vision models, the time spent within the cuDNN and cuBLAS API calls dominates the model execution time. The “other” time is due to either memory operations, synchronization, the framework’s choice of not using cuDNN API for certain operations, or other framework code that is neither cuDNN nor cuBLAS.

Based on the above observations, we propose a “lower-bound” latency metric for DL models. The “lower-bound” metric is defined by the latencies of the cuDNN and cuBLAS functions executed for the model layers within a specific software/hardware stack. The “lower-bound” latency is computed under different execution scenarios to determine if optimizations can be made, pinpoint where optimizations are, and quantify the potential benefits of optimizations, as detailed in Section III.

Iii Benanza Design and Implementation

Benanza consists of four main components: Model Processor, Automatic Benchmark Generator, Performance Database, and Analyzer. The components are shown in Figure 2 and are used in the benchmarking and analysis workflows:

  • [nosep,leftmargin=0.5em,labelwidth=*,align=left]

  • Benchmarking workflow:

    1
    The Model Processor takes ONNX models, parses them, performs shape inference, and finds the set of unique layers within the models. Two layers are considered the same (non-unique) if they have the same operator type and parameters (i.e. only differ in weight values).

    2
    The Automatic Benchmark Generator then generates micro-benchmarks for each unique layer. The generated micro-benchmarks measure the latency (or the GPU kernel metrics if profiling mode is enabled) of the corresponding cuDNN or cuBLAS function calls for the layers.

    3
    The micro-benchmarks are then run on systems of interest and the results are stored in the Performance Database.

  • Analysis workflow:

    4
    The user runs the target model using a framework on a system of interest with utilities provided by Benanza to get the model execution profile (i.e. the end-to-end latency, cuDNN and cuBLAS logs, and Nsight profile).

    5
    The user then specifies the model and system to Benanza. The model is parsed into layers and the Analyzer queries the latencies of each layer from the Performance Database (using the layers and system information provided) to compute the

    1
    “lower-bound” latency under different execution scenarios. By analyzing the model execution profile and the computed “lower-bound”, the Analyzer informs optimizations in:

    2
    parallel execution of independent layers,

    3
    convolution algorithm selection,

    4
    framework inefficiency,

    5
    layer fusion, and

    6
    Tensor Core usage.

Iii-a Benanza Model Processor

The

1
Model Processor parses ONNX models into Benanza’s internal representation (IR). The IR wraps around the ONNX Protobuf and has the same layer coverage. Since ONNX models do not have layer shapes information embedded (except for the input layers), shape inference [shapeinfer] is performed to determine the shape of each layer. Layers in the IR (referred to as layers and correspond to the ONNX nodes) are annotated with the inferred shapes. Benchmarks are generated for each layer using its type, shape, and parameters information.

{scaletikzpicturetowidth}

0.48

\sansmath

\sansmath

\sansmath

\sansmath

Figure 4: The percentage of unique layers within the models

We observe that layers with the same type, shape, and parameters (i.e. only differ in weight values) are repeated extensively within and across models. Figure 4 shows that most models have a low percentage of unique layers — indicating that layers are repeated extensively within the model. For example, ResNet50-v1 (ID=) has layers but only () are unique. The number of unique layers across models of similar architecture is also low. The ResNet*-v1 models (ID=) are built from the same modules and have a total of layers, of which only () are unique. Across all models, the total number of layers is , but only () are unique. We exploit this layer repeatability to optimize the benchmark generation and minimize the time to benchmark. Thus, the Model Processor unifies the repeated layers across the input models and produces a set of unique layers. The time saved can be used to explore other algorithms and data types (Sections III-B2 and III-B3) benchmarks.

Iii-B Automatic Benchmark Generator

The

2
Automatic Benchmark Generator uses the set of unique layers (produced by the Model Processor) and generates C code to invoke the benchmark runtime using each layer’s type, shape, and parameters information.

Iii-B1 The Benchmark Runtime

Benanza provides a benchmark runtime that measures the latency of the cuDNN or cuBLAS API required to execute each layer (as shown in Table II). The runtime also sets up the function arguments for each API. The setup time is not included in the latency measurement. The runtime uses the Google Benchmark [googlebenchmark] library — a micro-benchmarking support library. The Google Benchmark library dynamically determines the number of iterations to run each benchmark and ensures that the reported latency results are statistically stable. Generated benchmarks are linked with the cuDNN/cuBLAS libraries, and are run on systems of interest.

Iii-B2 Algorithm Instantiation

The convolution layers map to the cudnnConvolutionForward API (Table II). The convolution API takes one of the following algorithms as an argument: Implicit GEMM (IGEMM), Implicit PreComputed GEMM (IPGEMM), GEMM, Direct (DRCT), FFT, Tiled FFT (TFFT), Winograd (WING), and Winograd Non-Fused (WINGNF). These algorithms have different compute and memory characteristics [anderson2018optimal, ben2018demystifying]. The optimal algorithm to use depends on the system, layer shape, and layer parameters (e.g. filter size, stride, dilation, etc.) [cudnn]. For inference, most frameworks (e.g. MXNet, PyTorch, TensorFlow) rely on the cuDNN provided heuristic function (cudnnGetConvolutionForwardAlgorithm) to choose the convolution algorithm. The heuristic function suggests an algorithm given the layer’s shape, parameters, data type, system, etc. To explore the design space of algorithm selection, by default, for each layer Benanza generates benchmarks using all algorithms applicable to the layer.

Iii-B3 Data Type Support

Benanza can be configured to generate micro-benchmarks that target different data types. Both float16 and float32 are generated by default, but benchmarks can be instantiated for other data types. The float16 benchmarks use Tensor Cores when the API function (see Table II) and system (see Table III) supports it.

Iii-B4 Layers Fusion Support

Benanza can be configured to generate micro-benchmarks that target the cuDNN fused API (cudnnConvolutionBiasActivationForward) to perform the convolution, bias, and activation layer sequence. Two fusion pattern rules are currently handled by Benanza: ConvBiasActivation and ConvBias. The ConvBiasActivation maps directly to the fused API. Fusing ConvBias is implemented through the fused API using CUDNN_ACTIVATION_IDENTITY as the activation function and requires cuDNN version . For older cuDNN versions, the ConvBias is implemented as two calls — a cudnnConvolutionForward followed by a cudnnAddTensor. Users can extend Benanza’s fusion support by registering new fusion patterns as the cuDNN fused API evolves.

Iii-B5 Integration with CUPTI

Benanza can be configured to generate benchmarks that integrate with low-level GPU profiler libraries such as NVIDIA’s CUPTI [cupti]. This allows Benanza to capture detailed GPU metrics [gpumetrics] of benchmarks such as flops, memory transfers, etc. In this mode, the user specifies the metrics of interest, the number of benchmark iterations for warm-up, and the number of iterations to measure. Benanza does not use the Google Benchmark in this mode since a fixed, small, number of profiling runs suffice for statistically stable measurement of the metrics. The profiling outputs (name, timing, and metric values of GPU kernels) are stored as metadata to the corresponding benchmark entry in the Performance Database.

Iii-C Performance Database

The

3
benchmarking results are collected and published to Benanza’s Performance Database. Each entry within the database is indexed by the system, data type, and layer (type, shape, and parameter information). The Analyzer queries the database to get the benchmark latencies. If a query is a miss, then a warning with the information about the missing benchmark is issued to the user and the user is asked if they wish the Automatic Benchmark Generator to generate the missing benchmarks.

Figure 5: The first parallel module of Inception-v1 in Figure 8 visualized by the Benanza Analyzer. The layers are annotated with the name, type, and latency used for the “lower-bound” calculation. The critical path used in the parallel mode is highlighted in red.

Iii-D Benanza Analyzer

The

4
user runs the target model using a framework on a system of interest with utilities provided by Benanza to get the model execution profile. The model execution profile contains information about the model’s end-to-end latency, cuDNN and cuBLAS logs, and Nsight profile (which contains cuDNN/cuBLAS API calls and function backtrace information). Capturing the model end-to-end latency requires the user to place the provided timing functions within their application code. To capture the usage of cuDNN and cuBLAS functions within a framework, Benanza launches the user code with the CUDNN_LOGINFO_DBG and CUBLAS_LOGINFO_DBG environment variables. These environment variables enable the cuDNN and cuBLAS loggers respectively. Utilities to run the user code using NVIDIA’s Nsight profiler are also provided. The results from Nsight are parsed and correlated with the cuDNN and cuBLAS logs.

The

5
user then inputs the model execution profile along with the ONNX model, system, data type. The model is parsed by the Model Processor into layers. Then, the Benanza Analyzer queries the Performance Database for the benchmark latencies of each layer using the user-specified system and data type (by default float32). Due to algorithm (Section III-B2) instantiation, multiple benchmarks may exist for a layer. The Analyzer, therefore, selects the benchmark result achieving the lowest latency. The following analyses are then performed:

Iii-D1 \questionbox 1,2 Sequential and Parallel “Lower-Bound” Latency

DL models may contain layer sequences which can be executed independently in parallel. The sub-graph formed by these data-independent layer sequences is called a parallel module. For example, a parallel module in Inception-v1 is shown in Figure 5. A framework may execute the independent paths within the parallel module either sequentially or in parallel. Thus, the Analyzer computes the “lower-bound” latency of a model using two execution modes: sequential and parallel.

The sequential mode assumes that independent layers are executed sequentially, and therefore is defined as the sum of each layer’s benchmark latency. The parallel strategy assumes that data-independent layers are executed in parallel. Therefore, the parallel “lower-bound” latency is defined by the model’s critical path — the simple path from the start to the end layer with the highest latency. Finding the critical path of a graph is a longest path problem and is NP-hard. Since a DL model forms a directed acyclic graph (DAG), the critical path can be framed as a shortest path problem [Sedgewick:2011:ALG:2011916]. To compute the critical path we construct a weighted DAG from the model graph where the edge weight between two nodes (layers) is negative of the latency of the layer at the tail of the edge. Computing the shortest path from the start to the end layer of the constructed weighted DAG produces the critical path of the model. The parallel “lower-bound” latency is the sum of layers latencies along the critical path. Benanza visualizes the critical path of the model (e.g. Figure 5), and the difference between the sequential and parallel “lower-bound” latencies indicates the profit of executing independent layers in parallel. Other analyses performed by Benanza leverage the sequential and parallel “lower-bound” latencies, and the benefits can be calculated in terms of either sequential or parallel mode.

Iii-D2 \questionbox 3 Convolution Algorithm Selection

The Analyzer uses the parsed cuDNN log in the model execution profile to determine if the cuDNN algorithm used by the framework for each layer is optimal (recall from Section III-B2 that benchmark results using all available algorithms for layers exist in the Performance Database). Cases where the algorithm choice is sub-optimal are reported to the user along with how much end-to-end latency improvement could be gained if algorithm selection was ideal. The user can act upon these suggestions by forcing the framework to use specific algorithms.

Iii-D3 \questionbox 4 Framework Inefficiency Inspection

The expected cuDNN and cuBLAS API calls are known to the Analyzer from the “lower-bound” latency computation. The Analyzer compares the model execution profile against the expected execution to pinpoint inefficiencies within the framework. The user is presented with any deviation observed in cuDNN or cuBLAS API invocation’s parameters or their execution order. CUDA API functions and CUDA kernels executed between cuDNN or cuBLAS API calls, are also presented to the user — along with their backtraces.

Iii-D4 \questionbox 5 Layer Fusion Analysis

If the user enables the benchmark generation for layer fusion (as described in Section III-B4), then the Analyzer can be used to determine the potential profitability if layer fusion is employed. The Analyzer traverses the model layers and looks for the fusion pattern rules (listed in Section III-B4). If one of these patterns is found, then the corresponding fused operation’s latency is queried from the database and is used in the “lower-bound” computation (in either sequential or parallel model). If the benchmark is unavailable, or failed to run, then the latencies of the non-fused layers are used. The difference between the non-fused “lower-bound” latency and the fused “lower-bound” latency determines the profitability of layer fusion.

Iii-D5 \questionbox 6 Tensor Core Analysis

The Analyzer determines if the target model execution utilizes Tensor Cores by looking at kernel names in the model execution profile. Kernel names that match the _[ish]\d+* Regular-expression use Tensor Cores. By default, benchmarks targeting both float16 and float32 are generated. When benchmarks are run on systems with Tensor Core support, the difference between the “lower-bound” latency of float32 and float16 informs the profitability of using Tensor Cores and float16.

Iii-E Sustainability and Extensibility

Sustainability of Benanza is ensured by providing an automated benchmark generation and analysis workflow design along with a continuously updated Performance Database. Benchmarking requires limited effort, as the micro-benchmarks are automatically generated, and the user only needs to compile and run the generated code on systems of interest. The Performance Database is continuously updated with new benchmark results. A big insight of the proposed design is that there is ample layer repeatability within and across models. This keeps the number of unique layers and thus the number of Performance Database entries in check over time. For new models, only the newly introduced unique layers are benchmarked.

For example, consider a scenario where all models in Table I except for ResNet*-v2 have already been benchmarked and the results are in the Performance Database. Using our design, benchmarking the ResNet*-v2 models requires measuring all the ResNet*-v2 layers that are not within the Performance Database. Evaluating this hypothetical scenario results in a reduction ( minutes) in benchmarking time on the Tesla_V100 system for batch size . The saving would be even larger on slower systems. By storing and reusing the micro-benchmark results in the Performance Database we minimize the time cost of running micro-benchmarks.

Benanza is extensible. As shown in Figure 2, Benanza is designed as a set of modular components. As new cuDNN functions are introduced, users update the Benanza runtime accordingly. For example, if a new cuDNN convolution algorithm is added, then the user can just add it to the list of algorithms to instantiate in the convolution benchmark implementation. If a new cuDNN/cuBLAS API or a fused API is added, then a user needs to add the benchmark implementation for the new API using the templates provided by Benanza as a basis. Users can also extend the Automatic Benchmark Generator to support other runtimes that target other software libraries or hardware, and leverage most of the other analysis components unmodified. These runtimes can target the frameworks’ Python or C++ API or other DL libraries (e.g. MIOpen [jeh2019miopen] on AMD GPUs, or MKL-DNN [mkldnn] on CPUs). Through the novel benchmarking and analysis design, Benanza copes well with the fast evolving pace of DL innovations.

Iv Evaluation

We implemented Benanza and evaluated its design by answering

1-6
 . We evaluated ONNX models (listed in Table I) in the MXNet (v), ONNX Runtime (v), and PyTorch (v) frameworks. Experiments were run on the systems listed in Table III. All systems use Ubuntu LTS, CUDA , cuDNN Version , and CUDA Driver . The micro-benchmarks were compiled with GCC . We first computed the float32 “lower-bound” latency in both sequential and parallel modes. Then we used the Analyzer to uncover and explore optimization opportunities — cuDNN heuristics, framework inefficiencies, layer fusion, and usage of Tensor Cores, and show their impact on the end-to-end latency. The reader is encouraged to explore further documentation and experimentation at benanza.mlmodelscope.org .

Name CPU GPU (Release Year) GPU Architecture GPU Memory Capacity, Bandwidth Theoretical FP32 TFLOPS Theoretical Tensor TFLOPS
Tesla_K80 (AWS P2) Intel Xeon CPU E5-2686 v4 Tesla K80 (2014) Kepler 12 GB, 480 GB/s 5.6
Tesla_M60 (AWS G3) Intel Core i9-7900X CPU Tesla M60 (2015) Maxwell 7 GB, 160.4 GB/s 4.8
TITAN_ Xp Intel Xeon CPU E5-2686 v4 TITAN Xp (2017) Pascal 12 GB, 547.6 GB/s 12.2
TITAN_V Intel Core i7-7820X CPU TITAN V (2017) Volta 12 GB, 672 GB/s 14.9 110.0
Tesla_V100 (AWS P3) Intel Xeon CPU E5-2686 v4 Tesla V100 SXM2 (2018) Volta 16 GB, 900 GB/s 15.7 125.0
Quadro_RTX Intel Xeon CPU E5-2630 v4 Quadro RTX 6000 (2019) Turing 24 GB, 624 GB/s 16.3 130.5
Tesla_T4 (AWS G4) Intel Xeon Platinum 8259CL CPU Tesla T4 (2019) Turing 15 GB, 320 GB/s 8.1 65.0
Table III: We used GPU systems for evaluation. The systems cover the past GPU generations (from Kepler to the latest Turing). Amazon cloud (AWS) is used for of the systems and the other are local machines. The Turing and Volta GPUs support Tensor Cores and their theoretical Tensor Core performance (Tensor TFLOPS) are listed.

Iv-a End-to-end Latency vs. “Lower-Bound” Latency

We measured the end-to-end latency of the models using MXNet, ONNX Runtime, and PyTorch on the Tesla_V100 system. Figure 6 shows the latency difference across all models and Figure 7 compares the models across the frameworks. Due to the lack of support of some ONNX operators by ONNX Runtime [onnxruntime] and PyTorch [paszke2017pytorch], not all models run within these frameworks. As MXNet is the fastest in general, subsequent sections of the paper (with the exception of Section IV-C) focus on informing optimizations in MXNet.

Figure 6: The end-to-end latency of all ONNX models using batch size 1 with MXNet backend on Tesla_V100 in Table III.
Figure 7: The end-to-end latency of all ONNX models with MXNet, ONNX Runtime, and PyTorch backends (normalized to MXNet latency) using batch size 1 on Tesla_V100.

Iv-A1 \questionbox 1,2 Sequential Mode vs Parallel Mode

The difference between the “lower-bound” latency and the end-to-end latency indicates the optimization opportunities in the framework and its use of the cuDNN and cuBLAS APIs. The “lower-bound” latency of a model normalized to the model’s end-to-end latency in a framework is referred to as normalized “lower-bound” latency. Figure 8 shows the normalized “lower-bound” latency using sequential and parallel modes in MXNet across all models using batch size 1 on the Tesla_V100 system.

The sequential normalized “lower-bound” latencies across models have a geometric mean of , thus a potential latency speedup of can be made to MXNet. The parallel “lower-bound” latency across models has a geometric mean of , indicating a potential latency speedup of . The difference between a model’s parallel and sequential “lower-bound” latency depends on the existence of parallel modules within the model and how compute-intensive the date-independent paths are. Models without parallel modules have the same parallel and sequential “lower-bound” latency. For models with compute-intensive parallel modules, such as the Inception models (ID=), the potential speedup of the MXNet’s latency is , , and respectively. The sequential and parallel normalized “lower-bound” latencies of LeNet (ID=) are both low because LeNet is a simple model which has low latency ( as shown in Figure 6). For LeNet, the MXNet overhead and other non-compute portion is high, thus the normalized “lower-bound” latency is low.

The sequential “lower-bound” latencies of the models with parallel modules (e.g. Inception and ResNet models) are closer to their end-to-end latency (when compared to the parallel “lower-bound”). This suggests that parallel modules are mostly executed sequentially in MXNet, even though the independent layers could be run in parallel. We verified the sequential execution behavior in MXNet by inspecting the model execution profile. We evaluated the benefits of the latter optimizations in terms of the serial “lower-bound” latency.

Figure 8: The sequential and parallel normalized “lower-bound” latency in MXNet using batch size 1 on Tesla_V100.
Figure 9: The end-to-end latency of ResNet50_v1 in MXNet across batch sizes and systems.
Figure 10: The cuDNN heuristic selects non-optimal convolution layer algorithms for ResNet50_v1 using batch size on Tesla_V100. Up to speedup can be achieved if selection was ideal.
Figure 11: The speedup achieved for ResNet50_v1 by applying the MXNet optimization described in Section IV-C1 across batch sizes and systems.
Figure 12: The normalized “lower-bound” latency of ResNet150-v1.
Figure 13: The geometric mean of the normalized “lower-bound” latencies of all models.
Figure 14: The end-to-end latency speedup for ResNet50-v1 if the cuDNN heuristic selections were optimal.
Figure 15: The geometric mean of the end-to-end latency speedup for all models by using the optimal convolution algorithm.
Figure 16: The end-to-end latency speedup for ResNet50-v1 if layer fusion was performed.
Figure 17: The “lower-bound” latency speedup if Tensor Cores (NCHW) were used for ResNet50-v1.
Figure 18: The “lower-bound” latency speedup for ResNet50-v1 if Tensor Cores (NHWC) were used.
Figure 19: The end-to-end latency speedup for ResNet50-v1 if Tensor Cores (NHWC) were used.
Figure 20: The end-to-end latency speedup for ResNet50-v1 if parallel execution, optimal algorithm selections, layer fusion, and Tensor Cores (NHWC) were used.

Iv-A2 Batch Sizes and Systems

To demonstrate Benanza’s functions across batch sizes and systems, we evaluated the “lower-bound” latency of all models using different batch sizes from to on representative systems (shown in Table III). We select batch size , since some models cannot be run using batch sizes beyond due to GPU memory limitations. Figure 9 shows the end-to-end latency of ResNet50-v1 on all systems in log scale. As expected, latencies are reversely correlated to the compute capability of the system (e.g. theoretical FP32 TFLOPS in Table III). ResNet50-v1 has a higher latency on Quadro_RTX when compared to Tesla_V100, since Quadro_RTX is a desktop-grade GPU with a memory bandwidth of GB/s whereas Tesla_V100 is a server-grade GPU and a memory bandwidth of GB/s.

Figure 20 shows the (sequential) normalized “lower-bound” latency of ResNet50-v1 across batch sizes and systems. The figure suggests that ResNet50-v1’s optimization opportunities are system and batch size dependent. Both Tesla_V100 and TITAN_V are highly optimized to run ResNet50-v1 across batch sizes, since their normalized latencies are high — ranging from to . The normalized latencies for Tesla_T4 and Quaro_RTX are high for batch sizes to but drop beyond that. ResNet50-v1 is less optimized on the other systems and has low normalized “lower-bound” latencies.

The geometric mean of the normalized “lower-bound” latencies for all the models across systems and batch sizes is shown in Figure 20. Tesla_V100 and TITAN_V, are still observed to have high normalized latencies (). A drop was still observed for Tesla_T4 and Quaro_RTX at batch size . Tesla_M60 and TITAN_Xp have normalized “lower-bound” latencies between and . The oldest GPU generation, Tesla_K80, is the least optimized and has the lowest normalized “lower-bound” latency.

Overall, the current software stack (latest MXNet, cuDNN, and CUDA libraries used in the evaluation) is more optimized on the recent GPU generations (Turing and Volta) and for smaller batch sizes. Compared to Volta, the software stack is less optimized for Turing. This is possibly because Turing was newly released, and we expect optimizations that target Turing to increase. Moreover, the low normalized “lower-bound” latencies on the older GPUs suggest that vendors prioritize optimizations for newer GPU generations over older ones.

Figure 21: The speedup achieved by removing unnecessary cuDNN API synchronizations in PyTorch on Tesla_V100 using batch size 1.

Iv-B \questionbox 3 cuDNN Convolution Heuristics

Using the Benanza Analyzer, we observed that heuristics employed by cuDNN (and subsequently the frameworks) are not always optimal. For example, Figure 10 shows the convolution layer latencies using the algorithms informed by cuDNN heuristics (labeled as cuDNN Heuristic) normalized to using the optimal algorithm (labeled as Ideal Algorithm) for ResNet50_v1 using batch size 32 on Tesla_V100. The algorithm choices are listed in Section III-B2. Figure 20 shows the end-to-end latency speedup for ResNet50_v1 across batch sizes and systems by using the optimal convolution algorithm for all convolution layers. Figure 20 shows the geometric mean of the end-to-end latency speedup for all models by using the optimal algorithms. At batch size , the speedup ranges between and across GPUs. Both cutting edge and older GPU architectures can benefit from better cuDNN heuristics.

Iv-C \questionbox 4 Inefficiencies in Frameworks

We used Benanza to identify the inefficiencies in MXNet and PyTorch. We then implemented the optimizations informed by Benanza and show the latency speedup after the framework modifications.

Iv-C1 MXNet ONNX Model Loader

We observed through the Analyzer that there are layers in the model execution profile where the cuDNN API arguments deviate from what is expected. An inspection of the Analyzer’s parsed Nsight profile pointed to an image_2d_pad_constant_kernel GPU kernel function being invoked before every convolutional layer. Non-zero padding leads to the observed deviation between the expected and actual cuDNN API calls. We inspected the MXNet source code and found that padding layers are inserted during the loading of ONNX models in MXNet. ONNX supports specifying asymmetric padding as a parameter in convolution layers, whereas MXNet does not. Therefore, MXNet must insert padding layers before convolution layers where asymmetric padding is used when loading ONNX models. However, the MXNet ONNX model loader adds padding layers before every convolution layer (regardless of the use of asymmetric padding). A non-intrusive optimization is to only insert padding layers if asymmetric padding is used. With this simple one-line optimization, we observed up to end-to-end latency speedup for ResNet50-v1 (shown in Figure 11).

Iv-C2 PyTorch cuDNN Wrapper

Using Benanza we observed that there were excessive calls to cudaStreamWaitEvent between cuDNN API calls. Using the backtrace information from the Nsight profile, we identified the PyTorch source file that introduces these synchronizations. Upon further study of the source code, we found that all cuDNN functions are invoked by a cuDNN wrapper in PyTorch. The wrapper manages a pool of cuDNN handles and is designed to enable invoking cuDNN functions from different CPU threads. cuDNN functions managed by the same handle are synchronized and executed sequentially. In the current PyTorch (v), however, a single handle is used for inference, and thus forced synchronization occurs before each cuDNN function call. The synchronizations cause stalls on average between cuDNN functions, thus the latency saved through this optimization is a function of the number of layers in a model. We modified PyTorch to elide the cuDNN wrapper and only synchronize before and after performing inference, and Figure 21 shows the speedup achieved using batch size 1. MobileNet-v2 achieves a speedup, since it has low latency and a large number of layers.

Iv-D \questionbox 5 Layer Fusion

We used Benanza to evaluate the potential benefits of layer fusion. Figure 20 shows the end-to-end latency speedup from layer fusion for ResNet50-v1 across the systems. ResNet50-v1 has the layer sequence pattern ConvBiasBatchNormActivation. Benanza reports that the ConvBias sequence can be fused for better latency and performs the fusion analysis (Section III-D4). In all, () layers were fused and up to speedup was achieved over the end-to-end latency across systems for ResNet150-v1. By inspecting the model execution profile, we found no indication that MXNet, ONNX Runtime, or PyTorch perform layer fusion using the cuDNN fused API.

Iv-E \questionbox 6 Tensor Cores

We used Benanza to evaluate the potential benefits of using float16 and Tensor Cores available on recent GPU architectures. While the cuDNN Tensor Core API supports both NHWC and NCHW layouts, NVIDIA recommends the use of NHWC. We use Benanza to generate benchmarks targeting both the NHWC and NCHW and evaluated the “lower-bound” latency speedup, as shown in Figures 20 and 20 respectively. As expected, using the NHWC achieves higher speedup. Internally, the cuDNN API implements NCHW convolutions in terms of NHWC with an implicit transposition. As compute dominates (i.e. larger batch sizes), the relative overhead of the transposition becomes small; hence, NCHW and NHWC have similar performance for larger batch sizes. Figure 20 shows the end-to-end latency speedup by using Tensor Cores(NHWC). TITAN_V achieves significant speedup (up to ). We can see that Telsa_T4 benefits most from Tensor Cores for smaller batch sizes (i.e. might be best used for low-latency inference).

Iv-F \questionbox 1,2,3,5,6 Parallel Execution, Algorithm Selection, Layer Fusion, and Tensor Cores

Benanza can be used to perform the above analysis jointly. To demonstrate this, we analyzed the end-to-end latency speedup when using parallel execution of data-independent layers, optimal algorithm selection, layer fusion, and Tensor Cores (NHWC). Figure 20 shows the end-to-end latency speedup for ResNet50-v1 across batch sizes and systems. Up to a and speedup can be achieved by TITAN_V and Tesla_V100 respectively. We can surmise, from the previous analysis, that most of the profit for TITAN_V is attributed to its use of Tensor Cores. Quadro_RTX and Telsa_T4 achieve marginal speedup over the Tensor Core results.

V Related Work

DL Benchmarking: There has been no shortage of work on developing benchmarks to characterize DL models. These DL benchmarks either take a model as a black-box and measure the user-observable latency and throughput (end-to-end benchmarks) or delve deeper into models to characterize the layer or kernel performance (micro-benchmarks). The end-to-end benchmarks [mlperf, aimatrix, dawnbench] provide a corpus of models that are deemed to be of value to characterize for industry and research. Micro-benchmarks [deepbench, convbench, benchdnn, aimatrix] distill DL models into their layers or kernels, and are hand-curated. Micro-benchmarking enables easy measurements of layers within popular DL models and integrates easily with profiling tools.  [deep500] has been recently proposed to enable fair comparison of DL techniques at different levels of granularity. At the operator level, [deep500] takes ONNX models and generates micro-benchmarks that target the framework’s Python API to measure the latency of each operator. Benanza also takes ONNX models as input, but generates lower-level cuDNN and cuBLAS micro-benchmarks to compute the “lower-bound” latency of the model, and perform analysis. The authors are unaware of previous work which generates micro-benchmarks from model layers and couples it with an analysis workflow to inform optimizations.

Performance Advising: There is past work on using profiling to inform users of possible optimizations. These optimizations are performed at the compiler level [ashouri2019survey], or are plugins to code editors to inform proper usage of APIs[vandierendonck2010paralax, haj2019neurovectorizer]. Low-level profile reports and some suggestions on how to address bottlenecks are provided by profilers and IDEs such as: NVIDIA’s Nvprof [nvprof], Intel’s VTune [vtune], Oracle’s Solaris Studio [solarisoracle], Microsoft’s Roslyn [ng2011roslyn], and IBM’s XL [du2015explore]. To the author’s knowledge, there has been no work on applying or specializing the optimization advising to the DL domain.

Vi Conclusion

This paper presents Benanza, a sustainable and extensible DL benchmarking and analysis design that automatically generates layer-wise benchmarks for DL models to compute the “lower-bound” latency and inform optimizations on GPUs. We use Benanza to evaluate a set of models using different frameworks on GPUs, and pinpointed the optimizations in parallel layer execution, cuDNN algorithm selection, framework inefficiency, layer fusion, and Tensor Core usage. The results show that Benanza fills a significant gap within the characterization/optimization cycle and would boost the productivity of DL model, framework, and library developers.

Acknowledgments

This work is supported by IBM-ILLINOIS Center for Cognitive Computing Systems Research (C3SR) - a research collaboration as part of the IBM Cognitive Horizon Network.

References

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
399560
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description