AIBench: An Agile Domain-specific Benchmarking Methodology and an AI Benchmark Suite

Abstract

Domain-specific software and hardware co-design is encouraging as it is much easier to achieve efficiency for fewer tasks. Agile domain-specific benchmarking speeds up the process as it provides not only relevant design inputs but also relevant metrics, and tools. Unfortunately, modern workloads like Big data, AI, and Internet services dwarf the traditional one in terms of code size, deployment scale, and execution path, and hence raise serious benchmarking challenges.

This paper proposes an agile domain-specific benchmarking methodology. Together with seventeen industry partners, we identify ten important end-to-end application scenarios, among which sixteen representative AI tasks are distilled as the AI component benchmarks. We propose the permutations of essential AI and non-AI component benchmarks as end-to-end benchmarks. An end-to-end benchmark is a distillation of the essential attributes of an industry-scale application. We design and implement a highly extensible, configurable, and flexible benchmark framework, on the basis of which, we propose the guideline for building end-to-end benchmarks, and present the first end-to-end Internet service AI benchmark.

The preliminary evaluation shows the value of our benchmark suite—AIBench against MLPerf and TailBench for hardware and software designers, micro-architectural researchers, and code developers. The specifications, source code, testbed, and results are publicly available from the web site http://www.benchcouncil.org/AIBench/index.html.

   

AIBench:
An Agile Domain-specific Benchmarking Methodology and an AI Benchmark Suite

   

Abstract and Section 1 (Introduction) were contributed by Jianfeng Zhan. Section 2 was contributed by Jianfeng Zhan, Lei Wang, Wanling Gao, and Fei Tang. Section 3 was contributed by Jianfeng Zhan. Section 4.1 was contributed by Chunjie Luo, Fei Tang, Zihan Jiang, Wanling Gao, Jianfeng Zhan, and seventeen industry partners. Section 4.2 and component benchmarks were contributed by Wanling Gao, Chunjie Luo, Xingwang Xiong, Fei Tang, Zihan Jiang, Tianshu Hao, Fanda Fan, Xu Wen, Fan Zhang, Yunyou Huang, Jianan Chen, and Mengjia Du. Section 4.3 and micro benchmarks were contributed by Wanling Gao and Daoyi Zheng. Section 4.4 was contributed by Wanling Gao, Fei Tang, Lei Wang, and Jianfeng Zhan. Section 5 was contributed by Fei Tang, Wanling Gao, Lei Wang, and Jianfeng Zhan. Section 6 was contributed by Jianfeng Zhan, Wanling Gao, Fei Tang, Lei Wang, and Chuanxin Lan. Section 7 and Section 8 were contributed by Jianfeng Zhan, Wanling Gao, and Lei Wang. Rui Ren and Chen Zheng provide Testbed support.


BenchCouncil: International Open Benchmarking Council
Chinese Academy of Sciences
Beijing, China
http://www.benchcouncil.org/AIBench/index.html

Technical Report No. BenchCouncil-AIBench-2020

February 17, 2020

1 Introduction

As it is much easier to achieve more efficient algorithms, systems, and architectures for fewer tasks, domain-specific software and hardware co-design is widely explored. For example, each of Internet service giants like Facebook, Google, Alibaba focuses on a specific application domain, i.e., search engine, social networks, E-commerce, respectively, and they are active practitioners. The ongoing AI accelerator boom is another witness to this trend. As the AI advancement has brought breakthroughs in processing images, video, speech, and audio  [42], Internet service providers pervasively perform software and hardware AI co-design to augment their services [49, 32, 10, 39, 55]. This trend is also witnessed by big data advancement, and there are hundreds of single-purpose solutions in the forms of NoSQL, NewSQL or hardware accelerators.

Agile domain-specific benchmarking speeds up software and hardware co-design. Unfortunately, modern workloads dwarf the traditional one in terms of code size, deployment scale, and execution path, and hence raise serious benchmarking challenges. For example, the traditional desktop workloads, e.g., data compression [9], image manipulation [9], are about one hundred thousand lines of code, and run on a single node. The Web server workloads [5] are hundreds of thousands of lines of code, and run on a small scale cluster, i.e., dozens of nodes. However, for modern workloads, their runtime environment stacks (e.g., Spark [8], TensorFlow [10]) alone are more than millions of lines of code, and these workloads often run on a large-scale cluster, i.e., tens of thousands of nodes [16]. Moreover, modern Internet services adopt a microservice-based architecture, which is often distributed across different datacenters, and consists of diversity of AI and non-AI modules with very long and complex execution paths. Worst of all, the real-world data sets, workloads or even AI models are hidden within the giant Internet service providers’ datacenters [32, 14], which further exaggerates the benchmarking challenges.

On one hand, the hardware and software designers should consider the overall system’s effects. Using micro (interchangeable with kernel in this paper) or component benchmarks alone can lead to incorrect conclusions. For example, in Section 6.2.1, we found that in terms of mere execution path, end-to-end tail latency deteriorates even hundreds times comparing to a single AI component tail latency, which can not be predicted by a state-of-the-art statistical model [24] as discussed in Section 6.2.1. Hereby, end-to-end indicates the overall critical path. It may refer to the end-to-end (tail) latency of an online service, or even cover offline AI training when updating an AI model for online services in a real time manner, as discussed in Section 6.2.2.

On the other hand, it is usually difficult to justify porting a full-scale end-to-end application to a new computer system or architecture simply to obtain a benchmark number [29, 15]. For hardware designers, an end-to-end application is too huge to run on the simulators. In addition, evaluating a full-scale end-to-end application raises difficulties in reproducibility and interpretability of performance data [28], and may lead to an error-prone conclusion. After gaining full knowledge of overall critical information, micro and component benchmarks are still a necessary part of the evaluation.

Put in other words, we believe a domain-specific benchmark suite should have three integrated parts. End-to-end benchmarks let software and hardware designer learn about the overall system information. Each end-to-end benchmark is a distillation of the essential attributes of an industry-scale application, and hence reduces the side effect of the latter’s huge code size, extreme deployment scale, and complex execution paths. Measuring the achieved performance and quality targets for representative AI tasks, the component benchmarks provides diverse computation and memory access patterns for the micro-architectural researchers. The micro benchmarks are provided, and the code developers can drill down to hotspot functions for performance optimization.

Figure 1: The Agile Domain-specific Benchmarking Methodology.

This paper proposes an agile domain-specific benchmarking methodology as shown in Fig. 1. Without losing its generality, we apply it in characterizing the AI and Internet services application domains. First, in cooperation with seventeen industry partners, we investigate their domain-specific benchmarking requirements, and extract ten important end-to-end application scenarios. Instead of using real-world applications, we propose the permutations of essential AI and non-AI tasks as end-to-end benchmarks.

Second, we identify sixteen representative AI tasks as the AI component benchmarks with both performance and quality targets. After profiling sixteen AI component benchmarks, we identify and implement fourteen frequent-appearing units of computation as the micro benchmarks.

Third, we present a highly extensible, configurable, and flexible benchmark framework, allowing researchers to create end-to-end applications by using different components commonly found in major application domains. On the basis of the framework, we propose guidelines on how to build end-to-end benchmarks, and design and implement the first end-to-end Internet service AI benchmark—E-commerce search intelligence.

The evaluation on a hybrid cluster consisting of 16-node CPUs and 4-node GPUs show the value of AIBench against MLPerf and TailBench. We gain many insights for hardware and software designers, micro-architectural researchers, and code developers. Several important observations are as follows: (1) In serving the same request, different AI components incur significantly different latency; an end-to-end tail latency deteriorates dozens times or even hundreds times with respect to a single AI component, which can not be predicted by a state-of-the-art statistical model [24]. (2) Internet service architects must perform a tradeoff among service quality, model complexity, and model accuracy. (3) AI models are updated in a real time manner in many end-to-end application scenarios. Offline training should be included into end-to-end benchmarking. (4) As they demonstrate distinct computation and memory patterns, diverse AI tasks should be included into the AI component benchmarks. (5) Drilling down to hotspot functions is helpful for code optimization.

The rest of this paper is organized as follows. Section 2 explains the motivation. Section 3 summarizes the methodology. Section 4 describes how to characterize the AI and Internet service application domains. Section 5 illustrates how to build an end-to-end benchmark. Section 6 performs evaluation. Section 7 summarizes the related work. Section 8 draws a conclusion.

2 Motivation

2.1 Why End-to-end Benchmarking Is Necessary

Modern Internet services process millions of user queries daily, thus the tail latency is of paramount importance in terms of user experience [24]. However, a microservice-based architecture contains various AI and Non-AI modules, and consequently forms long and complex execution paths. Existing AI benchmarking efforts mostly provide a few micro or component benchmarks, and thus fail to model the critical paths and the permutation of primary components of an industry-scale application.

The end-to-end tail latency deteriorates even 100X comparing to a single component tail latency. The end-to-end tail latency indicates the overall performance of the entire execution path, while the component tail latency only reports the performance of a single module. Our experiments in Section 6.2.1 show that the end-to-end tail latency deteriorates dozens times or even hundreds times comparing to a single component tail latency. For an AI component—recommendation, the difference is 13X, while for image classification, the difference reaches up to 296X.

Debugging the performance of a single component benchmark alone does not touch the full execution path and fail to provide bottleneck information among the primary modules within a critical path. Considering a 90th percentile latency, We found that among the four AI related components, the recommendation component occupies 72% of the execution time, while the image classification component only occupies 1.1%. This indicates that benchmarking a single AI component alone without the overall critical path does not make sense.

2.2 Can a Statistical Model Predict the End-to-end Tail Latency?

Someone may argue after profiling many components’ tail latency performance, a statistical model can predict the end-to-end tail latency. Our answer is NO! In Section 6.2.1, We use a state-of-the-art queuing theory [24] to evaluate the end-to-end application’s latency and tail latency. Through the experimental evaluations, we find that the gap is 3.4 times between the actual average latency and the theoretical one, while the gap is 8.1 times between the actual 99th percentile latency and the theoretical one. Furthermore, the state-of-art queuing model [24] for tail latency takes the system as a whole, and is not suited for the end-to-end application that needs characterize the permutations of several or dozens of components.

2.3 Why Offline AI Training is also Essential in End-to-end Benchmarking

As witnessed by our many industry partners, when an AI model is used for online service, it has to be updated in a real time manner. For example, one E-commence giant demands that the AI models have to be updated every one hour, and the updated model will bring in the award about 3% click-through rate and millions of profits. In Section 6.2.2, the evaluation shows offline training should be included into end-to-end benchmarking for performing tradeoffs among model update interval, training overhead, and accuracy improvement.

3 Agile Domain-specific Benchmarking Methodology

As modern AI and Internet service workloads are not only diverse, but also fast changing and expanding, the traditional benchmark methodology that creates a new benchmark or proxy for every possible workload is prohibitively costly and even impossible [29]. Hence an agile domain-specific benchmarking methodology is extremely essential. Fig. 1 summarizes our methodology.

Step One. We investigate domain-specific benchmarking requirements with the industry partners. The input of this step is the candidate list of industry-scale applications. Just copying the real-world applications is impossible for two reasons. First, they treat the real-world workloads, data sets, or models are confidential issues. Second, the massive code size, extreme deployment scale, and complex execution path make it infeasible. So the purpose of this step is to understand their essential components and the permutation of different components.

Step Two. On the basis of the output from Step One, This step distills representative AI and non-AI tasks. Different from traditional task, each AI task like image classification has both performance and quality targets [45]. Generally, an AI component specification defines a task in a high level language [64], only algorithmically in a paper-and-pencil approach [15]. We implement each task as a component benchmark. The benchmark also provides a reference AI model, evaluation metrics, and state-of-the-art quality target [45].

Step Three. According to the output of Step Two, we profile the full component benchmarks and drill down to frequently-appearing and time-consuming units of computation. We implement those units of computation as micro benchmarks. Micro benchmarks are easily portable to new architecture and system, and are beneficial to fine-grained profiling and tuning.

Step Four. According to the outputs of Steps One and Two, we design and implement a reusing benchmark framework, including AI and non-AI component library, the data input, online inference, offline training, and deployment tool modules.

Step Five. On the basis of the benchmark framework, we build end-to-end benchmarks. Each end-to-end benchmark models the permutation of several or tens of essential AI or non-AI components, reflecting complex interactions among different modules and depicting overall system’s performance. In addition, we propose domain-specific evaluation metrics.

4 The AIBench Design and Implementation

We first give a summary of the seventeen Industry Partners’ benchmarking requirements, and then identify the representative AI tasks (component benchmarks and micro benchmarks). Finally, we propose the reusing benchmark framework.

4.1 Seventeen Industry Partners’ Benchmarking Requirements

Collaborating with seventeen industry partners whose domains include search engine, e-commerce, social network, news feed, video and etc, we extract the essential end-to-end application scenarios from their products or services.

The real-world applications are complex, and we only distill the permutations of primary AI and non-AI tasks. Table 1 summarizes the list of end-to-end application scenarios.

For example, the first scenario in Table 1—E-commerce search intelligence is extracted from an E-commerce giant. A user will be classified into different groups to provide personalized services. The results are ranked according to the relations between the queries and the products. And the ranking is adjusted by learning from the history query and hitting logs. The recommended products are also returned with the search results to the users. We extract this industry-scale application into several AI tasks like classification, learning to rank, recommendation, and non-AI tasks like query parsing, database operation, and indexing. Section 5.1 will describe how to implement this benchmark on the basis of the reusing framework described in Section 4.

In general, end-to-end benchmarks concern overall system’s effects, including quality-ensured response latency, tail latency, and latency-bounded throughput. A quality-ensured performance example is that a quality (e.g., accuracy) deviation from the target is within 2%. Different application scenarios have domain-specific evaluation metrics. For example, several scenarios require that the AI model is updated in a real time manner.

End to End Application Scenario Involved AI Task Involved Non-AI Task Data Metrics Model Update Frequency
E-commerce search intelligence Classification; Learning to rank; Recommendation Query parsing, Database operation, Indexing User Data, Product data, Query data Precision, Recall, Latency High
Language and dialogue translation Text-to-Text translation; Speech recognition Query parsing Text, Speech Accuracy, Latency Low
Content-based image retrieval Object detection; Classification; Spatial transformer; Image-to-Text Query parsing, Indexing, Sort Image Precision, Recall, Latency High
Web searching Text summarization; Learning to rank; Recommendation Query parsing, Indexing, Crawler, Sort, Hash Product data, Query data Precision, Recall, Latency High
Facial authentication and payment Face embedding; 3D face recognition; Encryption Face image Accuracy , Latency Low
News feed Recommendation Database operation, Sort, Basic statistics, Filter Text Precision, Recall High
Photo translation Classification; Spatial transformer; Text-to-Text translation Query parsing Image, Text Accuracy, BLEU, Latency Low
Live streaming Image generation; Image-to-Image Video codec, Video capture Image Latency Low
Video services Image compression; Video prediction Video codec Video Accuracy, Latency Low
Online gaming 3D object reconstruction; Image generation; Image-to-Image Rendering Image Latency Low
Table 1: Domain-specific Benchmarking Requirements

4.2 Representative AI Tasks

To cover a wide spectrum of AI Tasks, we thoroughly analyze the end-to-end application scenarios shown in Table 1. In total, we identify sixteen representative AI tasks. For each AI task, we implement it on TensorFow [10] and PyTorch [7] as the AI component benchmarks. Table 2 summarizes the sixteen component benchmarks in AIBench.

Classification. This task is to extract different thematic classes within the input data like an image or text file. It is a typical task in Internet services or other application domains, and is widely used in multiple scenarios, like category prediction and spam detection.

Image Generation. This task aims to provide an unsupervised learning problem to mimic the distribution of data and generate images. The typical scenario includes image resolution enhancement, which is used to generate high-resolution image.

Text-to-Text Translation. This task needs to translate a text from one language to another, which is the most important field of computational linguistics. It can be used to translate a search query and translate dialogue.

Image-to-Text. This task is to generate the description of an image automatically. It can be used to generate image caption or recognize optical character.

Image-to-Image. This task is to convert an image from one representation to another one. It can be used to synthesize the images with different facial ages and simulate virtual makeup.

Speech Recognition. This task is to recognize and translate a spoken language into text. This task is beneficial for voice search and voice dialogue translation.

Face Embedding. This task is to transform a facial image into a vector in an embedding space. The typical scenarios are facial similarity analysis and face recognition.

3D Face Recognition. This task is to recognize the 3D facial information from multiple images from different angles. This task mainly focuses on three-dimensional images, and is beneficial to the facial similarity and facial authentication scenario.

Object Detection. This task is to detect the objects within an image. The typical scenarios include vertical search and video object detection.

Recommendation. This task is to provide recommendations. This task is widely used for advertise recommendation, community recommendation, or product recommendation.

Video Prediction. This task is to predict the future video frames through predicting previous frames transformation. The typical scenarios are video compression and video encoding, for efficient video storage and transmission.

Image Compression. This task is to compress the images and reduce the redundancy [57]. The task is important for Internet services in terms of reducing data storage overhead and improving data transmission efficiency.

3D Object Reconstruction. This task is to predict and reconstruct 3D objects [62]. The typical scenarios are maps search, light field rendering, virtual reality, and online gaming.

Text Summarization. This task is to generate a text summary, which is important for search results preview, headline generation, and keyword discovery.

Spatial Transformer. This task is to perform spatial transformations [36]. A typical scenario is space invariance image retrieval, so that an image can be retrieved even if it is extremely stretched.

Learning to Rank. This task is to learn the attributes of a searched content and rank the scores for the results, which is the key for a search engine service.

No. Component Benchmark Algorithm Data Set
DC-AI-C1 Image classification ResNet50 [33] ImageNet [25], Cifar [41]
DC-AI-C2 Image generation WassersteinGAN [13] LSUN [63]
DC-AI-C3 Text-to-Text translation Transformer [58] WMT English-German [1]
DC-AI-C4 Image-to-Text Neural Image Caption Model [60] Microsoft COCO [44]
DC-AI-C5 Image-to-Image CycleGAN [66] Cityscapes [21]
DC-AI-C6 Speech recognition DeepSpeech2 [12] Librispeech [51]
DC-AI-C7 Face embedding Facenet [54] LFW [35], VGGFace2 [17]
DC-AI-C8 3D Face Recognition 3D face models [59] 77,715 samples from 253 face IDs
DC-AI-C9 Object detection Faster R-CNN [52] Microsoft COCO [44]
DC-AI-C10 Recommendation Neural collaborative filtering [34] MovieLens [31]
DC-AI-C11 Video prediction Motion-Focused predictive models [27] Robot pushing data set [27]
DC-AI-C12 Image compression Recurrent neural network [57] ImageNet [25]
DC-AI-C13 3D object reconstruction Convolutional encoder-decoder network [62] ShapeNet Data set [18]
DC-AI-C14 Text summarization Sequence-to-sequence model [48] Gigaword data set [53]
DC-AI-C15 Spatial transformer Spatial transformer networks [36] MNIST [43]
DC-AI-C16 Learning to rank Ranking distillation [56] Gowalla [19]
Table 2: Component Benchmarks in AIBench.

The AI tasks concern both performance and quality targets. The primary metrics include the samples processed per second, the wall clock time to train a model achieving a target quality (Time-to-quality) [20], the wall clock time to train the specified epochs, quality-ensured throughput, and the energy consumption to train a model achieving a target quality (Energy-to-quality) [20].

4.3 The AIBench Micro Benchmarks

After profiling the sixteen component benchmarks, we identify fourteen frequently-appearing units of computation. They are Covolution, Fully connected, Relu, Sigmoid, Tanh, MaxPooling, AvgPooling, CosineNorm, BatchNorm, Dropout, Element-wise multipy, Softmax, Data arrangement, and Memcpy. We implement them as a set of micro benchmarks using TensorFlow [10] and Pthreads.

4.4 The AIBench Framework

Figure 2: Reusing Framework.

As shown in Fig. 2, the framework provides loosely coupled modules that can be easily configured. Currently, the AIBench framework includes data input, offline training, online inference, non-AI library, and deployment tool modules. On the basis of the AIBench framework, we can easily compose an end-to-end benchmark.

The data input module is responsible for feeding data into the other modules. It collects representative real-world data sets, which are from not only the authoritative public websites but also our industry partners after anonymization. The data schema is designed to maintain the real-world data characteristics, so as to alleviate the confidential issue. Based on the data schema, a series of data generators are further provided to support an large-scale data generation, like user or product information. To cover a wide spectrum of data characteristics, we take diverse data types, e.g., structured, semi-structured, un-structured, and different data sources, e.g., table, graph, text, image, audio, video, into account. Our framework integrates various open-source data storage systems, and supports large-scale data generation and deployment [47].

The offline training and online inference modules are provided to build an end-to-end benchmark. First, the offline training module chooses one or more component benchmarks, through specifying the required benchmark ID, input data, and execution parameters like batch size. Then the offline training module trains a model and provides the trained model to the online inference module. The online inference module loads the trained model onto the serving system, i.e., TensorFlow serving. The non-AI library module provides the non-AI computations and database access, including query parsing, database operations, indexing, sort, crawler, hash, encryption, basic statistics, filter, video codec, video capture, and rendering. For a complex end-to-end application, the online inference, the non-AI library, and the offline training modules together constitute an overall critical path.

To be easily deployed on a large-scale cluster, the framework provides deployment tools that contain two automated deployment templates using Ansible and Kubernetes. The Ansible templates support scalable deployment on physical or virtual machines, while the kubernetes templates are used to deploy on a container cluster. A configuration file needs to be specified for installation and deployment, including module parameters like a chosen benchmark ID, input data, and the cluster parameters like nodes, memory, and network information. Through the deployment tools, a user doesn’t need to know how to install and run each individual module.

5 Building end-to-to benchmarks

In this section, we illustrate how to build end-to-end benchmarks, and later discuss the guideline.

5.1 The Design and Implementation of an E-commerce Search Intelligence

On the basis of the reusing framework, we implement the first end-to-end AI application benchmark—an E-commerce search intelligence (in short, E-commerce). This benchmark models the complete use-case of a realistic E-commerce search intelligence, covering both text searching and image searching scenarios.

Figure 3: AIBench Implementation.

The E-commerce benchmark consists of four subsystems: online server, offline analyzer, query generator, and data storage, as shown in Fig. 3. Among them, online server receives the query requests and performs personalized searching and recommendation, integrating AI inference.

Offline analyzer chooses the appropriate AI component benchmarks and performs a training stage to generate a learning model. Also, offline analyzer is responsible to build data indexes to accelerate data access.

Query generator is to simulate concurrent users and send query requests to online server based on a specific configuration. Note that a query item provides either text or image to reflect different search habits of users. The configuration designates the parameters like concurrency, query arriving rate, distribution, user thinking time, and the ratio of text items and image items. The configurations simulate different query characteristics and satisfy multiple generation strategies. We implement our query generator based on JMeter [37].

The data storage module stores all kinds of data. The user database saves all the attributes of user information. The product database holds all the attributes of the product information. The logs record the complete query histories. The text data contain the product description text or the user comments. The image and video data depict the appearance and usage of product vividly. The audio data store the voice search data and voice chat data. Overall, the data storage covers various data types including structured, unstructured, and semi-structured data, and diverse data sources, including table, text, image, audio and video.

To support scalable deployment on the clusters with different scales, each module is scalable and can be deployed on multiple nodes. Also, a series of data generators are provided to generate E-commerce data with different scales, through setting several parameters, e.g., the number of products and product attribute fields, the number of users and user attribute fields.

Online Server

Online server provides personalized searching and recommendations. Online server consists of four modules, including search planer, recommender, searcher, and ranker.

Search planer is the entrance of online server. It is responsible for receiving the query requests from query generator, and sending the request to the other modules and receiving the return results. We use the Spring Boot framework [61] to implement search planer.

Recommender is to analyze the query item and provide personalized recommendation, according to the user information obtained from the user database. It first conducts query spelling correction and query rewriting, and then it predicts the belonged category of the query item based on two classification models—FastText [38] and ResNet50 [33]. FastText is for text classification when a query item is text data, and ResNet50 [33] is for image classification when a query item is an image. Using a deep neural network proposed by Alibaba  [49], query planer then conducts an inference process and uses the offline trained model to provide personalized recommendation. It returns two vectors: one is the probability vector of the predicted categories, and the other is the user preference score vector of product attributes, such as the user preference for brand, color and etc. We use TensorFlow serving [50] to provide text classification, image classification, and online recommendation services.

To guarantee scalability and service efficiency, searcher follows an industry-scale architecture. Searcher is deployed on several different clusters, and three clusters are the default configuration. The clusters hold the inverted indexes of product information in memory to guarantee high concurrency and low latency. According to the click-through rate and purchase rate, the products belong to three categories according to the popularity—high, medium, and low, and the proportion of data volume is 15%, 50%, and 50%, respectively. Note that the high popularity category is a subset of the medium popularity category. The indexes of products with different popularity are stored into the different clusters. Given a searching request, the searcher searches these three clusters one by one until reaching a specific amount. Generally, the cluster that holds low popularity products is rarely searched in a realistic scenario. So for each category, searcher adopts different deployment strategies. The cluster for high popularity contains more nodes and more backups to guarantee the searching efficiency. While the cluster for low popularity deploys the least number of nodes and backups. We use Elasticsearch [30] to set up and manage the Searcher deploying on the three clusters.

Ranker uses the weight returned by recommender as an initial weight, and ranks the scores of products through a personalized L2R neural network [49]. Ranker uses TensorFlow serving [50] to implement product ranking.

Offline Analyzer

Offline analyzer is responsible for training models and building indexes to improve the online serving performance. It consists of three modules—AI offline trainer, job scheduler, and indexer.

AI offline trainer is to train models using the data stored in the database. Offline trainer digests the features of the product data, e.g., text, image, audio, video. To power the efficiency of online server, Offline trainer chooses ten AI algorithms (component benchmarks) from the AIBench framework. The ten component benchmarks include classification for category prediction, recommendation for personalized recommendation, learning to ranking for result scoring and ranking, image-to-text for image caption, image-to-image and image generation for image resolution enhancement, face embedding for face detection within an image, spatial transformer for image rotating and resizing, object detection for detecting video data, and speech recognition for audio data recognition.

Job scheduler provides two kinds of training mechanisms: batch processing and streaming-like processing. In a realistic scenario, some models need to be updated frequently. For example, when users search an item and click one product showed in the first page, the application will immediately train a new model based on the product that the users just clicked, and make new recommendations shown in the second page. Our benchmark implementations consider this situation, and adopt a streaming-like approach to updating the models every several seconds. For batch processing, trainer will update the models every several hours.

Indexer is to build indexes for product information. Indexer provides three kinds of indexes: the inverted indexes with a few fields of products for searching, the forward indexes with a few fields for ranking, and the forward indexes with a majority of fields for summary generation.

5.2 Guidelines

We are implementing other end-to-end benchmarks listed in Table 1. There are some guidelines.

(1) Determine the essential AI and non-AI component benchmarks.

(2) For each component benchmark, find the valid input data and the data input module.

(3) Determine the valid permutation of AI and non-AI components.

(4) Specify the module-related configurations, i.e., benchmark ID, input data, execution parameters, Non-AI library, and cluster-related configurations, i.e., node, memory, and network information.

(5) Specify the deployment strategy and write the scripts for the automated deployment tool.

(6) Train the AI models of the selected AI component benchmarks using the offline training module, and transfer the trained models to the online inference module.

6 Evaluation

This section summarizes our evaluation using AIBench end-to-end, component and micro benchmarks. In Section 6.2, we explain why end-to-end benchmarking is necessary for both online server and offline trainer, and gain several insights, which can not be found using MLPerf [4] and TailBench [40]. In Section 6.3, we characterize diverse and distinct computation and memory patterns of sixteen AI tasks, emphasizing the necessity of including diverse AI tasks for benchmarking, which is also ignored by MLPerf [4]. In Section 6.4, we drill down to the hotspot functions, and analyze their execution stalls.

6.1 Experiment Setup

Node Configurations

We perform experiments on a 16-node CPU and 4-node GPU cluster. All the nodes are connected with a 1 Gb Ethernet network. Each CPU node is equipped with two Xeon E5645 processors and 32 GB memory. Each processor contains six physical out-of-order cores. Hyper-Threading is disabled. The OS version of each node is Linux CentOS 6.9 with the Linux kernel version 3.11.10. The software versions are JDK 1.8.0, Python 3.6.8, and GCC 5.4, respectively. We perform offline training on four Nvidia Titan XP GPUs. Every Titan XP owns 3840 Nvidia Cuda cores and 12 GB memory.

Performance Data Collection

We use the network time protocol (NTP) [46] for synchronizing cluster-wide clock. We use a profiling tool—Perf [23] to collect the CPU micro-architectural data through the hardware performance monitoring counters (PMCs). For GPU profiling, we use the Nvidia profiling toolkit—nvprof [6] to track the running performance of GPU. To profile accuracy-ensured performance, we first adjust the parameters, e.g., batch size, to achieve the state-of-the-art quality target of that model on a given dataset, and then sample 1,000 epochs using the same parameter settings. For the GAN based model, whose accuracy is hard to measure, we set their parameters according to the referenced paper and reproduce the results. We run each benchmark three times and report the average numbers.

6.2 The Necessity of End-to-end Benchmarking

This subsection demonstrates why end-to-to benchmarking is necessary for both online services and offline trainer in Section 6.2.1 and Section 6.2.2, respectively.

Figure 4: Latency of Online Server.

End-to-end Benchmarking is Necessary for Online Server

We deploy online server on the 16-node CPU cluster. Online server contains one query generator node (Jmeter 5.1.1), one search planer node (SpringBoot 2.1.3), two recommender nodes (TensorFlow Serving 1.14.0), nine searcher nodes (Elasticsearch 6.5.2), one ranker node (TensorFlow Serving 1.14.0), and two nodes for data storage (Neo4j 3.5.8 for the user database, Elasticsearch 6.5.2 for the product database).

The product database contains a hundred thousand products with 32-attribute fields. Query generator simulates 1000 users with 30-second warm up time. The users send query requests continuously every think time interval, which follow a Poisson distribution. Note that the proportions of text queries and image queries are 90% and 10%, respectively. In total, we collect the performance numbers until 20,000 query requests have finished. We train each AI task to achieve the quality target of the referenced paper.

The latency is an important metric to evaluate the service quality. Fig. 4(a) 1 shows the end-to-end latency of online server. We find that the average, 90th percentile, and 99th percentile latency, of the entire execution path of the current implementation is 215.5, 843, and 1419 milliseconds, respectively.

We further perform the latency breakdown of each module to identify the critical paths, including the recommender, searcher, search planer, and ranker modules, as shown in Fig. 4(b). The latency of search planer is negligible, so we do not report it in Fig. 4(b). We find that recommender occupies the largest proportion of latency: 48 milliseconds, 60 milliseconds, and 317 milliseconds for the average, 90th percentile, 99th percentile latency, respectively. In comparison, the latency of searcher and ranker is both within 5 milliseconds, respectively. Although recommender and ranker both contain AI related components, they incur significantly different latencies.

Furthermore, Fig. 4(c) drills down the latency breakdown of the recommender module to a component level, which includes query_parsing, user_DB_access, image_classifier, text_classifier and recommendation. We find that user_DB_access (non-AI component) and recommendation (AI component) are the top two key components that impact the latency. Especially, the average latency of the recommendation component takes up 60% of the average latency of the recommender module, and occupies 13% of the total end-to-end latency of the online server subsystem. The 99th percentile latency of the recommendation component is 289 milliseconds, while the number for the recommender module and the whole subsystem are 317 milliseconds and 1419 milliseconds, respectively. The reason for that end-to-end tail latency deteriorates dozens times or even hundreds times with respect to a single component are 1) a single component may be not in the critical path; 2) even an AI component like recommendation is in the critical path, there exists cascading interaction effects with the other AI and non-AI components.

We also analyze the execution time ratio of the AI components vs. the non-AI components in online server. If we exclude the data preprocessing and communication latency, the time spent on the AI components and the non-AI components is 38 and 17 milliseconds for the average latency, which indicates that the AI components are essential critical path of an industry-scale end-to-end benchmark like the E-commerce benchmark.

Can a Statistical Model Predict the End-to-end Tail latency? As an end-to-end benchmark is much complex in using a hardware or software evaluation, an intuition is that can we use a statistical model to predict the end-to-end tail latency? The answer is NO!

The state-of-the-art work [24] uses the M/M/1 and M/M/K queuing models to calculate the p’th percentile latency. We repeat their work, and choose the M/M/1 model to predict the latency as we only deploy one instance of online server. In the M/M/1 model, the p’th percentile latency () and the average latency () can be calculated using the following formula: , . is the service rate, which follows the exponential distribution. is the arrival rate, which follows the Poisson distribution.

We get the number of —20 requests per second through the experiments. Then we set as 1.0 requests per second (10 simulated users), 9.1 requests per second (100 simulated users), and 16.7 requests per second (200 simulated users), respectively. For different settings, the theoretical number of the average latency is 53ms, 91ms, and 303ms, while the actual number is 123ms, 459ms, and 852ms, respectively. The average gap is 3.4 times. The theoretical number of the 99th percentile latency is 242ms, 422ms, and 1394ms, while the actual number is 953ms, 5008ms, and 11980ms, respectively. The average gap is 8.1 times.

The main reason for this huge gap is as follows. It is complex and uncertain to execute an end-to-end benchmark, and the service rate doesn’t follow the exponential distribution. So, the M/M/1 model is far away from the realistic situation. However, the more generalized model (such as G/G/1 model) is difficult to be used to calculate the tail latency. Furthermore, if we try to characterize the permutations of executing dozens of components in an end-to-end benchmark, we need a more sophisticated analytical model such as a queuing network model, which is much infeasible to perform a calculation of tail latency.

Tradeoff among Service Quality, Model Accuracy, and Model Complexity. The online inference module needs to load the trained model and conducts a forward computation to obtain the result. Usually, increasing the depth of a neural network model may improve the model accuracy, but it will lead to a larger model size and longer inference time. For comparison, we replace ResNet50 with ResNet152 in image_classifier. The model accuracy improvement is 1.5%, while the end-to-end 99th percentile latency deteriorates by 9.7X. Hence, Internet service architects must perform a tradeoff between the service quality, model complexity, and model accuracy.

Tradeoff among Model Update Interval, Accuracy Improvement, and Training Overhead Using Offline Trainer

Updating AI models in a real time manner is a significant domain-specific concern in many scenarios. We evaluate the real-time model update efficiency using offline training. We deploy offline trainer on four Titan XP GPUs.

We adopt incremental learning method to update the models for online inference, and explore the relationship between the model update interval, training time overhead, and accuracy improvement. Our experiments show that comparing to the original training time and accuracy, 35% additional training time brings in 1.9% accuracy improvement for image_classifier, and 10% additional training time brings in 0.3% accuracy improvement for ranker.

Thus, offline training is an integrated part of end-to-end benchmarking. It not only facilitates measuring the model update efficiency, but also provides a guidance on how to choose an optimal update interval to balance the tradeoff between training overhead and accuracy improvement.

6.3 Why Diversity of AI Tasks Matters for Benchmarking?

We characterize distinct computation and memory patterns of the diverse AI tasks, emphasizing the necessity of including diverse AI tasks for benchmarking.

We characterize the sixteen component benchmarks of AIBench. The AIBench component benchmarks are deployed on the Titan XP GPUs, and we focus on a single GPU performance. The CUDA and Nvidia driver versions are 10.0 and 410.78, respectively.

We evaluate the PyTorch implementations with the version of 1.1.0. The data set for each benchmark is as follows: ImageNet (137 GB) for image classification and Image compression; LSUN (42.8 GB) for image generation; VGGFace2 (36 GB) for face embedding; Microsoft COCO (13 GB) for Image-to-Text and object detection; MNIST (9.5 MB) for spatial transformer; Cityscapes (267 MB) for Image-to-Image; MovieLens (190 MB) for recommendation; Librispeech (59.3 GB) for speech recognition; Gowalla (107 MB) for learning to rank; WMT English-German (1.2 MB) for Text-to-Text translation; Robot pushing data set (137 GB) for Video prediction; ShapeNet Data set (6.8 GB) for 3D object reconstruction; Gigaword data set (277 MB) for Text summarization; 3D face data (37 GB) for 3D Face Recognition, respectively.

GPU architecture contains multiple streaming multiprocessors (SM), each of which has a certain number of CUDA cores, memory registers, memory caches, warp schedulers and etc. To characterize the AIBench component benchmarks from a perspectives of computation and memory access patterns, We choose five micro-architectural metrics, including achieved_occupancy, ipc_efficiency, gld_efficiency, gst_efficiency, and dram_utilization. Achieved_occupancy represents the ratio of the average active warps per active cycle to the maximum number of warps supported on a multiprocessor [6]. Ipc_efficiency indicates the ratio of the executed instructions per cycle to the theoretical number [6]. Gld_efficiency means the ratio of the requested global memory load throughput to the required global memory load throughput [6]. Gst_efficiency means the ratio of the requested global memory store throughput to the required global memory store throughput [6]. Dram_utilization means the utilization level of the device memory relative to the peak utilization [6].

Fig.  5 presents the computation and memory characteristics of the sixteen AI benchmarks. We find that they have distinct computation and memory patterns not only under different scenarios, e.g., processing text, image, audio, video, but also under different tasks of the same scenario, e.g., image classification and image generation. Thus, diverse AI tasks reflecting different computation and memory access patterns should be included into the AI benchmarks. Achieving a state-of-the-art quality target for each AI task will incur heavy training overhead, however, it does not justify including only a few benchmarks [64].

Figure 5: Computation and Memory Patterns of AIBench Components (1: achieved_occupancy; 2: ipc_efficiency; 3: gld_efficiency; 4: gst_efficiency; 5: dram_utilization).

6.4 Drill Down To Functional-level Code

Following the experiments in 6.3, We drill down to the hotspot functions and analyze their runtime breakdown and execution stalls for code optimization.

The overall execution performance of these component benchmarks are varying in terms of IPC, which measures the executed instructions per cycle. From Fig. 5, we find that the IPC efficiency ranges from 0.25 (Learning_to_rank) to 0.77 (Text_to_Text translation). Some benchmarks like learning_to_rank have extremely low IPC comparing to the other benchmarks. To discover the factors that impact the performance greatly, we first conduct runtime breakdown analysis and decompose the benchmarks into the hotspot kernels or functions, then we find the GPU execution efficiency in terms of different percentage of stalls.

Figure 6: Runtime Breakdown of AIBench Components.

Runtime Breakdown

We use nvprof to trace the runtime breakdown and find the hotspot functions that occupy more than 80% of runtime in total. Since each run involves dozens of function calls, we single out the functions that occupy large proportions of runtime and classify them into several categories of kernels according to their computation logic. Through statistics, we find that the most time-consuming functions among all component benchmarks have much in common, and they are classified into eight categories of kernels, which are a subset of the AIBench micro benchmarks: data arrangement, convolution, general matrix multiply (gemm), batch normalization, element-wise operation, relu activation 2, pooling, and memory copy, spanning from computation kernels to memory access kernels. Note that each kernel contains a bunch of functions that solve the similar issue. For example, a gemm kernel includes single or double precision floating general matrix multiply. Fig. 6 shows the runtime breakdown of sixteen component benchmarks, using the average number of all involved functions within each micro benchmark. Note that the remaining 20% functions are not considered in this figure. Further, for each micro benchmark, we summarize typical functions that occupy a large proportion of runtime among the component benchmarks, as shown in Table 3. We find that learning_to_rank spends too much time on data arrangement operations from Fig. 6, and the corresponding function call is maxwell_scudnn_128x32_stridedB_splitK_interior_nn with an IPC of 0.98. This is the reason why leaning_to_rank has the lowest IPC of 0.99. We believe that the eight micro benchmarks and these corresponding functions are the optimization points not only for CUDA library optimizations but also for micro-architectural optimizations.

Micro Benchmark Function Name
Data Arragement maxwell_scudnn_128x128_stridedB_splitK_interior_nn
maxwell_scudnn_128x32_stridedB_splitK_interior_nn
maxwell_scudnn_128x128_stridedB_interior_nn
Convolution maxwell_scudnn_winograd_128x128_ldg1_ldg4_tile148n_nt
wgrad_alg0_engine
fft2d_r2c_32x32
GEMM maxwell_sgemm_128x64_nt
maxwell_sgemm_128x64_nn
sgemm_32x32x32_NN_vec
BatchNorm cudnn::detail::bn_fw_tr_1C11_kernel_NCHW
cudnn::detail::bn_bw_1C11_kernel_new
batch_norm_backward_kernel
at::native::batch_norm_backward_kernel
Relu maxwell_scudnn_128x128_relu_small_nn
maxwell_scudnn_128x128_relu_interior_nn
maxwell_scudnn_128x32_relu_interior_nn
Element-wise element-wise add kernel
element-wise threshold kernel
element-wise mul kernel
Pooling MaxPoolBackward
AvePoolForward
Memcpy CUDA memcpy HtoD
CUDA memcpy DtoD
Table 3: Hotspot Functions.

Stall Analysis

Focusing on the above eight most time-consuming micro benchmarks, we evaluate the following stalls of these kernels. Instruction fetch stall (Inst_fetch) indicates the percentage of stalls because the next assembly instruction has not yet been fetched; Execution dependency stall (Exe_depend) is the percentage of stalls because an input required by the instruction is not yet available; Memory dependency stall (Mem_depend) is the percentage of stalls because a memory operation cannot be performed due to the required resources not being available or fully utilized; Texture stall (Texture) is the percentage of stalls because of the under-utilization of the texture sub-system; Synchronization stall (Sync) is the percentage of stalls due to a syncthreads call; Constant memory dependency stall (Const_mem_depend) is the percentage of stalls because of immediate constant cache miss; Pipe busy stall (Pipi_busy) is percentage of stalls because a compute operation cannot be performed because the compute pipeline is busy; Memory throttle stall (Mem_throttle) is the percentage of stalls due to large pending memory operations [6].

Figure 7: Stall Breakdown of the Hotspot Functions.

The breakdown of eight stalls of the hotspot functions is shown in Fig. 7. The top two GPU execution stalls are memory dependency stalls, and execution dependency stalls. For example, for Element-Wise benchmark, the memory dependency stalls occupy a very large proportion of 70%, thus resulting in a low IPC number of about 0.86 on average. The memory dependency stalls may occurs due to high cache misses, and thus the load/store resources are not available. Possible optimization strategies include optimizing date alignment, data locality, and data access patterns. The execution dependency stalls may occur due to low instruction-level parallelism, and exploiting ILP may alleviate partial execution dependency stalls to a certain degree.

7 Related Work

State-of-the-art and state-of-the-practise AI or Internet service benchmarks only provide a few micro or component benchmarks, as shown in Table 4, and none of them distill representative and essential AI or non-AI components, and especially the permutations of different AI and non-AI components in characterizing industry-scale AI and Internet service applications.

AIBench
MLPerf
Fathom
DeepBench
DNNMark
DAWNBench
TBD
Benchmark Framework (Extensible)
Modular-design \CheckmarkBold \CheckmarkBold
End-to-End Application Benchmark
Online module \CheckmarkBold
Offline module \CheckmarkBold
Component Benchmark
Image classification
Train
\CheckmarkBold \CheckmarkBold \CheckmarkBold \CheckmarkBold \CheckmarkBold
Infer
\CheckmarkBold \CheckmarkBold \CheckmarkBold \CheckmarkBold
Image generation
Train
\CheckmarkBold \CheckmarkBold
Infer
\CheckmarkBold
Text-to-Text
Train
\CheckmarkBold \CheckmarkBold \CheckmarkBold \CheckmarkBold
Infer
\CheckmarkBold \CheckmarkBold \CheckmarkBold
Image-to-Text
Train
\CheckmarkBold
Infer
\CheckmarkBold
Image-to-Image
Train
\CheckmarkBold
Infer
\CheckmarkBold
Speech recog- nition
Train
\CheckmarkBold \CheckmarkBold \CheckmarkBold \CheckmarkBold
Infer
\CheckmarkBold \CheckmarkBold \CheckmarkBold \CheckmarkBold
Face embedding
Train
\CheckmarkBold
Infer
\CheckmarkBold
3D Face Recognition
Train
\CheckmarkBold
Infer
\CheckmarkBold
Object detection
Train
\CheckmarkBold \CheckmarkBold \CheckmarkBold
Infer
\CheckmarkBold \CheckmarkBold
Recommenda- tion
Train
\CheckmarkBold \CheckmarkBold \CheckmarkBold
Infer
\CheckmarkBold
Video prediction
Train
\CheckmarkBold
Infer
\CheckmarkBold
Image compression
Train
\CheckmarkBold \CheckmarkBold
Infer
\CheckmarkBold \CheckmarkBold
3D object re- construction
Train
\CheckmarkBold
Infer
\CheckmarkBold
Text sum- marization
Train
\CheckmarkBold
Infer
\CheckmarkBold
Spatial transformer
Train
\CheckmarkBold
Infer
\CheckmarkBold
Learning to rank
Train
\CheckmarkBold
Infer
\CheckmarkBold
Games
Train
\CheckmarkBold \CheckmarkBold \CheckmarkBold
Infer
\CheckmarkBold
Memory network
Train
\CheckmarkBold
Infer
\CheckmarkBold
Question answering
Train
\CheckmarkBold
Infer
\CheckmarkBold
Micro Benchmark
Convolution \CheckmarkBold \CheckmarkBold \CheckmarkBold
Fully connected \CheckmarkBold \CheckmarkBold \CheckmarkBold
Element-wise op
\CheckmarkBold
Pooling \CheckmarkBold \CheckmarkBold
Normalization \CheckmarkBold \CheckmarkBold
Dropout \CheckmarkBold \CheckmarkBold
Softmax \CheckmarkBold \CheckmarkBold
Memory access \CheckmarkBold
AllReduce \CheckmarkBold
Real-world Data sets and Software Stack
Text data 3 1 2 N/A N/A 1 1
Image data 8 2 2 N/A N/A 2 4
3D data 2 0 0 N/A N/A 0 0
Audio data 1 0 1 N/A N/A 0 2
Video data 1 0 1 N/A N/A 0 0
Software Stack 3 2 1 1 1 2 4
Table 4: AI Benchmark Comparison.

MLPerf [3] is an ML benchmark suite targeting six AI tasks, including image classification, object detection, speech recognition, translation, recommendation, and reinforcement learning. It provides both light-weight and heavy-weight implementations. Totally, it includes seven benchmarks for training and five benchmarks for inference. The MLPerf training benchmark [45] proposes a series of benchmarking rules to eliminate the side effect of the stochastic nature of AI.

DAWNBench [20] is a benchmark and competition focusing on end-to-end performance, which means the training or inference time to achieve a state-of-the-art accuracy. It only focuses on two component benchmarks including image classification on CIFAR10 and ImageNet, and question answering on SQuAD.

Fathom [11] provides eight deep learning component benchmarks implemented with TensorFlow. Three of the eight benchmarks use different models for the image classification task. The Autoenc workload provides a variational autoencoder and can be used to reduce the dimension and compress images.

TBD Suite [65] is a benchmark suite for DNN training. It provides eight neural network models that covers six AI tasks. TailBench [40] is a benchmark suite consists of eight latency-critical workloads.

DeepBench [2] consists of four operations involved in training deep neural networks, including three basic operations and recurrent layer operations. DNNMark [26] is a GPU benchmark suite that consists of a collection of deep neural network primitives. Both DeepBench and DNNMark ignore the quality target in benchmarking.

Additionally, for machine learning and deep learning evaluation, MLModelScope [22] proposes a specification for repeatable model evaluation and a runtime to measure experiments.

There are two significant differences of AIBench from the other benchmark suite. One is to propose the permutations of essential AI and non-components as end-to-end benchmarks. We provide the reusing framework to speed up building end-to-end benchmarks. The other is considering end-to-end benchmarks, components benchmarks and micro benchmarks as three integrated parts. As a marked departure from the past, AIBench lets software and hardware designer learn about the overall system information (end-to-end benchmarks), provides diverse computation and memory access patterns (component benchmarks) as the design inputs for micro-architectural researchers, and drill down to hotspot functions (micro benchmarks) for the code developers.

8 Conclusion

This paper proposes an agile domain-specific benchmarking methodology that speeds up software and hardware co-design. Together with seventeen industry partners, we identify ten end-to-end application scenarios, distill sixteen representative AI tasks and fourteen time-consuming units of computations. We propose the permutations of the essential AI and non-AI tasks as the end-to-end benchmark to characterize industry-scale applications. We design and implement a reusable framework to facilitate agile end-to-end benchmark building. We build the first end-to-end benchmark to model E-commerce search intelligence. Our evaluation shows that the end-to-end benchmark integrating both online service and offline training provides overall system performance for hardware and software designers. The component benchmarks reflect diverse computation and memory access patterns, essential for micro-architectural researchers. The micro benchmarks represent hotspot functions, beneficial to code optimization.

Footnotes

  1. With respect to the real numbers in our industry partner, the number is quite high. They have taken many measures to decrease the overall latency.
  2. Relu activation is an element-wise operation, here we use a separate category of Relu considering its large proportion and diverse CUDA functions.

References

  1. https://nlp.stanford.edu/projects/nmt/.
  2. “Deepbench,” https://svail.github.io/DeepBench/.
  3. “Mlperf,” https://mlperf.org.
  4. “Mlperf website,” https://www.mlperf.org.
  5. “Nginx website,” http://nginx.org/.
  6. “Nvidia profiling toolkit,” https://docs.nvidia.com/cuda/profiler-users-guide/index.html.
  7. “Pytorch,” http://pytorch.org.
  8. “Spark,” https://spark.apache.org/.
  9. “Speccpu 2017,” https://www.spec.org/cpu2017/.
  10. M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mane, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viegas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” arXiv preprint arXiv:1603.04467, 2016.
  11. R. Adolf, S. Rama, B. Reagen, G.-Y. Wei, and D. Brooks, “Fathom: reference workloads for modern deep learning methods,” in Workload Characterization (IISWC).   IEEE, 2016, pp. 1–10.
  12. D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen, J. Chen, J. Chen, Z. Chen, M. Chrzanowski, A. Coates, G. Diamos, K. Ding, N. Du, E. Elsen, J. Engel, W. Fang, L. Fan, C. Fougner, L. Gao, C. Gong, A. Hannun, T. Han, L. V. Johannes, B. Jiang, C. Ju, B. Jun, P. LeGresley, L. Lin, J. Liu, Y. Liu, W. Li, X. Li, D. Ma, S. Narang, A. Ng, S. Ozair, Y. Peng, R. Prenger, S. Qian, Z. Quan, J. Raiman, V. Rao, S. Satheesh, D. Seetapun, S. Sengupta, K. Srinet, A. Sriram, H. Tang, L. Tang, C. Wang, J. Wang, K. Wang, Y. Wang, Z. Wang, Z. Wang, S. Wu, L. Wei, B. Xiao, W. Xie, Y. Xie, D. Yogatama, B. Yuan, J. Zhan, and Z. Zhu, “Deep speech 2: End-to-end speech recognition in english and mandarin,” in International conference on machine learning, 2016, pp. 173–182.
  13. M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein gan,” arXiv preprint arXiv:1701.07875, 2017.
  14. G. Ayers, J. H. Ahn, C. Kozyrakis, and P. Ranganathan, “Memory hierarchy for web search,” in 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).   IEEE, 2018, pp. 643–656.
  15. D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. Simon, V. Venkatakrishnan, and S. Weeratunga, “The nas parallel benchmarks,” The International Journal of Supercomputing Applications, vol. 5, no. 3, pp. 63–73, 1991.
  16. L. A. Barroso and U. Hölzle, “The datacenter as a computer: An introduction to the design of warehouse-scale machines,” Synthesis Lectures on Computer Architecture, vol. 4, no. 1, pp. 1–108, 2009.
  17. Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman, “Vggface2: A dataset for recognising faces across pose and age,” in 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018).   IEEE, 2018, pp. 67–74.
  18. A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu, “Shapenet: An information-rich 3d model repository,” arXiv preprint arXiv:1512.03012, 2015.
  19. E. Cho, S. A. Myers, and J. Leskovec, “Friendship and mobility: user movement in location-based social networks,” in Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining.   ACM, 2011, pp. 1082–1090.
  20. C. Coleman, D. Narayanan, D. Kang, T. Zhao, J. Zhang, L. Nardi, P. Bailis, K. Olukotun, C. Ré, and M. Zaharia, “Dawnbench: An end-to-end deep learning benchmark and competition,” Training, vol. 100, no. 101, p. 102, 2017.
  21. M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3213–3223.
  22. A. Dakkak, C. Li, J. Xiong, and W.-m. Hwu, “Frustrated with replicating claims of a shared model? a solution,” arXiv preprint arXiv:1811.09737, 2019.
  23. A. C. De Melo, “The new linux perf tools,” in Slides from Linux Kongress, vol. 18, 2010.
  24. C. Delimitrou and C. Kozyrakis, “Amdahl’s law for tail latency,” Communications of the ACM, vol. 61, no. 8, pp. 65–72, 2018.
  25. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and F.-F. Li, “Imagenet: A large-scale hierarchical image database,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on.   IEEE, 2009, pp. 248–255.
  26. S. Dong and D. Kaeli, “Dnnmark: A deep neural network benchmark suite for gpus,” in Proceedings of the General Purpose GPUs.   ACM, 2017, pp. 63–72.
  27. C. Finn, I. Goodfellow, and S. Levine, “Unsupervised learning for physical interaction through video prediction,” in Advances in neural information processing systems, 2016, pp. 64–72.
  28. W. Gao, J. Zhan, L. Wang, C. Luo, D. Zheng, F. Tang, B. Xie, C. Zheng, X. Wen, X. He, H. Ye, and R. Ren, “Data motifs: A lens towards fully understanding big data and ai workloads,” Parallel Architectures and Compilation Techniques (PACT), 2018 27th International Conference on, 2018.
  29. W. Gao, J. Zhan, L. Wang, C. Luo, D. Zheng, X. Wen, R. Ren, C. Zheng, X. He, H. Ye, H. Tang, Z. Cao, S. Zhang, and J. Dai, “Bigdatabench: A scalable and unified big data and ai benchmark suite,” arXiv preprint arXiv:1802.08254, 2018.
  30. C. Gormley and Z. Tong, Elasticsearch: the definitive guide: a distributed real-time search and analytics engine.   ” O’Reilly Media, Inc.”, 2015.
  31. F. M. Harper and J. A. Konstan, “The movielens datasets: History and context,” ACM Transactions on Interactive Intelligent Systems (TiiS), vol. 5, no. 4, p. 19, 2016.
  32. K. Hazelwood, S. Bird, D. Brooks, S. Chintala, U. Diril, D. Dzhulgakov, M. Fawzy, B. Jia, Y. Jia, A. Kalro, J. Law, K. Lee, J. Lu, P. Noordhuis, M. Smelyanskiy, L. Xiong, and X. Wang, “Applied machine learning at facebook: A datacenter infrastructure perspective,” in 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).   IEEE, 2018, pp. 620–629.
  33. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  34. X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T.-S. Chua, “Neural collaborative filtering,” in Proceedings of the 26th international conference on world wide web.   International World Wide Web Conferences Steering Committee, 2017, pp. 173–182.
  35. G. B. Huang, M. Mattar, T. Berg, and E. Learned-Miller, “Labeled faces in the wild: A database forstudying face recognition in unconstrained environments,” in Workshop on faces in’Real-Life’Images: detection, alignment, and recognition, 2008.
  36. M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu, “Spatial transformer networks,” in Advances in neural information processing systems, 2015, pp. 2017–2025.
  37. A. JMeter, “Apache jmeter,” Online.(2016). http://jmeter. apache. org/-Visited, pp. 04–25, 2017.
  38. A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, and T. Mikolov, “Fasttext.zip: Compressing text classification models,” arXiv preprint arXiv:1612.03651, 2016.
  39. N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P.-l. Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. Mackean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani, C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, and D. H. Yoon, “In-datacenter performance analysis of a tensor processing unit,” in Proceedings of the 44th Annual International Symposium on Computer Architecture.   ACM, 2017, pp. 1–12.
  40. H. Kasture and D. Sanchez, “Tailbench: a benchmark suite and evaluation methodology for latency-critical applications,” in 2016 IEEE International Symposium on Workload Characterization (IISWC).   IEEE, 2016, pp. 1–10.
  41. A. Krizhevsky, V. Nair, and G. Hinton, “The cifar-10 dataset,” online: http://www. cs. toronto. edu/kriz/cifar. html, vol. 55, 2014.
  42. Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
  43. Y. LeCun, C. Cortes, and C. Burges, “Mnist handwritten digit database,” AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist, vol. 2, p. 18, 2010.
  44. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision.   Springer, 2014, pp. 740–755.
  45. P. Mattson, C. Cheng, C. Coleman, G. Diamos, P. Micikevicius, D. Patterson, H. Tang, G.-Y. Wei, P. Bailis, V. Bittorf, D. Brooks, D. Chen, D. Dutta, U. Gupta, K. Hazelwood, A. Hock, X. Huang, B. Jia, D. Kang, D. Kanter, N. Kumar, J. Liao, G. Ma, D. Narayanan, T. Oguntebi, G. Pekhimenko, L. Pentecost, V. J. Reddi, T. Robie, T. St. John, C.-J. Wu, L. Xu, C. Young, and M. Zaharia, “Mlperf training benchmark,” arXiv preprint arXiv:1910.01500, 2019.
  46. D. L. Mills, “Network time protocol (ntp),” Network, 1985.
  47. Z. Ming, C. Luo, W. Gao, R. Han, Q. Yang, L. Wang, and J. Zhan, “Bdgs: A scalable big data generator suite in big data benchmarking,” arXiv preprint arXiv:1401.5465, 2014.
  48. R. Nallapati, B. Zhou, C. Gulcehre, and B. Xiang, “Abstractive text summarization using sequence-to-sequence rnns and beyond,” arXiv preprint arXiv:1602.06023, 2016.
  49. Y. Ni, D. Ou, S. Liu, X. Li, W. Ou, A. Zeng, and L. Si, “Perceive your users in depth: Learning universal user representations from multiple e-commerce tasks,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.   ACM, 2018, pp. 596–605.
  50. C. Olston, N. Fiedel, K. Gorovoy, J. Harmsen, L. Lao, F. Li, V. Rajashekhar, S. Ramesh, and J. Soyke, “Tensorflow-serving: Flexible, high-performance ml serving,” arXiv preprint arXiv:1712.06139, 2017.
  51. V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2015, pp. 5206–5210.
  52. S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.
  53. A. M. Rush, S. Harvard, S. Chopra, and J. Weston, “A neural attention model for sentence summarization,” in ACLWeb. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2017.
  54. F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 815–823.
  55. B. Smith and G. Linden, “Two decades of recommender systems at amazon. com,” Ieee internet computing, vol. 21, no. 3, pp. 12–18, 2017.
  56. J. Tang and K. Wang, “Ranking distillation: Learning compact ranking models with high performance for recommender system,” in ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018.
  57. G. Toderici, D. Vincent, N. Johnston, S. Jin Hwang, D. Minnen, J. Shor, and M. Covell, “Full resolution image compression with recurrent neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5306–5314.
  58. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
  59. R.-L. Vieriu, S. Tulyakov, S. Semeniuta, E. Sangineto, and N. Sebe, “Facial expression recognition under a wide range of head poses,” in 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), vol. 1.   IEEE, 2015, pp. 1–7.
  60. O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: Lessons learned from the 2015 mscoco image captioning challenge,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 4, pp. 652–663, 2017.
  61. P. Webb, D. Syer, J. Long, S. Nicoll, R. Winch, A. Wilkinson, M. Overdijk, C. Dupuis, and S. Deleuze, “Spring boot reference guide,” Part IV. Spring Boot features, vol. 24, 2013.
  62. X. Yan, J. Yang, E. Yumer, Y. Guo, and H. Lee, “Perspective transformer nets: Learning single-view 3d object reconstruction without 3d supervision,” in Advances in Neural Information Processing Systems, 2016, pp. 1696–1704.
  63. F. Yu, A. Seff, Y. Zhang, S. Song, T. Funkhouser, and J. Xiao, “Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop,” arXiv preprint arXiv:1506.03365, 2015.
  64. J. Zhan, L. Wang, W. Gao, and R. Ren, “Benchcouncil’s view on benchmarking ai and other emerging workloads,” Technical Report, 2019.
  65. H. Zhu, M. Akrout, B. Zheng, A. Pelegris, A. Phanishayee, B. Schroeder, and G. Pekhimenko, “Tbd: Benchmarking and analyzing deep neural network training,” arXiv preprint arXiv:1803.06905, 2018.
  66. J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2223–2232.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
408669
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description