Domain-specific software and hardware co-design is encouraging as it is much easier to achieve efficiency for fewer tasks. Agile domain-specific benchmarking speeds up the process as it provides not only relevant design inputs but also relevant metrics, and tools. Unfortunately, modern workloads like Big data, AI, and Internet services dwarf the traditional one in terms of code size, deployment scale, and execution path, and hence raise serious benchmarking challenges.
This paper proposes an agile domain-specific benchmarking methodology. Together with seventeen industry partners, we identify ten important end-to-end application scenarios, among which sixteen representative AI tasks are distilled as the AI component benchmarks. We propose the permutations of essential AI and non-AI component benchmarks as end-to-end benchmarks. An end-to-end benchmark is a distillation of the essential attributes of an industry-scale application. We design and implement a highly extensible, configurable, and flexible benchmark framework, on the basis of which, we propose the guideline for building end-to-end benchmarks, and present the first end-to-end Internet service AI benchmark.
The preliminary evaluation shows the value of our benchmark suite—AIBench against MLPerf and TailBench for hardware and software designers, micro-architectural researchers, and code developers. The specifications, source code, testbed, and results are publicly available from the web site http://www.benchcouncil.org/AIBench/index.html.
An Agile Domain-specific Benchmarking Methodology and an AI Benchmark Suite
Abstract and Section 1 (Introduction) were contributed by Jianfeng Zhan. Section 2 was contributed by Jianfeng Zhan, Lei Wang, Wanling Gao, and Fei Tang. Section 3 was contributed by Jianfeng Zhan. Section 4.1 was contributed by Chunjie Luo, Fei Tang, Zihan Jiang, Wanling Gao, Jianfeng Zhan, and seventeen industry partners. Section 4.2 and component benchmarks were contributed by Wanling Gao, Chunjie Luo, Xingwang Xiong, Fei Tang, Zihan Jiang, Tianshu Hao, Fanda Fan, Xu Wen, Fan Zhang, Yunyou Huang, Jianan Chen, and Mengjia Du. Section 4.3 and micro benchmarks were contributed by Wanling Gao and Daoyi Zheng. Section 4.4 was contributed by Wanling Gao, Fei Tang, Lei Wang, and Jianfeng Zhan. Section 5 was contributed by Fei Tang, Wanling Gao, Lei Wang, and Jianfeng Zhan. Section 6 was contributed by Jianfeng Zhan, Wanling Gao, Fei Tang, Lei Wang, and Chuanxin Lan. Section 7 and Section 8 were contributed by Jianfeng Zhan, Wanling Gao, and Lei Wang. Rui Ren and Chen Zheng provide Testbed support.
BenchCouncil: International Open Benchmarking Council
Chinese Academy of Sciences
Technical Report No. BenchCouncil-AIBench-2020
February 17, 2020
As it is much easier to achieve more efficient algorithms, systems, and architectures for fewer tasks, domain-specific software and hardware co-design is widely explored. For example, each of Internet service giants like Facebook, Google, Alibaba focuses on a specific application domain, i.e., search engine, social networks, E-commerce, respectively, and they are active practitioners. The ongoing AI accelerator boom is another witness to this trend. As the AI advancement has brought breakthroughs in processing images, video, speech, and audio , Internet service providers pervasively perform software and hardware AI co-design to augment their services [49, 32, 10, 39, 55]. This trend is also witnessed by big data advancement, and there are hundreds of single-purpose solutions in the forms of NoSQL, NewSQL or hardware accelerators.
Agile domain-specific benchmarking speeds up software and hardware co-design. Unfortunately, modern workloads dwarf the traditional one in terms of code size, deployment scale, and execution path, and hence raise serious benchmarking challenges. For example, the traditional desktop workloads, e.g., data compression , image manipulation , are about one hundred thousand lines of code, and run on a single node. The Web server workloads  are hundreds of thousands of lines of code, and run on a small scale cluster, i.e., dozens of nodes. However, for modern workloads, their runtime environment stacks (e.g., Spark , TensorFlow ) alone are more than millions of lines of code, and these workloads often run on a large-scale cluster, i.e., tens of thousands of nodes . Moreover, modern Internet services adopt a microservice-based architecture, which is often distributed across different datacenters, and consists of diversity of AI and non-AI modules with very long and complex execution paths. Worst of all, the real-world data sets, workloads or even AI models are hidden within the giant Internet service providers’ datacenters [32, 14], which further exaggerates the benchmarking challenges.
On one hand, the hardware and software designers should consider the overall system’s effects. Using micro (interchangeable with kernel in this paper) or component benchmarks alone can lead to incorrect conclusions. For example, in Section 6.2.1, we found that in terms of mere execution path, end-to-end tail latency deteriorates even hundreds times comparing to a single AI component tail latency, which can not be predicted by a state-of-the-art statistical model  as discussed in Section 6.2.1. Hereby, end-to-end indicates the overall critical path. It may refer to the end-to-end (tail) latency of an online service, or even cover offline AI training when updating an AI model for online services in a real time manner, as discussed in Section 6.2.2.
On the other hand, it is usually difficult to justify porting a full-scale end-to-end application to a new computer system or architecture simply to obtain a benchmark number [29, 15]. For hardware designers, an end-to-end application is too huge to run on the simulators. In addition, evaluating a full-scale end-to-end application raises difficulties in reproducibility and interpretability of performance data , and may lead to an error-prone conclusion. After gaining full knowledge of overall critical information, micro and component benchmarks are still a necessary part of the evaluation.
Put in other words, we believe a domain-specific benchmark suite should have three integrated parts. End-to-end benchmarks let software and hardware designer learn about the overall system information. Each end-to-end benchmark is a distillation of the essential attributes of an industry-scale application, and hence reduces the side effect of the latter’s huge code size, extreme deployment scale, and complex execution paths. Measuring the achieved performance and quality targets for representative AI tasks, the component benchmarks provides diverse computation and memory access patterns for the micro-architectural researchers. The micro benchmarks are provided, and the code developers can drill down to hotspot functions for performance optimization.
This paper proposes an agile domain-specific benchmarking methodology as shown in Fig. 1. Without losing its generality, we apply it in characterizing the AI and Internet services application domains. First, in cooperation with seventeen industry partners, we investigate their domain-specific benchmarking requirements, and extract ten important end-to-end application scenarios. Instead of using real-world applications, we propose the permutations of essential AI and non-AI tasks as end-to-end benchmarks.
Second, we identify sixteen representative AI tasks as the AI component benchmarks with both performance and quality targets. After profiling sixteen AI component benchmarks, we identify and implement fourteen frequent-appearing units of computation as the micro benchmarks.
Third, we present a highly extensible, configurable, and flexible benchmark framework, allowing researchers to create end-to-end applications by using different components commonly found in major application domains. On the basis of the framework, we propose guidelines on how to build end-to-end benchmarks, and design and implement the first end-to-end Internet service AI benchmark—E-commerce search intelligence.
The evaluation on a hybrid cluster consisting of 16-node CPUs and 4-node GPUs show the value of AIBench against MLPerf and TailBench. We gain many insights for hardware and software designers, micro-architectural researchers, and code developers. Several important observations are as follows: (1) In serving the same request, different AI components incur significantly different latency; an end-to-end tail latency deteriorates dozens times or even hundreds times with respect to a single AI component, which can not be predicted by a state-of-the-art statistical model . (2) Internet service architects must perform a tradeoff among service quality, model complexity, and model accuracy. (3) AI models are updated in a real time manner in many end-to-end application scenarios. Offline training should be included into end-to-end benchmarking. (4) As they demonstrate distinct computation and memory patterns, diverse AI tasks should be included into the AI component benchmarks. (5) Drilling down to hotspot functions is helpful for code optimization.
The rest of this paper is organized as follows. Section 2 explains the motivation. Section 3 summarizes the methodology. Section 4 describes how to characterize the AI and Internet service application domains. Section 5 illustrates how to build an end-to-end benchmark. Section 6 performs evaluation. Section 7 summarizes the related work. Section 8 draws a conclusion.
2.1 Why End-to-end Benchmarking Is Necessary
The end-to-end tail latency deteriorates even 100X comparing to a single component tail latency. The end-to-end tail latency indicates the overall performance of the entire execution path, while the component tail latency only reports the performance of a single module. Our experiments in Section 6.2.1 show that the end-to-end tail latency deteriorates dozens times or even hundreds times comparing to a single component tail latency. For an AI component—recommendation, the difference is 13X, while for image classification, the difference reaches up to 296X.
Debugging the performance of a single component benchmark alone does not touch the full execution path and fail to provide bottleneck information among the primary modules within a critical path. Considering a 90th percentile latency, We found that among the four AI related components, the recommendation component occupies 72% of the execution time, while the image classification component only occupies 1.1%. This indicates that benchmarking a single AI component alone without the overall critical path does not make sense.
2.2 Can a Statistical Model Predict the End-to-end Tail Latency?
Someone may argue after profiling many components’ tail latency performance, a statistical model can predict the end-to-end tail latency. Our answer is NO! In Section 6.2.1, We use a state-of-the-art queuing theory  to evaluate the end-to-end application’s latency and tail latency. Through the experimental evaluations, we find that the gap is 3.4 times between the actual average latency and the theoretical one, while the gap is 8.1 times between the actual 99th percentile latency and the theoretical one. Furthermore, the state-of-art queuing model  for tail latency takes the system as a whole, and is not suited for the end-to-end application that needs characterize the permutations of several or dozens of components.
2.3 Why Offline AI Training is also Essential in End-to-end Benchmarking
As witnessed by our many industry partners, when an AI model is used for online service, it has to be updated in a real time manner. For example, one E-commence giant demands that the AI models have to be updated every one hour, and the updated model will bring in the award about 3% click-through rate and millions of profits. In Section 6.2.2, the evaluation shows offline training should be included into end-to-end benchmarking for performing tradeoffs among model update interval, training overhead, and accuracy improvement.
3 Agile Domain-specific Benchmarking Methodology
As modern AI and Internet service workloads are not only diverse, but also fast changing and expanding, the traditional benchmark methodology that creates a new benchmark or proxy for every possible workload is prohibitively costly and even impossible . Hence an agile domain-specific benchmarking methodology is extremely essential. Fig. 1 summarizes our methodology.
Step One. We investigate domain-specific benchmarking requirements with the industry partners. The input of this step is the candidate list of industry-scale applications. Just copying the real-world applications is impossible for two reasons. First, they treat the real-world workloads, data sets, or models are confidential issues. Second, the massive code size, extreme deployment scale, and complex execution path make it infeasible. So the purpose of this step is to understand their essential components and the permutation of different components.
Step Two. On the basis of the output from Step One, This step distills representative AI and non-AI tasks. Different from traditional task, each AI task like image classification has both performance and quality targets . Generally, an AI component specification defines a task in a high level language , only algorithmically in a paper-and-pencil approach . We implement each task as a component benchmark. The benchmark also provides a reference AI model, evaluation metrics, and state-of-the-art quality target .
Step Three. According to the output of Step Two, we profile the full component benchmarks and drill down to frequently-appearing and time-consuming units of computation. We implement those units of computation as micro benchmarks. Micro benchmarks are easily portable to new architecture and system, and are beneficial to fine-grained profiling and tuning.
Step Four. According to the outputs of Steps One and Two, we design and implement a reusing benchmark framework, including AI and non-AI component library, the data input, online inference, offline training, and deployment tool modules.
Step Five. On the basis of the benchmark framework, we build end-to-end benchmarks. Each end-to-end benchmark models the permutation of several or tens of essential AI or non-AI components, reflecting complex interactions among different modules and depicting overall system’s performance. In addition, we propose domain-specific evaluation metrics.
4 The AIBench Design and Implementation
We first give a summary of the seventeen Industry Partners’ benchmarking requirements, and then identify the representative AI tasks (component benchmarks and micro benchmarks). Finally, we propose the reusing benchmark framework.
4.1 Seventeen Industry Partners’ Benchmarking Requirements
Collaborating with seventeen industry partners whose domains include search engine, e-commerce, social network, news feed, video and etc, we extract the essential end-to-end application scenarios from their products or services.
The real-world applications are complex, and we only distill the permutations of primary AI and non-AI tasks. Table 1 summarizes the list of end-to-end application scenarios.
For example, the first scenario in Table 1—E-commerce search intelligence is extracted from an E-commerce giant. A user will be classified into different groups to provide personalized services. The results are ranked according to the relations between the queries and the products. And the ranking is adjusted by learning from the history query and hitting logs. The recommended products are also returned with the search results to the users. We extract this industry-scale application into several AI tasks like classification, learning to rank, recommendation, and non-AI tasks like query parsing, database operation, and indexing. Section 5.1 will describe how to implement this benchmark on the basis of the reusing framework described in Section 4.
In general, end-to-end benchmarks concern overall system’s effects, including quality-ensured response latency, tail latency, and latency-bounded throughput. A quality-ensured performance example is that a quality (e.g., accuracy) deviation from the target is within 2%. Different application scenarios have domain-specific evaluation metrics. For example, several scenarios require that the AI model is updated in a real time manner.
|End to End Application Scenario||Involved AI Task||Involved Non-AI Task||Data||Metrics||Model Update Frequency|
|E-commerce search intelligence||Classification; Learning to rank; Recommendation||Query parsing, Database operation, Indexing||User Data, Product data, Query data||Precision, Recall, Latency||High|
|Language and dialogue translation||Text-to-Text translation; Speech recognition||Query parsing||Text, Speech||Accuracy, Latency||Low|
|Content-based image retrieval||Object detection; Classification; Spatial transformer; Image-to-Text||Query parsing, Indexing, Sort||Image||Precision, Recall, Latency||High|
|Web searching||Text summarization; Learning to rank; Recommendation||Query parsing, Indexing, Crawler, Sort, Hash||Product data, Query data||Precision, Recall, Latency||High|
|Facial authentication and payment||Face embedding; 3D face recognition;||Encryption||Face image||Accuracy , Latency||Low|
|News feed||Recommendation||Database operation, Sort, Basic statistics, Filter||Text||Precision, Recall||High|
|Photo translation||Classification; Spatial transformer; Text-to-Text translation||Query parsing||Image, Text||Accuracy, BLEU, Latency||Low|
|Live streaming||Image generation; Image-to-Image||Video codec, Video capture||Image||Latency||Low|
|Video services||Image compression; Video prediction||Video codec||Video||Accuracy, Latency||Low|
|Online gaming||3D object reconstruction; Image generation; Image-to-Image||Rendering||Image||Latency||Low|
4.2 Representative AI Tasks
To cover a wide spectrum of AI Tasks, we thoroughly analyze the end-to-end application scenarios shown in Table 1. In total, we identify sixteen representative AI tasks. For each AI task, we implement it on TensorFow  and PyTorch  as the AI component benchmarks. Table 2 summarizes the sixteen component benchmarks in AIBench.
Classification. This task is to extract different thematic classes within the input data like an image or text file. It is a typical task in Internet services or other application domains, and is widely used in multiple scenarios, like category prediction and spam detection.
Image Generation. This task aims to provide an unsupervised learning problem to mimic the distribution of data and generate images. The typical scenario includes image resolution enhancement, which is used to generate high-resolution image.
Text-to-Text Translation. This task needs to translate a text from one language to another, which is the most important field of computational linguistics. It can be used to translate a search query and translate dialogue.
Image-to-Text. This task is to generate the description of an image automatically. It can be used to generate image caption or recognize optical character.
Image-to-Image. This task is to convert an image from one representation to another one. It can be used to synthesize the images with different facial ages and simulate virtual makeup.
Speech Recognition. This task is to recognize and translate a spoken language into text. This task is beneficial for voice search and voice dialogue translation.
Face Embedding. This task is to transform a facial image into a vector in an embedding space. The typical scenarios are facial similarity analysis and face recognition.
3D Face Recognition. This task is to recognize the 3D facial information from multiple images from different angles. This task mainly focuses on three-dimensional images, and is beneficial to the facial similarity and facial authentication scenario.
Object Detection. This task is to detect the objects within an image. The typical scenarios include vertical search and video object detection.
Recommendation. This task is to provide recommendations. This task is widely used for advertise recommendation, community recommendation, or product recommendation.
Video Prediction. This task is to predict the future video frames through predicting previous frames transformation. The typical scenarios are video compression and video encoding, for efficient video storage and transmission.
Image Compression. This task is to compress the images and reduce the redundancy . The task is important for Internet services in terms of reducing data storage overhead and improving data transmission efficiency.
3D Object Reconstruction. This task is to predict and reconstruct 3D objects . The typical scenarios are maps search, light field rendering, virtual reality, and online gaming.
Text Summarization. This task is to generate a text summary, which is important for search results preview, headline generation, and keyword discovery.
Spatial Transformer. This task is to perform spatial transformations . A typical scenario is space invariance image retrieval, so that an image can be retrieved even if it is extremely stretched.
Learning to Rank. This task is to learn the attributes of a searched content and rank the scores for the results, which is the key for a search engine service.
|No.||Component Benchmark||Algorithm||Data Set|
|DC-AI-C1||Image classification||ResNet50 ||ImageNet , Cifar |
|DC-AI-C2||Image generation||WassersteinGAN ||LSUN |
|DC-AI-C3||Text-to-Text translation||Transformer ||WMT English-German |
|DC-AI-C4||Image-to-Text||Neural Image Caption Model ||Microsoft COCO |
|DC-AI-C5||Image-to-Image||CycleGAN ||Cityscapes |
|DC-AI-C6||Speech recognition||DeepSpeech2 ||Librispeech |
|DC-AI-C7||Face embedding||Facenet ||LFW , VGGFace2 |
|DC-AI-C8||3D Face Recognition||3D face models ||77,715 samples from 253 face IDs|
|DC-AI-C9||Object detection||Faster R-CNN ||Microsoft COCO |
|DC-AI-C10||Recommendation||Neural collaborative filtering ||MovieLens |
|DC-AI-C11||Video prediction||Motion-Focused predictive models ||Robot pushing data set |
|DC-AI-C12||Image compression||Recurrent neural network ||ImageNet |
|DC-AI-C13||3D object reconstruction||Convolutional encoder-decoder network ||ShapeNet Data set |
|DC-AI-C14||Text summarization||Sequence-to-sequence model ||Gigaword data set |
|DC-AI-C15||Spatial transformer||Spatial transformer networks ||MNIST |
|DC-AI-C16||Learning to rank||Ranking distillation ||Gowalla |
The AI tasks concern both performance and quality targets. The primary metrics include the samples processed per second, the wall clock time to train a model achieving a target quality (Time-to-quality) , the wall clock time to train the specified epochs, quality-ensured throughput, and the energy consumption to train a model achieving a target quality (Energy-to-quality) .
4.3 The AIBench Micro Benchmarks
After profiling the sixteen component benchmarks, we identify fourteen frequently-appearing units of computation. They are Covolution, Fully connected, Relu, Sigmoid, Tanh, MaxPooling, AvgPooling, CosineNorm, BatchNorm, Dropout, Element-wise multipy, Softmax, Data arrangement, and Memcpy. We implement them as a set of micro benchmarks using TensorFlow  and Pthreads.
4.4 The AIBench Framework
As shown in Fig. 2, the framework provides loosely coupled modules that can be easily configured. Currently, the AIBench framework includes data input, offline training, online inference, non-AI library, and deployment tool modules. On the basis of the AIBench framework, we can easily compose an end-to-end benchmark.
The data input module is responsible for feeding data into the other modules. It collects representative real-world data sets, which are from not only the authoritative public websites but also our industry partners after anonymization. The data schema is designed to maintain the real-world data characteristics, so as to alleviate the confidential issue. Based on the data schema, a series of data generators are further provided to support an large-scale data generation, like user or product information. To cover a wide spectrum of data characteristics, we take diverse data types, e.g., structured, semi-structured, un-structured, and different data sources, e.g., table, graph, text, image, audio, video, into account. Our framework integrates various open-source data storage systems, and supports large-scale data generation and deployment .
The offline training and online inference modules are provided to build an end-to-end benchmark. First, the offline training module chooses one or more component benchmarks, through specifying the required benchmark ID, input data, and execution parameters like batch size. Then the offline training module trains a model and provides the trained model to the online inference module. The online inference module loads the trained model onto the serving system, i.e., TensorFlow serving. The non-AI library module provides the non-AI computations and database access, including query parsing, database operations, indexing, sort, crawler, hash, encryption, basic statistics, filter, video codec, video capture, and rendering. For a complex end-to-end application, the online inference, the non-AI library, and the offline training modules together constitute an overall critical path.
To be easily deployed on a large-scale cluster, the framework provides deployment tools that contain two automated deployment templates using Ansible and Kubernetes. The Ansible templates support scalable deployment on physical or virtual machines, while the kubernetes templates are used to deploy on a container cluster. A configuration file needs to be specified for installation and deployment, including module parameters like a chosen benchmark ID, input data, and the cluster parameters like nodes, memory, and network information. Through the deployment tools, a user doesn’t need to know how to install and run each individual module.
5 Building end-to-to benchmarks
In this section, we illustrate how to build end-to-end benchmarks, and later discuss the guideline.
5.1 The Design and Implementation of an E-commerce Search Intelligence
On the basis of the reusing framework, we implement the first end-to-end AI application benchmark—an E-commerce search intelligence (in short, E-commerce). This benchmark models the complete use-case of a realistic E-commerce search intelligence, covering both text searching and image searching scenarios.
The E-commerce benchmark consists of four subsystems: online server, offline analyzer, query generator, and data storage, as shown in Fig. 3. Among them, online server receives the query requests and performs personalized searching and recommendation, integrating AI inference.
Offline analyzer chooses the appropriate AI component benchmarks and performs a training stage to generate a learning model. Also, offline analyzer is responsible to build data indexes to accelerate data access.
Query generator is to simulate concurrent users and send query requests to online server based on a specific configuration. Note that a query item provides either text or image to reflect different search habits of users. The configuration designates the parameters like concurrency, query arriving rate, distribution, user thinking time, and the ratio of text items and image items. The configurations simulate different query characteristics and satisfy multiple generation strategies. We implement our query generator based on JMeter .
The data storage module stores all kinds of data. The user database saves all the attributes of user information. The product database holds all the attributes of the product information. The logs record the complete query histories. The text data contain the product description text or the user comments. The image and video data depict the appearance and usage of product vividly. The audio data store the voice search data and voice chat data. Overall, the data storage covers various data types including structured, unstructured, and semi-structured data, and diverse data sources, including table, text, image, audio and video.
To support scalable deployment on the clusters with different scales, each module is scalable and can be deployed on multiple nodes. Also, a series of data generators are provided to generate E-commerce data with different scales, through setting several parameters, e.g., the number of products and product attribute fields, the number of users and user attribute fields.
Online server provides personalized searching and recommendations. Online server consists of four modules, including search planer, recommender, searcher, and ranker.
Search planer is the entrance of online server. It is responsible for receiving the query requests from query generator, and sending the request to the other modules and receiving the return results. We use the Spring Boot framework  to implement search planer.
Recommender is to analyze the query item and provide personalized recommendation, according to the user information obtained from the user database. It first conducts query spelling correction and query rewriting, and then it predicts the belonged category of the query item based on two classification models—FastText  and ResNet50 . FastText is for text classification when a query item is text data, and ResNet50  is for image classification when a query item is an image. Using a deep neural network proposed by Alibaba , query planer then conducts an inference process and uses the offline trained model to provide personalized recommendation. It returns two vectors: one is the probability vector of the predicted categories, and the other is the user preference score vector of product attributes, such as the user preference for brand, color and etc. We use TensorFlow serving  to provide text classification, image classification, and online recommendation services.
To guarantee scalability and service efficiency, searcher follows an industry-scale architecture. Searcher is deployed on several different clusters, and three clusters are the default configuration. The clusters hold the inverted indexes of product information in memory to guarantee high concurrency and low latency. According to the click-through rate and purchase rate, the products belong to three categories according to the popularity—high, medium, and low, and the proportion of data volume is 15%, 50%, and 50%, respectively. Note that the high popularity category is a subset of the medium popularity category. The indexes of products with different popularity are stored into the different clusters. Given a searching request, the searcher searches these three clusters one by one until reaching a specific amount. Generally, the cluster that holds low popularity products is rarely searched in a realistic scenario. So for each category, searcher adopts different deployment strategies. The cluster for high popularity contains more nodes and more backups to guarantee the searching efficiency. While the cluster for low popularity deploys the least number of nodes and backups. We use Elasticsearch  to set up and manage the Searcher deploying on the three clusters.
Offline analyzer is responsible for training models and building indexes to improve the online serving performance. It consists of three modules—AI offline trainer, job scheduler, and indexer.
AI offline trainer is to train models using the data stored in the database. Offline trainer digests the features of the product data, e.g., text, image, audio, video. To power the efficiency of online server, Offline trainer chooses ten AI algorithms (component benchmarks) from the AIBench framework. The ten component benchmarks include classification for category prediction, recommendation for personalized recommendation, learning to ranking for result scoring and ranking, image-to-text for image caption, image-to-image and image generation for image resolution enhancement, face embedding for face detection within an image, spatial transformer for image rotating and resizing, object detection for detecting video data, and speech recognition for audio data recognition.
Job scheduler provides two kinds of training mechanisms: batch processing and streaming-like processing. In a realistic scenario, some models need to be updated frequently. For example, when users search an item and click one product showed in the first page, the application will immediately train a new model based on the product that the users just clicked, and make new recommendations shown in the second page. Our benchmark implementations consider this situation, and adopt a streaming-like approach to updating the models every several seconds. For batch processing, trainer will update the models every several hours.
Indexer is to build indexes for product information. Indexer provides three kinds of indexes: the inverted indexes with a few fields of products for searching, the forward indexes with a few fields for ranking, and the forward indexes with a majority of fields for summary generation.
We are implementing other end-to-end benchmarks listed in Table 1. There are some guidelines.
(1) Determine the essential AI and non-AI component benchmarks.
(2) For each component benchmark, find the valid input data and the data input module.
(3) Determine the valid permutation of AI and non-AI components.
(4) Specify the module-related configurations, i.e., benchmark ID, input data, execution parameters, Non-AI library, and cluster-related configurations, i.e., node, memory, and network information.
(5) Specify the deployment strategy and write the scripts for the automated deployment tool.
(6) Train the AI models of the selected AI component benchmarks using the offline training module, and transfer the trained models to the online inference module.
This section summarizes our evaluation using AIBench end-to-end, component and micro benchmarks. In Section 6.2, we explain why end-to-end benchmarking is necessary for both online server and offline trainer, and gain several insights, which can not be found using MLPerf  and TailBench . In Section 6.3, we characterize diverse and distinct computation and memory patterns of sixteen AI tasks, emphasizing the necessity of including diverse AI tasks for benchmarking, which is also ignored by MLPerf . In Section 6.4, we drill down to the hotspot functions, and analyze their execution stalls.
6.1 Experiment Setup
We perform experiments on a 16-node CPU and 4-node GPU cluster. All the nodes are connected with a 1 Gb Ethernet network. Each CPU node is equipped with two Xeon E5645 processors and 32 GB memory. Each processor contains six physical out-of-order cores. Hyper-Threading is disabled. The OS version of each node is Linux CentOS 6.9 with the Linux kernel version 3.11.10. The software versions are JDK 1.8.0, Python 3.6.8, and GCC 5.4, respectively. We perform offline training on four Nvidia Titan XP GPUs. Every Titan XP owns 3840 Nvidia Cuda cores and 12 GB memory.
Performance Data Collection
We use the network time protocol (NTP)  for synchronizing cluster-wide clock. We use a profiling tool—Perf  to collect the CPU micro-architectural data through the hardware performance monitoring counters (PMCs). For GPU profiling, we use the Nvidia profiling toolkit—nvprof  to track the running performance of GPU. To profile accuracy-ensured performance, we first adjust the parameters, e.g., batch size, to achieve the state-of-the-art quality target of that model on a given dataset, and then sample 1,000 epochs using the same parameter settings. For the GAN based model, whose accuracy is hard to measure, we set their parameters according to the referenced paper and reproduce the results. We run each benchmark three times and report the average numbers.
6.2 The Necessity of End-to-end Benchmarking
End-to-end Benchmarking is Necessary for Online Server
We deploy online server on the 16-node CPU cluster. Online server contains one query generator node (Jmeter 5.1.1), one search planer node (SpringBoot 2.1.3), two recommender nodes (TensorFlow Serving 1.14.0), nine searcher nodes (Elasticsearch 6.5.2), one ranker node (TensorFlow Serving 1.14.0), and two nodes for data storage (Neo4j 3.5.8 for the user database, Elasticsearch 6.5.2 for the product database).
The product database contains a hundred thousand products with 32-attribute fields. Query generator simulates 1000 users with 30-second warm up time. The users send query requests continuously every think time interval, which follow a Poisson distribution. Note that the proportions of text queries and image queries are 90% and 10%, respectively. In total, we collect the performance numbers until 20,000 query requests have finished. We train each AI task to achieve the quality target of the referenced paper.
The latency is an important metric to evaluate the service quality. Fig. 4(a)
We further perform the latency breakdown of each module to identify the critical paths, including the recommender, searcher, search planer, and ranker modules, as shown in Fig. 4(b). The latency of search planer is negligible, so we do not report it in Fig. 4(b). We find that recommender occupies the largest proportion of latency: 48 milliseconds, 60 milliseconds, and 317 milliseconds for the average, 90th percentile, 99th percentile latency, respectively. In comparison, the latency of searcher and ranker is both within 5 milliseconds, respectively. Although recommender and ranker both contain AI related components, they incur significantly different latencies.
Furthermore, Fig. 4(c) drills down the latency breakdown of the recommender module to a component level, which includes query_parsing, user_DB_access, image_classifier, text_classifier and recommendation. We find that user_DB_access (non-AI component) and recommendation (AI component) are the top two key components that impact the latency. Especially, the average latency of the recommendation component takes up 60% of the average latency of the recommender module, and occupies 13% of the total end-to-end latency of the online server subsystem. The 99th percentile latency of the recommendation component is 289 milliseconds, while the number for the recommender module and the whole subsystem are 317 milliseconds and 1419 milliseconds, respectively. The reason for that end-to-end tail latency deteriorates dozens times or even hundreds times with respect to a single component are 1) a single component may be not in the critical path; 2) even an AI component like recommendation is in the critical path, there exists cascading interaction effects with the other AI and non-AI components.
We also analyze the execution time ratio of the AI components vs. the non-AI components in online server. If we exclude the data preprocessing and communication latency, the time spent on the AI components and the non-AI components is 38 and 17 milliseconds for the average latency, which indicates that the AI components are essential critical path of an industry-scale end-to-end benchmark like the E-commerce benchmark.
Can a Statistical Model Predict the End-to-end Tail latency? As an end-to-end benchmark is much complex in using a hardware or software evaluation, an intuition is that can we use a statistical model to predict the end-to-end tail latency? The answer is NO!
The state-of-the-art work  uses the M/M/1 and M/M/K queuing models to calculate the p’th percentile latency. We repeat their work, and choose the M/M/1 model to predict the latency as we only deploy one instance of online server. In the M/M/1 model, the p’th percentile latency () and the average latency () can be calculated using the following formula: , . is the service rate, which follows the exponential distribution. is the arrival rate, which follows the Poisson distribution.
We get the number of —20 requests per second through the experiments. Then we set as 1.0 requests per second (10 simulated users), 9.1 requests per second (100 simulated users), and 16.7 requests per second (200 simulated users), respectively. For different settings, the theoretical number of the average latency is 53ms, 91ms, and 303ms, while the actual number is 123ms, 459ms, and 852ms, respectively. The average gap is 3.4 times. The theoretical number of the 99th percentile latency is 242ms, 422ms, and 1394ms, while the actual number is 953ms, 5008ms, and 11980ms, respectively. The average gap is 8.1 times.
The main reason for this huge gap is as follows. It is complex and uncertain to execute an end-to-end benchmark, and the service rate doesn’t follow the exponential distribution. So, the M/M/1 model is far away from the realistic situation. However, the more generalized model (such as G/G/1 model) is difficult to be used to calculate the tail latency. Furthermore, if we try to characterize the permutations of executing dozens of components in an end-to-end benchmark, we need a more sophisticated analytical model such as a queuing network model, which is much infeasible to perform a calculation of tail latency.
Tradeoff among Service Quality, Model Accuracy, and Model Complexity. The online inference module needs to load the trained model and conducts a forward computation to obtain the result. Usually, increasing the depth of a neural network model may improve the model accuracy, but it will lead to a larger model size and longer inference time. For comparison, we replace ResNet50 with ResNet152 in image_classifier. The model accuracy improvement is 1.5%, while the end-to-end 99th percentile latency deteriorates by 9.7X. Hence, Internet service architects must perform a tradeoff between the service quality, model complexity, and model accuracy.
Tradeoff among Model Update Interval, Accuracy Improvement, and Training Overhead Using Offline Trainer
Updating AI models in a real time manner is a significant domain-specific concern in many scenarios. We evaluate the real-time model update efficiency using offline training. We deploy offline trainer on four Titan XP GPUs.
We adopt incremental learning method to update the models for online inference, and explore the relationship between the model update interval, training time overhead, and accuracy improvement. Our experiments show that comparing to the original training time and accuracy, 35% additional training time brings in 1.9% accuracy improvement for image_classifier, and 10% additional training time brings in 0.3% accuracy improvement for ranker.
Thus, offline training is an integrated part of end-to-end benchmarking. It not only facilitates measuring the model update efficiency, but also provides a guidance on how to choose an optimal update interval to balance the tradeoff between training overhead and accuracy improvement.
6.3 Why Diversity of AI Tasks Matters for Benchmarking?
We characterize distinct computation and memory patterns of the diverse AI tasks, emphasizing the necessity of including diverse AI tasks for benchmarking.
We characterize the sixteen component benchmarks of AIBench. The AIBench component benchmarks are deployed on the Titan XP GPUs, and we focus on a single GPU performance. The CUDA and Nvidia driver versions are 10.0 and 410.78, respectively.
We evaluate the PyTorch implementations with the version of 1.1.0. The data set for each benchmark is as follows: ImageNet (137 GB) for image classification and Image compression; LSUN (42.8 GB) for image generation; VGGFace2 (36 GB) for face embedding; Microsoft COCO (13 GB) for Image-to-Text and object detection; MNIST (9.5 MB) for spatial transformer; Cityscapes (267 MB) for Image-to-Image; MovieLens (190 MB) for recommendation; Librispeech (59.3 GB) for speech recognition; Gowalla (107 MB) for learning to rank; WMT English-German (1.2 MB) for Text-to-Text translation; Robot pushing data set (137 GB) for Video prediction; ShapeNet Data set (6.8 GB) for 3D object reconstruction; Gigaword data set (277 MB) for Text summarization; 3D face data (37 GB) for 3D Face Recognition, respectively.
GPU architecture contains multiple streaming multiprocessors (SM), each of which has a certain number of CUDA cores, memory registers, memory caches, warp schedulers and etc. To characterize the AIBench component benchmarks from a perspectives of computation and memory access patterns, We choose five micro-architectural metrics, including achieved_occupancy, ipc_efficiency, gld_efficiency, gst_efficiency, and dram_utilization. Achieved_occupancy represents the ratio of the average active warps per active cycle to the maximum number of warps supported on a multiprocessor . Ipc_efficiency indicates the ratio of the executed instructions per cycle to the theoretical number . Gld_efficiency means the ratio of the requested global memory load throughput to the required global memory load throughput . Gst_efficiency means the ratio of the requested global memory store throughput to the required global memory store throughput . Dram_utilization means the utilization level of the device memory relative to the peak utilization .
Fig. 5 presents the computation and memory characteristics of the sixteen AI benchmarks. We find that they have distinct computation and memory patterns not only under different scenarios, e.g., processing text, image, audio, video, but also under different tasks of the same scenario, e.g., image classification and image generation. Thus, diverse AI tasks reflecting different computation and memory access patterns should be included into the AI benchmarks. Achieving a state-of-the-art quality target for each AI task will incur heavy training overhead, however, it does not justify including only a few benchmarks .
6.4 Drill Down To Functional-level Code
Following the experiments in 6.3, We drill down to the hotspot functions and analyze their runtime breakdown and execution stalls for code optimization.
The overall execution performance of these component benchmarks are varying in terms of IPC, which measures the executed instructions per cycle. From Fig. 5, we find that the IPC efficiency ranges from 0.25 (Learning_to_rank) to 0.77 (Text_to_Text translation). Some benchmarks like learning_to_rank have extremely low IPC comparing to the other benchmarks. To discover the factors that impact the performance greatly, we first conduct runtime breakdown analysis and decompose the benchmarks into the hotspot kernels or functions, then we find the GPU execution efficiency in terms of different percentage of stalls.
We use nvprof to trace the runtime breakdown and find the hotspot functions that occupy more than 80% of runtime in total.
Since each run involves dozens of function calls, we single out the functions that occupy large proportions of runtime and classify them into several categories of kernels according to their computation logic.
Through statistics, we find that the most time-consuming functions among all component benchmarks have much in common, and they are classified into eight categories of kernels, which are a subset of the AIBench micro benchmarks: data arrangement, convolution, general matrix multiply (gemm), batch normalization, element-wise operation, relu activation
|Micro Benchmark||Function Name|
|Element-wise||element-wise add kernel|
|element-wise threshold kernel|
|element-wise mul kernel|
|Memcpy||CUDA memcpy HtoD|
|CUDA memcpy DtoD|
Focusing on the above eight most time-consuming micro benchmarks, we evaluate the following stalls of these kernels. Instruction fetch stall (Inst_fetch) indicates the percentage of stalls because the next assembly instruction has not yet been fetched; Execution dependency stall (Exe_depend) is the percentage of stalls because an input required by the instruction is not yet available; Memory dependency stall (Mem_depend) is the percentage of stalls because a memory operation cannot be performed due to the required resources not being available or fully utilized; Texture stall (Texture) is the percentage of stalls because of the under-utilization of the texture sub-system; Synchronization stall (Sync) is the percentage of stalls due to a syncthreads call; Constant memory dependency stall (Const_mem_depend) is the percentage of stalls because of immediate constant cache miss; Pipe busy stall (Pipi_busy) is percentage of stalls because a compute operation cannot be performed because the compute pipeline is busy; Memory throttle stall (Mem_throttle) is the percentage of stalls due to large pending memory operations .
The breakdown of eight stalls of the hotspot functions is shown in Fig. 7. The top two GPU execution stalls are memory dependency stalls, and execution dependency stalls. For example, for Element-Wise benchmark, the memory dependency stalls occupy a very large proportion of 70%, thus resulting in a low IPC number of about 0.86 on average. The memory dependency stalls may occurs due to high cache misses, and thus the load/store resources are not available. Possible optimization strategies include optimizing date alignment, data locality, and data access patterns. The execution dependency stalls may occur due to low instruction-level parallelism, and exploiting ILP may alleviate partial execution dependency stalls to a certain degree.
7 Related Work
State-of-the-art and state-of-the-practise AI or Internet service benchmarks only provide a few micro or component benchmarks, as shown in Table 4, and none of them distill representative and essential AI or non-AI components, and especially the permutations of different AI and non-AI components in characterizing industry-scale AI and Internet service applications.
|Benchmark Framework (Extensible)|
|End-to-End Application Benchmark|
|Speech recog- nition||
|3D Face Recognition||
|3D object re- construction||
|Text sum- marization||
|Learning to rank||
|Real-world Data sets and Software Stack|
MLPerf  is an ML benchmark suite targeting six AI tasks, including image classification, object detection, speech recognition, translation, recommendation, and reinforcement learning. It provides both light-weight and heavy-weight implementations. Totally, it includes seven benchmarks for training and five benchmarks for inference. The MLPerf training benchmark  proposes a series of benchmarking rules to eliminate the side effect of the stochastic nature of AI.
DAWNBench  is a benchmark and competition focusing on end-to-end performance, which means the training or inference time to achieve a state-of-the-art accuracy. It only focuses on two component benchmarks including image classification on CIFAR10 and ImageNet, and question answering on SQuAD.
Fathom  provides eight deep learning component benchmarks implemented with TensorFlow. Three of the eight benchmarks use different models for the image classification task. The Autoenc workload provides a variational autoencoder and can be used to reduce the dimension and compress images.
TBD Suite  is a benchmark suite for DNN training. It provides eight neural network models that covers six AI tasks. TailBench  is a benchmark suite consists of eight latency-critical workloads.
DeepBench  consists of four operations involved in training deep neural networks, including three basic operations and recurrent layer operations. DNNMark  is a GPU benchmark suite that consists of a collection of deep neural network primitives. Both DeepBench and DNNMark ignore the quality target in benchmarking.
Additionally, for machine learning and deep learning evaluation, MLModelScope  proposes a specification for repeatable model evaluation and a runtime to measure experiments.
There are two significant differences of AIBench from the other benchmark suite. One is to propose the permutations of essential AI and non-components as end-to-end benchmarks. We provide the reusing framework to speed up building end-to-end benchmarks. The other is considering end-to-end benchmarks, components benchmarks and micro benchmarks as three integrated parts. As a marked departure from the past, AIBench lets software and hardware designer learn about the overall system information (end-to-end benchmarks), provides diverse computation and memory access patterns (component benchmarks) as the design inputs for micro-architectural researchers, and drill down to hotspot functions (micro benchmarks) for the code developers.
This paper proposes an agile domain-specific benchmarking methodology that speeds up software and hardware co-design. Together with seventeen industry partners, we identify ten end-to-end application scenarios, distill sixteen representative AI tasks and fourteen time-consuming units of computations. We propose the permutations of the essential AI and non-AI tasks as the end-to-end benchmark to characterize industry-scale applications. We design and implement a reusable framework to facilitate agile end-to-end benchmark building. We build the first end-to-end benchmark to model E-commerce search intelligence. Our evaluation shows that the end-to-end benchmark integrating both online service and offline training provides overall system performance for hardware and software designers. The component benchmarks reflect diverse computation and memory access patterns, essential for micro-architectural researchers. The micro benchmarks represent hotspot functions, beneficial to code optimization.
- With respect to the real numbers in our industry partner, the number is quite high. They have taken many measures to decrease the overall latency.
- Relu activation is an element-wise operation, here we use a separate category of Relu considering its large proportion and diverse CUDA functions.
- “Deepbench,” https://svail.github.io/DeepBench/.
- “Mlperf,” https://mlperf.org.
- “Mlperf website,” https://www.mlperf.org.
- “Nginx website,” http://nginx.org/.
- “Nvidia profiling toolkit,” https://docs.nvidia.com/cuda/profiler-users-guide/index.html.
- “Pytorch,” http://pytorch.org.
- “Spark,” https://spark.apache.org/.
- “Speccpu 2017,” https://www.spec.org/cpu2017/.
- M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mane, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viegas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” arXiv preprint arXiv:1603.04467, 2016.
- R. Adolf, S. Rama, B. Reagen, G.-Y. Wei, and D. Brooks, “Fathom: reference workloads for modern deep learning methods,” in Workload Characterization (IISWC). IEEE, 2016, pp. 1–10.
- D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen, J. Chen, J. Chen, Z. Chen, M. Chrzanowski, A. Coates, G. Diamos, K. Ding, N. Du, E. Elsen, J. Engel, W. Fang, L. Fan, C. Fougner, L. Gao, C. Gong, A. Hannun, T. Han, L. V. Johannes, B. Jiang, C. Ju, B. Jun, P. LeGresley, L. Lin, J. Liu, Y. Liu, W. Li, X. Li, D. Ma, S. Narang, A. Ng, S. Ozair, Y. Peng, R. Prenger, S. Qian, Z. Quan, J. Raiman, V. Rao, S. Satheesh, D. Seetapun, S. Sengupta, K. Srinet, A. Sriram, H. Tang, L. Tang, C. Wang, J. Wang, K. Wang, Y. Wang, Z. Wang, Z. Wang, S. Wu, L. Wei, B. Xiao, W. Xie, Y. Xie, D. Yogatama, B. Yuan, J. Zhan, and Z. Zhu, “Deep speech 2: End-to-end speech recognition in english and mandarin,” in International conference on machine learning, 2016, pp. 173–182.
- M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein gan,” arXiv preprint arXiv:1701.07875, 2017.
- G. Ayers, J. H. Ahn, C. Kozyrakis, and P. Ranganathan, “Memory hierarchy for web search,” in 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2018, pp. 643–656.
- D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. Simon, V. Venkatakrishnan, and S. Weeratunga, “The nas parallel benchmarks,” The International Journal of Supercomputing Applications, vol. 5, no. 3, pp. 63–73, 1991.
- L. A. Barroso and U. Hölzle, “The datacenter as a computer: An introduction to the design of warehouse-scale machines,” Synthesis Lectures on Computer Architecture, vol. 4, no. 1, pp. 1–108, 2009.
- Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman, “Vggface2: A dataset for recognising faces across pose and age,” in 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018). IEEE, 2018, pp. 67–74.
- A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu, “Shapenet: An information-rich 3d model repository,” arXiv preprint arXiv:1512.03012, 2015.
- E. Cho, S. A. Myers, and J. Leskovec, “Friendship and mobility: user movement in location-based social networks,” in Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2011, pp. 1082–1090.
- C. Coleman, D. Narayanan, D. Kang, T. Zhao, J. Zhang, L. Nardi, P. Bailis, K. Olukotun, C. Ré, and M. Zaharia, “Dawnbench: An end-to-end deep learning benchmark and competition,” Training, vol. 100, no. 101, p. 102, 2017.
- M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3213–3223.
- A. Dakkak, C. Li, J. Xiong, and W.-m. Hwu, “Frustrated with replicating claims of a shared model? a solution,” arXiv preprint arXiv:1811.09737, 2019.
- A. C. De Melo, “The new linux perf tools,” in Slides from Linux Kongress, vol. 18, 2010.
- C. Delimitrou and C. Kozyrakis, “Amdahl’s law for tail latency,” Communications of the ACM, vol. 61, no. 8, pp. 65–72, 2018.
- J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and F.-F. Li, “Imagenet: A large-scale hierarchical image database,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009, pp. 248–255.
- S. Dong and D. Kaeli, “Dnnmark: A deep neural network benchmark suite for gpus,” in Proceedings of the General Purpose GPUs. ACM, 2017, pp. 63–72.
- C. Finn, I. Goodfellow, and S. Levine, “Unsupervised learning for physical interaction through video prediction,” in Advances in neural information processing systems, 2016, pp. 64–72.
- W. Gao, J. Zhan, L. Wang, C. Luo, D. Zheng, F. Tang, B. Xie, C. Zheng, X. Wen, X. He, H. Ye, and R. Ren, “Data motifs: A lens towards fully understanding big data and ai workloads,” Parallel Architectures and Compilation Techniques (PACT), 2018 27th International Conference on, 2018.
- W. Gao, J. Zhan, L. Wang, C. Luo, D. Zheng, X. Wen, R. Ren, C. Zheng, X. He, H. Ye, H. Tang, Z. Cao, S. Zhang, and J. Dai, “Bigdatabench: A scalable and unified big data and ai benchmark suite,” arXiv preprint arXiv:1802.08254, 2018.
- C. Gormley and Z. Tong, Elasticsearch: the definitive guide: a distributed real-time search and analytics engine. ” O’Reilly Media, Inc.”, 2015.
- F. M. Harper and J. A. Konstan, “The movielens datasets: History and context,” ACM Transactions on Interactive Intelligent Systems (TiiS), vol. 5, no. 4, p. 19, 2016.
- K. Hazelwood, S. Bird, D. Brooks, S. Chintala, U. Diril, D. Dzhulgakov, M. Fawzy, B. Jia, Y. Jia, A. Kalro, J. Law, K. Lee, J. Lu, P. Noordhuis, M. Smelyanskiy, L. Xiong, and X. Wang, “Applied machine learning at facebook: A datacenter infrastructure perspective,” in 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2018, pp. 620–629.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T.-S. Chua, “Neural collaborative filtering,” in Proceedings of the 26th international conference on world wide web. International World Wide Web Conferences Steering Committee, 2017, pp. 173–182.
- G. B. Huang, M. Mattar, T. Berg, and E. Learned-Miller, “Labeled faces in the wild: A database forstudying face recognition in unconstrained environments,” in Workshop on faces in’Real-Life’Images: detection, alignment, and recognition, 2008.
- M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu, “Spatial transformer networks,” in Advances in neural information processing systems, 2015, pp. 2017–2025.
- A. JMeter, “Apache jmeter,” Online.(2016). http://jmeter. apache. org/-Visited, pp. 04–25, 2017.
- A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, and T. Mikolov, “Fasttext.zip: Compressing text classification models,” arXiv preprint arXiv:1612.03651, 2016.
- N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P.-l. Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. Mackean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani, C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, and D. H. Yoon, “In-datacenter performance analysis of a tensor processing unit,” in Proceedings of the 44th Annual International Symposium on Computer Architecture. ACM, 2017, pp. 1–12.
- H. Kasture and D. Sanchez, “Tailbench: a benchmark suite and evaluation methodology for latency-critical applications,” in 2016 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 2016, pp. 1–10.
- A. Krizhevsky, V. Nair, and G. Hinton, “The cifar-10 dataset,” online: http://www. cs. toronto. edu/kriz/cifar. html, vol. 55, 2014.
- Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
- Y. LeCun, C. Cortes, and C. Burges, “Mnist handwritten digit database,” AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist, vol. 2, p. 18, 2010.
- T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision. Springer, 2014, pp. 740–755.
- P. Mattson, C. Cheng, C. Coleman, G. Diamos, P. Micikevicius, D. Patterson, H. Tang, G.-Y. Wei, P. Bailis, V. Bittorf, D. Brooks, D. Chen, D. Dutta, U. Gupta, K. Hazelwood, A. Hock, X. Huang, B. Jia, D. Kang, D. Kanter, N. Kumar, J. Liao, G. Ma, D. Narayanan, T. Oguntebi, G. Pekhimenko, L. Pentecost, V. J. Reddi, T. Robie, T. St. John, C.-J. Wu, L. Xu, C. Young, and M. Zaharia, “Mlperf training benchmark,” arXiv preprint arXiv:1910.01500, 2019.
- D. L. Mills, “Network time protocol (ntp),” Network, 1985.
- Z. Ming, C. Luo, W. Gao, R. Han, Q. Yang, L. Wang, and J. Zhan, “Bdgs: A scalable big data generator suite in big data benchmarking,” arXiv preprint arXiv:1401.5465, 2014.
- R. Nallapati, B. Zhou, C. Gulcehre, and B. Xiang, “Abstractive text summarization using sequence-to-sequence rnns and beyond,” arXiv preprint arXiv:1602.06023, 2016.
- Y. Ni, D. Ou, S. Liu, X. Li, W. Ou, A. Zeng, and L. Si, “Perceive your users in depth: Learning universal user representations from multiple e-commerce tasks,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 2018, pp. 596–605.
- C. Olston, N. Fiedel, K. Gorovoy, J. Harmsen, L. Lao, F. Li, V. Rajashekhar, S. Ramesh, and J. Soyke, “Tensorflow-serving: Flexible, high-performance ml serving,” arXiv preprint arXiv:1712.06139, 2017.
- V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 5206–5210.
- S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.
- A. M. Rush, S. Harvard, S. Chopra, and J. Weston, “A neural attention model for sentence summarization,” in ACLWeb. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2017.
- F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 815–823.
- B. Smith and G. Linden, “Two decades of recommender systems at amazon. com,” Ieee internet computing, vol. 21, no. 3, pp. 12–18, 2017.
- J. Tang and K. Wang, “Ranking distillation: Learning compact ranking models with high performance for recommender system,” in ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018.
- G. Toderici, D. Vincent, N. Johnston, S. Jin Hwang, D. Minnen, J. Shor, and M. Covell, “Full resolution image compression with recurrent neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5306–5314.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
- R.-L. Vieriu, S. Tulyakov, S. Semeniuta, E. Sangineto, and N. Sebe, “Facial expression recognition under a wide range of head poses,” in 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), vol. 1. IEEE, 2015, pp. 1–7.
- O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: Lessons learned from the 2015 mscoco image captioning challenge,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 4, pp. 652–663, 2017.
- P. Webb, D. Syer, J. Long, S. Nicoll, R. Winch, A. Wilkinson, M. Overdijk, C. Dupuis, and S. Deleuze, “Spring boot reference guide,” Part IV. Spring Boot features, vol. 24, 2013.
- X. Yan, J. Yang, E. Yumer, Y. Guo, and H. Lee, “Perspective transformer nets: Learning single-view 3d object reconstruction without 3d supervision,” in Advances in Neural Information Processing Systems, 2016, pp. 1696–1704.
- F. Yu, A. Seff, Y. Zhang, S. Song, T. Funkhouser, and J. Xiao, “Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop,” arXiv preprint arXiv:1506.03365, 2015.
- J. Zhan, L. Wang, W. Gao, and R. Ren, “Benchcouncilâs view on benchmarking ai and other emerging workloads,” Technical Report, 2019.
- H. Zhu, M. Akrout, B. Zheng, A. Pelegris, A. Phanishayee, B. Schroeder, and G. Pekhimenko, “Tbd: Benchmarking and analyzing deep neural network training,” arXiv preprint arXiv:1803.06905, 2018.
- J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2223–2232.