Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications
Abstract
The application of deep learning techniques resulted in remarkable improvement of machine learning models. In this paper provides detailed characterizations of deep learning models used in many Facebook social network services. We present computational characteristics of our models, describe highperformance optimizations targeting existing systems, point out their limitations and make suggestions for the future generalpurpose/accelerated inference hardware. Also, we highlight the need for better codesign of algorithms, numerics and computing platforms to address the challenges of workloads often run in data centers.
1 Introduction
Machine learning (ML), deep learning (DL) in particular, is used across many social network services. The high quality visual, speech, and language DL models must scale to billions of users of Facebook’s social network services [25].
The power consumption in data centers^{1}^{1}1The collective power consumption of data centers around the world would be ranked 4th behind only China, US and EU [4] used to run these models has been rapidly increasing over time. A significant fraction of the future demand is expected to come from workloads corresponding to DL inference, as shown on Figure 1. The higher DL inference demand is due to the expanding range of corresponding applications and the steady improvement in the quality of DL models, which is often associated with the increase in compute and memory requirements [2].
In order to tackle this trend, a lot of research has been done on optimizing computing platforms for DL, including but not limited to [33, 18, 49, 1, 61, 50, 62, 25]. However, a great challenge has been the fast pace of changes in DL applications. For instance, the previously relevant AlexNet [40] is no longer representative of the computation characteristics of today’s computer vision (CV) systems. The rate of change in DL models is so fast that hardware optimized for old models can easily become inefficient for new models.
In order to provide flexibility, availability and low latency, many inference workloads still run on CPU servers. Even though we had direct access to the DL models and our optimizations so far were mostly targeted for these general purpose processors, we also had difficulty to keep up with the rapid pace of innovation. However, our characterization suggests the following general needs from new DL hardware designs:

High memory bandwidth and capacity for embeddings

Support for powerful matrix and vector engines

Large onchip memory for inference with small batches

Support for halfprecision floatingpoint computation
These needs result from the characteristics of DL models important to us now (and projected to be in the future), and from our experience in optimizing DL applications for current computing platforms as well as their limitations found from our experiences. In particular, we highlight a gap in characteristics between the models commonly studied by the systems community and ones running in our data centers, with implications for future processor design.
2 Characterization of DL Inference
This section highlights characteristics of DL inference workloads that are of interest in our data centers. Section 2.1 describes DL models used in our social network services and discusses trends observed in their evolution over time. Section 2.2 presents detailed characteristics, focusing on aspects related to processor architecture design, and Section 2.3 details their common computational kernels.
2.1 Representative Models
We divide inference workloads into three categories. The first provides personalized feed, ranking or recommendations, based on previous user interactions. The second and third are used for content understanding, visual and natural language content, respectively. The latter infer information used for powering recommendations, integrity and security such as detecting objectionable content.
2.1.1 Ranking and Recommendation
Recommendation systems are one of the most common DL workloads in data centers with many applications like ads, feed, and search. Recommendation is usually formulated as an eventprobability prediction problem, where an ML model predicts the probability of one or multiple events at the same time. The items associated with the most likely events are ranked higher and shown to the user [28].
Without going into a comprehensive scientific literature review, we point out that over time the ML models and recommendation systems have evolved to incorporate neural networks (NNs). The latter has progressed from matrix and tensorbased factorizations [19, 37] to autoencoder and neural collaborative filtering [27, 41, 59]. Further advances led to the development of more complex models, such as wide and deep as well as deep cross neural networks, which have been successfully applied in practice [13, 26, 70, 76].
These models usually use a combination of signals from dense and sparse features. The former are represented as a vector of real values, while the latter are often represented as indices of an onehot encoded vector in a highdimensional space. The sparse features are processed with embedding lookups that project sparse indices to a lower dimensional space. As in Figure 2, the resulting embeddings are combined with the dense features to produce higher order interactions, for example using a set of fully connected layers (FCs) or parameterless additive and multiplicative mixing [55].
The embedding tables can easily contain billions of parameters, while FCs usually have a modest number of parameters. The size of these models is often bound by the memory of the system at hand and can easily require a memory capacity exceeding tens of GBs.
These models often have to predict eventprobabilities for multiple ad or feed candidates for a single user, usually within 100s ms time constraint. These properties allow us to leverage batching to achieve high performance in FCs. However, the overall model’s execution tends to be memory bandwidth bound and is dominated by the embedding lookups. These lookups perform a large number of mostly random accesses across table columns, but read an entire column vector for each such random access. For more details, refer to SparseLengthsSum operator in Caffe2.
Future Trends:

Larger Embeddings: Adding more sparse signals and increasing embedding dimensions tends to improve model quality. Therefore, we expect even larger embeddings to be used. This will further increase the pressure on memory and leads to systems with larger memory capacity, while putting more focus on distributed training and inference.
2.1.2 Computer Vision
CV models were some of the earliest to adopt DL techniques. They rely on convolutions that apply C K K filters on the B [F ] C height (H) width (W) input images with B batch size and C input channels or video clip with F frames and produce a result with C output channels.
Image Classification involves matching images to classes. Currently, ResNets are widely used for classification [26]. However, recently much larger ResNeXt models have shown stateoftheart accuracy even with weakly supervised training on billions of Instagram images [43, 74]. For example, our ResNeXt10132x4d model contains 43M parameters and requires 8B multiplyadd operations during inference, relying on group convolutions with G=32 and d=4^{2}^{2}2In a group convolution, only the input channels in the same group are used for computing a given output channel. A group convolution with total C input, C output channels and G groups is essentially G independent convolutions each with d=C/G input and C/G output channels. A special case where C=C=G and consequently group size d=1 is called depthwise convolution. in its first residual block. The largest configuration ResNeXt10132x48d contains 829M parameters and requires 153B multiplyadd operations during inference, relying on group convolutions with G=32 and d=48 in its first residual block. It further improves the Top1 validation accuracy of ImageNet1K by 4% to 85.4% [43].
Object Detection involves identifying specific regions that contain objects of interest. One of the largest scale object detection systems running in data centers today is the text detection part of the Rosetta system used to understand text in images [8]. It uses the FasterRCNNShuffle model, that relies on FasterRCNN architecture [54], with the ResNet trunk replaced with ShuffleNet [75], which uses 33 depthwise convolutions and 11 group convolutions with d=4. Despite ShuffleNet efficiency, object detection tends to be more time consuming than classification for the following reasons.
First, detection requires high resolution for accuracy. Whereas 224224 images are sufficient for classification, detection typically resizes images such that the maximum side is 800 while maintaining the aspect ratio. Therefore, a typical input size of dimension 3800600 for object detection is 9.5 larger than a typical input for classification.
Second, FasterRCNN employs a regionproposal based approach where the final convolutional block is batched over many proposals of smaller spatial resolution. In Rosetta, the activations tend to be of dimensions [25100 proposals] [544 or 1088 channels] [7,14] [7,14]. The spatial resolution is typically 77 or 144, with large number of channels. Hence the number of proposals is a limiting factor in the number of objects that can be detected and is typically bounded due to computational cost.
Video Understanding historically has taken framebased approach where sampled video frames are applied through image models. However, recently 3D convolutions gained wide adoption owing to higher accuracies given their ability to model temporal in addition to spatial domain [65]. Extensive studies have been done to analyze the performance vs. accuracy tradeoff of vanilla 3D ResNets compared to factorized 3D convolutions as in Res(2+1)D [66], ResNeXt3D and ShuffleNet3D [3]. In particular, ResNeXt3D with depthwise convolutions, which factorizes the 3D convolution into channel and spatiotemporal dimension, requires 3 less FLOPs than Res(2+1)D, which factorizes the 3D convolution across spatial and temporal dimension. Further, trading off spatial resolution for increasing clip length shows improved accuracy. In the future, increasing both the temporal and spatial resolution would be important for more complex video understanding tasks, such as object detection or action recognition.
Future Trends:

Model Exploration: There is an increasing trend to finetune the last few layers of a model specific to each application (such as adding additional categories for classification) while all applications share a common trunk. This leads to the inference cost per image increasing linearly as a factor of the computational cost of only the final few layers. These final convolutions typically have a large number of channels and work on much smaller spatial resolutions, which can be important optimization targets.

Convolution Types: Group/depthwise convolutions such as in ResNeXt and ShuffleNet, originally introduced for mobile inference, have increasingly been adopted in the data center due to accuracy and FLOP efficiency. Depthwise convolutions are memory bandwidth bound, while majority of FLOPs spent elsewhere: e.g. ResNeXt3D has 97.1% of all FLOPs in 111 convolutions.

Large Activations: Image understanding is moving beyond simple classification tasks into more complex domains such as object detection, action recognition, and human pose estimation, which requires larger activations. For instance, tasks like object detection require higher resolution for accuracy, and video inference with more frames per clip demonstrates higher accuracy due to more temporal context. More CV tasks will see this trend, adding pressure on onchip memory capacity and/or offchip memory bandwidth.

Batch Size: Although CV inference typically does not have very strict latency requirements, small batches are still preferable in nonclassification use cases. Whereas classification tasks perform well when the aspect ratios are distorted into a fixed shape like 224224, doing so results in huge accuracy drops in complex tasks like object detection, making batching difficult. Moreover, owing to large activations, increasing batch size puts further pressure on onchip memory capacity.
2.1.3 Language Models
Neural machine translation (NMT) has become the dominant approach to machine translation [34, 47, 64, 5]. It relies on the encoderdecoder approach, also called seq2seq. The former encodes the input sentence, while the latter decodes the encoding into the target output sentence.
This approach has been successfully applied to many natural language processing tasks, such as summarization [46, 17], speech recognition [23], syntactic and semantic parsing [11, 69] as well as question answering and dialog systems [9, 52].
Encoderdecoder approaches vary according to the encoder and decoder implementation. A major challenge with NMT is the dependence of a translation segment on its position in the sentence. This motivated the reliance on recurrent neural networks (RNNs) as one can encode the statement’s position in the recurrent network during translation. This approach has shown successful results and is widely used in practice [64, 5]. In this approach, the encoder and the decoder are typically implemented using a Gated Recurrent Unit (GRU) [12] or a Long Short Term Memory (LSTM) cells [29].


Category  Model Types  Model Size (# params)  Batch Size (typical)  Max. Live Activations  Arith. intensity (weights)  Arith. intensity (act. & weights)  Latency (constraints) 


Recommendation  FCs  1–10M  1–100  >10K  20–200  20–200  10s of ms 
Embeddings  >10 Billion  1–100  >10K  1–2  1–2  10s of ms  


Computer Vision  ResNet50  25M  1 image  2M  avg. 303/min. 100  avg. 164/min. 25  No strict constraints 
ResNeXt10132x448  43–829M  1 image  2.4–29M  avg. 380/min. 100  avg. 188/min. 28  
FasterRCNNShuffle  6M  1 image  13.2M  avg.3.5K/min.2.5K  avg. 145/min. 4  
ResNeXt3D101  21M  1 clip  58M  avg. 22K/min. 2K  avg. 172/min. 6  


Language  seq2seq (GRU/LSTM)  100M1B  18 tokens  >100K  2–20  2–20  10s of ms 

Future Trends:

Model Exploration: Results have shown that adding more layers and ensembles improves translation quality, but leads to larger NMT models [64]. Reranking the output of a model is a successful approach that can be used together with ensembles [60]. Also, multilingual models are an attractive way to scale one model to multiple languages but each multilingual model may need more capacity to handle multiple language pairs [32, 58].

Parallelism: While successful, RNNbased approaches impose dependencies on each translated word, making it difficult to utilize highly parallel computing architectures such as GPUs. Recognizing this has motivated NMT models that lift the time dependencies imposed by RNNs. In [20], both the encoder and decoder are implemented as stacked convolutions. In [68], the transformer model is introduced which removes the need for recurrence or convolution altogether and instead only relies on the attention mechanism to improve achievable hardware parallelism at the expense of additional computation. Results from this work show that NMT training time can be significantly reduced while having the additional benefit of model generality. While these approaches benefit from improved parallelism in both the encoder and the decoder during training and the encoder during inference, a typical inference generates an output sequentially using beam search. A more recent work has attempted to remove the time dependency in the decoder at inference time [24].

Batch Size: Inference with small batches is well suited in the context of instant translation. However, largescale inference can also be done offline for localization purposes. In that case, using larger batch sizes can be beneficial as throughput becomes more important than latency.
2.2 Compute Characteristics
Let the arithmetic intensity be defined as (# of operations needed to evaluate) / (# of elements incurred in data traffic) during the execution of model. The compute, memory capacity, and memory bandwidth demand of our representative DL inference workloads is shown in Table 1. We report two arithmetic intensities: (i) assuming only weights are incurring the traffic, for example when all activations fit in a level closer to compute in the memory hierarchy, and (ii) assuming that both weights and activations are incurring traffic.
For DL hardware designs, there are notable differences between DL workloads found in our representative sample and those commonly studied in the systems community.
First, embedding layers stand out with huge model sizes (more than 10s of GBs) and significantly low arithmetic intensities. Mathematically, the operation we perform on the embedding tables is a sparsematrix times densematrix multiplication. The sparse matrix has >10 rows and >10M columns, each row with >10 nonzeros. This matrix is multiplied with a dense matrix with >10M rows and >10 columns.
The embedding lookup operation can be an interesting opportunity for applying emerging memory technologies and specialized hardware. On one hand, more expensive Highbandwidth memory (HBM) could be useful because it provides higher bandwidth but unfortunately its current capacity is limited. On the other hand, Nonvolatile memory (NVM) is an economical alternative to store embeddings compared to DRAM, but the associated bandwidth is too low to be practical out of the box. Further, the memory access pattern to embedding tables has low temporal locality which makes caching challenging, while low spatial locality often results in underutilization (due to access granularity of 10s of Bytes versus NVM block size). Nonetheless, several techniques have been proposed to mitigate these problems [16].
Second, recent models can benefit more from larger onchip memory capacity. In a hypothetical accelerator with 100 TOP/s and 100 GB/s DRAM bandwidth the performance projected by a roofline model^{3}^{3}3We assume that the model parameters are stored as int8 numbers. We apply a roofline model for each layer, where each layer differs in whether it reads activations/weights from off or onchip memory based on a simple greedy onchip memory allocation algorithm [72]. improves with larger onchip memory capacities, as shown in Figure 3. This is not only driven by larger models, such as NMT seq2seq and ResNeXt101, but also by larger activations, such as 800600 input images for ShuffleNet and videos for ResNeXt3D.
Notice that the FC layers in recommendation and NMT models use small batch sizes so performance is bound by offchip memory bandwidth unless parameters can fit onchip. The batch size can be increased while maintaining latency with higher compute throughput of accelerators [33], but only up to a point due to other application requirements. The number of operations per weight in CV models are generally high, but the number of operations per activation is not as high (some layers in the ShuffleNet and ResNeXt3D models are as low as 4 or 6). This is why the performance of ShuffleNet and ResNeXt3D varies considerably with onchip memory bandwidth as shown in Figure 3. Had we only considered their minimum 2K operations per weight, we would expect that 1 TB/s of onchip memory is sufficient to saturate the peak 100 TOP/s compute throughput of the hypothetical accelerator. As the application would be compute bound with 1 TB/s of onchip memory bandwidth, we would expect there to be no performance difference between 1 TB/s and 10 TB/s.
Third, common primitive operations are not just canonical multiplications of square matrices, but often involve tallandskinny matrices or vectors. These shapes arise from group/depthwise convolutions that have recently become popular in CV, and from small batch sizes in Recommendation/NMT models due to their latency constraints. Therefore, it is desired to have a combination of matrixmatrix engines to execute the bulk of FLOPs from computeintensive models in an energyefficient manner and powerful enough vector engines to handle the rest.
2.3 Computation Kernels
Let us now illustrate the time spent in different computational kernels on CPU in our data centers. Figure 4 shows that FCs are the most time consuming operations, followed by embedding lookups and tensor manipulations^{4}^{4}4 “Tensor Manipulation” refers to concatenation (for combining dense and sparse features in Figure 2), splitting, slicing, and so on, which are good targets for whole graph optimizations discussed in Section 3.3..
Following Caffe2 framework convention, the FC operator is defined as , with matrix and matrix as inputs, and K being the inner reduction dimension. The convolutions can be logically transformed to matrix multiplications using im2col, which results in , and as shown in Figure 5.
We often refer to the number of rows as effective batch size or batch/spatial dimension, while as the output feature dimension. If or are small (e.g., and corresponding to FC and group/depthwise convolutions), the matrixmatrix multiplication becomes narrow and more closely resembles matrixvector multiplication, with performance deteriorating accordingly from BLAS3 to BLAS2 levels. In this case a matrixmatrix multiplication engine is expected to have a low utilization. This happens for small batch sizes (e.g., recommendation and NMT) and group convolution with few output channels per group (e.g., ShuffleNet).
The number of operations per weight read is proportional to the batch/spatial dimension, and the number of operations per activation read is proportional to the output feature dimension. If either is small, then performance is expected to be bound by memory bandwidth. When an matrix activation matrix is multiplied with a weight matrix, we compute operations while reading weights, leading to operations per weight. For example, when a batch/spatial dimension () is 10, the operations per weight is 20. In this case, if model parameters are stored as int8 numbers then saturating 100 TOP/s architecture would require 5 TB/s of memory bandwidth for reading weights. Similarly, the number of operations per activation is . With an output feature dimension of 8, operations per activation of 16 would require 6.25 TB/s for reading input activations.
Note that the overall arithmetic intensity of a DL model can be misleading and we should also look at its individual layers. For example, even though the depthwise convolutions in ShuffleNet and ResNeXt account for only 2% of total FLOPs, if a hypothetical accelerator can achieve 100 TOP/s for the other convolutions and only 2 TOP/s for the depthwise convolutions due to memory bandwidth limitations, time spent in the depthwise convolutions will be comparable to the others.
Finally, we point out that standard benchmarks, like DeepBench [6], typically give more emphasis on batch sizes larger than what is encountered in our use cases. They do not capture small reduction dimensions in depthwise convolutions, and big activation tensors in image detection and video models.
3 Performance Optimizations
DL inference workloads running on Facebook’s social network services need to serve billions of users with fluctuating capacity demand over time [25]. Therefore, the availability and flexibility of computing resources is important. In addition, many inference workloads require low latency. These are the reasons why, currently, most inference workloads run on CPUs. Even though accelerators can significantly reduce inference time and improve energy efficiency, CPUs will still be responsible for a good portion of DL inference, especially in cases where tight integration with business logic is needed.
3.1 Data Center Fleetwide DL Inference Profiling
Our datacenters are running diverse DL inference workloads. Table 1 lists representative models, but by no means covers all of our models and new models with new types of data and varying tensor shapes are always coming online. Therefore, it is important to continously monitor DL model performance characteristics fleet wide. DL operations typically utilize a large fraction of peak compute or memory bandwidth, depending on their arithmetic intensity, and are less limited by memory latency or control overheads compared to other typical data center workloads. They often involve regular compute and memory access patterns, lending themselves as good targets of analytical performance models.
For this purpose we have implemented the observer software design pattern that can be applied to individual operators and are executed at the start and end of the operator. We have developed a number of functions called by observers that track performance metrics for each operator’s execution (refer to Caffe2 operator cost inference functions for more details). When considered in conjunction with the layer’s full specification such as layer type, input/output tensor shapes, and element types, we can understand whether a given layer execution should be memorybandwidth or compute bound. Viewed differently, we can estimate the benefits of optimizing any specific operator. This is particularly useful as it gives us the necessary data to estimate the priority of a considered optimization.
In order to keep track of the accuracy and identify inefficiencies in the roofline models we maintain detailed perlayer logs that measure execution time, memory bandwidth in GB/s and actual attained FLOP/s that are derived from hardware performance counters for sampled DL operator executions. A telemetry agent running on each host collects and compares this information with given predictions across all of our data centers. Also, to set realistic goals for our optimization efforts, we developed a number of benchmarks tuned for each potential bottleneck.
3.2 Reduced Precision Inference
The reducedprecision inference has been shown to be effective at improving compute throughput within a power budget, especially in mobile platforms. However, applying reducedprecision inference in data centers is nontrivial.
First, while mobile platforms have widely adopted CV models such as ShuffleNet and MobileNet that tradeoff accuracy for significant reduction in compute requirements [75, 30], DL inference in data centers prefers accurate but compute intensive models like ResNet [26] and ResNeXt [74]. In particular, when DL inference is related to core services like feed or integrity/security the accuracy loss should be very small. Usually 1% change in the accuracy compared with singleprecision floatingpoint results is acceptable.
Also, while general purpose CPUs have high availability in datacenters, they have not yet adapted to rapidly increasing compute demand of DL inference and hence lack good support for highperformance reducedprecision inference. This is exacerbated by less mature highperformance and highaccuracy reducedprecision linear algebra libraries for CPUs compared to their higher precision counter parts.
3.2.1 Performance Challenges
Current generations of x86 processors [31] provide conversion instructions between half and singleprecision floating point numbers (vcvtph2ps and vcvtps2ph), but without native halffloat (fp16) computation. They also require a sequence of instructions (vpmaddubsw + vpmaddwd + vpadd) to implement 8bit integer multiplications with 32bit accumulation with marginally higher (33%) compute throughput than that of singleprecision floating point (fp32) [56]. The compute throughput of 8bit integer multiplications with 16bit accumulation can be about twice higher than fp32, but this often results in significant accuracy drops unless combined with outlieraware quantization that will be described shortly. On the other hand, VNNI instructions provide higher throughput int8 multiplications with 32bit accumulation but they are not available in current x86 microarchitectures [71]. As a result, we had to tailor optimization strategies based on the performance bottleneck.
If the performance is memorybandwidth bound, then using fp16 when storing weights or using 8bit multiplications with 32bit accumulation (i8acc32) can increase the arithmetic intensity by up to a factor of 2 and 4, respectively. In this case, we can obtain speedups proportional to the memory bandwidth saving, even when we save nothing with respect to the number of instructions. For example this happens in FCs with small batch sizes and group convolutions with a small number of channels per group (the extreme case being depthwise convolution with just one channel per group).
We have designed and implemented a reducedprecision linear algebra library for DL inference called FBGEMM [36, 35]. Figure 6(a) plots the performance of our optimized fp16 and i8acc32 matrix multiplication (GEMM) in FBGEMM compared with Intel MKL’s fp32 GEMM. The experiments are performed on a single thread running on Intel Xeon E52680 v4 with turbo mode off using Intel MKL version 2017 update 3. Notice that for cases with low arithmetic intensity our fp16 and i8acc32 GEMMs obtain up to 2 and 4 speedups over MKL’s fp32 GEMM, respectively. For instance, applying our fp16 GEMM, we obtain up to 2 speedup in FC layers in a recommendation model with 15% overall latency reduction. Also, applying our i8acc32 GEMM, we obtain overall 2.4 speedup in the FasterRCNNShuffle used for our optical character recognition application.
If the performance is bound by the instruction throughput, then we use 8bit multiplications with 16bit accumulation and periodic spills to 32bit accumulators (i8acc16), which can provide 2 compute throughput over fp32. To avoid saturation and accuracy drops, we employ outlieraware quantization that separates out weights with bigger magnitude as outliers [50]. Here, we consider a typical threshold for outliers, where a weight is not an outlier if representable with 7 bits (i.e. the value of weight is between 64 and 63). We split the weight matrix into two parts, , where is in 7 bits and contains the residual. The matrix multiplication, , is calculated in two stages, where uses 16bit accumulation, and uses 32bit accumulation. We find that becomes a sparse matrix, often with density less than 0.1%, especially when combined with symmetric quantization [39]. Since is sparse, accounts for a small fraction of the total time. Figure 6(b) plots the performance of our i8acc16 GEMM compared with MKL GEMM in fp32, which achieves up to 2 speedup for matrix shapes with high arithmetic intensity. In particular, applying our i8acc16 GEMM to ResNet50, we obtain 1.7 speedup over MKL in fp32.
Even though some of the applied optimizations are done to work around limitations of current x86 processors, they provide insight for future DL hardware optimizations. Our optimizations show it is useful to apply different quantization techniques depending on where the performance bottleneck lies. For example, quantization techniques that are primarily for saving storage and bandwidth should be tested with embedding layers, FCs with small batch size, and depthwise convolutions. Our experience with outlieraware quantization shows that a highperformance sparse linear algebra engine will be helpful not only for pruned models but also for reducing required precision of nonpruned models. For example, 6bit quantized models can be computed in 4bit for main values while the outlier part is computed with the 6bit sparse engine.
3.2.2 Accuracy Challenges:
Impressive progress has been made in lowprecision DL inference, some of which consider even ternary or binary quantization [53, 77]. However, even 8bit quantization has presented its own set of challenges to meet the accuracy requirements of our DL workloads in data centers. The following five techniques were effective at meeting the accuracy requirements:

Finegrain Quantization. Instead of having a single quantization parameter per tensor, applying quantization in a finer granularity is often required. Examples are per output feature quantization in FCs, per output channel quantization in convolutions, per group quantization in group convolutions, or perentry quantization in embedding tables.

Quantizationaware Training. We found that quantizationaware training for example using fake quantization is important for meeting the accuracy requirements. This aligns with a recent white paper [39] that shows the importance of perchannel quantization and quantizationaware training in quantizing CNNs for mobile platforms.

Selective Quantization. Unlike mobile platforms which can highly prefer endtoend quantization, DL inference in data centers should be able to fall back to floatingpoint in accuracysensitive parts of DL models. We systematically profile errors introduced by quantization per layer and skip quantization when the error is too high. Examples include the first and last few layers of CNNs.

Outlieraware Quantization. In addition to the outlieraware quantization technique described previously for 16bit accumulation, we can take advantage of the fact that the range of values can be confined much more once outliers are ignored. For example, instead of quantizing a tensor for the range [min(), max()], we can quantize for a smaller range, such that the L2 norm of quantization error is minimized with respect to the distribution of values. Unlike weight tensors, activation tensors are not constant, so we collect distribution of activation tensors by running with calibration inputs from the training data.

Netaware Quantization. We can often further reduce the range we’re quantizing for based on neighboring operators. For example, if an operator is only followed by ReLU, we can narrow down the range by excluding negative values.
For instance, using these techniques, a ResNet50 model with int8 quantization (except softmax) achieves 75.6% Top1 and 92.8% Top5 accuracy for ImageNet1K validation set [15], which corresponds to only 0.3% Top1 and 0.1% Top5 accuracy drop compared to the baseline fp32 model [22].
3.2.3 Software Challenges:
Linear algebra operations for machine learning inference require optimizations that are quite different from those for highperformance scientific computing (i.e. HPC). The standard BLAS interface cannot provide the desired performance for the matrix shapes that are common in DL inference. Since the compute requirement in DL is rapidly changing, it can be premature to attempt to standardize a new linear algebra interface for DL, but it worthwhile to discuss the associated requirements and challenges.
As shown in Figure 5, typical matrix shapes in DL inference are smaller and often tall and skinny, compared to those in typical HPC applications. Highperformance matrix multiplications often “pack” a block of input matrices into a format friendly for vectorization and cache locality. For large enough square matrices, the overhead of packing can be amortized inside a single matrix multiplication adhering to the standard BLAS interface. However, for tallskinny matrices, we need to amortize the packing overhead across multiple matrix multiplications for constant weight matrices which requires a new interface that accepts a custom prepacked matrix.
A significant fraction of DL computation is not strictly matrix multiplication. For example, the convolution operator in CNNs can be expressed as im2col followed by matrix multiplication, but this often does not lead to the highest performance due to the duplication of input activations and the overhead of im2col. Therefore, it is important to promote convolution as a firstclass citizen of the interface to enable the computation of direct convolutions without im2col. This will also enable algorithmic optimizations such as Winograd or FFTbased convolution as in cuDNN with automatic choice of the best algorithm for given tensor shapes. The native convolution interface is particularly important for group convolution with only a few channels per group. If we individually apply im2col followed by GEMM for each group, the reduction dimension and the output feature dimension are too small for efficient vectorization and parallelization. Note that even the FC layer cannot be implemented strictly with only a GEMM operation as it involves a bias term which should be fused with the main GEMM to save memory bandwidth. It is also desirable to fuse other common operations such as ReLU.
Reducedprecision fixedpoint computation requires additional steps such as handling nonzero offsets used in asymmetric quantization and rescaling 32bit intermediate results of matrix multiplication, which should be fused with the main GEMM to save bandwidth. Google’s gemmlowp library [21] provides a welldesigned interface of fusing “output pipeline” with the main GEMM. However, gemmlowp doesn’t provide native convolution interface and is mostly optimized for ARM Neon and Intel x86 SSE4.1, not for AVX2 and AVX512.
Intel MKLDNN is another library that provides high performance implementations of DL primitives on CPU. MKLDNN implements advanced features such as Winograd convolution in int8. On the other hand, FBGEMM has features such as outlieraware quantization. Therefore, some of our DL inference applications use FBGEMM and some others use MKLDNN, depending on compute characteristics and operator availability. Lowprecision linear algebra for DL inference is still a young field, and we believe it is healthy to have multiple complementary solutions that can experiment different optimizations while adopting proven techniques from each other.
The below code snippet shows an example of our FBGEMM library interface. In this example, a templatized C++ function that performs a matrix multiplication for different data types is shown. The salient features of this interface are the way it accepts a packed B matrix (usually weights that can be packed once and used multiple times) and also a parameter for packing matrix A. The packing of matrix A can be specialized and fused with memory bandwidth bound operations such as im2col, rowwise sum for asymmetric quantization, or depthwise convolution. outProcess parameter is templatized to support various types of data processing on output once the main GEMM part is finished (similar to gemmlowp’s output pipeline). As previously mentioned, many matrices in DL inference are tallskinny so the main kernels of matrix multiplication are dynamically generated justintime to take advantage of matrix size specific optimizations. The FBGEMM library is open source and integrated with Caffe2 deeplearning framework. For more complete examples, refer to the tests and benchmarks in our open source project.
template<typename T_PACK_A, typename T_PACK_B, typename T_C, typename OUT_FUNCTOR> void gemmPacked( // packed inputs T_PACK_A& packA, T_PACK_B& packedB, // output T_C* C, uint32_t ldc, // postprocessing functor, e.g. Relu OUT_FUNCTOR& outProcess);
3.3 Whole Graph Optimization
While it is important to optimize the performance of individual operators as outlined in the previous subsections, we can get additional significant performance improvements by looking at the DL graph as a whole and performing crossoperation optimizations. A few different optimizations fall into this category, including operator fusion, data movement elimination, operator scheduling, and threading for both inter and intraop parallelism. This section focuses on operator fusion, specifically quantifying potential speedups of operator fusion. The realized speedup from operator fusion will heavily depend on the efficiency of underlying fused kernel. Automatic generation of fused kernels is an active area of research and early productization efforts are underway [57, 67, 10, 42]. However, it is still often necessary to write fused kernels manually. For this reason, we focus our efforts in two directions: 1) to find the top few opportunities where we will get the most gains from fusion for our models that can be worth manual attention and 2) to find a broader set of opportunities for compiler generated kernels.
Our approach to identify fusion opportunities for both cases is similar. We aim at identifying subgraphs that occur commonly in our workloads across the entire fleet; and are expected to have high speedup potentials. We log the complete graphs annotated with operator dependencies, frequency, and input/output tensor shapes. We then run a frequent subgraph mining algorithm on the nets captured. The idea here is to find all subgraphs that are executed frequent enough and order them on the basis of speedup potential from fusion. To perform the ordering, we use the input/output dimensions for the operators to compute a simple roofline model for the subgraph being considered. Specifically, we compute performance projected by the roofline model before and after fusion, and use the difference to estimate speedup potential. Note that we filter out some subgraphs based on specific operator pattern rules. For example, we rule out subgraphs with operators that are not data parallel and hence challenging to fuse. Finally, we run a topk algorithm on the ordered subgraphs to return the top opportunities.
With this analysis, we were able to find several opportunities for merging batched matrix multiplies with tensor manipulation operations. As analyzed in Figure 4, these tensor manipulation operations comprise about 17% of the overall DL inference CPU time. Most of these operations are memory bandwidth limited; merging them with compute bound operations resulted in a total of over 10% savings in run time.
4 Application Driven HW Codesign Directions
This section discusses implications of the DL model characteristics and their optimization for software and hardware codesign. We believe that the serverside DL workload optimizations should be considered as a codesign problem along three axes: DL models, numerics (quantization, Winograd/FFT convolution, and sparsity), and hardware platforms. Also, the process should be driven by DL models because of their rapid changes and diversity. We highlight a few relevant observations in this regard next.
Workload Diversity: DL is a fast moving field while the design space of inference hardware is huge. Therefore, one needs a fast turnaround loop with performance modeling capability to predict benefits of various hardware and software cooptimizations based on workload characteristics captured from a wide range of uptodate DL models. This study reveals the following characteristics of DL models running in our data centers. First, they have diverse compute patterns where matrices do not necessarily have “nice” square shapes. There are also many “longtail” operators other than FC and convolutional layers. Therefore, in addition to matrix multiplication engines, hardware designers should consider general and powerful vector engines. Second, DL models in our data centers have diverse and sometimes conflicting demands to memory subsystem. For example, due to larger activation matrices or matrices with tallandskinny shapes, recent CV and NMT models need bigger onchip memory capacity to sustain high compute throughput without being bottlenecked by offchip memory bandwidth. However, we should not solely rely on onchip capacity to fit the entire model because it is difficult to project onchip memory capacity demand of future models. Some of our current recommendation models are already too big to fit onchip memory. Recommendation models not only require a huge memory capacity but also high bandwidth.
Data Center Requirements: When codesigning inference hardware for data centers, it is important to keep in mind that data center serverside DL inference has different requirements from mobile/embedded/IoT devices. For example, some quantization and pruning techniques report 2–3% accuracy drops but that is often too high for data center environment and they are often not deployed. If quantization drops the accuracy of say 32x32d model by more than 1% with less than 2 speedup, it can be more advantageous to just use the 32x16d model without quantization. In order to minimize accuracy drops from quantization, inference hardware for data centers should support perchannel quantization. They also should support fp16 or bfloat16 compute as a fallback in accuracy sensitive parts such as the last layer of some DL models.
Service Disaggregation: DL applications have distinctive compute and memory characteristics, compared to other typical data center workloads. Specifically, DL inference often utilizes a higher fraction of peak FLOPs, memory capacity, and bandwidth. As a result, other jobs on the same machine can suffer memory capacity and bandwidth pressure, or power limitation, e.g. reduction in turbo frequency when AVX2 or AVX512 is used by deep learning workload [38]. This reduces the performance of other important components such as business logic and has detrimental effect on full system performance. Hence, a natural decision is to disaggregate DL inference into a separate tier (accelerated or not). Disaggregation can also allow to pool requests from many frontend servers, increasing the batch size and hence compute efficiency. A challenge is that inference queries and results need to be transferred between the tiers over the network. Thus, the tier design, network bandwidth and latency, and compression techniques need to be carefully considered. For example, a hypothetical accelerator with 100 TOP/s compute throughput would require a few GB/s PCIe and/or network bandwidth for the DL models listed in Table 1, unless image decompression can be done within the accelerator or on the same host.
DL Model and Hardware Codesign: It is important to codesign DL models to be aware of the cost associated with the required hardware resources. While power/energy budget, onchip memory capacity, and offchip memory bandwidth are typically more scarce resources, research on efficient DL model architectures often only optimizes the number of floatingpoint operations. When the performance is bandwidth bound, adding more FLOPs without increasing the bandwidth consumption can be a good way to improve the accuracy while maintaining the performance. If adding 2 FLOPs to the FC part of a recommendation model and increasing the embedding dimension of its embedding table by 2 provide similar accuracy improvements, we would expect adding FLOPs will be the more economical direction. Recovering accuracy losses from quantization by making DL models wider is an example of hardware cost aware tradeoffs: int8 multiplication consumes more than 5 less energy and area compared to fp16 multiplication, hence there is a big room to recover the accuracy while maintaining the energy savings [14, 44]. NMT models with higher arithmetic intensity and parallelism such as the transformer architecture also illustrate hardware cost aware tradeoffs.
5 Related Work
Recently, Hazelwood et al. presented a holistic characterization of ML workloads in data centers, covering not just inference but also training and data acquisition [25]. The paper discussed unique challenges of ML workloads in data centers, including their diversity, huge data and compute capacity demand [25]. Our paper aims to provide insights useful for software/hardware codesign of DL applications, focusing on DL inference characteristics related to hardware design.
Hardware accelerators for server ML workloads have been actively studied by the academia. Also, NVIDIA GPUs and Google TPUs have been successfully used in industry [48, 33]. In particular, Google TPU relies on a systolic array accelerator mainly targeted for 8bit matrixmatrix multiplicationm which is challenging to utilize for small batches and group/depthwise convolutions.
Also, Microsoft Brainwave is a matrixvector accelerator for low latency AI applications in data centers [18]. It consists of dotproduct engines which perform, with the broadcast vector and its local matrix weights, dotproduct operations in parallel. The salient features of Brainwave are model pinning and block floating point representation. The large onchip memory of FPGA is exploited to store weights on chip, avoiding offchip DRAM accesses. Block floating point offers low precision computation by enabling 4 or 5bit multiplications of mantissa and 5bit additions of shared exponents. However, it is not clear if architectures like Brainwave are general enough to efficiently target our diverse DL inference workloads
Moreover, a number of techniques has been proposed to improve energy efficiency of DL inference by taking advantage of reduced precision and sparsity. NVIDIA’s SCNN skips computation with zero input in matrix multiplications, thereby offering significant improvements in energy efficiency [49]. Akhlaghi et al. propose early stopping of convolution when the output is expected to be nonpositive followed by ReLU [1]. Sharma et al. present a systolic array accelerator called BitFusion which supports variable bit precision [61]. Park et al. present a zeroaware 4bit accelerator called OLAccel which applies reduced precision to the majority of data while keeping a small fraction of large value data in high precision [50], a technique also used in the optimizations described in this paper. Fleischer et al. propose a multiTOP/s AI core supporting a wide range of precision from fp16 (for training) to 1 or 2bit (for inference) [62]. Our paper shows that, while lowprecision and sparse computation can significantly improve energy efficiency, they should meet the accuracy requirements of serverside DL inference to be widely used in data centers.
Finally, we point out that a number of DL benchmarks are actively being developed [45, 63]. A benchmark framework has been presented where a model zoo of benchmark neural networks is provided and the performance of neural networks, optimized by users, is measured on real mobile devices remotely [63]. MLPerf aims at providing benchmarks for both server and mobile devices [45]. These benchmarks will facilitate systemlevel performance measurements and comparisons on diverse software platforms like TensorFlow [42] and PyTorch [51] as well as hardware architectures.
6 Conclusion
In the face of rapid innovation in deep learning and the increase of their computation and memory requirements, codesigning DL inference hardware for current and future DL models is an important but challenging problem. We believe our DL inference characterization and optimization experience can provide useful insights for DL inference hardware designs. We hope our paper can also contribute to discussion around software ecosystem such as benchmarking suites, linear algebra interface optimized for DL, and the compiler for optimizing and scheduling the whole graph of DL models, which are important parts of codesign process.
7 Acknowledgements
We would like to thank Caffe2, AML and Glow team members for help with collecting information and reviewing this study.
References
 [1] V Aklaghi, Amir Yazdanbakhsh, Kambiz Samadi, H Esmaeilzadeh, and RK Gupta. Snapea: Predictive early activation for reducing computation in deep convolutional neural networks. In ACM/IEEE International Symposium on Computer Architecture (ISCA), 2018.
 [2] Dario Amodei and Danny Hernandez. AI and Compute. https://blog.openai.com/aiandcompute/, 2018. OpenAI.
 [3] Anonymous. An empirical study of groupwise convolution for video classification networks. In Under review, 2018.
 [4] Maria Avgerinou, Paolo Bertoldi, and Luca Castellazzi. Trends in data centre energy consumption under the european code of conduct for data centre energy efficiency. Energies, 10(10):1470, 2017.
 [5] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
 [6] Baidu. DeepBench. https://svail.github.io/DeepBench/.
 [7] Alex Beutel, Paul Covington, Sagar Jain, Can Xu, Jia Li, Vince Gatto, and Ed H Chi. Latent Cross: Making Use of Context in Recurrent Recommender Systems. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pages 46–54. ACM, 2018.
 [8] Fedor Borisyuk, Albert Gordo, and Viswanath Sivakumar. Rosetta: Large scale system for text detection and recognition in images. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 71–79. ACM, 2018.
 [9] Daniel Braun, Adrian HernandezMendez, Florian Matthes, and Manfred Langen. Evaluating Natural Language Understanding Services for Conversational Question Answering Systems. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pages 174–185. Association for Computational Linguistics, 2017.
 [10] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Haichen Shen, Eddie Yan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. TVM: endtoend compilation stack for deep learning. In SysML Conference, 2018.
 [11] Jianpeng Cheng, Siva Reddy, Vijay Saraswat, and Mirella Lapata. Learning structured natural language representations for semantic parsing. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 44–55, Vancouver, Canada, July 2017. Association for Computational Linguistics.
 [12] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoderdecoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
 [13] Paul Covington, Jay Adams, and Emre Sargin. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems, pages 191–198. ACM, 2016.
 [14] William Dally. Highperformance hardware for machine learning. NIPS Tutorial, 2015.
 [15] Jia Deng, Wei Dong, Richard Socher, LiJia Li, Kai Li, and Li FeiFei. Imagenet: A largescale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. Ieee, 2009.
 [16] Assaf Eisenman, Maxim Naumov, Darryl Gardner, Misha Smelyanskiy, Sergey Pupyrev, Kim Hazelwood, Asaf Cidon, and Sachin Katti. Bandana: Using nonvolatile memory for storing deep learning models. arXiv:1811.05922v2, 2018.
 [17] Angela Fan, David Grangier, and Michael Auli. Controllable abstractive summarization. arXiv preprint arXiv:1711.05217, 2017.
 [18] Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, Logan Adams, Mahdi Ghandi Stephen Heil, Prerak Patel, Adam Sapek, Gabriel Weisz, Lisa Woods, Sitaram Lanka, Steven K. Reinhardt, Adrian M., Caulfield Eric, S. Chung, and Doug Burger. A Configurable CloudScale DNN Processor for RealTime AI. In ACM/IEEE International Symposium on Computer Architecture (ISCA), 2018.
 [19] Evgeny Frolov and Ivan Oseledets. Tensor methods and recommender systems. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 7(3):e1201, 2017.
 [20] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. Convolutional sequence to sequence learning. arXiv preprint arXiv:1705.03122, 2017.
 [21] Google. Lowprecision matrix multiplication. https://github.com/google/gemmlowp/.
 [22] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
 [23] Alex Graves, Abdelrahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neural networks. In Acoustics, speech and signal processing (icassp), 2013 ieee international conference on, pages 6645–6649. IEEE, 2013.
 [24] Jiatao Gu, James Bradbury, Caiming Xiong, Victor OK Li, and Richard Socher. Nonautoregressive neural machine translation. arXiv preprint arXiv:1711.02281, 2017.
 [25] Kim Hazelwood, Sarah Bird, David Brooks, Soumith Chintala, Utku Diril, Dmytro Dzhulgakov, Mohamed Fawzy, Bill Jia, Yangqing Jia, Aditya Kalro, James Law, Kevin Lee, Jason Lu, Pieter Noordhuis, Misha Smelyanskiy, Liang Xiong, and Xiaodong Wang. Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective. In High Performance Computer Architecture (HPCA), 2018 IEEE International Symposium on, pages 620–629. IEEE, 2018.
 [26] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [27] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and TatSeng Chua. Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web, pages 173–182. International World Wide Web Conferences Steering Committee, 2017.
 [28] Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yanxin Shi, Antoine Atallah, Ralf Herbrich, Stuart Bowers, and Joaquin Quiñonero Candela. Practical lessons from predicting clicks on ads at facebook. In Proceedings of the Eighth International Workshop on Data Mining for Online Advertising, pages 1–9. ACM, 2014.
 [29] Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 [30] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
 [31] Intel 64 and IA32 Architectures Optimization Reference Manual. https://software.intel.com/enus/download/intel64andia32architecturesoptimization/referencemanual.
 [32] Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and Jeffrey Dean. Google’s multilingual neural machine translation system: enabling zeroshot translation. arXiv preprint arXiv:1611.04558, 2016.
 [33] Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, , and Doe Hyun Yoon. Indatacenter performance analysis of a tensor processing unit. In ACM/IEEE International Symposium on Computer Architecture (ISCA), pages 1–12. IEEE, 2017.
 [34] Nal Kalchbrenner and Phil Blunsom. Recurrent continuous translation models. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1700–1709, 2013.
 [35] Daya Khudia, Protonu Basu, and Summer Deng. Opensourcing FBGEMM for stateoftheart serverside inference. https://code.fb.com/mlapplications/fbgemm/, 2018. Facebook.
 [36] Daya Khudia, Protonu Basu, Summer Deng, Jianyu Huang, Haixin Liu, Jongsoo Park, and Mikhail Smelyanskiy. Facebook GEMM library. https://github.com/pytorch/fbgemm, 2018. Facebook.
 [37] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. Computer, (8):30–37, 2009.
 [38] Vlad Krasnov. On the danger of Intel’s frequency scaling. https://blog.cloudflare.com/onthedangersofintelsfrequencyscaling/.
 [39] Raghuraman Krishnamoorthi. Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv preprint arXiv:1806.08342, 2018.
 [40] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 [41] Oleksii Kuchaiev and Boris Ginsburg. Training deep autoencoders for collaborative filtering. arXiv preprint arXiv:1708.01715, 2017.
 [42] Chris Leary and Todd Wang. XLA: TensorFlow, compiled. TensorFlow Dev Summit, 2017.
 [43] Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens van der Maaten. Exploring the limits of weakly supervised pretraining. arXiv preprint arXiv:1805.00932, 2018.
 [44] Asit Mishra, Eriko Nurvitadhi, Jeffrey J Cook, and Debbie Marr. WRPN: wide reducedprecision networks. arXiv preprint arXiv:1709.01134, 2017.
 [45] MLPerf. https://mlperf.org.
 [46] Ramesh Nallapati, Bowen Zhou, Caglar Gulcehre, and Bing Xiang. Abstractive text summarization using sequencetosequence rnns and beyond. arXiv preprint arXiv:1602.06023, 2016.
 [47] Graham Neubig. Neural machine translation and sequencetosequence models: A tutorial. ArXiv:1703.01619, 2017.
 [48] NVIDIA. Turing architecture whitepaper. 2018.
 [49] Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W Keckler, and William J Dally. SCNN: An accelerator for compressedsparse convolutional neural networks. In ACM/IEEE International Symposium on Computer Architecture (ISCA), 2017.
 [50] Eunhyuk Park, Dongyoung Kim, and Sungjoo Yoo. An EnergyEfficient Neural Network Accelerator based on OutlierAware Low Precision Computation. In ACM/IEEE International Symposium on Computer Architecture (ISCA), 2018.
 [51] Adam Paszke, Sam Gross, Soumith Chintala, and Gregory Chanan. Pytorch: Tensors and dynamic neural networks in python with strong gpu acceleration, 2017.
 [52] Holger Quast and Robert Bosch. Speech Dialogue Systems and Natural Language Processing, pages 67–106. Springer Berlin Heidelberg, Berlin, Heidelberg, 2004.
 [53] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. XNORNet: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pages 525–542. Springer, 2016.
 [54] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster RCNN: Towards realtime object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
 [55] Steffen Rendle. Factorization Machines. 2010.
 [56] Andres Rodriguez, Eden Segal, Etay Meiri, Evarist Fomenko, Young Jin Kim, Haihao Shen, and Barukh Ziv. Lower Numerical Precision Deep Learning Inference and Training. https://software.intel.com/enus/articles/lowernumericalprecisiondeeplearninginference/andtraining.
 [57] Nadav Rotem, Jordan Fix, Saleem Abdulrasool, Summer Deng, Roman Dzhabarov, James Hegeman, Roman Levenstein, Bert Maher, Satish Nadathur, Jakob Olesen, Jongsoo Park, Artem Rakhov, and Misha Smelyanskiy. Glow: Graph lowering compiler techniques for neural networks. arXiv preprint arXiv:1805.00907, 2018.
 [58] Holger Schwenk and Matthijs Douze. Learning joint multilingual sentence representations with neural machine translation. arXiv preprint arXiv:1704.04154, 2017.
 [59] Suvash Sedhain, Aditya Krishna Menon, Scott Sanner, and Lexing Xie. Autorec: Autoencoders meet collaborative filtering. In Proceedings of the 24th International Conference on World Wide Web, pages 111–112. ACM, 2015.
 [60] Rico Sennrich, Barry Haddow, and Alexandra Birch. Edinburgh neural machine translation systems for WMT 16. In Proceedings of the First Conference on Machine Translation, pages 371–376, Berlin, Germany, August 2016. Association for Computational Linguistics.
 [61] Hardik Sharma, Jongse Park, Naveen Suda, Liangzhen Lai, Benson Chau, Joon Kyung Kim, Vikas Chandra, and Hadi Esmaeilzadeh. Bit Fusion: BitLevel Dynamically Composable Architecture for Accelerating Deep Neural Networks. arXiv preprint arXiv:1712.01507, 2017.
 [62] Vijayalakshmi Srinivasan, Bruce Fleischer, Sunil Shukla, Matthew Ziegler, Joel Silberman, Jinwook Oh, Jungwook Choi, Silvia Mueller, Ankur Agrawal, Tina Babinsky, Nianzheng Cao, ChiaYu Chen, Pierce Chuang, Thomas Fox, George Gristede, Michael Guillorn, Howard Haynie, Michael Klaiber, Dongsoo Lee, ShihHsien Lo, Gary Maier, Michael Scheuermann, Swagath Venkataramani, Christos Vezyrtzis, Naigang Wang, Fanchieh Yee, Ching Zhou, PongFei Lu, Brian Curran, Leland Chang, and Kailash Gopalakrishnan. A Scalable MultiTeraOPS Deep Learning Processor Core for AI Training and Inference. In Proceedings of VLSI Symposium, 2018.
 [63] Fei Sun. Benchmark Data Driven Codesign for Neural Network Applications. In Design Automation Conference, 2018.
 [64] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.
 [65] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015.
 [66] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A Closer Look at Spatiotemporal Convolutions for Action Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6450–6459, 2018.
 [67] Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. Tensor comprehensions: Frameworkagnostic highperformance machine learning abstractions. arXiv preprint arXiv:1802.04730, 2018.
 [68] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
 [69] Oriol Vinyals, Łukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey Hinton. Grammar as a foreign language. In Advances in Neural Information Processing Systems, pages 2773–2781, 2015.
 [70] Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. Deep & cross network for ad click predictions. In Proceedings of the ADKDD’17, page 12. ACM, 2017.
 [71] WikiChip. Cascade Lake Microarchitecture. https://en.wikichip.org/wiki/intel/microarchitectures/cascade_lake.
 [72] Samuel Williams, Andrew Waterman, and David Patterson. Roofline: an insightful visual performance model for multicore architectures. Communications of the ACM, 52(4):65–76, 2009.
 [73] ChaoYuan Wu, Amr Ahmed, Alex Beutel, Alexander J Smola, and How Jing. Recurrent recommender networks. In Proceedings of the tenth ACM international conference on web search and data mining, pages 495–503. ACM, 2017.
 [74] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 5987–5995. IEEE, 2017.
 [75] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. arXiv preprint arXiv:1707.01083, 2017.
 [76] Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. Deep interest network for clickthrough rate prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1059–1068. ACM, 2018.
 [77] Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Dorefanet: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.