High Performance Scalable FPGA Accelerator for Deep Neural Networks
1 abstract
Lowprecision is the first order knob for achieving higher Artificial Intelligence Operations (AITOPS). However the algorithmic space for sub8bit precision compute is diverse, with disruptive changes happening frequently, making FPGAs a natural choice for Deep Neural Network inference, In this work we present an FPGAbased accelerator for CNN inference acceleration. We use INT82 compute (with 8 bit activation and 2 bit weights) which is recently showing promise in the literature, and which no known ASIC, CPU or GPU natively supports today. Using a novel Adaptive Logic Module (ALM) based design, as a departure from traditional DSP based designs, we are able to achieve high performance measurement of 5 AITOPS for Arria10 and project a performance of 76 AITOPS at 0.7 TOPS/W for Stratix10. This exceeds known CPU, GPU performance and comes close to best known ASIC (TPU) numbers, while retaining the versatility of the FPGA platform for other applications.
2 Introduction
Deep Neural Networks (DNNs) have achieved stateoftheart accuracy for a variety of inference tasks in various domains such as object detection, image classification, and speech recognition. However, this is achieved at a very high computation ( 8 GFlop per image classification using ResNet50), memory footprint ( 100 MB), and bandwidth [20]. One way to circumvent this problem is to perform lowprecision compute [14, 12, 8, 9, 18], so much so that 8bit compute has today become the mainstay for inference [16], yielding 4x or higher benefit over FP32. At the same time, research is pushing the boundaries even upto ternary (1, 0, +1), and binary operations, thereby significantly boosting the AITOPs (Artifitial Intellegence Tera Operations per second)  loosely defined as the amount of lowest precision compute at which acceptably accurate inference can be performed by a machine.
However, as we stride into the 8bit and sub8 bit compute domain to leverage higher performance, we discover significantly higher diversity amongst options for low precision compute. These not only vary in specific details of the algorithm used (for example GEMMLOWP vs Dynamic Fixed Point), but also in the type of data used (FP or INT) and precision of activations, and weights. Indeed one finds across the literature almost all combinations of precisionpairs for activations and weights (for example 8 bit activation 2 bit weights [14], 5 bit activation, and 4 bit weights. Moreover, the precision of operations is dependent on the application and sometimes varies even between different operations (or layers) in the application leading to a hybrid precision scenario. Such being the diversiry of lowprecision inference for DNNs, it requires a thorough investigation of FPGA’s for DNN inference, where the precision can be arbitrarily altered for activations and weights across applications and parts of an application. Not surprising then is the wide array of FPGA offerings for 8bit and sub8bit inference [7, 5, 6]. It may also be noted that while ASIC’s will always have higher compute density than FPGA’s, the fast evolving algorithmic landscape makes things challenging for ASIC’s, while on the other hand FPGA’s can be used for multiple different (nonDNN) applications, thereby providing a much higher value proposition.
An interesting sub8bit design point emerges when we have integer 8bit activations and ternary weights. Here, the activations can fit into onchip memory or many modern FPGA’s like the Arria10 [1]. For inference, we can swap between input and output buffers for consecutive layers, thereby needing a memory footprint of 23 buffers for storing activation of one data point. At the same time, weights can stream from memory at almost the lowest achievable bandwidth (lowest achieved for binary weights), or can even reside in memory for certain cases. This design choice is further buttressed by competitive accuracy results for stateoftheart Convolutional Neural Networks (like ResNet50) on ImageNet1k dataset using INT82 and fine grained quantization (FGQ) [14]. FGQ using INT82 achieves 4% lower Top1 accuracy.
From an FPGA design persective an INT82 design provides an additional interesting opportunity. Since ternary weights are used, all MAC (MultiplyandAccumulate) operations are essentially replaced by additions. This allows us to explore the use of Adaptive Logic Modules (ALM’s) for the MAC operations, in contrast to the traditional use of DSP’s for Neural Network Compute. This is critical as FPGA’s typically have equivalent or less density of DSP Flops [1], as compared to CPU’s, or GPU’s [2]. Moreover a majority of an FPGA floorplan consists of ALM’s and not DSP’s, so leveraging ALM’s for DNN compute becomes critical to competitive AITOPs with respect to CPU’s and GPU’s.
In this work we present an FPGA design for INT82 computation using the Fine Grained Quantization (FGQ) method. We design IP components which are tailored to the FGQ method to minimize algorithmic overheads, and a systolic array based Processing Element optimized for INT82. Data locality is reused to fully maximize the efficiency of the design. The main contributions of this work are: To the best of our knowledge, this is the first work which uses ALM’s for endtoend NN compute. We conduct our experiments with ResNet50.
A FPGA design cooptimized for the lowprecision algorithm.
Power performance projection of 0.7 AITOPS/W for inference on the Stratix10, exceeding that of any currently available CPU, or GPU and closely matching ASIC’s like TPU (1.2 TOPS/W).
3 Related Work
Related work can be summarized into two main categories. The first category focuses on recent research that improves accuracy of DNN’s with low precision data types. The second focuses on efficient hardware implementation of low precision DNN’s.
3.1 Accuracy of Low Precision DNN
DNN’s inferencing with low precision data types are a well researched topic [14]. Many prior works have investigated low precision for weights while keeping activations at full precision[21, 13]. Stochastic binarization scheme was shown to achieve stateoftheart (SOTA) accuracies on smaller data sets (MNIST, CIFAR10, SVHN) in [21]. Near SOTA accuracy with AlexNet Topology on a larger ImageNet data set was shown in [13]. the above works have retained activations at full precision. However, to realize full power/potential benefit for low precision weights, activations also need to be at lower precision. Researchers have also demonstrated that the use of 84 bits shows reasonable high accuracy compared to full precision [17].
Binary Neural Network (BNN) investigates the use of 1bit values for the weights, where weights are constrained to either +1 or 1 [12]. BNN’s have shown to achieve SOTA accuracy on smaller data sets (ex., CIFAR10). XNORNet, which use binarized weights and activations, achieves top1accuracy drop of 12% for AlexNet [13]. Thus, achieving high accuracy with binary networks is still a challenge. Ternary Neural Networks (TNN) investigate the use of 2bit values for weights, where weights are constrained to +1,1, and 0. Top1 error rate of 25.6% has been reported for ResNet50 models trained on ImageNet with 32bit activations and ternary weights [4]. Recently, Top1 error of 29.24% for ResNet 50 with 8bit activations and ternary weights was achieved with the FGQ technique [14]. As this is the highest accuracy number reported on ResNet with ImageNet for 82 thus far, we use the models (FGQ) presented in this work to further finetune the accuracy for ResNet network with ImageNet.
3.2 Hardware Implementation of Low Precision DNN
DNN can be efficiently mapped into a FPGA to achieve high performance as shown in various prior works [15, 11, 3]. Several prior research works have implemented DNN’s on a FPGA with precision width of 16/32bits. As a result, they use DSP for the core computation, failing to not make maximum use of the abundant logic resources available on modern FPGA’s. Very few works have evaluated DNN implementation for topologies like ResNet on modern FPGA such as the Stratix 10 to evaluate performance. Stratix 10 performance for Ternery DNN on ResNet network was shown in [7]. However, the activations are still constrained to 32bits, thus not fully utilizing the ALM’s available. ResNet implementation on the Arria10 FPGA with 16bit precision was implemented in [20]. Maximum Tops that could be achieved by this design was only 0.28Tops. This is the first work that implements 82 on a FPGA with dynamic fixed point support to achieve high performance for ResNet topology.
4 Ternary ResNet Network
4.1 ResNet Topology Overview
Deep residual networks (ResNet) have shown superior classification accuracy with fewer model parameters compared to previous DNN models [10]. ResNet network has become the benchmark in industry/academia for DNN’s. Efficient implementation of the network on hardware is important. A typical ResNet topology consists of 16 different ResNet Modules (RM) with each consisting of convolution, maxpooling, batch norm, scale, and relu as shown Figure 1. ResNet topologies have varying kernel dimensions across layers. Thye also have skip connections, where the left and right branch of the ResNet are merged into an elementwise layer for summation. The above characteristics of a ResNet have made the topology highly irregular compared to previous DNN models such as AlexNet, and VGG. Therefore, it is highly challenging to implement ResNet topology on a FPGA to achieve high performance with fixed hardware and memory resources.
To achieve high accuracy on ImageNet data set using ResNet models, the first (Conv1, Pool1, BN1 on CPU) and last layers (Pool5, FC, Softmax), shown in Figure 1, are used at high precision with 8bit activations(a) and 8bit weights(w). Our FPGA accelerator is designed to support only 8a2w to make the design simpler and extract maximum performance from the FPGA. Since the first and last layers work at high precision, they are run on the CPU. As the other layers are implemented on the FPGA, they function only on convolutions, with the batch norm and scale layers fused with convolution filters.
4.2 ResNet50 Model Optimization
A FGQ based technique was applied on ResNet 50 network to achieve high accuracy for 8bit activations and ternary weights [14]. The fine grain quantization technique proposed divides the learned weights into sets of disjoint blocks (N), and then ternarizes them. There is a scaling factor (alpha) associated with a block of size (N). It is shown that for N = 64, 99% of MAC operations can be replaced by ternary accumulations. This results in 15x potential improvement in performance [14]. For ResNet, the number of inputs and output channels scale by a multiple of 64. Thus we have taken N=64 in our work. However, during inferencing, the batch norm and scale parameters in ResNet layers can be fused into the alpha scaling parameter, which was not considered in [14]. The fusion of these layers helps in easy implementation of the network on hardware.
Below, we outline how the fusion can be carried out with the FGQ technique in this work. Let denote the ternary convolution operation. Let denote the quantized activations. After fusing BN and scaling parameters, for a given block of FP32 weights and for a given ofm, we scale the FP32 weights by with a bias of . Then, fused FP32 weights become
We ternarize this as , where is a quantized scaling factor and , . Then, the (partial) output of ternary convolution, for the given ofm, is
and the full output, for the given ofm, is
5 Scalable FPGA DNN Architecture
5.1 Architecture Description
The FPGA architecture designed provides a highly configurable SIMD engine. SIMD structures exist in the DNN, and can be mapped to the Processing Elements (PE) present in our architecture. The accelerator is a scalable SIMD engine, where the data flow is optimized to maximize the number of operations performed for each byte of data fetched. A layer in ResNet topology can be used many times across the entire topology. Different modules in the hardware must be flexible enough to be reused across layers, thus helping in keeping the number of hardware resources to a minimum across a large set of resources.
We design a tile based spatial architecture consisting of many tiles and PE’s to perform MAC operations, as shown in Figure 2. Tiles and PE’s are easily configurable in our design. The parallelism in convolution is extracted across many output feature maps; within one input feature map, the weights are reused along all the tiles in PE. Each PE works on a new set of input pixel points. As a result, each tile produces different output feature maps. We have fixed the number of tiles to be 64, and the number of PE’s per tile to be 4 (based on total compute that can be fit into the chip).
Different hardware components are listed below:
Load Store Unit (LSU): Manages data fetch and save from/to system memory through PCIe Avalon interface. LSU handles multiple outstanding requests.
Tile: Each tile contains an 1D array of PE’s. This can be parameterizable.
PE: Mapped to ALM’s in FPGA, it performs the core compute.
IRAM Buffer: Mapped to block RAMS in FPGA, they are used to store IFMs/inputs.
BSRAM Buffer: Mapped to block RAMS in FPGA, they are used to store Kernels/Weights.
SSRAM Buffer: Mapped to block RAMS in FPGA, they are used to store scaling values.
BBSRAM Buffer: Mapped to block RAMS in FPGA, they are used to store Bias values.
ORAM Buffer: Mapped to block RAMS in FPGA, they are used to store OFMs/outputs.
Element_Wise Buffer: Mapped to block RAMS in FPGA, they are used to bring in previously computed OFM data from memory for elementwise addition.
5.2 Core Execution Pipeline
Our core compute engine pipeline, shown in Figure 4, is designed with deep pipeline stages (20). Different computing models present inside the PE are shown Figure 4. All the computing elements are mapped into ALM’s.
Dot64 Engine  Tile layout of our architecture is shown in Figure 3. Each PE performs a dot64 compute. A single dot64 engine block performs MAC computations on 64 pixels of IFM, and 64 pixels of weights that result in a 15bit output. Ternary multiplication operation is executed with LUT based multiplier without the need for actual multiplier circuitry. If the weights are 1, the computation can be simplified by negating the input value. Each PE in the tile produces one pixel of the output feature maps (OFMs). A single dot64 operation is optimized and mapped into ALM’s. A single dot64 logic uses 660 ALM’s and operates at 660MHz with high packing efficiency. Optimizing the PE logic design is critical since peak performance is directly related to the number of PE logic blocks that we can fit in our tile based logic.
Scaling Engine  Output of dot64 is multiplied by a 16bit scaling weight, resulting in a 31bit output.
Accumulator/Bias unit  Each partially computed output pixel is accumulated (32bit accumulator), resulting in a 32bit pixel output. The bias value is also added to each of the computed output pixel values. The partially computed pixel values are then fed back from the accumulator into the accumulator engine in the next cycle until a full OFM is computed. After the full OFM is computed, the value is written into the OSRAM Buffer.
Our architecture is fully pipelined so that we feed new data into different compute elements of the PE every cycle. There are 4 read channels, which read 64 bytes of data for input, weights, scaling values and bias. They store the read values into the internal memory. The data is read from the internal memory, and fed into the compute engine. Memory controllers are responsible for requesting new input data, filling the memory buffers, draining the sections of output completed and generating request to system memory to write the data.
We have one output write channel to write the computed OFM pixel values into the memory. If the layer consists of 128 IFM’s and 64 OFM’s, the first set of 64 IFM’s and 64 OFM’s are computed and stored in the OSRAM Buffer. The next set of 64 IFM’s are read from memory, passed into the compute engine. During the accumulator stage, we read the previously computed partial outputs (from the previous 64 IFM’s) from memory, accumulate with the currently computed value, and store the computed value back into memory.
DFP data format used in this work consists of a combination of integer, , and shared exponent, . In this work, we use a single shared exponent (8bits) per layer for weights and inputs. Once all the OFM values are computed, we downconvert the 32bit output pixel values to 8bits, as outlined below.
We find the absolute (abs) max from all the OFM values.
From the max value, we determine the shift value for down conversion. The shift value is determined by using a Leading Zero Count (LZC) detector, as shown in the expression given below. P represents the number of bits used by integer elements in the OFM.
(1) 
The same shift value will be used across all the OFM pixel points.
With the shift value determined (), we right shift, and down convert to 8 bits (). Then, we round depending on the value of the round and bias bits. The first two bits after the right shift are the round and bias bits. the rest of the seven bits along with the sign bit are our downconverted bits. If both the bias and round bits are not set to 0, we add 1 to our downconverted output. Exponent activation for the layer is computed by summing the current exponent for activation, the weight exponent, and the value of the down conversion shift computed previously, as shown in Equation 1 and Figure 6. Total exponent value computed is passed to the next layer, which becomes the activation exponent for the next layer.
The downconverted output is written back to memory thru the LSU. We combine the first 8bit pixel from each of the 64 OFM’s (from 64 tiles), which results in one cache line, and write that block of data to the memory. Correspondingly, this is done for the other OFM pixels.
ResNet also has elementwise layers, where the left and right convolution branches are merged, and added together to get an 8bit value. At any moment of time, our accelerator can process a left or right branch of ResNet topology. For example, left branch convolutions are performed, and the outputs are stored in the memory. Next the right branch of convolutions are executed in the pipeline. The OFM computed is stored in the OSRAM Buffer (32 bits each). We then read the first 32bit pixel from each of the 64 tiles (32x64), and feed it into the downconversion logic block. We obtain the 8bit downconverted output as explained above (8x64), resulting in a total of 512 bits. When the elementwise layer is initiated, we read the previously computed values from the LSU into the buffer. At every cycle, we feed 512 bits (8x64) of data from the buffer into the elementwise logic unit. The data is read from the buffer.
Adding two 8bit DFP’s produces an 8bit output, and a new shared exponent. In case of the elementwise layer, we add the output of two convolution layers (/) together to produce an 8bit output. Since the left and right branches have different exponent values computed, we perform a shift (finding the greater of the two exponents), and then add them in the elementwise unit block. This results in 8bit values, as shown in Equation 2. and are the shared exponents of the left and right branch layers.
(2) 
6 Memory Organization/Layer Programming
Based on the requirements of each layer, we construct a memory layout for each of the inputs, weights, and scaling values for efficient data distribution into different memory buffers. The LSU controller helps in movement of data between the memory and onchip buffers.
BSRAM : Kernal memory layout is arranged in system memory by combining each of the 2bit pixels from 64 weights. Each pixel in the kernal (3x3) will be 128 bits. Kernal data will be then distributed to 64 tiles. BSRAM buffer holds the weight values. In our design with 64 tiles, there is a separate BSRAM buffer per tile to feed the data into the PE’s present in all the tiles. Distribute logic in the BSRAM distributes the data that is read from the memory into individual BSRAM buffers present in each of the tiles. To reduce the load on the distribute logic, we divide it into set of four distribute logic blocks, which feed a set of 16 tiles. Each BSRAM buffer has a controller, which initiates the reading and writing into the buffer.
SSRAM: Scaling memory layout is arranged in a similar manner to the BSRAM, where each scaling value is 16 bits. SSRAM buffer holds the scaling value. There is a separate SSRAM buffer per tile to feed the scaling values into each of the PE’s present on the tiles. The SSRAM distribute logic distributes the data read from memory to the individual SSRAM buffer, where we divide it into a set of two distribute logic blocks that feed a set of 32 tiles.
BBSRAM: Bias memory layout contains one 32bit value depending on number of the OFMs. The BBSRAM buffer holds the bias value. We add them to each of the computed OFM pixel points.
ISRAM: ISRAM memory layout is organized in system memory by combining each of the 8bit pixels from the 64 IFM’s (combine along zdepth). Every pixel in our modified input image would be 512 bits. Internally, the 512 bits would be read, and distributed to each of the PE’s on the tiles. ISRAM logic consists of 1 bank per PE (4 PE’s per tile); the ISRAM controller the feeds data needed for computation into the corresponding PE based on the stride information.
ResNet contains a sequence of layers that need to be executed sequentially in our hardware. Based on the topology of the particular layer, our accelerator contains a set of registers that need to be programmed for DNN execution. Our register programming is divided into 2 parts:
1. Program the core registers: Provide the layer topology information through the registers, such as input, output feature map dimensions, kernal size, stride, number of channels, and number of tiles that are stored in the configuration register.
2. Program the LSU registers: Provide data fetch information (size of the data that needs to be fetched, and the base address) through these registers. These set of registers will provide information on the quantity of IFM and kernel data that need to be fetched from memory, and OFM data to be written to memory.
The control logic keeps track of the current number of executed layers, and loads the corresponding values into the registers for each layer upon completion of the current layer (transfer of data from layer 1 to layer 2).
7 Experimental Results
Our proposed DNN hardware accelerator for ResNet50 topology was implemented on two of Intel’s stateoftheart FPGA’s, the Arria10 GX1150 and the Stratix10. The complete DNN accelerator was written in RTL. We used Intel Quartus Prime software, and EPE tool to estimate performance and power. For the Stratix10, we used Quartus Early Beta Release to evaluate our network.
We retrained the ResNet50 ternary model, using the fine tuning method described in [14]. We optimized this model to suit the design parameters of the hardware and fused the batchnorm paramters into convolution layers. We validated the accuracy of the fused model using a modified version of Caffe with support for emulated ternary operations and acheived top1 accuracy of 71.1% for ResNet50.
We began our experiment by running convolutions with 8bit activations and ternary weights on the Arria10 FPGA. The first and last layers of ResNet50 are executed on the CPU(88) which account for only 4.3% of the total flops (3.8GMACs). The Altera Aria10 GX1150 contains 427K ALM’s with 2713 M20K RAM blocks. The different RAM blocks in our design shown in Table 2 were mapped to M20 blocks in FPGA. Table 1 shows the FPGA resource utilization on implementing a ResNet network on the Arria10 FPGA. Total ALM usage was 83% for mapping the entire ResNet with DFP support onto the Arria10. In Table 3, we provide the breakdown of ALM usage for one instance of the different logic/memory modules present in the design. Some of the modules were instantiated many times in the design
Precision  82 

ALM usage  83% 
DSP usage  0 
Freq(MHz)  200 
Accuracy(top1)  71.10% 
RAM(M20K)  19% 
Dot operation  64 
Buffer  Dimension  Instances 



IRAM  512 X 128  8  0.524  
BSRAM  128 X 128  64  0.13  
ORAM  128 X 1028  64  1.29  
BBSRAM  32x128  64  0.2445  
ElementWise  512*128  1  0.065  
SSRAM  16 X128  64  0.0163 
Modules  ALM Usage 

Dot64  660 
Load Unit  2143 
Store Unit  342 
DownConversion  108 
ElementWise  36 
Memory_ctrl  4476 
Memory_dist.logic  7510 
Max_logic  242 
Avalon_interface  100 
We compared the performance obtained for ResNet50 on the Arria10 and the Stratix10. We believe our design can be further optimized and thus we make performance projections with more aggressive performance targets (300MHz and 400MHz). Figures 7 and 8 show TOP/s and GOP/s/Watt obtained for ResNet50 on the Arria10. Our design obtained 5 TOP/s on ResNet50. This is the highest performance number reported on the Arria10 for ResNet50. With an aggressive performance target of 400MHZ, which we believe is achievable, we can obtain performance of 10TOP/s.
Stratix10 results are obtained for the S10_SG280 part (933K ALM’s) using Quartus tools. By studying performance on the lowerend S10 part, we project performance on the Stratix10’s highestend part (S10_SG550), which has 1.8M ALM’s . The Stratix10 is the advanced FPGA device with HyperFlex support (Hyperflex can enable frequency > 600MHz). On the S10 (SG550), we project a performance of 76 TOP/s and 0.78 TOP/s/Watt for ResNet50, as shown in Figures 9and 10.
There has been numerous research on DNN implementation on FPGA. However, very few works have shown FPGA performance on stateoftheart topologies, like ResNet50, with ternary weights and on latest Intel FPGA. We compare our S10_SG280 performance estimates for ResNet50 with results reported in [7] for the same S10 part. The ternary ResNet reported in [7] uses 2bit weights and 32bit activations, which is different from 82 presented in this work. Our S10 performance numbers show >4x improvement in TOP/s compared to [7], as shown in Figure 11. High performance is achievable as lower precision width uses less logic. Thus, this frees up space to pack more compute into our design. We also extract more performance from our design by fine tuning the dot module without using any DSP’s. This results in lower ALMs/op and achieves high packing efficiency.
We also compare the TOP/s obtained for ResNet50 on S10/A10 with other architectures such as the Arria10 and Xilink Zynq Z7045 SOC in Figure 11. ResNet50 implementation on the Arria10 with 16bit precision (A10 [20]) was shown to achieve 0.315 TOP/s [20]. Their entire design used 69% DSP’s and 30% ALM’s. By lowering the precision, we are able to use ALM’s in our design effectively and achieve more than 15x performance improvement on the A10 compared to [20]. Recent ResNet50 implementation on Xilink Zynq Z7045 was shown to achieve 0.128TOP/s [19]. This design used 16bit fixed precision and can execute 256 MAC/cycle. Our design can perform 16K MAC/cycle and our S10/A10 performance outdoes the Xilink comprehensively, as shown in Figure 11.
We also compare peak TOP/s obtained with the Titan X GPU for Ternary ResNet50 (32(a)2(w)) in [7]. The S10 (SG280) outperforms the Titan X GPU by more than 5x in performance. GPU’s are not yet designed to effectively implement low precision operations and thus FPGA’s outperform the GPU in performance achieved.
GPU  TOP/s 

Precision  TOP/s/Watt  
V100  125  300  FP16  0.416667  
P4  22  75  INT8  0.293333  
P40  47  250  INT8  0.188  
P100  18.7  250  FP16  0.0748 
Various recent NVIDIA GPU offerings for Inferencing are shown in Table 4. Our A10 and S10 (SG_280/SG_550) platforms can achieve better performance/watt for ResNet50 compared to theoretical TOP/s/Watt offered by these GPU devices.
8 Conclusions
In this paper, we have designed a high performance DNN accelerator that can accelerate low precision inference for deep learning. There is tradeoff between being able to accomodate more compute into hardware using low precision data types and retaining accuracy compared to full precision. In this work, with fused a ResNet50 network and applied FGQ technique to ternarize the weight. We achieved a top1 accuracy of 71.1%. We have efficiently implemented a ResNet50 network with 8bit activations and ternary weights using only ALM’s on FPGA. We have optimized our core compute engine such that we can efficiently pack more compute into a FPGA. Our design is capable of performing 16K MAC operations per cycle. Our experimental results indicated that performance achieved by our design on the Arria10 and Stratix10 for ResNet50 can outperform other hardware implementations of the ResNet50 on FPGA’s such as the Arria10, Xilink, and stateoftheart GPU.
plus 0.3ex
References
 [1] Altera arria10 website. https://www.altera.com/products/fpga/arriaseries/arria10/overview.html.
 [2] Nvidia gpu for inferencing. http://www.nvidia.com/object/accelerateinference.html.
 et al. [2015a] Chen Zhang et al. Optimizing fpgabased accelerator design for deep convolutional neural networks. FPGA, 2015a.
 et al. [2016a] Chenzhuo Zhu et al. Trained ternary quantization. CoRR, 2016a.
 et al. [2017a] D. J. M. Moss et al. High performance binary neural networks on the xeon+fpga platform. In FPL, 2017a.
 et al. [2016b] E. Nurvitadhi et al. Accelerating binarized neural networks: Comparison of fpga, cpu, gpu, and asic. 2016b.
 et al. [2017b] Eriko Nurvitadhi et al. Can fpgas beat gpus in accelerating nextgeneration deep neural networks? FPGA, 2017b.
 et al. [2016c] Itay Hubara et al. Quantized neural networks: Training neural networks with low precision weights and activations. CoRR, 2016c.
 et al. [2015b] Kaiming He et al. Deep residual learning for image recognition. CoRR, 2015b.
 et al. [2015c] Kaiming He et al. Deep residual learning for image recognition. CoRR, 2015c.
 et al. [2016d] M. Motamedi et al. Design space exploration of fpgabased deep convolutional neural networks. In ASPDAC, 2016d.
 et al. [2016e] Matthieu Courbariaux et al. Binarynet: Training deep neural networks with weights and activations constrained to +1 or 1. CoRR, 2016e.
 et al. [2016f] Mohammad Rastegari et al. Xnornet: Imagenet classification using binary convolutional neural networks. CoRR, 2016f.
 et al. [2017c] Naveen Mellempudi et al. Ternary neural networks with finegrained quantization. 2017c.
 et al. [2016g] Naveen Suda et al. Throughputoptimized openclbased fpga accelerator for largescale convolutional neural networks. FPGA, 2016g.
 et al. [2017d] Norman P. Jouppi et al. Indatacenter performance analysis of a tensor processing unit. CoRR, 2017d.
 et al. [2016h] Philipp Gysel et al. Hardwareoriented approximation of convolutional neural networks. CoRR, 2016h.
 et al. [2015d] Suyog Gupta et al. Deep learning with limited numerical precision. CoRR, 2015d.
 et al. [2017e] Vinayak Gokhale et al. Snowflake: A model agnostic accelerator for deep convolutional neural networks. CoRR, 2017e.
 et al. [2017f] Y. Ma et al. Endtoend scalable fpga accelerator for deep residual networks. In 2017 IEEE ISCAS, 2017f.
 et al. [2015e] Zhouhan Lin et al. Neural networks with few multiplications. CoRR, 2015e.