High Performance Scalable FPGA Accelerator for Deep Neural Networks

High Performance Scalable FPGA Accelerator for Deep Neural Networks

Sudarshan Srinivasan    Pradeep Janedula    Saurabh Dhoble    Sasikanth Avancha    Dipankar Das   
Naveen Mellempudi
   Bharat Daga    Martin Langhammer    Gregg Baeckler    Bharat Kaul
Intel Corporation
{sudarshan.srinivasan}@intel.com

1 abstract

Low-precision is the first order knob for achieving higher Artificial Intelligence Operations (AI-TOPS). However the algorithmic space for sub-8-bit precision compute is diverse, with disruptive changes happening frequently, making FPGAs a natural choice for Deep Neural Network inference, In this work we present an FPGA-based accelerator for CNN inference acceleration. We use INT-8-2 compute (with 8 bit activation and 2 bit weights) which is recently showing promise in the literature, and which no known ASIC, CPU or GPU natively supports today. Using a novel Adaptive Logic Module (ALM) based design, as a departure from traditional DSP based designs, we are able to achieve high performance measurement of 5 AI-TOPS for Arria10 and project a performance of 76 AI-TOPS at 0.7 TOPS/W for Stratix10. This exceeds known CPU, GPU performance and comes close to best known ASIC (TPU) numbers, while retaining the versatility of the FPGA platform for other applications.

2 Introduction

Deep Neural Networks (DNNs) have achieved state-of-the-art accuracy for a variety of inference tasks in various domains such as object detection, image classification, and speech recognition. However, this is achieved at a very high computation ( 8 GFlop per image classification using ResNet50), memory footprint ( 100 MB), and bandwidth [20]. One way to circumvent this problem is to perform low-precision compute [14, 12, 8, 9, 18], so much so that 8-bit compute has today become the mainstay for inference [16], yielding 4x or higher benefit over FP32. At the same time, research is pushing the boundaries even upto ternary (-1, 0, +1), and binary operations, thereby significantly boosting the AI-TOPs (Artifitial Intellegence Tera Operations per second) - loosely defined as the amount of lowest precision compute at which acceptably accurate inference can be performed by a machine.

However, as we stride into the 8-bit and sub-8 bit compute domain to leverage higher performance, we discover significantly higher diversity amongst options for low precision compute. These not only vary in specific details of the algorithm used (for example GEMMLOWP vs Dynamic Fixed Point), but also in the type of data used (FP or INT) and precision of activations, and weights. Indeed one finds across the literature almost all combinations of precision-pairs for activations and weights (for example 8 bit activation 2 bit weights [14], 5 bit activation, and 4 bit weights. Moreover, the precision of operations is dependent on the application and sometimes varies even between different operations (or layers) in the application leading to a hybrid precision scenario. Such being the diversiry of low-precision inference for DNNs, it requires a thorough investigation of FPGA’s for DNN inference, where the precision can be arbitrarily altered for activations and weights across applications and parts of an application. Not surprising then is the wide array of FPGA offerings for 8-bit and sub-8-bit inference [7, 5, 6]. It may also be noted that while ASIC’s will always have higher compute density than FPGA’s, the fast evolving algorithmic landscape makes things challenging for ASIC’s, while on the other hand FPGA’s can be used for multiple different (non-DNN) applications, thereby providing a much higher value proposition.

An interesting sub-8-bit design point emerges when we have integer 8-bit activations and ternary weights. Here, the activations can fit into on-chip memory or many modern FPGA’s like the Arria-10 [1]. For inference, we can swap between input and output buffers for consecutive layers, thereby needing a memory footprint of 2-3 buffers for storing activation of one data point. At the same time, weights can stream from memory at almost the lowest achievable bandwidth (lowest achieved for binary weights), or can even reside in memory for certain cases. This design choice is further buttressed by competitive accuracy results for state-of-the-art Convolutional Neural Networks (like ResNet50) on ImageNet-1k dataset using INT8-2 and fine grained quantization (FGQ) [14]. FGQ using INT-8-2 achieves 4% lower Top-1 accuracy.

From an FPGA design persective an INT-8-2 design provides an additional interesting opportunity. Since ternary weights are used, all MAC (Multiply-and-Accumulate) operations are essentially replaced by additions. This allows us to explore the use of Adaptive Logic Modules (ALM’s) for the MAC operations, in contrast to the traditional use of DSP’s for Neural Network Compute. This is critical as FPGA’s typically have equivalent or less density of DSP Flops [1], as compared to CPU’s, or GPU’s [2]. Moreover a majority of an FPGA floorplan consists of ALM’s and not DSP’s, so leveraging ALM’s for DNN compute becomes critical to competitive AI-TOPs with respect to CPU’s and GPU’s.

In this work we present an FPGA design for INT-8-2 computation using the Fine Grained Quantization (FGQ) method. We design IP components which are tailored to the FGQ method to minimize algorithmic overheads, and a systolic array based Processing Element optimized for INT-8-2. Data locality is re-used to fully maximize the efficiency of the design. The main contributions of this work are: To the best of our knowledge, this is the first work which uses ALM’s for end-to-end NN compute. We conduct our experiments with ResNet50.

A FPGA design co-optimized for the low-precision algorithm.

Power performance projection of 0.7 AI-TOPS/W for inference on the Stratix-10, exceeding that of any currently available CPU, or GPU and closely matching ASIC’s like TPU (1.2 TOPS/W).

3 Related Work

Related work can be summarized into two main categories. The first category focuses on recent research that improves accuracy of DNN’s with low precision data types. The second focuses on efficient hardware implementation of low precision DNN’s.

3.1 Accuracy of Low Precision DNN

DNN’s inferencing with low precision data types are a well researched topic [14]. Many prior works have investigated low precision for weights while keeping activations at full precision[21, 13]. Stochastic binarization scheme was shown to achieve state-of-the-art (SOTA) accuracies on smaller data sets (MNIST, CIFAR10, SVHN) in [21]. Near SOTA accuracy with AlexNet Topology on a larger ImageNet data set was shown in [13]. the above works have retained activations at full precision. However, to realize full power/potential benefit for low precision weights, activations also need to be at lower precision. Researchers have also demonstrated that the use of 8-4 bits shows reasonable high accuracy compared to full precision [17].

Binary Neural Network (BNN) investigates the use of 1-bit values for the weights, where weights are constrained to either +1 or -1 [12]. BNN’s have shown to achieve SOTA accuracy on smaller data sets (ex., CIFAR10). XNOR-Net, which use binarized weights and activations, achieves top1-accuracy drop of 12% for AlexNet [13]. Thus, achieving high accuracy with binary networks is still a challenge. Ternary Neural Networks (TNN) investigate the use of 2-bit values for weights, where weights are constrained to +1,-1, and 0. Top-1 error rate of 25.6% has been reported for ResNet-50 models trained on ImageNet with 32-bit activations and ternary weights [4]. Recently, Top-1 error of 29.24% for ResNet 50 with 8-bit activations and ternary weights was achieved with the FGQ technique [14]. As this is the highest accuracy number reported on ResNet with ImageNet for 8-2 thus far, we use the models (FGQ) presented in this work to further fine-tune the accuracy for ResNet network with ImageNet.

3.2 Hardware Implementation of Low Precision DNN

DNN can be efficiently mapped into a FPGA to achieve high performance as shown in various prior works [15, 11, 3]. Several prior research works have implemented DNN’s on a FPGA with precision width of 16/32-bits. As a result, they use DSP for the core computation, failing to not make maximum use of the abundant logic resources available on modern FPGA’s. Very few works have evaluated DNN implementation for topologies like ResNet on modern FPGA such as the Stratix 10 to evaluate performance. Stratix 10 performance for Ternery DNN on ResNet network was shown in [7]. However, the activations are still constrained to 32-bits, thus not fully utilizing the ALM’s available. ResNet implementation on the Arria10 FPGA with 16-bit precision was implemented in [20]. Maximum Tops that could be achieved by this design was only 0.28Tops. This is the first work that implements 8-2 on a FPGA with dynamic fixed point support to achieve high performance for ResNet topology.

4 Ternary ResNet Network

4.1 ResNet Topology Overview

Figure 1: ResNet 50 Topology. RM denotes ResNet module

.

Figure 2: Overall system design of our FPGA accelerator contains various computing modules and memory elements.

Deep residual networks (ResNet) have shown superior classification accuracy with fewer model parameters compared to previous DNN models [10]. ResNet network has become the benchmark in industry/academia for DNN’s. Efficient implementation of the network on hardware is important. A typical ResNet topology consists of 16 different ResNet Modules (RM) with each consisting of convolution, max-pooling, batch norm, scale, and relu as shown Figure 1. ResNet topologies have varying kernel dimensions across layers. Thye also have skip connections, where the left and right branch of the ResNet are merged into an element-wise layer for summation. The above characteristics of a ResNet have made the topology highly irregular compared to previous DNN models such as AlexNet, and VGG. Therefore, it is highly challenging to implement ResNet topology on a FPGA to achieve high performance with fixed hardware and memory resources.

To achieve high accuracy on ImageNet data set using ResNet models, the first (Conv1, Pool1, BN1 on CPU) and last layers (Pool5, FC, Softmax), shown in Figure 1, are used at high precision with 8-bit activations(a) and 8-bit weights(w). Our FPGA accelerator is designed to support only 8a-2w to make the design simpler and extract maximum performance from the FPGA. Since the first and last layers work at high precision, they are run on the CPU. As the other layers are implemented on the FPGA, they function only on convolutions, with the batch norm and scale layers fused with convolution filters.

4.2 ResNet-50 Model Optimization

A FGQ based technique was applied on ResNet 50 network to achieve high accuracy for 8-bit activations and ternary weights [14]. The fine grain quantization technique proposed divides the learned weights into sets of disjoint blocks (N), and then ternarizes them. There is a scaling factor (alpha) associated with a block of size (N). It is shown that for N = 64, 99% of MAC operations can be replaced by ternary accumulations. This results in 15x potential improvement in performance [14]. For ResNet, the number of inputs and output channels scale by a multiple of 64. Thus we have taken N=64 in our work. However, during inferencing, the batch norm and scale parameters in ResNet layers can be fused into the alpha scaling parameter, which was not considered in [14]. The fusion of these layers helps in easy implementation of the network on hardware.

Below, we outline how the fusion can be carried out with the FGQ technique in this work. Let denote the ternary convolution operation. Let denote the quantized activations. After fusing BN and scaling parameters, for a given block of FP32 weights and for a given ofm, we scale the FP32 weights by with a bias of . Then, fused FP32 weights become

We ternarize this as , where is a quantized scaling factor and , . Then, the (partial) output of ternary convolution, for the given ofm, is

and the full output, for the given ofm, is

5 Scalable FPGA DNN Architecture

5.1 Architecture Description

Figure 3: Tile layout of our architecture containing 4 PE per tile with weights and input fed into each of the PE

The FPGA architecture designed provides a highly configurable SIMD engine. SIMD structures exist in the DNN, and can be mapped to the Processing Elements (PE) present in our architecture. The accelerator is a scalable SIMD engine, where the data flow is optimized to maximize the number of operations performed for each byte of data fetched. A layer in ResNet topology can be used many times across the entire topology. Different modules in the hardware must be flexible enough to be re-used across layers, thus helping in keeping the number of hardware resources to a minimum across a large set of resources.

We design a tile based spatial architecture consisting of many tiles and PE’s to perform MAC operations, as shown in Figure 2. Tiles and PE’s are easily configurable in our design. The parallelism in convolution is extracted across many output feature maps; within one input feature map, the weights are re-used along all the tiles in PE. Each PE works on a new set of input pixel points. As a result, each tile produces different output feature maps. We have fixed the number of tiles to be 64, and the number of PE’s per tile to be 4 (based on total compute that can be fit into the chip).

Different hardware components are listed below:

Load Store Unit (LSU): Manages data fetch and save from/to system memory through PCIe Avalon interface. LSU handles multiple outstanding requests.

Tile: Each tile contains an 1-D array of PE’s. This can be parameterizable.

PE: Mapped to ALM’s in FPGA, it performs the core compute.

IRAM Buffer: Mapped to block RAMS in FPGA, they are used to store IFMs/inputs.

BSRAM Buffer: Mapped to block RAMS in FPGA, they are used to store Kernels/Weights.

SSRAM Buffer: Mapped to block RAMS in FPGA, they are used to store scaling values.

BBSRAM Buffer: Mapped to block RAMS in FPGA, they are used to store Bias values.

ORAM Buffer: Mapped to block RAMS in FPGA, they are used to store OFMs/outputs.

Element_Wise Buffer: Mapped to block RAMS in FPGA, they are used to bring in previously computed OFM data from memory for element-wise addition.

Figure 4: Core compute engine pipeline

5.2 Core Execution Pipeline

Our core compute engine pipeline, shown in Figure 4, is designed with deep pipeline stages (20). Different computing models present inside the PE are shown Figure 4. All the computing elements are mapped into ALM’s.

Dot64 Engine - Tile layout of our architecture is shown in Figure 3. Each PE performs a dot64 compute. A single dot64 engine block performs MAC computations on 64 pixels of IFM, and 64 pixels of weights that result in a 15-bit output. Ternary multiplication operation is executed with LUT based multiplier without the need for actual multiplier circuitry. If the weights are -1, the computation can be simplified by negating the input value. Each PE in the tile produces one pixel of the output feature maps (OFMs). A single dot64 operation is optimized and mapped into ALM’s. A single dot64 logic uses 660 ALM’s and operates at 660MHz with high packing efficiency. Optimizing the PE logic design is critical since peak performance is directly related to the number of PE logic blocks that we can fit in our tile based logic.

Scaling Engine - Output of dot64 is multiplied by a 16-bit scaling weight, resulting in a 31-bit output.

Accumulator/Bias unit - Each partially computed output pixel is accumulated (32-bit accumulator), resulting in a 32-bit pixel output. The bias value is also added to each of the computed output pixel values. The partially computed pixel values are then fed back from the accumulator into the accumulator engine in the next cycle until a full OFM is computed. After the full OFM is computed, the value is written into the OSRAM Buffer.

Our architecture is fully pipelined so that we feed new data into different compute elements of the PE every cycle. There are 4 read channels, which read 64 bytes of data for input, weights, scaling values and bias. They store the read values into the internal memory. The data is read from the internal memory, and fed into the compute engine. Memory controllers are responsible for requesting new input data, filling the memory buffers, draining the sections of output completed and generating request to system memory to write the data.

We have one output write channel to write the computed OFM pixel values into the memory. If the layer consists of 128 IFM’s and 64 OFM’s, the first set of 64 IFM’s and 64 OFM’s are computed and stored in the OSRAM Buffer. The next set of 64 IFM’s are read from memory, passed into the compute engine. During the accumulator stage, we read the previously computed partial outputs (from the previous 64 IFM’s) from memory, accumulate with the currently computed value, and store the computed value back into memory.

Figure 5: Down conversion flow with Element-Wise unit

DFP data format used in this work consists of a combination of integer, , and shared exponent, . In this work, we use a single shared exponent (8-bits) per layer for weights and inputs. Once all the OFM values are computed, we down-convert the 32-bit output pixel values to 8-bits, as outlined below.

We find the absolute (abs) max from all the OFM values.

From the max value, we determine the shift value for down conversion. The shift value is determined by using a Leading Zero Count (LZC) detector, as shown in the expression given below. P represents the number of bits used by integer elements in the OFM.

(1)

The same shift value will be used across all the OFM pixel points.

With the shift value determined (), we right shift, and down convert to 8 bits (). Then, we round depending on the value of the round and bias bits. The first two bits after the right shift are the round and bias bits. the rest of the seven bits along with the sign bit are our down-converted bits. If both the bias and round bits are not set to 0, we add 1 to our down-converted output. Exponent activation for the layer is computed by summing the current exponent for activation, the weight exponent, and the value of the down conversion shift computed previously, as shown in Equation 1 and Figure 6. Total exponent value computed is passed to the next layer, which becomes the activation exponent for the next layer.

The down-converted output is written back to memory thru the LSU. We combine the first 8-bit pixel from each of the 64 OFM’s (from 64 tiles), which results in one cache line, and write that block of data to the memory. Correspondingly, this is done for the other OFM pixels.

ResNet also has element-wise layers, where the left and right convolution branches are merged, and added together to get an 8-bit value. At any moment of time, our accelerator can process a left or right branch of ResNet topology. For example, left branch convolutions are performed, and the outputs are stored in the memory. Next the right branch of convolutions are executed in the pipeline. The OFM computed is stored in the OSRAM Buffer (32 bits each). We then read the first 32-bit pixel from each of the 64 tiles (32x64), and feed it into the down-conversion logic block. We obtain the 8-bit down-converted output as explained above (8x64), resulting in a total of 512 bits. When the element-wise layer is initiated, we read the previously computed values from the LSU into the buffer. At every cycle, we feed 512 bits (8x64) of data from the buffer into the element-wise logic unit. The data is read from the buffer.

Figure 6: Exponent computation flow

Adding two 8-bit DFP’s produces an 8-bit output, and a new shared exponent. In case of the element-wise layer, we add the output of two convolution layers (/) together to produce an 8-bit output. Since the left and right branches have different exponent values computed, we perform a shift (finding the greater of the two exponents), and then add them in the element-wise unit block. This results in 8-bit values, as shown in Equation 2. and are the shared exponents of the left and right branch layers.

(2)

6 Memory Organization/Layer Programming

Based on the requirements of each layer, we construct a memory layout for each of the inputs, weights, and scaling values for efficient data distribution into different memory buffers. The LSU controller helps in movement of data between the memory and on-chip buffers.

BSRAM : Kernal memory layout is arranged in system memory by combining each of the 2-bit pixels from 64 weights. Each pixel in the kernal (3x3) will be 128 bits. Kernal data will be then distributed to 64 tiles. BSRAM buffer holds the weight values. In our design with 64 tiles, there is a separate BSRAM buffer per tile to feed the data into the PE’s present in all the tiles. Distribute logic in the BSRAM distributes the data that is read from the memory into individual BSRAM buffers present in each of the tiles. To reduce the load on the distribute logic, we divide it into set of four distribute logic blocks, which feed a set of 16 tiles. Each BSRAM buffer has a controller, which initiates the reading and writing into the buffer.

SSRAM: Scaling memory layout is arranged in a similar manner to the BSRAM, where each scaling value is 16 bits. SSRAM buffer holds the scaling value. There is a separate SSRAM buffer per tile to feed the scaling values into each of the PE’s present on the tiles. The SSRAM distribute logic distributes the data read from memory to the individual SSRAM buffer, where we divide it into a set of two distribute logic blocks that feed a set of 32 tiles.

BBSRAM: Bias memory layout contains one 32-bit value depending on number of the OFMs. The BBSRAM buffer holds the bias value. We add them to each of the computed OFM pixel points.

ISRAM: ISRAM memory layout is organized in system memory by combining each of the 8-bit pixels from the 64 IFM’s (combine along z-depth). Every pixel in our modified input image would be 512 bits. Internally, the 512 bits would be read, and distributed to each of the PE’s on the tiles. ISRAM logic consists of 1 bank per PE (4 PE’s per tile); the ISRAM controller the feeds data needed for computation into the corresponding PE based on the stride information.

ResNet contains a sequence of layers that need to be executed sequentially in our hardware. Based on the topology of the particular layer, our accelerator contains a set of registers that need to be programmed for DNN execution. Our register programming is divided into 2 parts:
1. Program the core registers: Provide the layer topology information through the registers, such as input, output feature map dimensions, kernal size, stride, number of channels, and number of tiles that are stored in the configuration register.
2. Program the LSU registers: Provide data fetch information (size of the data that needs to be fetched, and the base address) through these registers. These set of registers will provide information on the quantity of IFM and kernel data that need to be fetched from memory, and OFM data to be written to memory.

The control logic keeps track of the current number of executed layers, and loads the corresponding values into the registers for each layer upon completion of the current layer (transfer of data from layer 1 to layer 2).

7 Experimental Results

Our proposed DNN hardware accelerator for ResNet50 topology was implemented on two of Intel’s state-of-the-art FPGA’s, the Arria10 GX1150 and the Stratix10. The complete DNN accelerator was written in RTL. We used Intel Quartus Prime software, and EPE tool to estimate performance and power. For the Stratix10, we used Quartus Early Beta Release to evaluate our network.

We retrained the ResNet50 ternary model, using the fine tuning method described in [14]. We optimized this model to suit the design parameters of the hardware and fused the batchnorm paramters into convolution layers. We validated the accuracy of the fused model using a modified version of Caffe with support for emulated ternary operations and acheived top-1 accuracy of 71.1% for ResNet50.

We began our experiment by running convolutions with 8-bit activations and ternary weights on the Arria10 FPGA. The first and last layers of ResNet50 are executed on the CPU(8-8) which account for only 4.3% of the total flops (3.8GMACs). The Altera Aria10 GX1150 contains 427K ALM’s with 2713 M20K RAM blocks. The different RAM blocks in our design shown in Table 2 were mapped to M20 blocks in FPGA. Table 1 shows the FPGA resource utilization on implementing a ResNet network on the Arria10 FPGA. Total ALM usage was 83% for mapping the entire ResNet with DFP support onto the Arria10. In Table 3, we provide the breakdown of ALM usage for one instance of the different logic/memory modules present in the design. Some of the modules were instantiated many times in the design

Precision 8-2
ALM usage 83%
DSP usage 0
Freq(MHz) 200
Accuracy(top1) 71.10%
RAM(M20K) 19%
Dot operation 64
Table 1: Arria10 Resource Utilization
Buffer Dimension Instances
Memory
(MB)
IRAM 512 X 128 8 0.524
BSRAM 128 X 128 64 0.13
ORAM 128 X 1028 64 1.29
BBSRAM 32x128 64 0.2445
ElementWise 512*128 1 0.065
SSRAM 16 X128 64 0.0163
Table 2: Resource usage of different buffers
Modules ALM Usage
Dot64 660
Load Unit 2143
Store Unit 342
DownConversion 108
ElementWise 36
Memory_ctrl 4476
Memory_dist.logic 7510
Max_logic 242
Avalon_interface 100
Table 3: ALM usage for diifferent Modules

We compared the performance obtained for ResNet50 on the Arria10 and the Stratix10. We believe our design can be further optimized and thus we make performance projections with more aggressive performance targets (300MHz and 400MHz). Figures 7 and 8 show TOP/s and GOP/s/Watt obtained for ResNet50 on the Arria10. Our design obtained 5 TOP/s on ResNet50. This is the highest performance number reported on the Arria10 for ResNet50. With an aggressive performance target of 400MHZ, which we believe is achievable, we can obtain performance of 10TOP/s.

Figure 7: Tops obtained on A10 for ResNet50 network at varying aggressive frequency targets.
Figure 8: GOP/s/Watt obtained on A10 for ResNet50 network at varying aggressive frequency targets.

Stratix10 results are obtained for the S10_SG280 part (933K ALM’s) using Quartus tools. By studying performance on the lower-end S10 part, we project performance on the Stratix10’s highest-end part (S10_SG550), which has 1.8M ALM’s . The Stratix10 is the advanced FPGA device with HyperFlex support (Hyperflex can enable frequency > 600MHz). On the S10 (SG550), we project a performance of 76 TOP/s and 0.78 TOP/s/Watt for ResNet50, as shown in Figures 9and 10.

Figure 9: Peak Tops obtained on S10 for ResNet50 network at varying aggressive frequency targets.
Figure 10: GOP/s/Watt obtained on S10 for ResNet50 network at varying aggressive frequency targets.

There has been numerous research on DNN implementation on FPGA. However, very few works have shown FPGA performance on state-of-the-art topologies, like ResNet50, with ternary weights and on latest Intel FPGA. We compare our S10_SG280 performance estimates for ResNet50 with results reported in [7] for the same S10 part. The ternary ResNet reported in [7] uses 2-bit weights and 32-bit activations, which is different from 8-2 presented in this work. Our S10 performance numbers show >4x improvement in TOP/s compared to [7], as shown in Figure 11. High performance is achievable as lower precision width uses less logic. Thus, this frees up space to pack more compute into our design. We also extract more performance from our design by fine tuning the dot module without using any DSP’s. This results in lower ALMs/op and achieves high packing efficiency.

We also compare the TOP/s obtained for ResNet50 on S10/A10 with other architectures such as the Arria10 and Xilink Zynq Z-7045 SOC in Figure 11. ResNet50 implementation on the Arria10 with 16-bit precision (A10 [20]) was shown to achieve 0.315 TOP/s [20]. Their entire design used 69% DSP’s and 30% ALM’s. By lowering the precision, we are able to use ALM’s in our design effectively and achieve more than 15x performance improvement on the A10 compared to [20]. Recent ResNet50 implementation on Xilink Zynq Z-7045 was shown to achieve 0.128TOP/s [19]. This design used 16-bit fixed precision and can execute 256 MAC/cycle. Our design can perform 16K MAC/cycle and our S10/A10 performance outdoes the Xilink comprehensively, as shown in Figure 11.

Figure 11: Performance/Watt comparison of our ResNet 50 results on A10/S10 with other prior work.

We also compare peak TOP/s obtained with the Titan X GPU for Ternary ResNet50 (32(a)-2(w)) in [7]. The S10 (SG280) outperforms the Titan X GPU by more than 5x in performance. GPU’s are not yet designed to effectively implement low precision operations and thus FPGA’s outperform the GPU in performance achieved.

GPU TOP/s
Power
(W)
Precision TOP/s/Watt
V100 125 300 FP16 0.416667
P4 22 75 INT8 0.293333
P40 47 250 INT8 0.188
P100 18.7 250 FP16 0.0748
Table 4: Theoretical Peak TOP/s and TOP/s/Watt of recent NVIDIA GPU for Inferencing [2]

Various recent NVIDIA GPU offerings for Inferencing are shown in Table 4. Our A10 and S10 (SG_280/SG_550) platforms can achieve better performance/watt for ResNet50 compared to theoretical TOP/s/Watt offered by these GPU devices.

8 Conclusions

In this paper, we have designed a high performance DNN accelerator that can accelerate low precision inference for deep learning. There is trade-off between being able to accomodate more compute into hardware using low precision data types and retaining accuracy compared to full precision. In this work, with fused a ResNet50 network and applied FGQ technique to ternarize the weight. We achieved a top-1 accuracy of 71.1%. We have efficiently implemented a ResNet50 network with 8-bit activations and ternary weights using only ALM’s on FPGA. We have optimized our core compute engine such that we can efficiently pack more compute into a FPGA. Our design is capable of performing 16K MAC operations per cycle. Our experimental results indicated that performance achieved by our design on the Arria10 and Stratix10 for ResNet50 can out-perform other hardware implementations of the ResNet50 on FPGA’s such as the Arria10, Xilink, and state-of-the-art GPU.

plus 0.3ex

References

  • [1] Altera arria10 website. https://www.altera.com/products/fpga/arriaseries/arria10/overview.html.
  • [2] Nvidia gpu for inferencing. http://www.nvidia.com/object/accelerate-inference.html.
  • et al. [2015a] Chen Zhang et al. Optimizing fpga-based accelerator design for deep convolutional neural networks. FPGA, 2015a.
  • et al. [2016a] Chenzhuo Zhu et al. Trained ternary quantization. CoRR, 2016a.
  • et al. [2017a] D. J. M. Moss et al. High performance binary neural networks on the xeon+fpga platform. In FPL, 2017a.
  • et al. [2016b] E. Nurvitadhi et al. Accelerating binarized neural networks: Comparison of fpga, cpu, gpu, and asic. 2016b.
  • et al. [2017b] Eriko Nurvitadhi et al. Can fpgas beat gpus in accelerating next-generation deep neural networks? FPGA, 2017b.
  • et al. [2016c] Itay Hubara et al. Quantized neural networks: Training neural networks with low precision weights and activations. CoRR, 2016c.
  • et al. [2015b] Kaiming He et al. Deep residual learning for image recognition. CoRR, 2015b.
  • et al. [2015c] Kaiming He et al. Deep residual learning for image recognition. CoRR, 2015c.
  • et al. [2016d] M. Motamedi et al. Design space exploration of fpga-based deep convolutional neural networks. In ASP-DAC, 2016d.
  • et al. [2016e] Matthieu Courbariaux et al. Binarynet: Training deep neural networks with weights and activations constrained to +1 or -1. CoRR, 2016e.
  • et al. [2016f] Mohammad Rastegari et al. Xnor-net: Imagenet classification using binary convolutional neural networks. CoRR, 2016f.
  • et al. [2017c] Naveen Mellempudi et al. Ternary neural networks with fine-grained quantization. 2017c.
  • et al. [2016g] Naveen Suda et al. Throughput-optimized opencl-based fpga accelerator for large-scale convolutional neural networks. FPGA, 2016g.
  • et al. [2017d] Norman P. Jouppi et al. In-datacenter performance analysis of a tensor processing unit. CoRR, 2017d.
  • et al. [2016h] Philipp Gysel et al. Hardware-oriented approximation of convolutional neural networks. CoRR, 2016h.
  • et al. [2015d] Suyog Gupta et al. Deep learning with limited numerical precision. CoRR, 2015d.
  • et al. [2017e] Vinayak Gokhale et al. Snowflake: A model agnostic accelerator for deep convolutional neural networks. CoRR, 2017e.
  • et al. [2017f] Y. Ma et al. End-to-end scalable fpga accelerator for deep residual networks. In 2017 IEEE ISCAS, 2017f.
  • et al. [2015e] Zhouhan Lin et al. Neural networks with few multiplications. CoRR, 2015e.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
388193
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description