Introduction ConvNets power stateoftheart solutions on a wide range of computer vision tasks. However, the high computational complexity of ConvNets hinders their deployment on embedded and mobile devices, where computational resources are limited. Using FPGAs to accelerate ConvNets has attracted significant research attention in recent years. FPGAs excel at lowprecision computation, and their adaptability to new algorithms lends themselves to supporting rapidly changing ConvNet models.
Despite recent efforts to use FPGAs to accelerate ConvNets, as \citekwon2018co points out, there still exists a wide gap between accelerator architecture design and ConvNet model design. The computer vision community has been primarily focusing on improving the accuracy of ConvNets on target benchmarks with only secondary attention to the computational cost of ConvNets. As a consequence, recent ConvNets have been trending toward more layers \citehe2016identity, more complex structures \citehuang2017densely, zoph2017learning, and more complicated operations \citeyu2015multi.
On the other hand, FPGA accelerator design has not leveraged the latest progress of ConvNets. Many FPGA designs still focus on networks trained on CIFAR10 \citekrizhevsky2009learning, a small dataset consisting of 32x32 thumbnail images. Such dataset is usually used for experimental purposes and is too small to have practical value. More recent designs aim to accelerate inefficient ConvNets such as AlexNet \citekrizhevsky2012imagenet or VGG16 \citesimonyan2014very, both of which have fallen out of use in stateoftheart computer vision applications. In addition, we observe that in many previous designs, key application characteristics such as framespersecond (FPS) are ignored in favor of simply counting GOPs, and accuracy, which is critical to applications, is often not even reported.
Specifically, we see a gap between ConvNet architectures and accelerator design in the following areas:
Inefficient ConvNet models: Many FPGA accelerators still target older, inefficient models such as AlexNet and VGG16, which require ordersofmagnitude greater storage and computational resources than newer, efficient models that achieve the same accuracy. With an inefficient model, an accelerator with high throughput in terms of GOPs can actually have low inference speed in terms of FPS, where FPS is the more essential metric of efficiency. To achieve AlexNetlevel accuracy, SqueezeNet \citeiandola2016squeezenet is 50x smaller than AlexNet; SqueezeNext \citegholami2018squeezenext is 112x smaller; ShiftNetC \citewu2017shift, with 1.6% higher accuracy, is 77x smaller. However, not many designs target those efficient models. Additionally, techniques for accelerating older models may not generalize to newer ConvNets.
ConvNet structures: Most ConvNets are structured solely for better accuracy. Some ConvNets are structured for optimal GPU efficiency, but few, if any, are designed for optimal FPGA efficiency. For example, the commonly used additive skip connection \citehe2016deep alleviates the difficulty of training deep ConvNets and significantly boosts accuracy. Despite its mathematical simplicity, the additive skip connection is difficult to efficiently implement on FPGAs. Additive skip connections involve adding the output data from a previous layer to the current layer, which requires either using onchip memory to buffer the previous layer’s output or fetching the output from offchip memory. Both options are inefficient on FPGAs.
ConvNet operators: ConvNet models contain many different types of operators. Commonly used operators include 11, 33, 55 convolutions, 33 maxpooling, etc. More recent models also contain depthwise, group, dilated, and factorized convolutions. Not all of these operators can be efficiently implemented on FPGAs. If a ConvNet contains many different types of operators, one must either allocate more dedicated compute units or make the compute unit more general. Either solution can potentially lead to high resource requirement, limited parallelism, and more complicated control flow. Also, hardware development will require more engineering effort.
Quantization: ConvNet quantization has been widely used to convert weights and activations from floating point to lowprecision numbers to reduce the computational cost. However, many of the previous methods are not practically useful for FPGAs due to the following problems: 1) Quantization can lead to serious accuracy loss, especially if the network is quantized to low precision numbers (less than 4 bits). Accuracy is vital for many computer vision applications. Unfortunately, carefully reporting accuracy has not been the norm in the FPGA community. 2) Many of the previously presented quantization methods are only effective on large ConvNet models such as VGG16, AlexNet, ResNet, etc. Since those models are known to be redundant, quantizing those to lowprecision is much easier. We are not aware of any previous work tested on efficient models such as MobileNet or ShuffleNet. 3) Many methods do not quantize weights and activations directly to fixed point numbers. Usually, quantized weights and activations are represented by fixedpoint numbers multiplied by some shared floating point coefficients. Such representation requires more complicated computation than purely fixedpoint operations, and are therefore more expensive.
In this work, we adopt an algorithmhardware codesign approach to develop a ConvNet accelerator called Synetgy and a novel ConvNet model called DiracDeltaNet. Both the accelerator and the ConvNet are tailored to FPGAs and are optimized for ImageNet classification accuracy and inference speed (in terms of FPS). Our codesign approach produces a novel ConvNet architecture DiracDeltaNet that is based on ShuffleNetV2 \citema2018shufflenet, one of the stateoftheart efficient models with small model size, low FLOP counts, hardware friendly skip connections, and competitive accuracy. We optimize the network by replacing all 33 convolutions with shift operations \citewu2017shift and 11 convolution, enabling us to implement a compute unit customized for 11 convolutions for better efficiency. The name “DiracDeltaNet” comes from the fact that the network only convolves input feature maps with 11 kernels. Such kernel functions can be seen as discrete 2D Dirac Delta functions. We further quantize the network to 4bit weights and 4bit activations, exploiting the strengths of FPGAs, with only a less than 1% accuracy drop. In short, DiracDeltaNet’s small model size, low operation count, low precision and simplified operators allow us to codesign a highly customized and efficient FPGA accelerator. Furthermore, the implementation only took two people working for one month using HighLevel Synthesis (HLS).
We trained DiracDeltaNet on ImageNet, implemented it on our accelerator architecture, Synetgy, and deployed on a lowcost FPGA board (Ultra96). Our inference speed reaches 96.5 FPS, surpassing previous works with similar accuracy by at least 16.9x. The DiracDeltaNet on our accelerator architecture also achieves 88.1% top5 classification accuracy – the highest among all the previously reported embedded FPGA accelerators.
Background \subsectionEfficient ConvNet Models For the task of image classification, improving accuracy on the ImageNet \citedeng2009imagenet dataset has been the primary focus of the computer vision community. For applications that are sensitive to accuracy, even a 1% improvement in accuracy on ImageNet is worth doubling or tripling model complexity. As a concrete example, ResNet152 \citehe2016deep achieves 1.36% higher ImageNet accuracy than ResNet50 at the cost of 3x more layers. In recent years, efficient ConvNet models have begun to receive more research attention. SqueezeNet \citeiandola2016squeezenet is one of the early models focusing on reducing the parameter size. While SqueezeNet is designed for image classification, later models, including SqueezeDet \citewu2017squeezedet and SqueezeSeg \citewu2017squeezeseg, wu2018squeezesegv2, extend the scope to object detection and pointcloud segmentation. More recent models such as MobileNet \citehoward2017mobilenets, sandler2018mobilenetv2 and ShuffleNet \citezhang1707shufflenet, ma2018shufflenet further reduce model complexity. However, without a target computing platform in mind, most models designed for “efficiency” can only target intermediate proxies to efficiency, such as parameter size or FLOP count, instead of focusing on more salient efficiency metrics, such as speed and energy. Recent works also try to bring in hardware insight to improve the actual efficiency. SqueezeNext\citegholami2018squeezenext uses a hardware simulator to adjust the macroarchitecture of the network for better efficiency. ShiftNet\citewu2017shift proposes a hardwarefriendly shift operator to replace expensive spatial convolutions. AddressNet\citezhong2018rejecteccv designed three shiftbased primitives to accelerate GPU inference.
ConvNet Quantization ConvNet quantization aims to convert fullprecision weights and activations of a network to lowprecision representations to reduce the computation and storage cost. Early works \citehan2015deep, zhu2016trained mainly focus on quantizing weights while still using fullprecision activations. Later works \citerastegari2016xnor,zhou2016dorefa,choi2018pact,Zhuang2017progressive quantize both weights and activations. Many previous works \citezhu2016trained,rastegari2016xnor,zhou2016dorefa see serious accuracy loss if the network is quantized to low precisions. Normally, an accuracy loss of more than 1% is already considered significant. Also, in many works \citezhu2016trained,choi2018pact, quantized weights or activations are represented by lowprecision numbers multiplied with some floating point coefficients. This can bring several challenges to hardware implementation. Last, but not least, most of the previous works report quantization results on inefficient models such as VGG, AlexNet, and ResNet. Given that those models are redundant, quantizing them to lower precisions is much easier. We have not yet seen any work which successfully applies quantization to efficient models.
Hardware Designs Most existing ConvNet hardware research has focused on improving the performance of either standalone convolution layers or a fullfledged, large ConvNet on large FPGA devices. \citezhang2015optimizing quantitatively studies the computation throughput and memory bandwidth requirement for ConvNets. \citezhang2017improving, ma2017optimizing present their own optimization for ConvNets based on analytical performance models. They achieve high throughput on VGG16 using their proposed design methodology with OpenCL. \citezhang2017frequency designs convolution in frequency domain to reduce the compute intensity of the ConvNet. They demonstrate good power performance results on VGG16, AlexNet, and GoogLeNet. \citenurvitadhi2017can implements a ternary neural network on highend Intel FPGAs and achieves higher performance/Watt than Titan X GPU. Most of the works mentioned above and others \citeli2016high, aydonat2017opencl, wei2017automated, target inefficient ConvNets on middle to highend FPGA devices. For compact ConvNets, \citeumuroglu2017finn demonstrates a binary neural network(BNN) FPGA design that performs CIFAR10 classification at 21906 frames per second(FPS) with 283 s latency on Xilinx ZC706 device. The BNN reports an accuracy of 80.1%. \citezhao2017accelerating, nakahara2017fully run the BNN on a smaller device ZC7020. Although all three works achieve promising frame rates, they have not implemented larger neural networks for the ImageNet classification. It should be noted that classification on CIFAR10 dataset is orders of magnitude simpler than ImageNet, since CIFAR10 contains 100x fewer classes, 26x fewer images, and 49x fewer pixels in each image. Networks trained on CIFAR10 dataset also have way smaller complexity compared to those trained on ImageNet. In comparison, networks for ImageNet classification are closer to realworld applicability. \citeqiu2016going first attempted to deploy VGG16 for ImageNet classification on embedded device zc7020 and achieved a frame rate of 4.45 fps. Later \citeguo2017software improved the frame rate to 5.7 fps. However, their frame rate was relatively low for realtime image classification tasks. \citeblott2018finnr, jiao2017accelerating, qiu2016going have achieved high frame rate on smaller devices, however, the accuracy of their network is not on par with \citeguo2017software for ImageNet classification.
ConvNet Design \labelNNDesign We discuss the ConvNet design in this section. The design of our ConvNet incorporates the feedback from both the computer vision applications and hardware accelerator design. Specifically, an ideal ConvNet model for embedded FPGA acceleration should satisfy the following aspects: 1) The network should not contain too many parameters or FLOPs but should still maintain a competitive accuracy. 2) The network structure should be hardware friendly to allow efficient scheduling. 3) The network’s operation set should be simplified for efficient FPGA implementation. 4) The network’s weights and activations should be quantized to lowprecision fixedpoint numbers without much accuracy loss.
ShuffleNetV2 We select ShuffleNetV21.0x \citema2018shufflenet as our starting point. ShuffleNetV2 is one of the stateoftheart efficient models. It has a top1 accuracy of 69.4% on ImageNet (2% lower than VGG16), but contains only 2.3M parameters (60x smaller than VGG16) and 146M FLOPs (109x smaller than VGG16).
The blocklevel structure of ShuffleNetV2 is illustrated in Fig. \reffig:shufflenetblocks. The input feature map of the block is first split into two parts along the channel dimension. The first branch of the network does nothing to the input data and directly feeds the input to the output. The second branch performs a series of 11 convolutions, 33 depthwise convolutions and another 11 convolution operations on the input. Outputs of two branches are then concatenated along the channel dimension. Channel shuffle \citezhang1707shufflenet is then applied to exchange information between branches. In downsampling blocks, depthwise 33 convolutions with a stride of 2 are applied to both branches of the block to reduce the spatial resolution. 11 convolutions are used to double the channel size of input feature maps. These blocks are cascaded to build a deep ConvNet. We refer readers to \citema2018shufflenet for the macrostructure description of the ShuffleNetV2.
We select ShuffleNetV21.0x not only because of its small model size and low FLOP count but also because it uses concatenative skip connections instead of additive skip connections. Additive skip connections, as illustrated in Fig. \reffig:addvsconcat(a), were first proposed in \citehe2016deep. It effectively alleviates the difficulty of training deep neural networks and therefore improves accuracy. It is widely used in many ConvNet designs. However, additive skip connections are not efficient on FPGAs. As illustrated in Fig. \reffig:addvsconcat(a), both the skip and the residual branches’ data need to be fetched onchip to conduct the addition. Though addition does not cost too much computation, the data movement is expensive. Concatenative skip connections, as illustrated in Fig. \reffig:addvsconcat(b), were first proposed in \citehuang2017densely. It has a similar positive impact to the network training and accuracy. With concatenative skip connections, data from skip branch is already in offchip DRAMs. So we can concatenate the two branches simply by writing the residual branch data next to the skip branch data. This avoids the extra memory access in additive skip connections and alleviates the memory bandwidth pressure.
\thesubsection DiracDeltaNet
Based on ShuffleNetV2, we build DiracDeltaNet through the following modifications: 1) we replace all the 33 convolutions with shift and 11 convolutions; 2) we reduce the kernel size of maxpooling from 33 to 22; 3) we modify the order of channel shuffle. We replace all of the 33 convolutions and 33 depthwise convolutions with shift operations and 11 convolutions. The motivation is that smaller convolution kernel sizes require less reuse of the feature map, resulting in simpler data movement schedule, control flow, and timing constraint. As pointed out by [wu2017shift], ConvNets rely on spatial convolutions (33 convolutions and 33 depthwise convolutions) to aggregate spatial information from neighboring pixels to the center position. However, spatial convolutions can be replaced by a more efficient operator called shift. The shift operator aggregates spatial information by copying nearby pixels directly to the center position. This is equivalent to shifting one channel of feature map towards a certain direction. When we shift different channels in different directions, the output feature map’s channel will encode all the spatial information. A comparison between 33 convolution and shift is illustrated in Fig. Document. A module containing a shift and 11 convolution is illustrated in Fig. Document. For 33 depthwise convolutions, we directly replace them with shift operations, as shown in Fig. Document. This direct replacement can lead to some accuracy loss. To mitigate this, we double the output filter number of the first 11 convolution on the nonskip branch from Fig. Document. Nominally, doubling the output channel size increases both FLOP count and parameter size by a factor of 2. However, getting rid of 33 convolutions allows us to design a computing unit customized for 11 convolutions with higher execution efficiency than a comparable unit for 33 depthwise convolutions. In the downsample block, we directly replace the strided 33 depthwise convolutions with a stride2 22 maxpooling. Unlike [wu2017shift], our shift operation only uses 4 cardinal directions (up, down, left, right) in addition to the identity mapping (noshift). This simplifies our hardware implementation of the shift operation without hurting accuracy. The first stage of ShuffleNetV2 consists of a 33 convolution with a stride of 2 and filter number of 24. It is then followed by a 33 maxpooling with a stride of 2. We replace these two layers to a module consisting of a series of 11 convolution, 22 maxpooling, and shift operations, as shown in Table Document. Compared with the original 33 convolutions, our proposed module has more parameters (2144 vs 648) and FLOPs (30.5M vs 8.1M). But the implementation and execution cost of the proposed first stage is negligible compared to a 33 convolution layer. After training the network, we find that this module gives near equal accuracy than the original 33 convolution module. With our new module, we can eliminate the remaining 33 convolutions from our network, enabling us to allocate more computational resources to 11 convolutions, and thereby increasing parallelism and throughput. In addition to replacing all 33 convolutions, we also reduce the maxpooling kernel size from 33 to 22. By using the same pooling kernel size as the stride, we eliminate the need to buffer extra data on the pooling kernel boundaries, thereby achieving better efficiency. Our experiments also show that reducing the maxpooling kernel size does not impact accuracy. We also modify the channel shuffle’s order to make it more hardware efficient. ShuffleNetV2 uses transpose operation to mix channels from two branches. This is illustrated in Fig. Document(a), where blue and red rectangles represent channels from different branches. The transpose based shuffling is not hardware friendly since it breaks the contiguous data layout. Performing channel shuffle in this manner will require multiple passes of memory read and write. We propose a more efficient channel shuffle showed in Fig. Document(b). We perform a circular shift to the feature map along the channel dimension. We can have the same number of channels exchanged between two branches while preserving the contiguity of the feature map and minimizing the memory accesses.
We name the modified ShuffleNetV21.0x model as DiracDeltaNet. The name comes from the fact that our network only contains 11 convolutions. With a kernel size of 1, the kernel functions can be seen as discrete 2D Dirac Delta functions. DiracDeltaNet’s macrostructure is summarized in Table Document. Stage 2,3,4 consist of chained DiracDeltaNet blocks depicted in Fig. Document with different feature map size, channel size and stride. We adopt the training recipe and hyperparameters described in [ma2018shufflenet]. We train DiracDeltaNet for 90 epoch with linear learning rate decay, the initial learning rate of 0.5, 1024 batch size and 4e5 weight decay. A comparison between ShuffleNetV21.0x and our DiracDeltaNet is summarized in Table Document.
Layer 


Stride  #Repeat 



Image  224  3  







Stage 2 



128  
Stage 3 



256  
Stage 4 



512  
Conv5  7  1  1  1  1024  
GlobalPool  1  7  1  1024  
FC  1  1000 
MACs  #Params  Top1 acc  Top5 acc  

ShuffleNetV21.0x  146M  2.3M  69.4%   
DiracDeltaNet  330M  3.3M  68.9%  88.7% 
\thesubsection ConvNet Quantization
To further reduce the cost of DiracDeltaNet, we apply quantization to convert floating point weights and activations to lowprecision integer values. For network weights, we follow DoReFaNet [zhou2016dorefa] to quantize fullprecision weights as
(\theequation) 
Here, denotes the latent fullprecision weight of the convolution kernel. is a function that quantizes its input in the range of to its nearest neighbor in . We follow PACT [choi2018pact] to quantize each layer’s activation as
(\theequation) 
is the activation of layer. is a function that clips the activation to the range between . is a layerwise trainable upper bound, determined by the training of the network. It is observed that during training can sometimes become a negative value, which affects the correctness of the PACT [choi2018pact] function. To ensure is always positive and to increase training stability, we use the absolute value of the trainable parameter rather than its original value. is the clipped activation from layer and it is further quantized to , a kbit activation tensor. Note that activations from the same layer share the same floating point coefficient , but activations from different layers can have different coefficients. This is problematic for the concatenative skip connection, since if the coefficients and are different, we need to first cast and from fixedpoint to floating point, recalculate a coefficient for the merged activation, and quantize it again to new fixedpoint numbers. This process is very inefficient. In our experiment, we notice that most of the layers in the DiracDeltaNet have similar coefficients with values. Therefore, we rewrite equation (Document) as
(\theequation) 
where is a coefficient shared by the entire network. This step ensures that activations from different layers of the network are quantized and normalized to the same scale of . As a result, we can concatenate activations from different layers directly without extra computation. Moreover, by using the same coefficient across the entire network, the convolution can be computed completely via fixedpoint operations. The coefficient can be fixed before or leave it as trainable. A general rule is that we should let have similar values of from different layers. Otherwise, if is either too small or too large, it can cause gradient vanishing or exploding problems in training, which leads to a worse accuracy of the network. In our network, we merge the PACT function and activation quantization into one module and name it ActQuant. The input to ActQuant is the output of 11 convolutions. Since the input and weight of the convolution are both quantized into fixedpoint integers, the output is also integers. Then, ActQuant is implemented as a lookuptable whose parameters are determined during training and fixed during inference. We follow [Zhuang2017progressive] to quantize the network progressively from fullprecision to the desired lowprecision numbers. The process is illustrated in Fig. Document, where xaxis denotes bitwidth of weights and yaxis denotes the bitwidth of activations. We start from the fullprecision network, train the network to convergence, and follow a path to progressively reduce the precision for weights or activations. At each point, we finetune the network for 50 epochs with step learning rate decay. Formally, we denote each point in the grid as a quantization configuration . Here represents the bitwidth of weight. is the bitwidth of activation. is the network containing the quantized parameters. The starting configuration would be the full precision network . Starting from this configuration, one can either go down to quantize the activation or go right to reduce the bitwidth of weight. More aggressive steps can be taken diagonally or even across several grids. The twostage and progressive optimization methods proposed in [Zhuang2017progressive] can be represented as two paths in Fig. Document. In our work, we start from . Then we use to initialize and obtain . And we apply step lr decay finetuning onto to recover the accuracy loss due to the quantization. After several epochs of finetuning, we get the desired lowprecision configuration with no accuracy loss. Following the same procedures, we are able to first go diagonally in the quantization grid to with less than 1% top5 accuracy loss compared to its full precision counterpart.
full  w4a4  

Top1 Acc  68.9%  68.3% 
Top5 Acc  88.7%  88.1% 
We use a pretrained ResNet50 labelrefinery [bagherinezhad2018label] to boost the accuracy of the quantized model. Even with such lowprecision quantization, our quantized model still preserves a very competitive top5 accuracy of 88.1%. Most of the previous quantization works [choi2018pact, Zhuang2017progressive, zhou2016dorefa] are only effective on large models such as VGG16, AlexNet or ResNet50. Our quantization result is summarized in Table Document.
\thesection Hardware Design
As mentioned in section Document, we aggressively simplified ShuffleNetV2’s operator set. Our modified network is mainly composed of the following operators:

convolution

maxpooling

shift

shuffle and concatenation
Our accelerator, Synetgy, is tailored to only support the operators above. This allows us to design more specialized compute units with simpler control, which enables us to further improve the hardware efficiency. The compute of the fullyconnected layer can be mapped onto our convolution unit. Shuffle operation is not fully supported on FPGA. CPUbased memory copy is needed to maintain the memory layout. And the remaining averagepooling layer which is not supported on the FPGA is offloaded to the ARM processor on the SoC platform. The benefits of simplified operator come from the algorithmhardware codesign, which also increase the productivity of hardware implementation. The accelerator implementation only took two people working for one month using HLS.
\thesubsection The accelerator architecture
Fig. Document shows the overall accelerator architecture design. Our accelerator, highlighted in light yellow, can be invoked by the CPU for computing one ConvPoolingShiftShuffle subgraph at a time. The CPU provides supplementary support to the accelerator. Both the FPGA and the CPU are used to run the network.
In quantized DiracDeltaNet, weights are 4bit, input and output activations are 4bit, and the largest partial sum is 17bit. The width of partial sum is determined by the input feature bit width and the largest channel size. Given that the largest channel size is 512, there are possible outcomes from the convolution, which requires 17 bits to represent.
\thesubsubsection Dataflow Architecture
Our hardware design is based on the dataflow architecture template [cheng2016high, vivado2018ug]. As illustrated in Fig. Document, we first extract a few process functions from the major operations including convolution, maxpooling, shift, shuffle and the memory load and store. We then chain them together using FIFOs with blocking read and nonblocking write. Note that the write is blocking once the FIFO is full. All the process functions are running concurrently and the execution of each function is triggered by the arrival of the data. Therefore, more tasklevel parallelism can be explicitly exposed to the HLS tool in addition to the instructionlevel parallelism.
\thesubsubsection Convolution Unit
The notations used in this section are listed in Table Document. As shown in Fig. Document, given an input feature map of size and a weight kernel of size , the generated output feature map is of size in 11 convolution. The 11 convolution is essentially a matrixmatrix multiplication.
Although [kwon2018co] suggests a weight stationary dataflow for 1 1 convolution dominant ConvNets, we find it not applicable to our design as the bit width of weights is much smaller than the partial sums (4 bit vs 17 bits). Transferring the partial sums on and offchip will incur more traffic on the memory bus. Therefore, we adopt the output stationary dataflow by retaining the partial sums in the local register file until an output feature is produced.
Fig. Document shows how we schedule the workload onto the accelerator. Note that the nested loops starting at line 17, 19 are automatically unrolled. Weights are prefetched onto onchip BRAM . We first block our inputs so multiplications can be mapped onto the compute units at each iteration (Line 1321). In every iteration, input features are fetched from the DRAM. They are convolved with number of weights of size and produce partial sums. Each iteration of the loop nest along the input channel dimension at line 12 takes cycles to finish based on the Vivado HLS report. Equivalently, it takes cycles to finish 4/4 bit multiplication. The partial sums are stored in the registers, which can be simultaneously accessed in every cycle. The parameter and were tuned for the area performance tradeoff. Increasing them increases overall resource utilization but helps to reduce the total number of execution cycles. Based on the roofline model [williams2009roofline], the attainable throughput is the computetocommunication (CTC) ratio multiplied by the bandwidth when it is bandwidth bound. The CTC ratio of our compute unit for the input feature is (maximum number is 512 in DiracDeltaNet), which is a variable. Larger output channel size indicates higher CTC ratio. According to our measurement, the maximum bandwidth of the DDR channel is 6GB/s, which means Giga input features (1 Byte contains two 4bit features) can be loaded. The theoretical memory bound throughput should be GMACs GOPs. For compute bound problems, the attainable throughput is dependent on the compute capability. In our case, it is =GOPs. Based on the analysis, the convolution unit will reach the bandwidth bound before it hits the computation roofline.
\thesubsubsection Conversion Unit
The high bitwidth to low bitwidth conversion is performed immediately after the kernel computation. It is a step function with 16 intervals that converts 17bit partial sum to 4bit activation. The threshold values are different for each layer. All of the readonly threshold values are stored in onchip BRAMs. An index number should be specified by the user function to select which set of threshold values to use for the compute of the current layer. In hardware, this unit is implemented by using 16 comparators. They are mapped onto a binary tree structure to reduce the circuit latency.
\thesubsubsection Pooling Unit
We adopt the line buffer design described in [zhao2017accelerating] to implement the maxpooling layer. For every iteration, of deep pixels are first fetched into the line buffers. Once the next pixel value is fetched, a large sliding window is formed. For every 2 cycles, we compare the values in the sliding window, output the largest one, and fetch the next 2 values. It takes iterations to finish the compute.
\thesubsubsection Shift Unit
The line buffer design is also used for the shift operation. In the shift unit, the input images are first padded with 1 zerovalue pixel at the width and height dimension. of pixels are then buffered and a sliding window is formed. The shift direction is different for different input channels. It is calculated based on the input channel index. After initialization, the unit is able to produce output pixel per cycle.
\thesubsubsection Shuffle Unit
Shuffle is implemented by changing the address offset of output features during the writeback phase. Since the shuffle operation still requires us to concatenate the outputs from the previous DiracDeltaNet block to the current DiracDeltaNet block outputs, the CPU is used to copy the output from previous DiracDeltaNet unit to the shuffled address. The memory copy operation should be done concurrently with the computation of current DiracDeltaNet unit.
\thesubsubsection Fully Connected Unit
We don’t explicitly design a dedicated unit to compute FC layer. Instead, we map the compute of FC layer onto our existing hardware convolution unit. The feature map size is 1 for the FC layer. While the convolution unit only supports 4bit weight, the FC layer’s computation is mapped in a bit serial like manner. The convolution unit processes each bit of the FC weight iteratively and bit shift is done by configuring the step function in the conversion unit.
\thesubsection Software
We use the ARM processor to control the layerbased accelerator and to compute the last averagepooling layer that is not supported by the accelerator. The host application runs on a full Linux system on the ARM CPU, which controls the memorymapped accelerator through the UIO driver interface. The Xilinx pythonbased PYNQ APIs [xilinx2018pynq] are used for fast deployment of the host software code on the Ultra 96 board.
\thesubsection Experimental Results
We implement our accelerator, Synetgy, on the Ultra96 development board with Xilinx Zynq UltraScale+ MPSoC targeted at embedded applications. Table Document shows the overall resource utilization of our implementation. We are able to utilize 34% of the total LUTs on the FPGA, as the bitlevel 4/4bit multiplications are mapped onto LUTs. BRAMs are mainly used for implementing the FIFO channels. DSPs are used for the address calculation for the AXI protocol. Our implementation runs at 250 MHz. Power measurements are obtained via a power monitor. We measured 5.3W with no workload running on the programming logic side and 5.5W max power on the Ultra96 power supply line when running our network.
LUT  FF  BRAM  DSP 
24130 (34.2%)  29867 (21.2%)  170 (78.7%)  37 (10.3%) 
VGGSVD[qiu2016going]  AlexNet[liang2018fp]  VGG16[suda2016throughput]  VGG16 [guo2017software]  DoReFa[jiao2017accelerating]  FINNR [blott2018finnr]  Ours  
Platform  Zynq XC7Z045  StratixV  StratixV  Zynq 7Z020  Zynq 7Z020  Zynq ZU3EG  Zynq ZU3EG  
Frame Rate (fps)  4.5  864.7  3.8  5.7  106.0  200.0  96.5  
Top1 Acc  64.64%  42.90%  66.58%  67.72%  46.10%  50.3%  68.30%  
Top5 Acc  86.66%  66.80%  87.48%  88.06%  73.10%  N/A  88.12%  
Precision  16b  16b  816b  8b  2b  12b  44b  
Throughput (GOPs)  136.97  1963.96  117.80  123  410.22  400 


Frequency(MHz)  150  150  120  214  200  220  250  
Power(W)  3.0  26.2  19.1  3.0  2.3  10.2  5.5 
Batch Size  1  2  4  8  10  16 

Frame Rate (fps)  58.7  72.9  84.1  94.4  95.9  96.5 
We compare our accelerator against previous work in Table Document. As explained before, ConvNets for ImageNet classification are usually orders of magnitude more complex than CIFAR10 classification. Therefore, we only compare accelerators targeting ConvNets for ImageNet classification with reasonable accuracy. Our work focuses on achieving competitive accuracy while improving the actual inference speed in terms of frames per second. Our experiments show that we successfully achieve those two goals. From the table, we can make the following observations: 1) Synetgy achieves the highest top1 and top5 accuracy on ImageNet. The only previous work that comes close to our accuracy is [guo2017software], but its frame rate is 16.9 slower than ours. 2) Among the embedded accelerators whose top1 accuracy is higher than 60%, which is a loose constraint, our model achieves the fastest inference speed. 3) Without the accuracy constraint, the speed of [liang2018fp, jiao2017accelerating, blott2018finnr] can go as fast as 864.7 frames per second. But their accuracy is rather low. 4) The peak attainable throughput of our accelerator is 418 GOPs, which is close to the theoretical compute roofline. Our average throughput (47.09 GOPs) is currently limited by the low hardware utilization. The inefficiency is mainly from the software shuffle operations and the first convolution layer whose input dimension is 3 which is much less than the hardware tiling factor . However, Synetgy still achieves competitive frame rate, demonstrating the efficacy of our codesign methodology. We see the opportunity of significant frame rate improvement through further algorithmhardware codesign. The reported frame rate is achieved with batch size set to 16. There is a fixed software overhead for invoking the pollbased hardware accelerator. The computation latency of the DiracDelta Block1 in Table Document is 0.15ms when the batch size is equal to 1. The latency for a single read on the accelerator control register is 0.40ms, which is greater than the actual compute time. In order to minimize this software overhead, we increase the batch size to schedule more computation running on the accelerator per invocation. Furthermore, the weights stored in onchip BRAM get reused more when batch size is increased. The frame rates of implementations with different batch sizes are summarized in Table Document. We break down the runtime of the whole heterogeneous system by bypassing one component of the system and measure the runtime. The result is shown in Table Document. The whole system runs at 95.9 FPS on ImageNet classification at a batch size of 10, including both hardware PE execution and software execution of average pooling, and shuffle. We see from the table that the CPUbased memory copy for the shuffle operation significantly degrades the performance. All other nonconv components impact the overall performance slightly. To further understand the efficiency of various operators (11 conv, 22 maxpooling, shift, and shuffle) implemented on FPGA and CPU, we measure the runtime of the DiracDeltaNet blocks with different configurations on Synetgy. The result is summarized in Table Document. We test 2 blocks with different input feature map and channel sizes. Note that the theoretical OPs of Block1 and Block2 is the same. As shown in the table, pooling and shift incur almost no performance drop. This is because the process functions for performing these operations do not impose new bottlenecks on the dataflow pipeline. Software memory copy latency of shuffle is more significant on Block1 than Block2. This is because memory copy overhead is proportional to . But total OPs remains the same, which means that smaller feature map needs less time for memory copy. The memory copy overhead can be possibly alleviated through running baremetal C code on the CPU.
Runtime (ms)  Frame Rate (fps)  

Overall  104.3  95.9  
w/o sw avg pool  100.3  99.7  
w/o fc  104.0  96.1  

104.2  96.0  
w/o sw shuffle  70.4  142.1  
hw only  65.7  152.2 
Runtime (ms)  
Block1  Block2  
feature map size  28  7 
in&out channel  128  512 
conv only  1.531  0.989 
conv+pool  1.530  0.993 
conv+shift  1.537  0.996 
conv+shuffle  4.409  1.636 
overall  4.364  1.441 
\thesection Conclusion and Future Works
In this paper, we adopt an algorithmhardware codesign approach to develop a ConvNet accelerator called Synetgy and a novel ConvNet model called DiracDeltaNet. Based on ShuffleNetV2, we optimize the network’s operators by replacing all the 33 convolutions with shift operations and 11 convolutions. This allows us to build a compute unit exclusively customized for 11 convolutions for better efficiency. We quantize the network’s weights to 4bit and activations to 4bit fixedpoint numbers with less than 1% accuracy loss. These quantizations very well exploit the nature of FPGA hardware. As a result, DiracDeltaNet has a small parameter size, low computational OPs, hardwarefriendly skip connections, low precision, and simplified operators. These features allow us to implement highly customized and efficient accelerators on FPGA. We implement the network on Ultra96 Soc systems. The implementation only took two people one month using HLS tools. Our accelerator, Synetgy, achieves a top5 accuracy of 88.1% on ImageNet, the highest among all the previously published embedded FPGA accelerators. It also reaches an inference speed of 96.5 FPS, surpassing prior works with similar accuracy by at least 16.9. While we see many more opportunities for further optimization, we believe this demonstrates the efficacy of our codesign methodology. For the future works, we will focus on further optimization. For example, we can add more layers in the dataflow architecture to improve the computetocommunication ratio. Correspondingly, we will need to adjust the network such that the computation subgraphs are more symmetric. {acks} We would like to thank all of the people who helped us realize this project, especially the anonymous reviewers, Kostadin Ilov, Rock Qu, Alessandro Pappalardo, Amir Gholaminejad, Peter Jin, Ravi Krishna, and Alvin Wan. The information, data, or work presented herein was funded in part by the Advanced Research Projects AgencyEnergy (ARPAE), U.S. Department of Energy, under Award Number DEAR0000849. The Research was partially funded by ADEPT Lab industrial sponsor Intel, and ADEPT Lab affiliates Google, Siemens, and SK Hynix. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof.