\thesubsection DiracDeltaNet
\section

Introduction ConvNets power state-of-the-art solutions on a wide range of computer vision tasks. However, the high computational complexity of ConvNets hinders their deployment on embedded and mobile devices, where computational resources are limited. Using FPGAs to accelerate ConvNets has attracted significant research attention in recent years. FPGAs excel at low-precision computation, and their adaptability to new algorithms lends themselves to supporting rapidly changing ConvNet models.

Despite recent efforts to use FPGAs to accelerate ConvNets, as \citekwon2018co points out, there still exists a wide gap between accelerator architecture design and ConvNet model design. The computer vision community has been primarily focusing on improving the accuracy of ConvNets on target benchmarks with only secondary attention to the computational cost of ConvNets. As a consequence, recent ConvNets have been trending toward more layers \citehe2016identity, more complex structures \citehuang2017densely, zoph2017learning, and more complicated operations \citeyu2015multi.

On the other hand, FPGA accelerator design has not leveraged the latest progress of ConvNets. Many FPGA designs still focus on networks trained on CIFAR10 \citekrizhevsky2009learning, a small dataset consisting of 32x32 thumbnail images. Such dataset is usually used for experimental purposes and is too small to have practical value. More recent designs aim to accelerate inefficient ConvNets such as AlexNet \citekrizhevsky2012imagenet or VGG16 \citesimonyan2014very, both of which have fallen out of use in state-of-the-art computer vision applications. In addition, we observe that in many previous designs, key application characteristics such as frames-per-second (FPS) are ignored in favor of simply counting GOPs, and accuracy, which is critical to applications, is often not even reported.

Specifically, we see a gap between ConvNet architectures and accelerator design in the following areas:

\textbf

Inefficient ConvNet models: Many FPGA accelerators still target older, inefficient models such as AlexNet and VGG16, which require orders-of-magnitude greater storage and computational resources than newer, efficient models that achieve the same accuracy. With an inefficient model, an accelerator with high throughput in terms of GOPs can actually have low inference speed in terms of FPS, where FPS is the more essential metric of efficiency. To achieve AlexNet-level accuracy, SqueezeNet \citeiandola2016squeezenet is 50x smaller than AlexNet; SqueezeNext \citegholami2018squeezenext is 112x smaller; ShiftNet-C \citewu2017shift, with 1.6% higher accuracy, is 77x smaller. However, not many designs target those efficient models. Additionally, techniques for accelerating older models may not generalize to newer ConvNets.

\textbf

ConvNet structures: Most ConvNets are structured solely for better accuracy. Some ConvNets are structured for optimal GPU efficiency, but few, if any, are designed for optimal FPGA efficiency. For example, the commonly used additive skip connection \citehe2016deep alleviates the difficulty of training deep ConvNets and significantly boosts accuracy. Despite its mathematical simplicity, the additive skip connection is difficult to efficiently implement on FPGAs. Additive skip connections involve adding the output data from a previous layer to the current layer, which requires either using on-chip memory to buffer the previous layer’s output or fetching the output from off-chip memory. Both options are inefficient on FPGAs.

\textbf

ConvNet operators: ConvNet models contain many different types of operators. Commonly used operators include 11, 33, 55 convolutions, 33 max-pooling, etc. More recent models also contain depth-wise, group, dilated, and factorized convolutions. Not all of these operators can be efficiently implemented on FPGAs. If a ConvNet contains many different types of operators, one must either allocate more dedicated compute units or make the compute unit more general. Either solution can potentially lead to high resource requirement, limited parallelism, and more complicated control flow. Also, hardware development will require more engineering effort.

\textbf

Quantization: ConvNet quantization has been widely used to convert weights and activations from floating point to low-precision numbers to reduce the computational cost. However, many of the previous methods are not practically useful for FPGAs due to the following problems: 1) Quantization can lead to serious accuracy loss, especially if the network is quantized to low precision numbers (less than 4 bits). Accuracy is vital for many computer vision applications. Unfortunately, carefully reporting accuracy has not been the norm in the FPGA community. 2) Many of the previously presented quantization methods are only effective on large ConvNet models such as VGG16, AlexNet, ResNet, etc. Since those models are known to be redundant, quantizing those to low-precision is much easier. We are not aware of any previous work tested on efficient models such as MobileNet or ShuffleNet. 3) Many methods do not quantize weights and activations directly to fixed point numbers. Usually, quantized weights and activations are represented by fixed-point numbers multiplied by some shared floating point coefficients. Such representation requires more complicated computation than purely fixed-point operations, and are therefore more expensive.

In this work, we adopt an algorithm-hardware co-design approach to develop a ConvNet accelerator called Synetgy and a novel ConvNet model called DiracDeltaNet. Both the accelerator and the ConvNet are tailored to FPGAs and are optimized for ImageNet classification accuracy and inference speed (in terms of FPS). Our co-design approach produces a novel ConvNet architecture DiracDeltaNet that is based on ShuffleNetV2 \citema2018shufflenet, one of the state-of-the-art efficient models with small model size, low FLOP counts, hardware friendly skip connections, and competitive accuracy. We optimize the network by replacing all 33 convolutions with shift operations \citewu2017shift and 11 convolution, enabling us to implement a compute unit customized for 11 convolutions for better efficiency. The name “DiracDeltaNet” comes from the fact that the network only convolves input feature maps with 11 kernels. Such kernel functions can be seen as discrete 2D Dirac Delta functions. We further quantize the network to 4-bit weights and 4-bit activations, exploiting the strengths of FPGAs, with only a less than 1% accuracy drop. In short, DiracDeltaNet’s small model size, low operation count, low precision and simplified operators allow us to co-design a highly customized and efficient FPGA accelerator. Furthermore, the implementation only took two people working for one month using High-Level Synthesis (HLS).

We trained DiracDeltaNet on ImageNet, implemented it on our accelerator architecture, Synetgy, and deployed on a low-cost FPGA board (Ultra96). Our inference speed reaches 96.5 FPS, surpassing previous works with similar accuracy by at least 16.9x. The DiracDeltaNet on our accelerator architecture also achieves 88.1% top-5 classification accuracy – the highest among all the previously reported embedded FPGA accelerators.

\section

Background \subsectionEfficient ConvNet Models For the task of image classification, improving accuracy on the ImageNet \citedeng2009imagenet dataset has been the primary focus of the computer vision community. For applications that are sensitive to accuracy, even a 1% improvement in accuracy on ImageNet is worth doubling or tripling model complexity. As a concrete example, ResNet152 \citehe2016deep achieves 1.36% higher ImageNet accuracy than ResNet50 at the cost of 3x more layers. In recent years, efficient ConvNet models have begun to receive more research attention. SqueezeNet \citeiandola2016squeezenet is one of the early models focusing on reducing the parameter size. While SqueezeNet is designed for image classification, later models, including SqueezeDet \citewu2017squeezedet and SqueezeSeg \citewu2017squeezeseg, wu2018squeezesegv2, extend the scope to object detection and point-cloud segmentation. More recent models such as MobileNet \citehoward2017mobilenets, sandler2018mobilenetv2 and ShuffleNet \citezhang1707shufflenet, ma2018shufflenet further reduce model complexity. However, without a target computing platform in mind, most models designed for “efficiency” can only target intermediate proxies to efficiency, such as parameter size or FLOP count, instead of focusing on more salient efficiency metrics, such as speed and energy. Recent works also try to bring in hardware insight to improve the actual efficiency. SqueezeNext\citegholami2018squeezenext uses a hardware simulator to adjust the macro-architecture of the network for better efficiency. ShiftNet\citewu2017shift proposes a hardware-friendly shift operator to replace expensive spatial convolutions. AddressNet\citezhong2018rejecteccv designed three shift-based primitives to accelerate GPU inference.

\subsection

ConvNet Quantization ConvNet quantization aims to convert full-precision weights and activations of a network to low-precision representations to reduce the computation and storage cost. Early works \citehan2015deep, zhu2016trained mainly focus on quantizing weights while still using full-precision activations. Later works \citerastegari2016xnor,zhou2016dorefa,choi2018pact,Zhuang2017progressive quantize both weights and activations. Many previous works \citezhu2016trained,rastegari2016xnor,zhou2016dorefa see serious accuracy loss if the network is quantized to low precisions. Normally, an accuracy loss of more than 1% is already considered significant. Also, in many works \citezhu2016trained,choi2018pact, quantized weights or activations are represented by low-precision numbers multiplied with some floating point coefficients. This can bring several challenges to hardware implementation. Last, but not least, most of the previous works report quantization results on inefficient models such as VGG, AlexNet, and ResNet. Given that those models are redundant, quantizing them to lower precisions is much easier. We have not yet seen any work which successfully applies quantization to efficient models.

\subsection

Hardware Designs Most existing ConvNet hardware research has focused on improving the performance of either standalone convolution layers or a full-fledged, large ConvNet on large FPGA devices. \citezhang2015optimizing quantitatively studies the computation throughput and memory bandwidth requirement for ConvNets. \citezhang2017improving, ma2017optimizing present their own optimization for ConvNets based on analytical performance models. They achieve high throughput on VGG16 using their proposed design methodology with OpenCL. \citezhang2017frequency designs convolution in frequency domain to reduce the compute intensity of the ConvNet. They demonstrate good power performance results on VGG16, AlexNet, and GoogLeNet. \citenurvitadhi2017can implements a ternary neural network on high-end Intel FPGAs and achieves higher performance/Watt than Titan X GPU. Most of the works mentioned above and others \citeli2016high, aydonat2017opencl, wei2017automated, target inefficient ConvNets on middle to high-end FPGA devices. For compact ConvNets, \citeumuroglu2017finn demonstrates a binary neural network(BNN) FPGA design that performs CIFAR10 classification at 21906 frames per second(FPS) with 283 s latency on Xilinx ZC706 device. The BNN reports an accuracy of 80.1%. \citezhao2017accelerating, nakahara2017fully run the BNN on a smaller device ZC7020. Although all three works achieve promising frame rates, they have not implemented larger neural networks for the ImageNet classification. It should be noted that classification on CIFAR10 dataset is orders of magnitude simpler than ImageNet, since CIFAR10 contains 100x fewer classes, 26x fewer images, and 49x fewer pixels in each image. Networks trained on CIFAR10 dataset also have way smaller complexity compared to those trained on ImageNet. In comparison, networks for ImageNet classification are closer to real-world applicability. \citeqiu2016going first attempted to deploy VGG16 for ImageNet classification on embedded device zc7020 and achieved a frame rate of 4.45 fps. Later \citeguo2017software improved the frame rate to 5.7 fps. However, their frame rate was relatively low for real-time image classification tasks. \citeblott2018finnr, jiao2017accelerating, qiu2016going have achieved high frame rate on smaller devices, however, the accuracy of their network is not on par with \citeguo2017software for ImageNet classification.

\section

ConvNet Design \labelNNDesign We discuss the ConvNet design in this section. The design of our ConvNet incorporates the feedback from both the computer vision applications and hardware accelerator design. Specifically, an ideal ConvNet model for embedded FPGA acceleration should satisfy the following aspects: 1) The network should not contain too many parameters or FLOPs but should still maintain a competitive accuracy. 2) The network structure should be hardware friendly to allow efficient scheduling. 3) The network’s operation set should be simplified for efficient FPGA implementation. 4) The network’s weights and activations should be quantized to low-precision fixed-point numbers without much accuracy loss.

\subsection

ShuffleNetV2 We select ShuffleNetV2-1.0x \citema2018shufflenet as our starting point. ShuffleNetV2 is one of the state-of-the-art efficient models. It has a top-1 accuracy of 69.4% on ImageNet (2% lower than VGG16), but contains only 2.3M parameters (60x smaller than VGG16) and 146M FLOPs (109x smaller than VGG16).

The block-level structure of ShuffleNetV2 is illustrated in Fig. \reffig:shufflenet-blocks. The input feature map of the block is first split into two parts along the channel dimension. The first branch of the network does nothing to the input data and directly feeds the input to the output. The second branch performs a series of 11 convolutions, 33 depth-wise convolutions and another 11 convolution operations on the input. Outputs of two branches are then concatenated along the channel dimension. Channel shuffle \citezhang1707shufflenet is then applied to exchange information between branches. In down-sampling blocks, depth-wise 33 convolutions with a stride of 2 are applied to both branches of the block to reduce the spatial resolution. 11 convolutions are used to double the channel size of input feature maps. These blocks are cascaded to build a deep ConvNet. We refer readers to \citema2018shufflenet for the macro-structure description of the ShuffleNetV2.

We select ShuffleNetV2-1.0x not only because of its small model size and low FLOP count but also because it uses concatenative skip connections instead of additive skip connections. Additive skip connections, as illustrated in Fig. \reffig:add-vs-concat(a), were first proposed in \citehe2016deep. It effectively alleviates the difficulty of training deep neural networks and therefore improves accuracy. It is widely used in many ConvNet designs. However, additive skip connections are not efficient on FPGAs. As illustrated in Fig. \reffig:add-vs-concat(a), both the skip and the residual branches’ data need to be fetched on-chip to conduct the addition. Though addition does not cost too much computation, the data movement is expensive. Concatenative skip connections, as illustrated in Fig. \reffig:add-vs-concat(b), were first proposed in \citehuang2017densely. It has a similar positive impact to the network training and accuracy. With concatenative skip connections, data from skip branch is already in off-chip DRAMs. So we can concatenate the two branches simply by writing the residual branch data next to the skip branch data. This avoids the extra memory access in additive skip connections and alleviates the memory bandwidth pressure.

\subfigure

[ShuffleNetV2 blocks [ma2018shufflenet]. ] \includegraphics[width=.8]figures/shufflenet-block.pdf \subfigure[Our modified DiracDeltaNet blocks. We replace depth-wise convolutions with shift operations. In the downsampling blocks, we use stride-2 max-pooling and shift operations to replace stride-2 depth-wise convolutions. We also double the filter number of the 1st 11 convolution on the non-skip branch in each module.] \includegraphics[width=.8]figures/dirac-block.pdf

Figure \thefigure: ShuffleNetV2 blocks vs. DiracDeltaNet blocks
\includegraphics

[width=0.6]figures/add-vs-concat.pdf

Figure \thefigure: Additive Skip Connections vs. Concatenative Skip Connections. Rectangles represent data tensors.

\thesubsection DiracDeltaNet

Based on ShuffleNetV2, we build DiracDeltaNet through the following modifications: 1) we replace all the 33 convolutions with shift and 11 convolutions; 2) we reduce the kernel size of max-pooling from 33 to 22; 3) we modify the order of channel shuffle. We replace all of the 33 convolutions and 33 depth-wise convolutions with shift operations and 11 convolutions. The motivation is that smaller convolution kernel sizes require less reuse of the feature map, resulting in simpler data movement schedule, control flow, and timing constraint. As pointed out by [wu2017shift], ConvNets rely on spatial convolutions (33 convolutions and 33 depth-wise convolutions) to aggregate spatial information from neighboring pixels to the center position. However, spatial convolutions can be replaced by a more efficient operator called shift. The shift operator aggregates spatial information by copying nearby pixels directly to the center position. This is equivalent to shifting one channel of feature map towards a certain direction. When we shift different channels in different directions, the output feature map’s channel will encode all the spatial information. A comparison between 33 convolution and shift is illustrated in Fig. Document. A module containing a shift and 11 convolution is illustrated in Fig. Document. For 33 depth-wise convolutions, we directly replace them with shift operations, as shown in Fig. Document. This direct replacement can lead to some accuracy loss. To mitigate this, we double the output filter number of the first 11 convolution on the non-skip branch from Fig. Document. Nominally, doubling the output channel size increases both FLOP count and parameter size by a factor of 2. However, getting rid of 33 convolutions allows us to design a computing unit customized for 11 convolutions with higher execution efficiency than a comparable unit for 33 depth-wise convolutions. In the downsample block, we directly replace the strided 33 depth-wise convolutions with a stride-2 22 max-pooling. Unlike [wu2017shift], our shift operation only uses 4 cardinal directions (up, down, left, right) in addition to the identity mapping (no-shift). This simplifies our hardware implementation of the shift operation without hurting accuracy. The first stage of ShuffleNetV2 consists of a 33 convolution with a stride of 2 and filter number of 24. It is then followed by a 33 max-pooling with a stride of 2. We replace these two layers to a module consisting of a series of 11 convolution, 22 max-pooling, and shift operations, as shown in Table Document. Compared with the original 33 convolutions, our proposed module has more parameters (2144 vs 648) and FLOPs (30.5M vs 8.1M). But the implementation and execution cost of the proposed first stage is negligible compared to a 33 convolution layer. After training the network, we find that this module gives near equal accuracy than the original 33 convolution module. With our new module, we can eliminate the remaining 33 convolutions from our network, enabling us to allocate more computational resources to 11 convolutions, and thereby increasing parallelism and throughput. In addition to replacing all 33 convolutions, we also reduce the max-pooling kernel size from 33 to 22. By using the same pooling kernel size as the stride, we eliminate the need to buffer extra data on the pooling kernel boundaries, thereby achieving better efficiency. Our experiments also show that reducing the max-pooling kernel size does not impact accuracy. We also modify the channel shuffle’s order to make it more hardware efficient. ShuffleNetV2 uses transpose operation to mix channels from two branches. This is illustrated in Fig. Document(a), where blue and red rectangles represent channels from different branches. The transpose based shuffling is not hardware friendly since it breaks the contiguous data layout. Performing channel shuffle in this manner will require multiple passes of memory read and write. We propose a more efficient channel shuffle showed in Fig. Document(b). We perform a circular shift to the feature map along the channel dimension. We can have the same number of channels exchanged between two branches while preserving the contiguity of the feature map and minimizing the memory accesses.

\includegraphics

[width=1.0]figures/shuffle.pdf

Figure \thefigure: Transpose Based Shuffle (ShuffleNetV2) vs. Our HW Efficient Shuffle (DiracDeltaNet)

We name the modified ShuffleNetV2-1.0x model as DiracDeltaNet. The name comes from the fact that our network only contains 11 convolutions. With a kernel size of 1, the kernel functions can be seen as discrete 2D Dirac Delta functions. DiracDeltaNet’s macro-structure is summarized in Table Document. Stage 2,3,4 consist of chained DiracDeltaNet blocks depicted in Fig. Document with different feature map size, channel size and stride. We adopt the training recipe and hyperparameters described in [ma2018shufflenet]. We train DiracDeltaNet for 90 epoch with linear learning rate decay, the initial learning rate of 0.5, 1024 batch size and 4e-5 weight decay. A comparison between ShuffleNetV2-1.0x and our DiracDeltaNet is summarized in Table Document.

Layer
Output
size
Kernel
size
Stride #Repeat
Output
channel
Image 224 3
Conv1
Maxpool
shift
Conv2
Maxpool
shift
224
112
112
112
56
56
1
2
3
1
2
3
1
2
1
1
2
1
1
1
1
1
1
1
32
64
Stage 2
28
28
2
1
1
3
128
Stage 3
14
14
2
1
1
7
256
Stage 4
7
7
2
1
1
3
512
Conv5 7 1 1 1 1024
GlobalPool 1 7 1 1024
FC 1 1000
Table \thetable: Macro-structure of DiracDeltaNet
MACs #Params Top-1 acc Top-5 acc
ShuffleNetV2-1.0x 146M 2.3M 69.4% -
DiracDeltaNet 330M 3.3M 68.9% 88.7%
Table \thetable: ShuffleNetV2-1.0x vs. DiracDeltaNet
\includegraphics

[width=1.0]figures/shift.pdf 1

Figure \thefigure: 33 Convolution vs. Shift. In 33 convolutions, pixels in a 33 region are aggregated to compute one output pixel at the center position. In the shift operation, a neighboring pixel is directly copied to the center position.
\includegraphics

[width=0.75]figures/shift-layer.pdf

Figure \thefigure: Using shift and 11 convolutions to replace 33 convolutions. This figure is from [wu2017shift].

\thesubsection ConvNet Quantization

To further reduce the cost of DiracDeltaNet, we apply quantization to convert floating point weights and activations to low-precision integer values. For network weights, we follow DoReFa-Net [zhou2016dorefa] to quantize full-precision weights as

(\theequation)

Here, denotes the latent full-precision weight of the convolution kernel. is a function that quantizes its input in the range of to its nearest neighbor in . We follow PACT [choi2018pact] to quantize each layer’s activation as

(\theequation)

is the activation of layer-. is a function that clips the activation to the range between . is a layer-wise trainable upper bound, determined by the training of the network. It is observed that during training can sometimes become a negative value, which affects the correctness of the PACT [choi2018pact] function. To ensure is always positive and to increase training stability, we use the absolute value of the trainable parameter rather than its original value. is the clipped activation from layer- and it is further quantized to , a k-bit activation tensor. Note that activations from the same layer share the same floating point coefficient , but activations from different layers can have different coefficients. This is problematic for the concatenative skip connection, since if the coefficients and are different, we need to first cast and from fixed-point to floating point, re-calculate a coefficient for the merged activation, and quantize it again to new fixed-point numbers. This process is very inefficient. In our experiment, we notice that most of the layers in the DiracDeltaNet have similar coefficients with values. Therefore, we rewrite equation (Document) as

(\theequation)

where is a coefficient shared by the entire network. This step ensures that activations from different layers of the network are quantized and normalized to the same scale of . As a result, we can concatenate activations from different layers directly without extra computation. Moreover, by using the same coefficient across the entire network, the convolution can be computed completely via fixed-point operations. The coefficient can be fixed before or leave it as trainable. A general rule is that we should let have similar values of from different layers. Otherwise, if is either too small or too large, it can cause gradient vanishing or exploding problems in training, which leads to a worse accuracy of the network. In our network, we merge the PACT function and activation quantization into one module and name it ActQuant. The input to ActQuant is the output of 11 convolutions. Since the input and weight of the convolution are both quantized into fixed-point integers, the output is also integers. Then, ActQuant is implemented as a look-up-table whose parameters are determined during training and fixed during inference. We follow [Zhuang2017progressive] to quantize the network progressively from full-precision to the desired low-precision numbers. The process is illustrated in Fig. Document, where x-axis denotes bit-width of weights and y-axis denotes the bit-width of activations. We start from the full-precision network, train the network to convergence, and follow a path to progressively reduce the precision for weights or activations. At each point, we fine-tune the network for 50 epochs with step learning rate decay. Formally, we denote each point in the grid as a quantization configuration . Here represents the bitwidth of weight. is the bitwidth of activation. is the network containing the quantized parameters. The starting configuration would be the full precision network . Starting from this configuration, one can either go down to quantize the activation or go right to reduce the bitwidth of weight. More aggressive steps can be taken diagonally or even across several grids. The two-stage and progressive optimization methods proposed in [Zhuang2017progressive] can be represented as two paths in Fig. Document. In our work, we start from . Then we use to initialize and obtain . And we apply step lr decay fine-tuning onto to recover the accuracy loss due to the quantization. After several epochs of fine-tuning, we get the desired low-precision configuration with no accuracy loss. Following the same procedures, we are able to first go diagonally in the quantization grid to with less than 1% top-5 accuracy loss compared to its full precision counterpart.

\includegraphics

[width=0.9]figures/grid.pdf

Figure \thefigure: Quantization Grid
full w4a4
Top-1 Acc 68.9% 68.3%
Top-5 Acc 88.7% 88.1%
Table \thetable: Quantization Result on DiracDeltaNet

We use a pre-trained ResNet50 label-refinery [bagherinezhad2018label] to boost the accuracy of the quantized model. Even with such low-precision quantization, our quantized model still preserves a very competitive top-5 accuracy of 88.1%. Most of the previous quantization works [choi2018pact, Zhuang2017progressive, zhou2016dorefa] are only effective on large models such as VGG16, AlexNet or ResNet50. Our quantization result is summarized in Table Document.

\thesection Hardware Design

As mentioned in section Document, we aggressively simplified ShuffleNetV2’s operator set. Our modified network is mainly composed of the following operators:

  • convolution

  • max-pooling

  • shift

  • shuffle and concatenation

Our accelerator, Synetgy, is tailored to only support the operators above. This allows us to design more specialized compute units with simpler control, which enables us to further improve the hardware efficiency. The compute of the fully-connected layer can be mapped onto our convolution unit. Shuffle operation is not fully supported on FPGA. CPU-based memory copy is needed to maintain the memory layout. And the remaining average-pooling layer which is not supported on the FPGA is offloaded to the ARM processor on the SoC platform. The benefits of simplified operator come from the algorithm-hardware co-design, which also increase the productivity of hardware implementation. The accelerator implementation only took two people working for one month using HLS.

\thesubsection The accelerator architecture

Fig. Document shows the overall accelerator architecture design. Our accelerator, highlighted in light yellow, can be invoked by the CPU for computing one Conv-Pooling-Shift-Shuffle subgraph at a time. The CPU provides supplementary support to the accelerator. Both the FPGA and the CPU are used to run the network.

\includegraphics

[width=1]figures/AccelAchitectureNew.pdf

Figure \thefigure: Accelerator Architecture

In quantized DiracDeltaNet, weights are 4-bit, input and output activations are 4-bit, and the largest partial sum is 17-bit. The width of partial sum is determined by the input feature bit width and the largest channel size. Given that the largest channel size is 512, there are possible outcomes from the convolution, which requires 17 bits to represent.

\resizebox

0.45! Notation Type Description WIDTH variable width of feature map HEIGHT variable height of feature map IC_TOTAL variable total input channel size OC_TOTAL variable total output channel size IC constant: 32 parallelism on input channel dimension OC constant: 32 parallelism on output channel dimension

Table \thetable: Notations

\thesubsubsection Dataflow Architecture

Our hardware design is based on the dataflow architecture template [cheng2016high, vivado2018ug]. As illustrated in Fig. Document, we first extract a few process functions from the major operations including convolution, max-pooling, shift, shuffle and the memory load and store. We then chain them together using FIFOs with blocking read and non-blocking write. Note that the write is blocking once the FIFO is full. All the process functions are running concurrently and the execution of each function is triggered by the arrival of the data. Therefore, more task-level parallelism can be explicitly exposed to the HLS tool in addition to the instruction-level parallelism.

\thesubsubsection Convolution Unit

The notations used in this section are listed in Table Document. As shown in Fig. Document, given an input feature map of size and a weight kernel of size , the generated output feature map is of size in 11 convolution. The 11 convolution is essentially a matrix-matrix multiplication.

\includegraphics

[width=3in]figures/1x1Conv.pdf

Figure \thefigure: 11 Convolution

Although [kwon2018co] suggests a weight stationary dataflow for 1 1 convolution dominant ConvNets, we find it not applicable to our design as the bit width of weights is much smaller than the partial sums (4 bit vs 17 bits). Transferring the partial sums on and off-chip will incur more traffic on the memory bus. Therefore, we adopt the output stationary dataflow by retaining the partial sums in the local register file until an output feature is produced.

\includegraphics[width=0.42]figures/CodeSnippetNew.pdf

Figure \thefigure: Pseudo Code for Kernel Compute Scheduling

Fig. Document shows how we schedule the workload onto the accelerator. Note that the nested loops starting at line 17, 19 are automatically unrolled. Weights are prefetched onto on-chip BRAM . We first block our inputs so multiplications can be mapped onto the compute units at each iteration (Line 1321). In every iteration, input features are fetched from the DRAM. They are convolved with number of weights of size and produce partial sums. Each iteration of the loop nest along the input channel dimension at line 12 takes cycles to finish based on the Vivado HLS report. Equivalently, it takes cycles to finish 4/4 bit multiplication. The partial sums are stored in the registers, which can be simultaneously accessed in every cycle. The parameter and were tuned for the area performance tradeoff. Increasing them increases overall resource utilization but helps to reduce the total number of execution cycles. Based on the roofline model [williams2009roofline], the attainable throughput is the compute-to-communication (CTC) ratio multiplied by the bandwidth when it is bandwidth bound. The CTC ratio of our compute unit for the input feature is (maximum number is 512 in DiracDeltaNet), which is a variable. Larger output channel size indicates higher CTC ratio. According to our measurement, the maximum bandwidth of the DDR channel is 6GB/s, which means Giga input features (1 Byte contains two 4-bit features) can be loaded. The theoretical memory bound throughput should be GMACs GOPs. For compute bound problems, the attainable throughput is dependent on the compute capability. In our case, it is =GOPs. Based on the analysis, the convolution unit will reach the bandwidth bound before it hits the computation roofline.

\includegraphics

[width=0.45]figures/LayoutNew.pdf

Figure \thefigure: Input Layout in DRAM

\thesubsubsection Conversion Unit

The high bitwidth to low bitwidth conversion is performed immediately after the kernel computation. It is a step function with 16 intervals that converts 17-bit partial sum to 4-bit activation. The threshold values are different for each layer. All of the read-only threshold values are stored in on-chip BRAMs. An index number should be specified by the user function to select which set of threshold values to use for the compute of the current layer. In hardware, this unit is implemented by using 16 comparators. They are mapped onto a binary tree structure to reduce the circuit latency.

\thesubsubsection Pooling Unit

We adopt the line buffer design described in [zhao2017accelerating] to implement the max-pooling layer. For every iteration, of deep pixels are first fetched into the line buffers. Once the next pixel value is fetched, a large sliding window is formed. For every 2 cycles, we compare the values in the sliding window, output the largest one, and fetch the next 2 values. It takes iterations to finish the compute.

\thesubsubsection Shift Unit

The line buffer design is also used for the shift operation. In the shift unit, the input images are first padded with 1 zero-value pixel at the width and height dimension. of pixels are then buffered and a sliding window is formed. The shift direction is different for different input channels. It is calculated based on the input channel index. After initialization, the unit is able to produce output pixel per cycle.

\thesubsubsection Shuffle Unit

Shuffle is implemented by changing the address offset of output features during the writeback phase. Since the shuffle operation still requires us to concatenate the outputs from the previous DiracDeltaNet block to the current DiracDeltaNet block outputs, the CPU is used to copy the output from previous DiracDeltaNet unit to the shuffled address. The memory copy operation should be done concurrently with the computation of current DiracDeltaNet unit.

\thesubsubsection Fully Connected Unit

We don’t explicitly design a dedicated unit to compute FC layer. Instead, we map the compute of FC layer onto our existing hardware convolution unit. The feature map size is 1 for the FC layer. While the convolution unit only supports 4-bit weight, the FC layer’s computation is mapped in a bit serial like manner. The convolution unit processes each bit of the FC weight iteratively and bit shift is done by configuring the step function in the conversion unit.

\thesubsection Software

We use the ARM processor to control the layer-based accelerator and to compute the last average-pooling layer that is not supported by the accelerator. The host application runs on a full Linux system on the ARM CPU, which controls the memory-mapped accelerator through the UIO driver interface. The Xilinx python-based PYNQ APIs [xilinx2018pynq] are used for fast deployment of the host software code on the Ultra 96 board.

\thesubsection Experimental Results

We implement our accelerator, Synetgy, on the Ultra96 development board with Xilinx Zynq UltraScale+ MPSoC targeted at embedded applications. Table Document shows the overall resource utilization of our implementation. We are able to utilize 34% of the total LUTs on the FPGA, as the bit-level 4/4bit multiplications are mapped onto LUTs. BRAMs are mainly used for implementing the FIFO channels. DSPs are used for the address calculation for the AXI protocol. Our implementation runs at 250 MHz. Power measurements are obtained via a power monitor. We measured 5.3W with no workload running on the programming logic side and 5.5W max power on the Ultra96 power supply line when running our network.

LUT FF BRAM DSP
24130 (34.2%) 29867 (21.2%) 170 (78.7%) 37 (10.3%)
Table \thetable: Resource Usage
VGG-SVD[qiu2016going] AlexNet[liang2018fp] VGG16[suda2016throughput] VGG16 [guo2017software] DoReFa[jiao2017accelerating] FINN-R [blott2018finnr] Ours
Platform Zynq XC7Z045 Stratix-V Stratix-V Zynq 7Z020 Zynq 7Z020 Zynq ZU3EG Zynq ZU3EG
Frame Rate (fps) 4.5 864.7 3.8 5.7 106.0 200.0 96.5
Top-1 Acc 64.64% 42.90% 66.58% 67.72% 46.10% 50.3% 68.30%
Top-5 Acc 86.66% 66.80% 87.48% 88.06% 73.10% N/A 88.12%
Precision 16b 16b 8-16b 8b 2b 1-2b 4-4b
Throughput (GOPs) 136.97 1963.96 117.80 123 410.22 400
47.09 (Overall)
418 (Peak)
Frequency(MHz) 150 150 120 214 200 220 250
Power(W) 3.0 26.2 19.1 3.0 2.3 10.2 5.5
Table \thetable: Performance Comparison of Synetgy and Previous Works
Batch Size 1 2 4 8 10 16
Frame Rate (fps) 58.7 72.9 84.1 94.4 95.9 96.5
Table \thetable: Frame Rate on Different Batch Size

We compare our accelerator against previous work in Table Document. As explained before, ConvNets for ImageNet classification are usually orders of magnitude more complex than CIFAR10 classification. Therefore, we only compare accelerators targeting ConvNets for ImageNet classification with reasonable accuracy. Our work focuses on achieving competitive accuracy while improving the actual inference speed in terms of frames per second. Our experiments show that we successfully achieve those two goals. From the table, we can make the following observations: 1) Synetgy achieves the highest top-1 and top-5 accuracy on ImageNet. The only previous work that comes close to our accuracy is [guo2017software], but its frame rate is 16.9 slower than ours. 2) Among the embedded accelerators whose top-1 accuracy is higher than 60%, which is a loose constraint, our model achieves the fastest inference speed. 3) Without the accuracy constraint, the speed of [liang2018fp, jiao2017accelerating, blott2018finnr] can go as fast as 864.7 frames per second. But their accuracy is rather low. 4) The peak attainable throughput of our accelerator is 418 GOPs, which is close to the theoretical compute roofline. Our average throughput (47.09 GOPs) is currently limited by the low hardware utilization. The inefficiency is mainly from the software shuffle operations and the first convolution layer whose input dimension is 3 which is much less than the hardware tiling factor . However, Synetgy still achieves competitive frame rate, demonstrating the efficacy of our co-design methodology. We see the opportunity of significant frame rate improvement through further algorithm-hardware co-design. The reported frame rate is achieved with batch size set to 16. There is a fixed software overhead for invoking the poll-based hardware accelerator. The computation latency of the DiracDelta Block1 in Table Document is 0.15ms when the batch size is equal to 1. The latency for a single read on the accelerator control register is 0.40ms, which is greater than the actual compute time. In order to minimize this software overhead, we increase the batch size to schedule more computation running on the accelerator per invocation. Furthermore, the weights stored in on-chip BRAM get reused more when batch size is increased. The frame rates of implementations with different batch sizes are summarized in Table Document. We break down the runtime of the whole heterogeneous system by bypassing one component of the system and measure the runtime. The result is shown in Table Document. The whole system runs at 95.9 FPS on ImageNet classification at a batch size of 10, including both hardware PE execution and software execution of average pooling, and shuffle. We see from the table that the CPU-based memory copy for the shuffle operation significantly degrades the performance. All other non-conv components impact the overall performance slightly. To further understand the efficiency of various operators (11 conv, 22 max-pooling, shift, and shuffle) implemented on FPGA and CPU, we measure the runtime of the DiracDeltaNet blocks with different configurations on Synetgy. The result is summarized in Table Document. We test 2 blocks with different input feature map and channel sizes. Note that the theoretical OPs of Block1 and Block2 is the same. As shown in the table, pooling and shift incur almost no performance drop. This is because the process functions for performing these operations do not impose new bottlenecks on the dataflow pipeline. Software memory copy latency of shuffle is more significant on Block1 than Block2. This is because memory copy overhead is proportional to . But total OPs remains the same, which means that smaller feature map needs less time for memory copy. The memory copy overhead can be possibly alleviated through running bare-metal C code on the CPU.

Runtime (ms) Frame Rate (fps)
Overall 104.3 95.9
w/o sw avg pool 100.3 99.7
w/o fc 104.0 96.1
w/o PYNQ
API call
104.2 96.0
w/o sw shuffle 70.4 142.1
hw only 65.7 152.2
Table \thetable: Runtime Latency for Different Functional Parts of the Whole System (Batch=10)
Runtime (ms)
Block1 Block2
feature map size 28 7
in&out channel 128 512
conv only 1.531 0.989
conv+pool 1.530 0.993
conv+shift 1.537 0.996
conv+shuffle 4.409 1.636
overall 4.364 1.441
Table \thetable: Runtime Analysis for the First and Last DiracDeltaNet Blocks in Different Operator Configurations (Batch=10)

\thesection Conclusion and Future Works

In this paper, we adopt an algorithm-hardware co-design approach to develop a ConvNet accelerator called Synetgy and a novel ConvNet model called DiracDeltaNet. Based on ShuffleNetV2, we optimize the network’s operators by replacing all the 33 convolutions with shift operations and 11 convolutions. This allows us to build a compute unit exclusively customized for 11 convolutions for better efficiency. We quantize the network’s weights to 4-bit and activations to 4-bit fixed-point numbers with less than 1% accuracy loss. These quantizations very well exploit the nature of FPGA hardware. As a result, DiracDeltaNet has a small parameter size, low computational OPs, hardware-friendly skip connections, low precision, and simplified operators. These features allow us to implement highly customized and efficient accelerators on FPGA. We implement the network on Ultra96 Soc systems. The implementation only took two people one month using HLS tools. Our accelerator, Synetgy, achieves a top-5 accuracy of 88.1% on ImageNet, the highest among all the previously published embedded FPGA accelerators. It also reaches an inference speed of 96.5 FPS, surpassing prior works with similar accuracy by at least 16.9. While we see many more opportunities for further optimization, we believe this demonstrates the efficacy of our co-design methodology. For the future works, we will focus on further optimization. For example, we can add more layers in the dataflow architecture to improve the compute-to-communication ratio. Correspondingly, we will need to adjust the network such that the computation subgraphs are more symmetric. {acks} We would like to thank all of the people who helped us realize this project, especially the anonymous reviewers, Kostadin Ilov, Rock Qu, Alessandro Pappalardo, Amir Gholaminejad, Peter Jin, Ravi Krishna, and Alvin Wan. The information, data, or work presented herein was funded in part by the Advanced Research Projects Agency-Energy (ARPA-E), U.S. Department of Energy, under Award Number DE-AR0000849. The Research was partially funded by ADEPT Lab industrial sponsor Intel, and ADEPT Lab affiliates Google, Siemens, and SK Hynix. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
366106
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description