1 title

\sysmltitlerunning\printAffiliationsAndNotice

1 Introduction

Edge AI applications require not only high inference accuracy from DNNs, but also aggressive inference speed, throughput, and energy efficiency to meet real-life demands. These applications rely on hardware-efficient DNN designs when they are deployed into embedded systems with extremely limited computation and memory resources. Recently, we have seen intensive studies on DNN accelerators in hardware, which attempt to take advantage of different hardware design styles, such as GPUs, FPGAs and ASICs, to improve the speed and efficiency of DNN inference and training processes Qiu et al. (2016); Chen et al. (2016); Zhang et al. (2017a); Jouppi et al. (2017); Franklin (2017); Zhang et al. (2018b); Li et al. (2019b); Chen et al. (2019).

Although hardware accelerators can be helpful, they are still limited by available resources to handle varied real-life applications, especially for embedded systems since most DNNs are not originally designed to be hardware-efficient. As a result, developers started to focus their optimization efforts on the software side, to compress DNNs for less complexities, lowering computation demands and memory footprints. Recent researches have demonstrated the possibility of using low bit-width data to represent original floating-point parameters, such as using binary and ternary networks Courbariaux et al. (2016); Rastegari et al. (2016); Li et al. (2016); Tschannen et al. (2018); Wang et al. (2018a); Gope et al. (2019). These solutions are intended to replace the hardware-intensive floating-point multiplications by logical operations, so that DNNs can be more efficient on hardware platforms.

Figure 1: A top-down design flow for hardware-efficient DNN deployment on resource-constrained embedded systems. Challenges appear between step 2 and 3 where iterative explorations are necessary to balance DNN accuracy and performance on targeted devices.

Researchers also investigate the network pruning strategies to reduce the redundancy of DNN structures Han et al. (2015, 2016); Luo et al. (2017). According to the published pruning strategies, the relatively less important connections between DNN layers are discarded and network retraining is then performed to regain accuracy. Significant reductions can be achieved on the classic DNNs, such as AlexNet Krizhevsky et al. (2012) and VGG-16 Simonyan and Zisserman (2014). Since the major benefit of network compression comes from the fully-connected (FC) layers, to continuously have effective pruning results for latter DNNs (e.g., GoogleNet Szegedy et al. (2015) and ResNet He et al. (2016)) with reduced FC layers, more sophisticated algorithms are required to be integrated in network pruning, such as using evolutionary algorithms Dai et al. (2019), alternating direction method of multipliers (ADMM) Ren et al. (2019), and iterative pruning Ding et al. (2018).

As most of the computations happen inside the convolutional (Conv) layers, previous works also attempt to reduce the computation complexity by using depth-wise separable Conv layers for image classification and ubiquitous keyword-spotting applications Howard et al. (2017); Zhang et al. (2017b). The depth-wise separable structure can effectively reduce the number of operations and provide more compact DNN designs for resource-constrained hardware. To further improve the DNN deployment on hardware, layer fusion is proposed in Alwani et al. (2016) to minimize data movements between on-chip and off-chip memory.

In general, a traditional design process for hardware-efficient DNNs can be summarized in Figure 1 with the adoption of above-mentioned technologies. It is a top-down design flow which starts from step 1: to select a reference DNN by concentrating on accuracy. Such DNNs are typically too complicated for targeted embedded systems and need to be compressed using software and hardware optimizations in step 2 and 3, respectively. Since software compression and hardware implementation are typically carried out in two separate steps, step 2 and 3 are usually performed in an iterative manner to balance DNN accuracy and hardware performance on targeted devices. Network retraining is also required to regain accuracy after compression before step 4. Because of the iterative nature of the process, it is very challenging to cover both inference accuracy in software and deployment efficiency in hardware.

In this paper, we address the hardware-efficient DNN design problem by proposing SkyNet, which is designed following a bottom-up DNN design approach with comprehensive awareness of hardware constraints. The main contributions of this paper are summarized as follows:

  • We survey the latest low power object detectors for embedded systems and identify the potential obstacles of using top-down DNN design flows, which may prevent improved DNN accuracy and hardware efficiency.

  • We propose a bottom-up design strategy of hardware-efficient DNNs for both embedded GPU and embedded FPGA; using such a design method, we propose SkyNet, which has comprehensive awareness of hardware limitations to overcome the challenges of top-down design flow.

  • We demonstrate SkyNet in DAC-SDC’19 using both TX2 GPU and Ultra96 FPGA with the stat-of-the-art accuracy. SkyNet achieved the highest overall score regarding accuracy, throughput, and energy-efficiency, and won the first place winner award for both GPU and FPGA tracks.

  • We extend SkyNet for object tracking. By using SkyNet as the backbone DNN, SiamRPN++ and SiamMask obtain 1.60X and 1.73X speedup with better or similar accuracy, and 37.20X smaller parameter size compared to using the original ResNet-50 backbone when running on a 1080Ti GPU.

2 Related Work


Rank GPU-Track Reference DNN Optimizations
’19 2nd
Thinker Xiong et al. (2019)
ShuffleNet + RetinaNet 1⃝ 2⃝ 3⃝ 9⃝
’19 3rd
DeepZS Deng et al. (2019)
Tiny YOLO Not clear 9⃝
’18 1st
ICT-CAS Lu et al. (2018)
Tiny YOLO 1⃝ 2⃝ 3⃝ 4⃝ Not clear
’18 2nd
DeepZ Deng and Zhuo (2018)
Tiny YOLO Not clear 9⃝
’18 3rd
SDU-Legend Zang et al. (2018)
YOLOv2 1⃝ 2⃝ 3⃝ 9⃝
Rank FPGA-Track Reference DNN Optimizations
’19 2nd
XJTU Tripler Zhao et al. (2019)
ShuffleNetV2 + YOLO 2⃝ 3⃝ 5⃝ 6⃝ 8⃝
’19 3rd
SystemsETHZ Kara and Alonso (2019)
SqueezeNet + YOLO 1⃝ 2⃝ 3⃝ 7⃝
’18 1st
TGIIF Zeng et al. (2018)
SSD 1⃝ 2⃝ 3⃝ 5⃝ 6⃝
’18 2nd
SystemsETHZ Kara et al. (2018)
SqueezeNet + YOLO 1⃝ 2⃝ 3⃝ 7⃝
’18 3rd
iSmart2 Hao et al. (2018)
MobileNet + YOLO 1⃝ 2⃝ 3⃝ 5⃝ 7⃝
Table 1: DAC-SDC winning entries from both GPU and FPGA tracks. They follow a top-down approach, from choosing reference DNNs to applying optimization strategies on software and hardware sides, so that they compress DNNs with improved hardware efficiency. Optimizations include: 1⃝ input resizing, 2⃝ network pruning, 3⃝ data quantization, and 4⃝ TensorRT Vanholder (2016) on software, and 5⃝ CPU-FPGA task partition, 6⃝ double-pumped DSP, 7⃝ fine-grained pipeline, 8⃝ clock gating, and 9⃝ multithreading on hardware.

Recent state-of-the-art object detectors feature DNN backbones to extract input features. Researchers initially propose a two-stage approach which first outputs multiple region proposals for object candidates and then generates more accurate regions with corresponding class labels Dai et al. (2016); Lin et al. (2017a); He et al. (2017); Cheng et al. (2018a, b); Li et al. (2019c). To improve the detection speed, some one-stage approaches are proposed to simultaneously regress object locations and classes Sermanet et al. (2014); Redmon et al. (2016); Liu et al. (2016); Lin et al. (2017b); Shen et al. (2019); Tian et al. (2019). Object tracking also relies on the features extracted from DNN backbones, and we have seen recent Siamese network based trackers formulate trackers as feature between the exemplar image and search region Tao et al. (2016); Valmadre et al. (2017); Wang et al. (2018b); Li et al. (2019a); Wang et al. (2019). These emerging methods make real-time object detection and tracking possible using desktop GPUs but still need aggressive compression before deploying onto embedded systems.

2.1 Low-Power Object Detectors

Nowadays, much attention has been paid to delivering hardware-efficient designs for object detection instead of simply pursuing higher inference quality. To address the design difficulties of real-life applications, a low power object detection challenge in DAC-SDC is proposed to target unmanned aerial vehicle (UAV) applications using embedded platforms, such as NVIDIA TX2 GPU, Ultra96 FPGA, and Xilinx Pynq-Z1 FPGA Xu et al. (2019). By examining the winning entries, we notice that all of them adopt one-stage detectors and share similar top-down DNN design approaches in Figure 1. As shown in Table 1, most of them start from well-established hardware-efficient DNNs, such as ShuffleNet Zhang et al. (2018a), SqueezeNet Iandola et al. (2016), and MobileNet Howard et al. (2017), and replace the image classifier with YOLO Redmon et al. (2016); Redmon and Farhadi (2017) or RetinaNet Lin et al. (2017b) back-end for object detection. Other solutions directly adopt the object detection algorithms, such as SSD Liu et al. (2016) and YOLO. To deliver hardware-efficient DNNs, they employ input resizing and network pruning to lower the network complexity. Some of the GPU entries use half-precision data format (16-bit) and TensorRT for improved throughput. More aggressive compression is found in FPGA designs because of even tighter resource budgets. DNN parameters are quantized to around 8 bits and some even down to 1 bit. They also cover task partitioning (between host CPU and FPGA), double-pumped DSP (with doubled working frequency in DSP units), tailored pipeline, multithreading, and clock gating to boost hardware performance and energy-efficiency.

Figure 2: (a) Accuracy results for the same AlexNet under different compression schemes: blue for parameter compression and green for FM compression. The legend shows the quantization details: each quantized model (#2#5) is denoted as bit precision for FMs across all layers, for Conv parameters, for the parameters in Convs, for the parameters in FCs, and for parameters in the FC in - format. (b) BRAM usages of accelerators with the same architecture but 1216-bit quantization for feature maps (FM12FM16) and different image resize factors. (c) DSP utilization of accelerator using different quantizations between weights (W) and feature maps (FMs) with the numbers indicating bits allocated.

2.2 Hardware-Aware Neural Network Search

To deliver DNNs for edge devices, there have been growing interests in using neural architecture search (NAS) Tan et al. (2019); Wu et al. (2019); Howard et al. (2019); Cai et al. (2018) to automatically find resource constrained DNN targeting edge-platforms. To find the efficient networks for a specific platform, Tan et al. (2019) uses real-time latency by running models on the targeted device instead of latency proxy. Limited by the number of available physical devices, Wu et al. (2019); Cai et al. (2018) use look-up-table (LUT) to approximate the run-time of models on a specific device. To incorporate human knowledge, Howard et al. (2019) uses platform-aware NAS to search DNNs for a platform and manually adjust the structure to make it more efficient. Compared to previous hardware-aware NAS methods that target a specific platform, SkyNet can target both embedded GPU and embedded FPGA platforms and capture hardware limitations by using the realistic hardware performance feedback instead of using LUT approximation.

3 Motivations

To deliver an even better solution, we investigate the potential obstacles in the top-down design flow (Figure 1) which may hinder further improvements on DNN accuracy and efficiency. We summarize two challenges as follows:

  • It is difficult to balance the sensitivities of DNN configurations on software and hardware during model compression following the top-down approach.

  • It is difficult to select appropriate reference DNNs at the very beginning of the top-down flow because of the uncertain accuracy variations for a given real-life task.

The first challenge causes tedious iterative explorations between software and hardware optimizations. With the similar hardware performance (e.g., throughput and latency), DNNs may have different accuracy results as the compression technologies are applied to different network components. We take data quantization as an example. In Figure 2 (a), the accuracy results vary significantly between parameter and intermediate feature map (FM) quantization. In this figure, the coordinates of the bubble center represent accuracy and model compression ratio, while the area of a bubble shows data size in megabyte (MB). We scale-up the FM bubble for better graphic effect. By compressing the model from Float32 to fixed point, we reduce 22X parameter size (237.9MB10.8MB) and 16X FM size (15.7MB0.98MB), respectively. In this case, inference accuracy is more sensitive to the FM precision.

Backbone DNN # of Parameter IoU
ResNet-18 11.18M 0.61
ResNet-34 21.28M 0.26
ResNet-50 23.51M 0.32
VGG-16 14.71M 0.25
SkyNet (ours) 0.44M 0.73
Table 2: Accuracy on DAC-SDC dataset using ResNet, VGG, and SkyNet backbones and the same back-end for object detection.

On the other hand, DNNs with similar accuracy may cause differences in hardware. To provide a quantitative analysis, Figure 2 (b) shows the BRAM (on-chip memory in FPGA) usages with different input sizes and FM quantizations. By reducing the resize factor from 1.00 to 0.78, we can maintain nearly the same DNN accuracy (<1.0% drop), but save half memory when the factor is smaller than 0.9. Similarly, Figure 2 (c) indicates small changes may lead to diverse DSP utilization. By taking the 16-bit FM (FM16) as an example, the required DSPs reduce from 128 to 64 when weights are changed from 15-bit (W15) to 14-bit (W14).

For the second challenge, it is difficult to select a reference DNN with relatively high accuracy upper bound on a given task. The DNNs with impressive accuracy on published datasets (e.g., CIFAR-10/100 and ImageNet) may not be always suitable. We evaluate the accuracy of popular DNNs on DAC-SDC object detection dataset and list the results in Table 2. With the same box regression part, these DNNs show no clear connection between their parameter size and inference accuracy after adequate training. Thus, it is not easy to select a promising reference model for a given task.

Figure 3: The proposed bottom-up DNN design flow to deliver hardware-efficient DNNs for embedded systems in three stages.

4 A Bottom-Up Design Approach

Motivated by the discussed challenges in Section 3, we propose a bottom-up approach to leverage the hardware-efficient DNN design for embedded systems. It is a three-stage approach as shown in Figure 3.

4.1 Stage 1: Bundle Selection and Evaluation

This flow starts with building the hardware-aware basic blocks, called Bundles. From a software perspective, a Bundle is a set of sequential DNN layers, which can be repeatedly stacked for constructing DNNs. From the hardware perspective, a Bundle is a set of IPs to be implemented on hardware. To capture the hardware constraints, Bundles need to be evaluated on targeted embedded systems for collecting realistic latency (for both FPGA and GPU) and resource utilization (for FPGA) results.

In the first stage, we enumerate DNN components (such as Conv, pooling, activation layers, etc.) and assemble them into Bundle . Each Bundle is then implemented and evaluated in targeted hardware devices for hardware performance metrics. To get their potential accuracy contribution, we build DNN sketches with fixed front- and back-end structures based on given tasks, and respectively insert one type of Bundle (with replications) in the middle. We limit one type of Bundle for one DNN sketch to guarantee its hardware efficiency. Then, DNN sketches are fast trained using targeted datasets to find out the ones with relatively high accuracy. By targeting the object detection task, for example, we can concatenate a input resizing unit (front-end) and a bounding box regression (back-end) with the selected Bundle to build a DNN sketch. The number of training epochs may vary from different datasets as a 20-epoch-training can distinguish sketches using the DAC-SDC dataset (with 100K images), while 5 epochs are enough if using Cifar-10 dataset. We have also seen similar strategies in Jiang et al. (2019) distinguish candidates by a 25-epoch-training on a subset of ImageNet. At last, the most promising Bundles located in the Pareto curve are selected for the next stage.

4.2 Stage 2: Hardware-Aware DNN Search

During DNN search, the inputs include the software and hardware metrics (e.g., DNN accuracy and throughput performance) and the targeted hardware platforms while the outputs are DNN candidates which meet the software and hardware requirements. To solve such a multi-objective optimization problem, we propose a group-based particle swarm optimization (PSO) evolutionary algorithm to discover proper DNN candidates since literature has demonstrated the validity of using evolutionary methods to discover DNNs with state-of-the-art accuracy Real et al. (2019); Elsken et al. (2019). From the design methodology perspective, SkyNet can be extended to support other optimization algorithms and meet the needs of different scenarios.

In the proposed group-based PSO algorithm, each individual DNN is regarded as a particle, and all active DNNs during the search contribute to the swarm. Since we only use one type of Bundle in each DNN, DNNs composed by the same type of Bundle are considered as a particle group. In order to maintain evolution stability, a DNN only evolves within its own group. We label the group optimal position as within the - group, meaning such DNN has the best fitness value evaluated under given conditions. We denote a DNN particle within group as and each has a pair of feature vectors to illustrate two hyper-parameters regarding DNN structure. represents the number of channels of each Bundle replication; and describes the pooling position between Bundles. Both feature vectors with dimension equal to the number of stacked Bundles in , and both of them affect accuracy and hardware performance. To locate the best DNN candidates, we propose Algorithm 1 with the following major components:

Population generation. An initial network population (a set of DNN candidates) is generated with groups and networks for each group. The search contains iterations and in the -th iteration, all networks are fast trained for epochs, where increases with .

Latency estimation. We perform a platform-specific latency estimation. For GPUs, we directly measure the inference latency on the training GPU, and scale latency to the targeted GPU for deployment if the target GPU is different from the training one. For FPGAs, we follow a predefined IP-based DNN accelerator template Hao et al. (2019) for hardware performance evaluation. Layer-specific IPs are implemented in hardware and shared by corresponding DNN layers. To maximize the performance, IPs are configured to fully consume the available resources. We then collect the end-to-end performance and resource overhead of each DNN from an FPGA high level synthesis tool.

Fitness value. After network training and latency estimation, we calculate the fitness value for each network as:

(1)

where is the validation accuracy of and represents the latency on hardware; is the targeted latency. Parameters () is used to balance between network accuracy and hardware performance.

Velocity calculation and particle update. In standard PSO, the updated velocity of a particle is calculated every iteration based on the current velocity, the velocities toward the local and the global best positions. Particles can move to a better position with assigned probabilities following the updated velocity. Similarly, in our case, DNNs in the same group update their positions (meaning network structures represented by feature vectors) based on the current design, the local best design (the best one across all passing iterations), and the group best design. To determine the velocity toward the local best and the group best , we compute the differences between positions of current and the local/group best designs. Since each position is represented by , position differences can be captured by the mismatch of layer expansion factors and pooling spots , respectively. Then, with the velocities known, we start evolving the current network by updating its position toward the local and the group best by a random percentage.

4.3 Stage 3: Feature Addition

More advanced DNN design features are added if hardware metrics allow. For example, we can include a bypass from low-level features to high-level features along with feature map reordering Redmon and Farhadi (2017) to improve small object detection. We can also replace ReLU with ReLU6 Sandler et al. (2018) to enhance hardware efficiency. More discussions are provided in the next section.

InitialPopulation(, )
while  do

        FastTraining(, )
GetFitnessVal() //evaluate all candidates
for each group  do
               GroupRank() //rank candidates in group i
GroupBest() //select the best one in group i
//get the group best position
GetPosition()
for each candidate in group  do
                      //rank across all passing iterations
LocalRank()
LocalBest()
//get the local best position
GetPosition ()
//get the current position
GetPosition ()
//get the velocity toward the local and the group best
GetV()
GetV()
Evolve(, , )
               end for
              
        end for
       
end while
Algorithm 1 The bottom-up DNN design with PSO

5 SkyNet

Figure 4: SkyNet backbone (model C in Table 3) generated by stacking six of the selected Bundle (circled by green dashed line) with DNN components as: DW-Conv3, PW-Conv1, BN, and ReLU6. The number of output channels is listed on top of each Bundle denoted as Ch. Three 22 pooling layers are inserted. The bypass is highlighted in orange, which passes feature maps generated by the Bundle #3 directly to the last Bundle. The feature map reordering is also performed along with the bypass.

5.1 SkyNet Architecture for object detection

Following the proposed flow, the best Bundle is selected as a combination of 33 depth-wise Conv layer (DW-Conv3 Howard et al. (2017)), 11 point-wise Conv layer (PW-Conv1), batch normalization layer (BN Ioffe and Szegedy (2015)), and ReLU6. By repeatedly stacking this Bundle, we generate three backbones shown in Table 3 for object detection in DAC-SDC. These networks share the same chain structure but with different configurations of feature map bypass. For model A, no bypass is included; while for the model B and C, output feature maps of Bundle #3 are fed into the Bundle #6. SkyNet also adapts the YOLO detector head by removing the classification output and use two anchors for bounding box regression.

5.2 Feature Map Bypass, Reordering, and ReLU6

By examining the DAC-SDC training data, we keep a record of the size ratio between the output bounding box and the input image and present a distribution diagram in Figure 6. It clearly shows that 91% of the objects to be detected are less than 9% of the original input image size, and 31% of them are even smaller than 1% of the input image size. It means the majority of objects inside this dataset are small objects and we need to provide additional DNN features accordingly. So, we add feature map bypass and reordering to enhance the ability of detecting small object (model B and C). The bypass helps to keep small object features in the later part (closer to the output layer) of the DNN by adding low-level high-resolution feature maps. Also, it is beneficial to have multiple feature maps (from different layers) before generating the bounding boxes. Since the bypass crosses a pooling layer (highlighted in Figure 4), we use reordering (shown in Figure 5) to align the size of original feature map (generated by the Bundle #5) and the low-level feature without losing information. The other feature to improve hardware efficiency is the ReLU6, which clips output range to . Since ReLU6 generates much smaller data range compared to the original ReLU (), less bits are required to represent intermediate FMs. It also helps to better implement lower-precision floating point in embedded GPUs and fixed-point data format in embedded FPGAs.

Configurations of SkyNet
A B C Bundle
input (3160360 color image)
DW-Conv3 (3)
PW-Conv1 (48)
#1
22 max-pooling
DW-Conv3 (48)
PW-Conv1 (96)
#2
22 max-pooling
DW-Conv3 (96)
PW-Conv1 (192)
Bypass Start FM Reordering (768)
#3
22 max-pooling
DW-Conv3 (192)
PW-Conv1 (384)
#4
DW-Conv3 (384)
PW-Conv1 (512)
#5

Bypass End
FM Concatenated
DW-Conv3
(512+768)
PW-Conv1 (48)
Bypass End
FM Concatenated
DW-Conv3
(512+768)
PW-Conv1 (96)
#6
PW-Conv1 (10)
Back-end for bounding box regression
Table 3: The SkyNet architecture with number of channels shown in the bracket. Each convolutional layer except the last one is followed by a BN and a ReLU (omitted for conciseness).
Figure 5: Feature map reordering from to with shrunken width and height but expanded number of channels. There is no information loss compared to pooling operation. In addition, this reorder pattern also ensures larger receptive field.
Figure 6: The distribution of bounding box relative size in DAC-SDC training dataset. We capture the bounding box relative size by computing the ratio of output bounding box size divided by the input image size. The green bars show the ratio distribution, and the blue curve shows the corresponding cumulative distribution.
Figure 7: Object detection results generated by SkyNet on DAC-SDC dataset. Challenges include to detect small objects and distinguish multiple similar objects (e.g., images in the first row).

6 Experiments on DAC-SDC

DAC-SDC features a single object detection challenge for embedded systems, which include embedded GPUs (NVIDIA TX2) and FPGAs (Pynq-Z1 and Ultra96) with very low energy consumption. The goal is to consider the most appropriate needs of UAV applications, such as capability of real-time processing, energy efficiency, and detection accuracy. To better reflect real-life challenges, images of the dataset are captured by UAVs in the real environment. The whole dataset is divided by two parts: the training dataset with 100,000 images with objects of interest across 12 main categories and 95 sub-categories, and the hidden test set for official evaluation with 50,000 images that only the contest organizers could access DJI (2018). Results generated by SkyNet are shown in Figure 7. In DAC-SDC’19, 52 GPU teams and 58 FPGA teams participated worldwide creating a very intense competition. Our SkyNet design has successfully delivered the best inference accuracy and total score for both GPU and FPGA tracks.

6.1 Ablation Study

We perform an ablation study on DAC-SDC dataset to analyze these three configurations of SkyNet (Model A, B, and C listed in Table 3). By combining two activation functions (ReLU and ReLU6), six configurations of SkyNet are evaluated. We train these models in an end-to-end fashion using multi-scale training with the learning rate starting from 1e-4 to 1e-7. We apply stochastic gradient descent (SGD) to update parameters. To further enrich the training data, we use data augmentations to distort, jitter, crop, and resize inputs with size 160320. The accuracy results are presented in Table 4, where SkyNet C - ReLU6 reaches the highest IoU (0.741) on the validation set. Therefore, we use this model as the proposed design for the following experiments.

DNN Model Parameter Size IoU
SkyNet A - ReLU
1.27 MB
0.653
SkyNet A - ReLU6
1.27 MB
0.673
SkyNet B - ReLU
1.57 MB
0.685
SkyNet B - ReLU6
1.57 MB
0.703
SkyNet C - ReLU
1.82 MB
0.713
SkyNet C - ReLU6
1.82 MB
0.741
Table 4: Validation accuracy of SkyNet.

6.2 Evaluation Criteria

Comprehensive evaluations are introduced in DAC-SDC, covering detection accuracy (IoU), throughput (FPS), and energy consumption. To identify the best design, a total score is calculated following Equation 2 to 5. Assuming there are registered teams and images in the test set, the IoU score for team , denoted as , is calculated as:

(2)

For energy, is denoted as the average energy consumption of all entries when performing DNN inference on the test dataset (Equation 3). The energy score of team () is then computed using Equation 4 relating to the ratio between average energy and the energy consumed by this team. is set to 2 and 10 for FPGA track and GPU track, respectively. Eventually, the total score, denoted as , is calculated in Equation 5 including both inference accuracy () and energy consumption ().

(3)
(4)
(5)
Team Name IoU FPS Power(W) Total Score
Results from 2019
SkyNet (ours)
0.731 67.33 13.50 1.504
Thinker Xiong et al. (2019)
0.713 28.79 8.55 1.442
DeepZS Deng et al. (2019)
0.723 26.37 15.12 1.422
Results from 2018
ICT-CAS Lu et al. (2018)
0.698 24.55 12.58 1.373
DeepZ Deng and Zhuo (2018)
0.691 25.30 13.27 1.359
SDU-legend Zang et al. (2018)
0.685 23.64 10.31 1.358
Table 5: GPU final results from DAC-SDC’19 and ’18 using the hidden test set with 50K images, evaluated by a TX2 GPU.
Team Name IoU FPS Power (W) Total Score
Results in 2019
SkyNet (ours)
0.716 25.05 7.26 1.526
XJTU_Tripler Zhao et al. (2019)
0.615 50.91 9.25 1.394
SystemsETHZ Kara and Alonso (2019)
0.553 55.13 6.69 1.318
Results in 2018
TGIIF Zeng et al. (2018)
0.624 11.96 4.20 1.267
SystemsETHZ Kara et al. (2018)
0.492 25.97 2.45 1.179
iSmart2 Hao et al. (2018)
0.573 7.35 2.59 1.164
Table 6: FPGA final results in DAC-SDC’19 and ’18 using the hidden test set with 50K images. Designs in 2019 are evaluated on a Ultra96 FPGA while designs in 2018 use a Pynq-Z1 FPGA.

6.3 GPU Implementation

For the TX2 GPU implementation, we keep all network parameters using Float32 to maintain the best inference accuracy. Since most of the compute-intensive parts of DNN inference are handled by NVIDIA cuDNN, which leaves little space for customized improvement, we start optimizing our design on a system-level.

The whole procedure of running SkyNet contains four steps as: 1) input fetching from the flash storage in a unit of batch; 2) image pre-process which includes input resizing and normalization; 3) DNN inference; and 4) post-process to generate bounding boxes and buffer results in DDR memory. The most straightforward way is to execute these steps in serial but with the cost of low resource utilization and poor throughput performance. In our design, we first merge step 1 and 2 in pre-process and enable multithreading technology to execute these steps in a pipelined fashion as shown in Figure 10. We use NVIDIA System Profiler (L4T) to capture the latency results. In average, the proposed system-level optimizations enable a 3.35X speedup compared to the original serial design and help our design reach the highest throughput performance, peaking at 67.33 FPS.

6.4 FPGA Implementation

To implement DNNs on FPGA, we suffer even scarcer resource budgets, as the theoretical peak performance provided by Ultra96 FPGA (144 GOPS @200MHz) is much lower than the TX2 GPU (665 GFLOPS @1300MHz). By using the proposed bottom-up design flow, hardware limitations have already captured by the Bundle design and the Bundle is instantiated on FPGA as a single customized hardware IP. Since the proposed network is structured by the same type of Bundle, this IP can be shared across different layers to cope with the resource constraints. Still, we need more optimizations to further enhance the performance.

Quantization, Batch Process, and Tiling

Scheme Feature Map Weight Accuracy (IoU)
0 Float32 Float32 0.741
1 9 bits 11 bits 0.727
2 9 bits 10 bits 0.714
3 8 bits 11 bits 0.690
4 8 bits 10 bits 0.680
Table 7: Validation accuracy results regarding different quantization schemes during FPGA implementation

Since fixed-point representation is more favorable in FPGA design, we quantize the FMs and weights from Float32 to fixed point and explore different quantization schemes in Table 7. After quantization, the SkyNet backbone suffers different levels of accuracy drop from 1.4% to 6.1% in scheme 1 to 4. We finally pick scheme 1 as accuracy has higher weight in the total score calculation (Equation 5).

Since network parameters can not be accommodated by the FPGA on-chip memory (which is BRAM with only 0.95 MB available), we have to store them in the external memory (DRAM), which easily makes the memory access bandwidth a bottleneck. To mitigate the bandwidth demand, input batch process is applied to exploit data reuse opportunities, where a certain number (which equals to the batch size) of input images are assembled before sending to FPGA for DNN inference, so that task size (the number of images being processed at one time) increases while consuming the same amount of network parameters from DRAM.

With larger batch size, the process of network inference asks for larger amount of FPGA on-chip memory to buffer intermediate FMs. Since our implementation is based on an IP-shared structure, buffers instantiated on FPGA are shared by different layers, which means the buffer may not be large enough for the FMs generated by the first few layers while too large for the last few layers as FMs get smaller after pooling. To solve this problem, we propose an input tiling and batch scheme as shown in Figure 9. Four inputs are stitched to form a larger input which can be processed as an entirety. With the tiling and batch process, it is possible to use one shared buffer across different layers without changing its size. The proposed solution inherits the benefit from batch process to allow better reuse of DNN weights and it eliminates the possible waste of unused buffer space.

Figure 8: Object tracking results generated by SkyNet on GOT-10K dataset.
Figure 9: The proposed batch and tiling design to increase the data reuse opportunity and avoid on-chip memory waste.
Figure 10: Task partitioning in SkyNet implementation on TX2 GPU and Ultra96 FPGA.

Layer Fusion, Memory Hierarchy, and Task Partitioning

To avoid dealing with the floating-point operations (e.g., inverse-square root) in BN layer, we use layer fusion to merge both parameters from Conv and its successive BN offline. So, there are no separated BN layers nor expensive floating-point operations required during DNN inference.

With hardware resources shared by DNN layers, the intermediate results need to be swapped in/out between on-chip and external memory. To boost the performance, we instantiate the selected Bundle on hardware and implement a five-stage pipeline with Load, EXE_CONV3, EXE_CONV1, EXE_Pooling, and WriteBack stages. By using Ping-pong buffers between memory and computation units, data transfer (in Load and WriteBack stages) can be fully overlapped by computation latency. Regarding the data transfer between adjacent execution stages (with “EXE” prefix), we keep data on-chip without going through external memory.

To fully utilize the available computational resource, we also implement task partitioning on the Ultra96. The whole design is shown in Figure 10, which is highly similar to our GPU design. Workloads are distributed to both CPU and FPGA and creating a system-level pipeline. With all three tasks (pre-process, SkyNet inference, and post-process) overlapped, our FPGA design can reach 25.05 FPS.

6.5 Result Comparison

After implementing the proposed DNN on GPU and FPGA following the strategies mentioned in Section 6.3 and 6.4, our designs are evaluated by the DAC-SDC organizers using the hidden test set. As shown in Table 5 and 6, we present the comparison results with the top-3 teams in DAC-SDC’19 and ’18. In our GPU design, SkyNet outperforms all other competitors by delivering the best accuracy (0.731), throughput performance (67.33), and total score (1.504). In terms of the FPGA design, SkyNet also reaches the best accuracy and gets the highest total score.

7 SkyNet Extension on GOT-10K

Since SkyNet can deliver real-time object detection on embedded systems, we setup experiments on the GOT-10k benchmark Huang et al. (2019) to demonstrate its potential on object tracking. GOT-10k is a large high-diversity database for generic object tracking with rich motion trajectory and wide coverage of object classes. Models are evaluated with two metrics in GOT-10k as average overlap (AO) and success rate (SR). AO is defined as the mean of IoU between prediction and ground truth bounding boxes, while SR is defined as the proportion of predictions where the IoU is beyond some threshold. During evaluation, Got-10K only provides the ground truth bounding box in the first frame and expect trackers to keep tracking on the same object for subsequent frames by predicting bounding boxes. The predictions will then be evaluated by the Got-10K server. In this section, we integrate the SkyNet backbone with two of the state-of-the-art trackers (SiamRPN++ and SiamMask) and evaluate its capability of real-time tracking.

7.1 Evaluation Using SiamRPN++

Siamese network is one of the most popular network structures for building object trackers. The Siamese trackers locate the object by the correlation between features extracted from the exemplar image and search image, where DNN-based feature extraction plays an important role. SiamRPN++ Li et al. (2019a) is the first Siamese tracker that has been proven to profit from using DNN backbones with different capacities as long as they are properly trained. To evaluate the performance of different backbones, we train three SiamRPN++ trackers with AlexNet, ResNet-50, and SkyNet backbones on GOT-10k. We maintain the size of exemplar and search images as 127127 and 255255 (128128 and 256256 for SkyNet for better implementation efficiency), respectively, and we set the learning rates start from 1e-3 to 1e-5. Results are shown in Table 8 where SkyNet achieves nearly the same quality (AO and SR) as the ResNet-50 backbone but much better speed (1.59X faster).

Backbone
AlexNet 0.354 0.385 0.101 52.36
ResNet-50 0.365 0.411 0.115 25.90
SkyNet 0.364 0.391 0.116 41.22
Table 8: Performance of SiamRPN++ trackers on GOT-10k with different backbones evaluated on single NVIDIA 1080Ti.

7.2 Evaluation Using SiamMask

SiamMask Wang et al. (2019) is another Siamese tracker which outperforms SiamRPN++ by incorporating image segmentation for object tracking tasks. Since information of the segmentation is not provided, it cannot be directly trained with GOT-10k dataset. Instead, we perform training using Youtube-VOS dataset Xu et al. (2018) and apply object tracking on Got-10K to compare the performance of different backbones using the SiamMask structure. We maintain the same input size setup as Section 7.1 and apply the learning rates from 1e-3 to 1e-4. As shown in Table 9, the proposed SkyNet backbone outperforms ResNet-50 in all metrics when using SiamMask tracker with better tracking quality and 1.73X speedup.

Backbone
ResNet-50 0.380 0.439 0.153 17.44
SkyNet 0.390 0.442 0.158 30.15
Table 9: Performance of SiamMask trackers on GOT-10k with different backbones evaluated on single NVIDIA 1080Ti.

8 Conclusions

In this paper, we proposed SkyNet, as well as a hardware-efficient method to generate compact DNNs for object detection running on embedded GPUs and embedded FPGAs. SkyNet design methodology is a novel bottom-up DNN design flow which can capture hardware limitations using realistic hardware feedbacks and deliver DNNs with great balance between software and hardware metrics, such as DNN inference accuracy and throughput performance. SkyNet was demonstrated on the 56th IEEE/ACM DAC-SDC low power object detection challenge and won the first place winner award for both GPU and FPGA tracks. We also extended SkyNet to handle object tracking task and it delivered 1.60X and 1.73X higher FPS, and 37.20X smaller parameter size with comparable accuracy when compared to the state-of-the-art Siamese trackers with ResNet-50 backbone.

9 Acknowledgments

This work was partly supported by the IBM-Illinois Center for Cognitive Computing System Research (CSR) – a research collaboration as part of IBM AI Horizons Network.

References

  1. Fused-layer CNN accelerators. In Proceedings of the International Symposium on Microarchitecture, Cited by: §1.
  2. Proxylessnas: direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332. Cited by: §2.2.
  3. Cloud-DNN: an open framework for mapping DNN models to cloud FPGAs. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), Cited by: §1.
  4. Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks. In IEEE International Solid-State Circuits Conference (ISSCC), Cited by: §1.
  5. Decoupled classification refinement: hard false positive suppression for object detection. arXiv preprint arXiv:1810.04002. Cited by: §2.
  6. Revisiting RCNN: on awakening the classification power of faster RCNN. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §2.
  7. Binarized neural networks: training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830. Cited by: §1.
  8. R-FCN: object detection via region-based fully convolutional networks. In Advances in neural information processing systems, Cited by: §2.
  9. NeST: a neural network synthesis tool based on a grow-and-prune paradigm. IEEE Transactions on Computers 68 (10), pp. 1487–1497. Cited by: §1.
  10. DAC-SDC’19 3rd place winner in GPU track. Cited by: Table 1, Table 5.
  11. DAC-SDC’18 2nd place winner in GPU track. Note: https://github.com/jndeng/DACSDC-DeepZAccessed: 2020-02-28 Cited by: Table 1, Table 5.
  12. Auto-balanced filter pruning for efficient convolutional neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Cited by: §1.
  13. The DAC-SDC dataset for low power object detection. Note: http://www.cse.cuhk.edu.hk/byu/2019-DAC-SDC/index.htmlAccessed: 2020-02-28 Cited by: §6.
  14. Efficient multi-objective neural architecture search via lamarckian evolution. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §4.2.
  15. NVIDIA Jetson TX2 delivers twice the intelligence to the edge. NVIDIA Accelerated Computing— Parallel For all. Cited by: §1.
  16. Ternary hybrid neural-tree networks for highly constrained iot applications. Cited by: §1.
  17. Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. In Proceedings of International Conference on Learning Representations (ICLR), Cited by: §1.
  18. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, Cited by: §1.
  19. DAC-SDC’18 3rd place winner in FPGA track. Note: https://github.com/onioncc/iSmartDNNAccessed: 2020-02-28 Cited by: Table 1, Table 6.
  20. FPGA/DNN co-design: an efficient design methodology for IoT intelligence on the edge. In Proceedings of the 56th Annual Design Automation Conference 2019 (DAC), Cited by: §4.2.
  21. Mask R-CNN. In Proceedings of the IEEE international conference on computer vision (CVPR), Cited by: §2.
  22. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Cited by: §1.
  23. Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §1, §2.1, §5.1.
  24. Searching for mobilenetv3. In Proceedings of the International Conference on Computer Vision, Cited by: §2.2.
  25. Got-10k: a large high-diversity benchmark for generic object tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §7.
  26. SqueezeNet: alexnet-level accuracy with 50x fewer parameters and 0.5 MB model size. arXiv preprint arXiv:1602.07360. Cited by: §2.1.
  27. Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §5.1.
  28. Accuracy vs. efficiency: achieving both through FPGA-implementation aware neural architecture search. In Proceedings of the 56th Annual Design Automation Conference 2019 (DAC), Cited by: §4.1.
  29. In-datacenter performance analysis of a tensor processing unit. In Proceedings of International Symposium on Computer Architecture (ISCA), Cited by: §1.
  30. DAC-SDC’19 3rd place winner in FPGA track. Cited by: Table 1, Table 6.
  31. DAC-SDC’18 2nd place winner in FPGA track. Note: https://github.com/fpgasystems/spooNNAccessed: 2020-02-28 Cited by: Table 1, Table 6.
  32. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, Cited by: §1.
  33. SiamRPN++: evolution of siamese visual tracking with very deep networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Cited by: §2, §7.1.
  34. Ternary weight networks. arXiv preprint arXiv:1605.04711. Cited by: §1.
  35. Implementing neural machine translation with bi-directional GRU and attention mechanism on FPGAs using HLS. In Proceedings of the 24th Asia and South Pacific Design Automation Conference (ASP-DAC), Cited by: §1.
  36. Scale-aware trident networks for object detection. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: §2.
  37. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Cited by: §2.
  38. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision (ICCV), Cited by: §2.1, §2.
  39. SSD: single shot multibox detector. In Proceedings of the European conference on computer vision (ECCV), Cited by: §2.1, §2.
  40. DAC-SDC’18 1st place winner in GPU track. Note: https://github.com/lvhao7896/DAC2018Accessed: 2020-02-28 Cited by: Table 1, Table 5.
  41. Thinet: a filter level pruning method for deep neural network compression. In Proceedings of the IEEE international conference on computer vision (ICCV), Cited by: §1.
  42. Going deeper with embedded FPGA platform for convolutional neural network. In Proceedings of International Symposium on Field-Programmable Gate Arrays (FPGA), Cited by: §1.
  43. Xnor-net: imagenet classification using binary convolutional neural networks. In Proceedings of European Conference on Computer Vision, Cited by: §1.
  44. Regularized evolution for image classifier architecture search. In Proceedings of the AAAI conference on artificial intelligence, Cited by: §4.2.
  45. You only look once: unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Cited by: §2.1, §2.
  46. YOLO9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Cited by: §2.1, §4.3.
  47. ADMM-nn: an algorithm-hardware co-design framework of dnns using alternating direction methods of multipliers. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Cited by: §1.
  48. Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.3.
  49. Overfeat: integrated recognition, localization and detection using convolutional networks. Cited by: §2.
  50. Improving object detection from scratch via gated feature reuse. In Proceedings of the British Machine Vision Conference (BMVC), Cited by: §2.
  51. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1.
  52. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Cited by: §1.
  53. Mnasnet: platform-aware neural architecture search for mobile. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2.
  54. Siamese instance search for tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Cited by: §2.
  55. Fcos: fully convolutional one-stage object detection. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: §2.
  56. StrassenNets: deep learning with a multiplication budget. Cited by: §1.
  57. End-to-end representation learning for correlation filter based tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  58. Efficient inference with tensorrt. Cited by: Table 1.
  59. Design flow of accelerating hybrid extremely low bit-width neural network in embedded FPGA. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL), Cited by: §1.
  60. Learning attentions: residual attentional siamese network for high performance online visual tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Cited by: §2.
  61. Fast online object tracking and segmentation: a unifying approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §7.2.
  62. Fbnet: hardware-aware efficient convnet design via differentiable neural architecture search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2.
  63. DAC-SDC’19 2nd place winner in GPU track. Cited by: Table 1, Table 5.
  64. Youtube-vos: a large-scale video object segmentation benchmark. In European Conference on Computer Vision (ECCV), Cited by: §7.2.
  65. DAC-sdc low power object detection challenge for uav applications. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §2.1.
  66. DAC-SDC’18 3rd place winner in GPU track. Note: https://github.com/xiaoyuuuuu/dac-hdc-2018-object-detection-in-Jetson-TX2Accessed: 2020-02-28 Cited by: Table 1, Table 5.
  67. DAC-SDC’18 1st place winner in FPGA track. Note: https://github.com/hirayaku/DAC2018-TGIIFAccessed: 2020-02-28 Cited by: Table 1, Table 6.
  68. Shufflenet: an extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.
  69. High-performance video content recognition with long-term recurrent convolutional network for FPGA. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL), Cited by: §1.
  70. DNNBuilder: an automated tool for building high-performance DNN hardware accelerators for FPGAs. In Proceedings of International Conference on Computer-Aided Design (ICCAD), Cited by: §1.
  71. Hello edge: keyword spotting on microcontrollers. arXiv preprint arXiv:1711.07128. Cited by: §1.
  72. DAC-SDC’19 2nd place winner in FPGA track. Cited by: Table 1, Table 6.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
410068
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description