PipeCNN: An OpenCL-Based FPGA Accelerator for Large-Scale Convolution Neuron Networks

PipeCNN: An OpenCL-Based FPGA Accelerator for Large-Scale Convolution Neuron Networks

Dong Wang, Jianjing An and Ke Xu Institute of Information Science
Beijing Jiaotong University
Beijing 100044, China
Email: wangdong@bjtu.edu.cn
Abstract

Convolutional neural networks (CNNs) have been widely employed in many applications such as image classification, video analysis and speech recognition. Being compute-intensive, CNN computations are mainly accelerated by GPUs with high power dissipations. Recently, studies were carried out exploiting FPGA as CNN accelerator because of its reconfigurability and energy efficiency advantage over GPU, especially when OpenCL-based high-level synthesis tools are now available providing fast verification and implementation flows. Previous OpenCL-based design only focused on creating a generic framework to identify performance-related hardware parameters, without utilizing FPGA’s special capability of pipelining kernel functions to minimize memory bandwidth requirement. In this work, we propose an FPGA accelerator with a new architecture of deeply pipelined OpenCL kernels. Data reuse and task mapping techniques are also presented to improve design efficiency. The proposed schemes are verified by implementing two representative large-scale CNNs, AlexNet and VGG on Altera Stratix-V A7 FPGA. We have achieved a similar peak performance of 33.9 GOPS with a 34 resource reduction on DSP blocks compared to previous work. Our design is openly accessible and thus can be reused to explore new architectures for neural network accelerators.

I Introduction

Convolutional neural network (CNN), as an emerging deep learning architecture, has received huge attentions in various applications, such as video surveillance, image searching, speech recognition, and robot vision. A CNN works with multiple convolution layers that extract features from input data, followed by classification layers making decisions. Typical large-scale CNNs [1, 2] usually consist of millions of neural units and millions of connections that require over billion operations to process only one input.

General purpose CPUs, being sequential systems with limited computational resources, are inefficient for implementing CNN-based compute-intensive applications. Currently, GPUs are widely adopted as hardware accelerators for training deep neuron networks. However, they are energy inefficient for embedded applications. FPGAs, which provide massive processing elements, reconfigurable interconnections and lower power dissipation, are naturally suitable to implement neural network circuits. Studies, such as [6], [5] have reported efficient CNN accelerators on embedded FPGA platforms. However, traditional register-transfer-level (RTL) design flow adopted in these studies require great effort in writing complex RTL codes, practicing time-consuming simulations and compilations before one can actually run accelerators on hardware.

High-Level Synthesis (HSL) tools, which enable automatic compilation from high-level program (C/C++) to low-level RTL specifications, were recently adopted by many studies to implement deep neural networks on FPGAs. In [3], an accelerator design was implemented by using the Vivado-HLS tool on a Xilinx VC707 FPGA. Computation throughput and memory bandwidth are quantitatively explored by using a roofline model to find the design with the best performance and lowest resource. However, only convolution layers are implemented. The work of [4] proposed a fixed-point CNN design using the OpenCL framework. A systematic methodology is presented to minimize execution time with given resource constraints. Due to a matrix multiplication-based kernel design for convolution layer and the GPU-like separated kernel organization adopted, FPGA’s special advantage of implementing deeply pipelined circuits (kernels) is not fully exploited to better improve computation throughput and minimize memory bandwidth.

Main contribution of this work are: (1) an OpenCL-based FPGA accelerator with an efficient structure of pipelined kernels is proposed for implementing large-scale CNNs; (2) the design space of the proposed architecture was fully explored on Stratix-V A7 FPGA and two real-word large-scale CNN models were implemented and tested. Results show that the proposed scheme achieves improved performance and resource utilization than previous works; (3) we have made our design openly accessible [7] for other researchers to study and explore new accelerator architectures for deep neural networks.

Ii OpenCL-Based CNN Implementation

Ii-a OpenCL Framework

OpenCL is an open, cross-platform parallel programming language that can be used in both GPU and FPGA developments. The OpenCL-based FPGA accelerator development flow is summarized in Fig. 1. In the framework, an FPGA board (as OpenCL device) is connected with a desktop CPU (as OpenCL host) through a high speed PCIe slot forming a heterogenous computing system. An OpenCL code, which defines multiple parallel compute units (CUs) in the form of kernel functions, is compiled and synthesized to run on the FPGA accelerator. On the host side, a C/C++ code runs on the CPU, providing vendor specific application programming interface (API) to communicate with the kernels implemented on the FPGA accelerator. This work uses the Altera OpenCL SDK toolset for compiling, implementing and profiling the OpenCL codes on FPGAs.

Fig. 1: OpenCL-based FPGA design flow for CNN accelerator.

Ii-B Proposed Accelerator Architecture

A standard CNN [1, 2] for image classification is comprised of one or more convolutional layers, pooling layers, followed by one or more fully connected (FC) layers. As analyzed in [4], the core part of the convolution layer is a 3-dimensional multiply-accumulate operation that can be defined by

(1)

where and denote the neurons at position in the input feature map and output feature map , respectively. represents the corresponding weights in the -th layer that gets convolved with . In pooling layers, 2-D subsampling operations are performed on neighboring neurons of the same feature map . As traversing deeper in the neural network, feature dimensions are gradually reduced. In FC layers, each output neuron is calculated by the weighted summation of all input neurons shown by

(2)

In some CNN models [1], local response normalization (LRN) layers that perform normalization operations on each input neuron value by a factor that depends on the neighboring neurons are also used following the pooling layer.

As illustrated in Fig. 2, the proposed architecture consists of four kernels that are connected by using Altera’s OpenCL extension Channel/Pipes. The Convolution kernel (Conv.) is designed to implement both the 3-D multiply-accumulate operation of (1) and the inner product operation of (2). The Pooling kernel performs the subsampling directly on the output data streams of the Conv. kernel. Two data mover kernels, namely MemRD and MemWR, transfer feature data and weights from/to the global memory. As analyzed in Section I, the cascaded kernels form a deep computation pipeline that can implement a serial of basic CNNs operations without the need of storing interlayer data back to global memory. It significantly reduces the bandwidth requirement compared to the work of [4]. The LRN function is implemented separately from the pipeline since it may function on data from adjacent feature maps or the same feature map, which requires multiple memory access patterns. Detailed design of each kernel is as follow:

Fig. 2: The Proposed Architecture of the CNN Accelerator.
1:Define shift-register as delayed buffer
2:Set required convolution counter bound
3:#pragma unroll by factor of
4:for each convolution pipeline do
5:     for each output neuron  do
6:         Initializing all to zeros
7:         for  do
8:              Read vectorized from Channel
9:              Read vectorized from Channel
10:              Perform parallel multiply-add operation of
11:              
12:              Perform register shifting
13:              Store result in
14:         end for
15:         Perform parallel summation
16:     end for
17:end for
Fig. 3: Pseudo-code of the convolution kernel.

Ii-B1 Convolution Kernel

A single-threaded kernel with parallel convolution data paths is designed to implement both the functions of the convolution and FC layers. Two techniques are used to improve the computation throughput and pipeline utilization. First, a multi-mode convolution circuit with a structure of deeply pipelined multiply-add tree and delayed buffers is designed. In [3], Eq. (1) was written as a 5-level nested loop, in which complicated loop tiling and memory partition techniques are used to improve computation throughput. However, manual memory partition capability is not yet available in Altera’s OpenCL and the Channel read/write operations used in loops will also prevent tiling optimization. Therefore, we transform (1) into a structure similar to (2), and implement both functions as a 2-level nested loop structure. The pseudo-code is shown in Fig. 3. When the appropriate buffer depth N is set, an efficient pipeline with an initial interval of two can be synthesized by Altera’s OpenCL compiler.

Secondly, data vectorization and parallel CU structures are both exploited in the design. Vectorized input features and weights are streamed by multiple Channels. A design parameter VEC_SIZE is introduced to control the input throughput. The outermost for loop is unrolled by a factor of CU_NUM to generate multiple instances of the convolution pipeline. Consequently, outputs in different output feature maps can be generated in parallel. When configured in 3D convolution mode, is set to the value of , while in FC mode, is set to , where C’=C/VEC_SIZE. When no pipeline stall are caused by Channel access, a speedup by VEC_SIZE CU_NUM can be achieved.

Ii-B2 Data Mover Kernels

Two multi-mode 3D NDRange kernels are designed to fetch/store data from/to the global memory for the computation pipelines. Data and work-item mapping schemes are illustrated in Fig. 4. In convolution mode, the MemRd kernel launches with a global work size of , while the MemWR kernel works in a NDRange of . Work-items are arranged into multiple concurrent work-groups, each of which has a local work size of (K, K, C’). Therefore, a strict data read-write ratio is assigned to the two kernels. The proposed data movers enable efficient data reuses which can significantly reduce the global memory bandwidth requirements: 1) for each work-item, the fetched data of the input feature map are replicated by registers inside the MemRD kernel, and then passed to all the CUs of the following Conv. kernel for parallel computation of the CU_NUM output features. 2) the fetched weights are first loaded onto an on-chip cache generated by the compiler. Different work-groups that share the same work-group index can reuse the same weights from the cache without issuing new global memory load instructions.

In FC mode, both the input feature and weight data are 1D vectors as defined in Eq. (2). Directly launching MemRD kernel with only one classification task will significantly reduce the opportunity of data reuse in weights. Therefore, we introduce batched processing capability in MemRD. For instance, a batch of classifications can be processed with a single kernel launch by mapping the input feature maps as a single 3D data set with the setting of .

Fig. 4: Data and work-item mapping scheme of the data mover kernels.

Ii-B3 Pooling Kernel

A line-buffer-based hardware structure is proposed for the pooling kernel as shown in Fig. 5. The kernel first reads data of the same feature maps in a line-by-line manner from the Channel and then store them in a group of line buffers. After all buffers are fully filled up, a window of feature map data are read out and send to the next stage of pooling logic. In CNNs, two pooling schemes, i.e., max-pooling and average-pooling, are widely used. Therefore, the pooling logic modules support selecting the maximum or computing the average value of the (L+1) inputs. The kernel can also be turned off by setting a control register.

Fig. 5: Line buffer-based hardware architecture of the pooling kernel ().

Ii-B4 LRN Kernel

We choose the piece-wise linear approximation scheme presented by [4] to implement the core exponent function of the LRN kernel. We improve this scheme by introducing a new lookup table segmentation scheme to reduce the hardware costs as shown in Fig. 6. In this new method, we divide the function evaluation range by using power of , where is a integer that controls the accuracy. The approach avoids complicated table addressing logic by directly operates on the exponent of the input. The hardware parameter is determined by the used segmentation parameter . In AlexNet implementation, a maximum approximation error of is achieved by setting .

1:Load the feature values into a local memory
2:Place barrier on local memory
3:for each  do
4:     Perform parallel access on to fetch all
5:     neighboring features
6:     Compute
7:     Copy the exponent from
8:     Set
9:     Access look-up table by
10:     Compute the approximated function value
11:     Compute result
12:end for
13:Place barrier on local memory
14:Store back to global memory
Fig. 6: Pseudo-code of the LRN kernel. pwlf refers to a piece-wise linear approximation operation.
Fig. 7: Design space exploration for AlexNet model on DE5-net FPGA board. The design with parameters VEC_SIZE=16 and CU_NUM=16 is too large to fit in the FPGA device, and is not reported.
Fig. 8: The profiled execution timeline of the OpenCL kernels running (a) AlexNet and (b) VGG-16 models. The configured batch size is 16.

Iii Experimental Results

In this section, we present the implementation results of the proposed OpenCL design on Altera Stratix-V FPGA based DE5-net board. The Stratix-V A7 FPGA consists of 622K logic elements (LEs), 256 DSP blocks and 2560 M20K RAMs. There are also two 2GB DDR3 DRAMS connected to the FPGA that function as the global memory. The OpenCL kernel codes are compiled by using Altera OpenCL SDK v15.1. Two large-scale CNN models: AlexNet (8 layers) and VGG (16 layers) models are tested with different hardware parameter settings to explore the design space.

Fig. 8 presents the measured performance (a) and detailed resource utilization (b-d) of the FPGA for implementing AlexNet model. It can be observed that all three categories of hardware resources scale linearly as the number of computing units increases. The corresponding improvements on performance are also significant from using 2 CUs to 8 CUs. As system throughput continue to increase, memory bandwidth gradually reaches the onboard DRAM’s limit (around 12.8GB/s), which introduces more frequent pipeline stalls resulting in a degraded performance gain. One can estimate from the reported data that the optimal parameters are VEC_SIZE=8 and CU_NUM=16 for the DE5-net board. Therefore, the shortest image classification time achieved are 43 ms for AlexNet and 718 ms for VGG-16 models, respectively. Fig. 8 shows the profiled timeline for each kernels running these two CNNs models. Note that the final runtime without kernel profiling will be lower than that is shown in the figure. To measure the power consumption, we blocked the power pins of the PCIe slot and powered the board through external port. The average power consumed by the board while running these two models are 27.3W and 29.8W, respectively.

We further compare the proposed design with other HSL-based designs in Table III. Since CNNs are multiplication-intensive, we adopt the number of DSPs consumed as the main factor to evaluate hardware resource utilizations. Our approach achieves a reduction on DSP resources while maintaining comparable performance with [4]. One could also estimate that the proposed architecture can obtain further improvements on performance over [4] when fixed-point data types were adopted. Moreover, the proposed design implements the full precision (32bit float format) CNN forward computation, which makes it also favorable to implement the backward propagation flow of model training. To make more straightforward comparison, we provide normalized performance as ”performance density” in the table. It is clear that our method outperforms previous works.

[t] Comparison with previous works.

FPGA2016[4] FPGA2015[3] This work
Device Stratix-V Virtex-7 Stratix-V
GXA7 VX485T GXA7
FPGA 622K LUTs 485K LUTs 622K LUTs
Capacity 256 DSP 2800 DSP 256 DSP
Design Scheme OpenCL Vivado HSL OpenCL
Frequency 120MHz 100MHz 181MHz
Precision fixed(8-16b) float float
Classification 45.7 msa 21.6 msb 43 msb
Time
Throughput 31.8 GOPSa 61.6 GOPSb 33.9 GOPSa
DSP Consumed 246 2240 162
Performance 0.13 0.027 0.21
Density GOPS/DSP GOPS/DSP GOPS/DSP
Power 25.8W 18.6W 27.3W
  • all operations for image classification.

  • convolution operation only.

Iv Conclusion

This work presents an open-source OpenCL-based FPGA accelerator for convolutional neural networks. A performance-cost scalable hardware architecture with efficiently pipelined kernels was proposed. Design spaces were explored by implementing two large-scale CNNs, AlexNet and VGG, on the DE5-net FPGA board. Results show that our scheme achieved significant improvements on performance density and resource utilizations compared to previous studies.

References

  • [1] A. Krizhevsky, I. Sutskever, G. E. Hinton and et al., “ImageNet classification with deep convolutional neural networks,” in Proc. Neural Information Processing Systems (NIPS’12), 2012.
  • [2] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv:1409.1556, 2014.
  • [3] C. Zhang, P. Li, G. Sun, Y. Guan, B. J. Xiao, and J. Cong, “Optimizing FPGA-based accelerator design for deep convolutional neural networks,” in Proc. ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA ’15), 2015.
  • [4] N. Suda, V. Chandra, G. Dasika, A. Mohanty, Y. F. Ma, S. Vrudhula, J. S. Seo, and Y. Cao, “Throughput-Optimized OpenCL-based FPGA accelerator for large-scale convolutional neural networks,” in Proc. ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA ’16), 2016.
  • [5] J. Qiu, J. Wang, S. Yao and et al., “Going deeper with embedded FPGA platform for convolutional neural network,” in Proc. ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA ’16), 2016.
  • [6] C. Wang, L. Gong, Q. Yu, X. Li, Y. Xie and X. Zhou, “DLAU: a scalable deep learning accelerator unit on FPGA,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2016.
  • [7] https://github.com/doonny/PipeCNN.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
3049
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description