PipeCNN: An OpenCLBased FPGA Accelerator for LargeScale Convolution Neuron Networks
Abstract
Convolutional neural networks (CNNs) have been widely employed in many applications such as image classification, video analysis and speech recognition. Being computeintensive, CNN computations are mainly accelerated by GPUs with high power dissipations. Recently, studies were carried out exploiting FPGA as CNN accelerator because of its reconfigurability and energy efficiency advantage over GPU, especially when OpenCLbased highlevel synthesis tools are now available providing fast verification and implementation flows. Previous OpenCLbased design only focused on creating a generic framework to identify performancerelated hardware parameters, without utilizing FPGA’s special capability of pipelining kernel functions to minimize memory bandwidth requirement. In this work, we propose an FPGA accelerator with a new architecture of deeply pipelined OpenCL kernels. Data reuse and task mapping techniques are also presented to improve design efficiency. The proposed schemes are verified by implementing two representative largescale CNNs, AlexNet and VGG on Altera StratixV A7 FPGA. We have achieved a similar peak performance of 33.9 GOPS with a 34 resource reduction on DSP blocks compared to previous work. Our design is openly accessible and thus can be reused to explore new architectures for neural network accelerators.
I Introduction
Convolutional neural network (CNN), as an emerging deep learning architecture, has received huge attentions in various applications, such as video surveillance, image searching, speech recognition, and robot vision. A CNN works with multiple convolution layers that extract features from input data, followed by classification layers making decisions. Typical largescale CNNs [1, 2] usually consist of millions of neural units and millions of connections that require over billion operations to process only one input.
General purpose CPUs, being sequential systems with limited computational resources, are inefficient for implementing CNNbased computeintensive applications. Currently, GPUs are widely adopted as hardware accelerators for training deep neuron networks. However, they are energy inefficient for embedded applications. FPGAs, which provide massive processing elements, reconfigurable interconnections and lower power dissipation, are naturally suitable to implement neural network circuits. Studies, such as [6], [5] have reported efficient CNN accelerators on embedded FPGA platforms. However, traditional registertransferlevel (RTL) design flow adopted in these studies require great effort in writing complex RTL codes, practicing timeconsuming simulations and compilations before one can actually run accelerators on hardware.
HighLevel Synthesis (HSL) tools, which enable automatic compilation from highlevel program (C/C++) to lowlevel RTL specifications, were recently adopted by many studies to implement deep neural networks on FPGAs. In [3], an accelerator design was implemented by using the VivadoHLS tool on a Xilinx VC707 FPGA. Computation throughput and memory bandwidth are quantitatively explored by using a roofline model to find the design with the best performance and lowest resource. However, only convolution layers are implemented. The work of [4] proposed a fixedpoint CNN design using the OpenCL framework. A systematic methodology is presented to minimize execution time with given resource constraints. Due to a matrix multiplicationbased kernel design for convolution layer and the GPUlike separated kernel organization adopted, FPGA’s special advantage of implementing deeply pipelined circuits (kernels) is not fully exploited to better improve computation throughput and minimize memory bandwidth.
Main contribution of this work are: (1) an OpenCLbased FPGA accelerator with an efficient structure of pipelined kernels is proposed for implementing largescale CNNs; (2) the design space of the proposed architecture was fully explored on StratixV A7 FPGA and two realword largescale CNN models were implemented and tested. Results show that the proposed scheme achieves improved performance and resource utilization than previous works; (3) we have made our design openly accessible [7] for other researchers to study and explore new accelerator architectures for deep neural networks.
Ii OpenCLBased CNN Implementation
Iia OpenCL Framework
OpenCL is an open, crossplatform parallel programming language that can be used in both GPU and FPGA developments. The OpenCLbased FPGA accelerator development flow is summarized in Fig. 1. In the framework, an FPGA board (as OpenCL device) is connected with a desktop CPU (as OpenCL host) through a high speed PCIe slot forming a heterogenous computing system. An OpenCL code, which defines multiple parallel compute units (CUs) in the form of kernel functions, is compiled and synthesized to run on the FPGA accelerator. On the host side, a C/C++ code runs on the CPU, providing vendor specific application programming interface (API) to communicate with the kernels implemented on the FPGA accelerator. This work uses the Altera OpenCL SDK toolset for compiling, implementing and profiling the OpenCL codes on FPGAs.
IiB Proposed Accelerator Architecture
A standard CNN [1, 2] for image classification is comprised of one or more convolutional layers, pooling layers, followed by one or more fully connected (FC) layers. As analyzed in [4], the core part of the convolution layer is a 3dimensional multiplyaccumulate operation that can be defined by
(1) 
where and denote the neurons at position in the input feature map and output feature map , respectively. represents the corresponding weights in the th layer that gets convolved with . In pooling layers, 2D subsampling operations are performed on neighboring neurons of the same feature map . As traversing deeper in the neural network, feature dimensions are gradually reduced. In FC layers, each output neuron is calculated by the weighted summation of all input neurons shown by
(2) 
In some CNN models [1], local response normalization (LRN) layers that perform normalization operations on each input neuron value by a factor that depends on the neighboring neurons are also used following the pooling layer.
As illustrated in Fig. 2, the proposed architecture consists of four kernels that are connected by using Altera’s OpenCL extension Channel/Pipes. The Convolution kernel (Conv.) is designed to implement both the 3D multiplyaccumulate operation of (1) and the inner product operation of (2). The Pooling kernel performs the subsampling directly on the output data streams of the Conv. kernel. Two data mover kernels, namely MemRD and MemWR, transfer feature data and weights from/to the global memory. As analyzed in Section I, the cascaded kernels form a deep computation pipeline that can implement a serial of basic CNNs operations without the need of storing interlayer data back to global memory. It significantly reduces the bandwidth requirement compared to the work of [4]. The LRN function is implemented separately from the pipeline since it may function on data from adjacent feature maps or the same feature map, which requires multiple memory access patterns. Detailed design of each kernel is as follow:
IiB1 Convolution Kernel
A singlethreaded kernel with parallel convolution data paths is designed to implement both the functions of the convolution and FC layers. Two techniques are used to improve the computation throughput and pipeline utilization. First, a multimode convolution circuit with a structure of deeply pipelined multiplyadd tree and delayed buffers is designed. In [3], Eq. (1) was written as a 5level nested loop, in which complicated loop tiling and memory partition techniques are used to improve computation throughput. However, manual memory partition capability is not yet available in Altera’s OpenCL and the Channel read/write operations used in loops will also prevent tiling optimization. Therefore, we transform (1) into a structure similar to (2), and implement both functions as a 2level nested loop structure. The pseudocode is shown in Fig. 3. When the appropriate buffer depth N is set, an efficient pipeline with an initial interval of two can be synthesized by Altera’s OpenCL compiler.
Secondly, data vectorization and parallel CU structures are both exploited in the design. Vectorized input features and weights are streamed by multiple Channels. A design parameter VEC_SIZE is introduced to control the input throughput. The outermost for loop is unrolled by a factor of CU_NUM to generate multiple instances of the convolution pipeline. Consequently, outputs in different output feature maps can be generated in parallel. When configured in 3D convolution mode, is set to the value of , while in FC mode, is set to , where C’=C/VEC_SIZE. When no pipeline stall are caused by Channel access, a speedup by VEC_SIZE CU_NUM can be achieved.
IiB2 Data Mover Kernels
Two multimode 3D NDRange kernels are designed to fetch/store data from/to the global memory for the computation pipelines. Data and workitem mapping schemes are illustrated in Fig. 4. In convolution mode, the MemRd kernel launches with a global work size of , while the MemWR kernel works in a NDRange of . Workitems are arranged into multiple concurrent workgroups, each of which has a local work size of (K, K, C’). Therefore, a strict data readwrite ratio is assigned to the two kernels. The proposed data movers enable efficient data reuses which can significantly reduce the global memory bandwidth requirements: 1) for each workitem, the fetched data of the input feature map are replicated by registers inside the MemRD kernel, and then passed to all the CUs of the following Conv. kernel for parallel computation of the CU_NUM output features. 2) the fetched weights are first loaded onto an onchip cache generated by the compiler. Different workgroups that share the same workgroup index can reuse the same weights from the cache without issuing new global memory load instructions.
In FC mode, both the input feature and weight data are 1D vectors as defined in Eq. (2). Directly launching MemRD kernel with only one classification task will significantly reduce the opportunity of data reuse in weights. Therefore, we introduce batched processing capability in MemRD. For instance, a batch of classifications can be processed with a single kernel launch by mapping the input feature maps as a single 3D data set with the setting of .
IiB3 Pooling Kernel
A linebufferbased hardware structure is proposed for the pooling kernel as shown in Fig. 5. The kernel first reads data of the same feature maps in a linebyline manner from the Channel and then store them in a group of line buffers. After all buffers are fully filled up, a window of feature map data are read out and send to the next stage of pooling logic. In CNNs, two pooling schemes, i.e., maxpooling and averagepooling, are widely used. Therefore, the pooling logic modules support selecting the maximum or computing the average value of the (L+1) inputs. The kernel can also be turned off by setting a control register.
IiB4 LRN Kernel
We choose the piecewise linear approximation scheme presented by [4] to implement the core exponent function of the LRN kernel. We improve this scheme by introducing a new lookup table segmentation scheme to reduce the hardware costs as shown in Fig. 6. In this new method, we divide the function evaluation range by using power of , where is a integer that controls the accuracy. The approach avoids complicated table addressing logic by directly operates on the exponent of the input. The hardware parameter is determined by the used segmentation parameter . In AlexNet implementation, a maximum approximation error of is achieved by setting .
Iii Experimental Results
In this section, we present the implementation results of the proposed OpenCL design on Altera StratixV FPGA based DE5net board. The StratixV A7 FPGA consists of 622K logic elements (LEs), 256 DSP blocks and 2560 M20K RAMs. There are also two 2GB DDR3 DRAMS connected to the FPGA that function as the global memory. The OpenCL kernel codes are compiled by using Altera OpenCL SDK v15.1. Two largescale CNN models: AlexNet (8 layers) and VGG (16 layers) models are tested with different hardware parameter settings to explore the design space.
Fig. 8 presents the measured performance (a) and detailed resource utilization (bd) of the FPGA for implementing AlexNet model. It can be observed that all three categories of hardware resources scale linearly as the number of computing units increases. The corresponding improvements on performance are also significant from using 2 CUs to 8 CUs. As system throughput continue to increase, memory bandwidth gradually reaches the onboard DRAM’s limit (around 12.8GB/s), which introduces more frequent pipeline stalls resulting in a degraded performance gain. One can estimate from the reported data that the optimal parameters are VEC_SIZE=8 and CU_NUM=16 for the DE5net board. Therefore, the shortest image classification time achieved are 43 ms for AlexNet and 718 ms for VGG16 models, respectively. Fig. 8 shows the profiled timeline for each kernels running these two CNNs models. Note that the final runtime without kernel profiling will be lower than that is shown in the figure. To measure the power consumption, we blocked the power pins of the PCIe slot and powered the board through external port. The average power consumed by the board while running these two models are 27.3W and 29.8W, respectively.
We further compare the proposed design with other HSLbased designs in Table III. Since CNNs are multiplicationintensive, we adopt the number of DSPs consumed as the main factor to evaluate hardware resource utilizations. Our approach achieves a reduction on DSP resources while maintaining comparable performance with [4]. One could also estimate that the proposed architecture can obtain further improvements on performance over [4] when fixedpoint data types were adopted. Moreover, the proposed design implements the full precision (32bit float format) CNN forward computation, which makes it also favorable to implement the backward propagation flow of model training. To make more straightforward comparison, we provide normalized performance as ”performance density” in the table. It is clear that our method outperforms previous works.
[t]
FPGA2016[4]  FPGA2015[3]  This work  
Device  StratixV  Virtex7  StratixV 
GXA7  VX485T  GXA7  
FPGA  622K LUTs  485K LUTs  622K LUTs 
Capacity  256 DSP  2800 DSP  256 DSP 
Design Scheme  OpenCL  Vivado HSL  OpenCL 
Frequency  120MHz  100MHz  181MHz 
Precision  fixed(816b)  float  float 
Classification  45.7 ms^{a}  21.6 ms^{b}  43 ms^{b} 
Time  
Throughput  31.8 GOPS^{a}  61.6 GOPS^{b}  33.9 GOPS^{a} 
DSP Consumed  246  2240  162 
Performance  0.13  0.027  0.21 
Density  GOPS/DSP  GOPS/DSP  GOPS/DSP 
Power  25.8W  18.6W  27.3W 

all operations for image classification.

convolution operation only.
Iv Conclusion
This work presents an opensource OpenCLbased FPGA accelerator for convolutional neural networks. A performancecost scalable hardware architecture with efficiently pipelined kernels was proposed. Design spaces were explored by implementing two largescale CNNs, AlexNet and VGG, on the DE5net FPGA board. Results show that our scheme achieved significant improvements on performance density and resource utilizations compared to previous studies.
References
 [1] A. Krizhevsky, I. Sutskever, G. E. Hinton and et al., “ImageNet classification with deep convolutional neural networks,” in Proc. Neural Information Processing Systems (NIPS’12), 2012.
 [2] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” arXiv:1409.1556, 2014.
 [3] C. Zhang, P. Li, G. Sun, Y. Guan, B. J. Xiao, and J. Cong, “Optimizing FPGAbased accelerator design for deep convolutional neural networks,” in Proc. ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays (FPGA ’15), 2015.
 [4] N. Suda, V. Chandra, G. Dasika, A. Mohanty, Y. F. Ma, S. Vrudhula, J. S. Seo, and Y. Cao, “ThroughputOptimized OpenCLbased FPGA accelerator for largescale convolutional neural networks,” in Proc. ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays (FPGA ’16), 2016.
 [5] J. Qiu, J. Wang, S. Yao and et al., “Going deeper with embedded FPGA platform for convolutional neural network,” in Proc. ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays (FPGA ’16), 2016.
 [6] C. Wang, L. Gong, Q. Yu, X. Li, Y. Xie and X. Zhou, “DLAU: a scalable deep learning accelerator unit on FPGA,” IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems, 2016.
 [7] https://github.com/doonny/PipeCNN.