CNNLab: a Novel Parallel Framework for Neural Networks using GPU and FPGA — a Practical Study with Trade-off Analysis

CNNLab: a Novel Parallel Framework for Neural Networks using GPU and FPGA
— a Practical Study with Trade-off Analysis

Maohua Zhu, Liu Liu Electrical and Computer Engineering, UCSB
Email:{maohuazhu,liu_liu}@umail.ucsb.edu
   Chao Wang Computer Sciecne,USTC
Email:cswang@ustc.edu.cn
   Yuan Xie Electrical and Computer Engineering,UCSB
Email:yuanxie@ece.ucsb.edu
Abstract

Designing and implementing efficient, provably correct parallel neural network processing is challenging. Existing high-level parallel abstractions like MapReduce are insufficiently expressive while low-level tools like MPI and Pthreads leave ML experts repeatedly solving the same design challenges. However, the diversity and large-scale data size have posed a significant challenge to construct a flexible and high-performance implementation of deep learning neural networks. To improve the performance and maintain the scalability, we present CNNLab, a novel deep learning framework using GPU and FPGA-based accelerators. CNNLab provides a uniform programming model to users so that the hardware implementation and the scheduling are invisible to the programmers. At runtime, CNNLab leverages the trade-offs between GPU and FPGA before offloading the tasks to the accelerators. Experimental results on the state-of-the-art Nvidia K40 GPU and Altera DE5 FPGA board demonstrate that the CNNLab can provide a universal framework with efficient support for diverse applications without increasing the burden of the programmers. Moreover, we analyze the detailed quantitative performance, throughput, power, energy, and performance density for both approaches. Experimental results leverage the trade-offs between GPU and FPGA and provide useful practical experiences for the deep learning research community.

I Introduction

In the past several years, machine learning has become pervasive in various research fields and commercial applications with achieved satisfactory products. In particular, the emerging of deep learning speeded up the development of machine learning and artificial intelligence. Consequently, deep learning has become a research hotspot in research organizations and the companies[1]. In general, deep learning uses a multi-layer neural network model to extract high-level features into a combination of low-level abstractions to find the distributed data features, to solve complex problems in machine learning. Currently, the most widely used neural models of deep learning are Deep Neural Networks (DNNs) and Convolution Neural Networks (CNNs), which have excellent capability in solving picture recognition, voice recognition, and other complex machine learning tasks.

However, with the increasing accuracy requirements and complexity for the practical applications, the size of the networks becomes explosively large scale (for example, the Google cat-recognizing system has 1 Billion neuronal connections). The explosive volume of data makes the data centers quite power consuming. Therefore, it poses significant challenges to implementing high-performance deep learning networks with low power cost, especially for large-scale deep learning neural network models.

The state-of-the-art means for accelerating deep learning algorithms are Field-Programmable Gate Array (FPGA) [2], Application Specific Integrated Circuit (ASIC) [3], and Graphic Processing Unit (GPU) [4]. GPU has been well recognized for its high performance in massive computing capacity. Compared with GPU acceleration, hardware accelerators like FPGA and ASIC can achieve at least satisfying performance with lower power consumption. However, both FPGA and ASIC have relatively limited computing resources, memory, and I/O bandwidths, therefore it is challenging to develop complex and massive deep neural networks using hardware accelerators. Up to now, the problem of providing efficient middleware support for different architectures has not been adequately solved.

Another challenge is the diversity and programming in deep learning applications. Due to the design complexity of the deep learning algorithms, architectures, and accelerators, it requires significant programming effort to make satisfying utilization of the accelerations in diverse application domains. If the computation is solved by the programmer manually, the quality of scheduling depends on the experiences of the programmer, who has limited knowledge of the hardware. To alleviate the burden of the high-level programmers, we can demonstrate the effectiveness of an efficient middleware support in deep learning research paradigm. To tackle these problems, in this paper, we present CNNLab, which is a middleware architecture targeting the state-of-the-art GPU and FPGA-based deep learning acceleration engines. Our main contributions are the following:

  • We introduce a novel framework into the state-of-the-art architecture for deep learning applications. Applications can be mapped heterogeneous accelerators within a well-structured interface to improve the flexibility and scalability. CNNLab maps the applications into computing kernels using CUDA and OpenCL programming interfaces.

  • CNNLab is based on a heterogeneous hybrid system which includes the software processor, GPU accelerator, and FPGA-based hardware accelerator to speed up the kernel computational parts of deep learning algorithms. In particular, we utilize an efficient middleware support to bridge the gap between high-level neural networks and hardware accelerators.

  • We construct a hardware prototype using state-of-the-art Nvidia K40 GPU and Altera FPGA platforms. Experimental results on real hardware demonstrate that CNNLab can achieve remarkable speedup with insignificant overheads. More importantly, to explore the trade-offs between different implementations using FPGA and GPU, we leverage the trade-offs by analyzing the quantitative results for the running time, throughput, power, energy and performance density of the accelerators.

The rest of the paper is organized as follows: Section II summarizes the problem description and motivation. Section III discusses the CNNLab architecture, including architecture, the programming model, and the hierarchical layers. After that, Section IV describes the GPU and FPGA implementation of the accelerators. We analyze the detailed results and discusses the trade-offs between the GPU and FPGA based implementations. Section V outlines the related study of the neural network accelerators. Finally, Section VI describes the conclusion and further works.

Fig. 1: Neural network hierarchy containing convolutional, pooling, and classifier layers.

Ii Problem Description and Motivation

Deep Learning has recently gained great popularity in the machine learning community due to their potential in solving previously difficult learning problems. Even though Deep and Convolutional Neural Networks have diverse forms, they share similar properties that a generic description can be formalized. First, these algorithms consist of a large number of layers, which are normally executed in sequence so they can be implemented and evaluated separately. Second, each layer usually contains several sub-layers called feature maps; we then use the terms feature input maps and feature output maps. Overall, there are three main kinds of layers: most of the hierarchy is composed of convolutional and pooling layers, and there is a classifier at the top of the network consisting of one or multiple layers. The role of convolutional layers is to apply one or several local filters to data from the input layer. Consequently, the connectivity between the input and output feature map is local. Consider the case where the input is an image, and the convolution is a 2D transform between a subset of the input layer and a kernel of the same dimensions, as illustrated in Fig. 1. The kernel values are the synaptic weights between an input layer and an output layer. In general, a DNN has two computational steps, including prediction process and training. Prediction process is a feedforward computation which computes the output for each given input with network settings. Training process includes pre-training which locally tunes the connection weights between the units in adjacent layers and global training which globally tunes the connection weights with the back-propagation (BP) algorithm.

Firstly, we introduce the prediction process which is a feedforward computation. The process computes in accordance with traditional neural network layer by layer, and the outputs of current layer are the inputs of the next layer. The deep neural network is composed of an input layer, an output layer, and multiple hidden layers to get representations of the data with multiple levels of abstraction. The prediction computation of DNNs is a bottom-up feed-forward process, where the output of the lower layer is the input of its upper layer. To present the prediction computation of deep neural networks clearly, we consider a layer (called L) with neurons, the lower layer of which has neurons. The connectivity between layers is full, so each pair of neurons own their private weight values, resulting an x weight matrix. L reads in inputs () from its lower layer, and then produces outputs (). The calculation of a neuron (k = 1, 2, …) in L can be represented as where is a neuron of the lower layer, f is the activation function, W is the weight coefficient of x and y, and b means the offset value. Since can be regarded as a multiplication between a row vector and a column vector , the computation of the whole layer can be formalized as a vector-matrix multiplication and activation function process, shown as Equation :

(1)

where:

(2)
(3)
(4)

To accelerate the kernel function of the CNN processing, this paper presents CNNLab architecture, which uses both GPU and FPGA as the platform to explore the tradeoff of the performance/power metrics.

Iii The CNNLab Abstraction

Fig. 2: High level CNNLab architecture with corresponding neural network layers model. The NN processing are decomposed into layers and scheduled either on the GPU or FPGA-based accelerators.

Iii-a Data Model and Processing Flow

In this section, we present the modeling framework of the CNNLab architecture. Fig. 2 illustrates the common infrastructure of CNNLab with high-level perspective. The front-end cloud users access the CNNlab platform via a uniform programming interface with the user definitions of the CNN model, whereby each layer is packaged and exposed in an API-like manner. Meanwhile, in the back-end, each layer is composed of specific functionalities provided by software libraries and resource pool through consistent communication interfaces. The functionality for each layer is combined with multiple data resources with a sequence of input/output parameters. It should be noted that the detailed composition and the middleware scheduling is invisible to front-end software users. At runtime, the application is first decomposed into multiple layers under the definition of specific parameters, which are then scheduled at runtime. Whenever a pending layer has obtained its requisite input parameters, it can be offloaded to a particular accelerator for immediate execution.

Fig. 3 presents a high-level overview of CNNLab processing flow in following steps. As the first step, the Deep Learning Specialist provides as inputs a high-level description of a ConvNet architecture together with information about the target CNNLab platform with the aid of a general layer-based model that we designed. Next, the structure of the NN input model will undergo the design space exploration and trade-off analysis in the middleware support, considering the requirements of the application. The design space is searched, and this process yields a succession of hardware mappings of the NN model onto the particular FPGA-based or GPU-based platforms, using OpenCL or CUDA programming interface, respectively.

Fig. 3: General Processing Flow of the CNNLab.

Iii-B User Defined Computation

To describe the CNN model, we define each layer associated with a tuple of parameters. Currently, the following types of layers are supported, which are the ones that have been most commonly used in the ConvNet literature:

Iii-B1 Convolutional Layer

The model of Convolutional Layer is abstracted as

(5)

where

  • and are the input/output matrix of each convolutional layer, which includes height width dimension.

  • refers to the kernel size each accelerator can processed with, which includes height width dimension.

  • is the stride which defines the step between successive convolution windows.

  • is the type of nonlinear function to be applied, e.g. sigmoid, tanh or ReLU.

Iii-B2 Normalization Layer

The model of Normalization Layer is abstracted as

(6)

where

  • is the input matrix of the normalization layer, which includes height width dimension.

  • is the type of normalization operation to be applied.

  • is the local size applied in the nonlinear layer.

  • and are the parameters used in LRN computation.

Iii-B3 Pooling Layer

The model of Pooling Layer is abstracted as

(7)

where

  • and are the input/output matrix of the pooling layer, which includes height width dimension.

  • T is the type of pooling operation to be applied, i.e. either max or average.

  • N is the number of pooling kernels in the pooling layer.

  • S is the stride which defines the step between successive pooling windows.

Fig. 4: General Programming Framework
Fig. 5: An Example Code Snippet for Convolutional Layer using cuDNN and OpenCL kernels

Iii-B4 FC Layer

The model of FC Layer is abstracted as

(8)

where

  • is the input matrix of each convolutional layer, which includes height width dimension for FC-dropout layer.

  • stands for the output of the FC layer.

Iii-C Programming Model and Wrapper

Based on the flexible programming framework in CNN lab, the tasks can be distributed to either GPU and FPGA-based accelerators. In particular, the middleware support should provide a uniform runtime for different hardware architectures. Fig. 4 illustrates the general programming framework based on the runtime kernels. The API forwards the requests via the scheduling middleware on the host code. The host code can offload part of the execution threads to CUDA kernels or OpenCL kernels, depending on the accelerator architecture and the design space exploration. Different kernels share a virtual memory space for communication among the parameters of accelerators. The scheduling process and run-time support are invisible to the programmers as the API provides an efficient bridge for high-level applications to hardware implementations. In particular, we use two example code snippets using both CUDA (cuDNN V5) and OpenCL as a demonstration.

Fig. 5 illustrates an example code segment using GPU and FPGA-based accelerators. The general processing flow includes following steps: 1) Platform Initialization, 2) Set Input and Kernel Descriptors, 3) Computation using Accelerators, and 4) Data Synchronization after Execution. To offload the kernels to the GPU and FPGA-based accelerators, the OpenCL contexts and cuDNN contexts are invoked with specific primitives. The code segments is general so as to be ported to other GPU or FPGA based hardware platforms.

Layer Name Layer Type Description
Conv1 Conv-ReLU Input: 3x224x224, Kernel: 96x3x11x11, Output: 96x55x55, Stride: 4
Conv2 Conv-ReLU Input: 96x27x27, Kernel: 256x96x5x5, Output: 256x27x27, Stride: 1
Conv3 Conv-ReLU Input: 256x13x13, Kernel: 384x256x3x3, Output: 384x13x13, Stride: 1
Conv4 Conv-ReLU Input: 384x13x13, Kernel: 384x384x3x3, Output: 384x13x13, Stride: 1
Conv5 Conv-ReLU Input: 384x13x13, Kernel: 256x384x3x3, Output: 256x13x13, Stride: 1
FC6 FC-dropout Input: 256x6x6, Output: 4096
FC7 FC-dropout Input: 4096, Output: 4096
FC8 FC-softmax Input: 4096, Output: 1000
TABLE I: Description of the Experimental Neural Network Model
missing missing missing
Fig. 6: Evaluation and Trade-off Analysis between GPU and FPGA based Acceleration Engines.

Iv Experiments Results and Trade-off Analysis

Iv-a Platform Setup and Network Description

To measure the performance and overheads of CNNLab architecture, we implemented a hardware prototype with hybrid heterogeneous systems:

CPU: An Intel Corei7-4770 processor clocked at 3.4 GHz was used as the CPU controller. The CPU processor assigns the computational tasks to the acceleration kernels via PCIE X8 edge connector for interconnection.

FPGA: An Intel-Altera DE5 was used to implement the design of deep learning module. Altera Quartus II toolchain to evaluate the speedup and hardware cost, as well as the PowerPlay to estimate the power consumption. The running frequencies of the accelerators on FPGA range from 171MHz to 304MHz (see Table III for detail).

GPU: A Nvidia K40 was used as the GPU accelerator, with 12,288 MB memory capacity, peak bandwidth of device memory at 288 GB/s, and peak single-precision floating point performance at 4.29TFLOPS. We use the state-of-the-art cuDNN V5 as the CUDA programming models (released in April 2016).

Table I introduces the experimental neural network model, including 5 Convolutional Layers and 3 FC Layers. We use ReLU as the nonlinear function in the Convolutional layer. For each layer, the parameters introduced in Section III.B is realized with different configurations.

Process Layer Name Layer Type fp operations per image Device Description
Forward FC6 FC-dropout 75497472 K40-cudnn Input: 256x6x6, Output: 4096
FC7 FC-dropout 33554432 K40-cudnn Input: 4096, Output: 4096
FC8 FC-softmax 8192000 K40-cudnn Input: 4096, Output: 1000
FC6 FC-dropout 75497472 K40-cublas Input: 256x6x6, Output: 4096
FC7 FC-dropout 33554432 K40-cublas Input: 4096, Output: 4096
FC8 FC-softmax 8192000 K40-cublas Input: 4096, Output: 1000
Backward FC6 FC-dropout 150994944 K40-cudnn Input: 256x6x6, Output: 4096
FC7 FC-dropout 67108864 K40-cudnn Input: 4096, Output: 4096
FC8 FC-softmax 16384000 K40-cudnn Input: 4096, Output: 1000
FC6 FC-dropout 150994944 K40-cublas Input: 256x6x6, Output: 4096
FC7 FC-dropout 67108864 K40-cublas Input: 4096, Output: 4096
FC8 FC-softmax 16384000 K40-cublas Input: 4096, Output: 1000
TABLE II: Network Description of GPU Models

Iv-B Results Analysis and Trade-offs between FPGA and GPU

We analyze the trade-offs between the two approaches on the following aspects: Performance(including execution time and throughput), Cost (including power and energy), and performance density (including throughput per watt, and throughput per joule) respectively.

Performance. Fig. 6 (a) presents the running time for the eight layers. GPU has better performance than FPGA on all the layers, and the speedup can achieve up to 1000x for FC layers. Regarding the eight layers, the speedup for convolutional layers (1-5) is lower than the FC layers (6-8), which contains matrix multiplication operations. We also evaluate and compare the throughput, as illustrated in (b). Results are similar to the running time that the GPU can achieve significant higher throughput than FPGA. For example, the peak throughput for GPU is 1632 GFLOPS in Conv 4 layer, while the peak throughput for FPGA is only 25.56 GFLOPS in Conv 2 layer.

Power and Energy. To establish the cost model for both approaches, we evaluate the power and energy consumption for GPU and FPGA-based accelerators. Fig. 6 (c) illustrates the comparison of the power cost. The average power for GPU is 97W while the power of FPGA-based accelerator for the convolutional layer is only 2.23W. To this end, FPGA is power saving due to the limited hardware resources and low working frequency (300MHz). Concerning the energy, both approaches have similar energy consumption when running convolutional layers. For example, the average energy for FPGA is 10.24J, while GPU cost 8.67J on average. In comparison, FPGA takes significantly higher energy for FC layers than GPU, as presented in (d). For FC layers, the average energy consumption for FPGA is 12.24J, while GPU only takes 0.64J on average. Results demonstrate that GPU can achieve better energy efficiency on FC layers due to the optimization of matrix multiplication operations.

Performance Density. Based on the performance and the power cost, we derive the performance density for both methods. First, for Throughput/Power metrics, GPU and FPGA has similar performance density in convolutional layers, that the GPU achieves 14.12 GFLOPS/W while FPGA gets 10.58 GFLOPS/W. For the FC layers, GPU substantially outperforms FPGA by achieving the average density at 14.20 GFLOPS/W, while FPGA only has 0.82 GFLOPS/W. Regarding the energy metrics, we measure the Operation/Energy (GFLOP/J) as the metric. In this case, GPU far outperforms the GPU by achieving 14732 GFLOP/J for all the layers on average, while FPGA only gets 41.35 GFLOP/J for the convolutional layer on average, and 3.19 GFLOP/J for FC layers.

Above results reveals that GPU can achieve higher speedup and throughput while FPGA saves more power consumption. Regarding the energy consumption and performance density, both approaches get similar results for convolutional computation, and GPU outperforms FPGA in the calculation for FC layers.

Fig. 7: Forward Comparison between Different GPU Models (cuDNN vs cuBLAS).
Conv Layer LRN FC Pooling
ALUTS 209786 48327 112152 35247
Registers 320,656 82,469 197,666 54,603
Logic utilization 172,006/234,720(73%) 51,185/234,720 (22%) 99,753/234,720 (42% ) 40,581/234,720(17%)
I/O pin 279/1,064(26%) 279/1,064(26%) 279/1,064(26%) 279 / 1,064(26%)
DSP blocks 162/256(63%) 3/256(1%) 130/256(51%) 0/256(0%)
Memory bits 8,236,663/52,428,800(16%) 3,996,240/52,428,800(8%) 5,556,688/52,428,800(11%) 1,419,856/52,428,800(3%)
RAM Blocks 1,428/2,560(56%) 432/2,560(17%) 651/2,560(25% ) 283/2,560(11% )
Actual Clock Freq 171.29MHz 269.02MHz 216.16MHz 304.50MHz
TABLE III: Resource Utilization of the Accelerator on FPGA

Iv-C Comparison between Different GPU Models

Above results demonstrate that GPU can achieve significantly higher throughput and performance density, especially for the FC layers. In this section, to evaluate the impact of the different GPU library models, we use both cuDNN and cuBLAS library to implement the FC layers in forward computation and back propagation, as illustrated in Table II.

Fig. 7 and Fig. 8 present the comparison between cuDNN and cuBLAS for forward computation and back propagation, respectively. We have following observations based on the experimental results:

  • In general, the cuBLAS library kernels achieve higher speedup (calculated by execution time) than the cuDNN library kernels. In particular, the speedup for cuBLAS against cuDNN is 1.69x in forward computation and 24.89x in BP. In comparison, the throughput for cuBLAS is 1.77x higher than cuDNN in forward computation, but cuDNN achieves 1.57x than cuBLAS in BP calculation.

  • The cuDNN and cuBLAS libraries have similar power consumptions for forward computation (79.12W and 78.73W on average, respectively), while for the BP, cuBLAS takes significantly more power saving than cuDNN, with the average power 78.77W and 123.40W respectively. Accordingly, the energy consumption of cuBLAS is much lower than the cuDNN, with the average energy 0.70J and 31.19J respectively.

  • Regarding the performance density, we calculate the Throughput/Power and Operation/Energy accordingly. Results demonstrate cuBLAS substantially outperforms the cuDNN library on performance density metrics.

Fig. 8: Backward Comparison between Different GPU Models (cuDNN vs cuBLAS).

Iv-D Resources Usage and Running Frequency

Table III lists the resources and power consumption for the modules in CNNLab accelerator. In particular, Of these NN layers, convolutional layer takes most significant logic devices as it requires computational power. In particular, the convolutional layer needs 73% of the hardware logics, 63% DSP blocks, and 56% RAM blocks. In comparison, pooling layer only takes 17% logic resources and 11% RAM blocks. Regarding the running frequency, the convolutional layer has the lowest frequency at 171.29MHz, while pooling achieves the highest frequency at 304.50MHz accordingly.

V Related Work

The neural network model has been an emerging field during the past few years [5]. In this section, we summarize the related acceleration engines, including cloud computing, GPU, and FPGA, respectively.

V-a Cloud based Acceleration

Distributed computing platforms have been widely recognized as the scalable, and easy-to-deploy measures [6]. Project Adam [7] describes the design and implementation of a distributed system comprised of commodity server machines to train large-scale deep learning models. SINGA [8] is a distributed deep learning system for training big models over large datasets. DistBelief [9] is a software framework that can utilize computing clusters with a good number of machines to train large models.

V-B GPU based Accelerators

GPU has been widely applied to the acceleration engine for data-intensive applications. For example, Coates et. al [10] present a high-performance computing system with a cluster of GPU servers, using Infiniband interconnects and MPI. NGPU [11] brings GPU accelerators together without hindering SIMT execution or adding excessive hardware overhead. Li et al. [12] propose an efficient GPU implementation of the large-scale recurrent neural network and demonstrate the power of scaling up the recurrent neural network with GPUs. Vasilache et al. examine the performance profile using fbfft of CNN training on the current generation of GPU [13]. Teng et al. describe an efficient DBN implementation on the GPU, including the pre-training and fine-tuning processes [14]. Recently, GeePS is a scalable deep learning architecture on distributed GPUs with specific parameters [15].

V-C FPGA and Hardware based Accelerators

To overcome the power consumption issue of the GPU and Cloud based frameworks, many developers seek solutions at hardware level [16, 17, 18]. For the IC based accelerator, Diannao [3] is one of the pioneers works solidifying the neural networks on the hardware circuits. Origami [19] present a tape-out accelerator with silicon measurements of power-, area- and I/O efficiency. Meanwhile, FPGA is more flexible due to the integration of the reconfigurable logic devices. Therefore, it can fit changing applications and parameters in neural networks [20, 21]. For example, Zhang et al. [2] explores the bandwidth for the parameters facing the limitation of an FPGA chip. Suda et al. [22] presents a design space exploration method OpenCL programming model approach, which can explore the trade-offs the parameters in the network topologies.

Besides the ASIC and FPGA-based accelerators, there have been numerous directions using emerging hardware technologies, such as Memristive Boltzmann Machine [23], and Processing-in-Memory techniques [24]. Energy efficient inference engine (EIE) uses compression by pruning the redundant connections and having multiple connections share the same weight [25].

Vi Conclusions and Future Work

FPGA and GPU have been demonstrated as very powerful and flexible platform for data-intensive neural network processing in machine learning applications. In this paper, we have presented CNNLab, a middleware support for GPU and FPGA-based framework to accelerate the neural network computing models. It can offload the tasks into different accelerators in the guidance of the neural network model and constraints. To achieve the trade-offs between the GPU and FPGA-based platform, we constructed the real prototype using Intel-Altera DE5 FPGA board and Nvidia K40 GPU platform. We measure the execution time, throughput, power consumption, energy cost, and performance density, respectively.

Experimental Results show that the GPU has better speedup (100x) and throughput (100x) against FPGA-based accelerator while FPGA is more power saving (50x) than GPU. More importantly, in our case study, the energy consumption fo GPU and FPGA are similar in convolutional computation, while GPU is more energy efficient in FC layer calculation. Regarding the performance density, both approaches achieve similar Throughput/Power metrics in convolutional layers (10GFLOPS/W for FPGA, and 14GFLOPS/W for GPU), but GPU has higher Operation/Energy than FPGA-based accelerators, especially for FC computation. Regarding the improvement between different GPU CUDA programming models, we also evaluate the metrics for the state-of-the-art cuDNN and cuBLAS, respectively. Results show that cuBLAS is more energy efficient with significantly higher speedup and lower power consumption.

Although the experimental results are inspiring, there are some future promising directions. First, the speedup of the accelerators can be further improved by compressed network models. Second, the hardware accelerator can be assisted with a large scale data processing framework like Spark or TensorFlow platforms.

References

  • [1] G. H. Yann LeCun, Yoshua Bengio, “Deep leaning,” Nature, vol. 521, pp. 436–444, May 2015.
  • [2] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing fpga-based accelerator design for deep convolutional neural networks,” FPGA ’15, pp. 161–170, 2015.
  • [3] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, “Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning,” ASPLOS ’14, pp. 269–284, 2014.
  • [4] A. Yazdanbakhsh, J. Park, H. Sharma, P. Lotfi-Kamran, and H. Esmaeilzadeh, “Neural acceleration for gpu throughput processors,” MICRO’15, pp. 482–493, 2015.
  • [5] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” ICLR’16, 2015.
  • [6] Q. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. Corrado, J. Dean, and A. Ng, “Building high-level features using large scale unsupervised learning,” in ICML’12, 2012.
  • [7] T. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman, “Project adam: Building an efficient and scalable deep learning training system,” OSDI’14, pp. 571–582, 2014.
  • [8] B. C. Ooi, K.-L. Tan, S. Wang, W. Wang, Q. Cai, G. Chen, J. Gao, Z. Luo, A. K. Tung, Y. Wang, Z. Xie, M. Zhang, and K. Zheng, “Singa: A distributed deep learning platform,” MM ’15, pp. 685–688, 2015.
  • [9] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, Q. V. Le, and A. Y. Ng, “Large scale distributed deep networks,” in NIPS’12, pp. 1232–1240, 2012.
  • [10] A. Coates, B. Huval, T. Wang, D. Wu, B. Catanzaro, and N. Andrew, “Deep learning with cots hpc systems,” in ICML’13, vol. 28, pp. 1337–1345, May 2013.
  • [11] A. Yazdanbakhsh, J. Park, H. Sharma, P. Lotfi-Kamran, and H. Esmaeilzadeh, “Neural acceleration for gpu throughput processors,” MICRO’15, pp. 482–493, 2015.
  • [12] B. Li, E. Zhou, B. Huang, J. Duan, Y. Wang, N. Xu, J. Zhang, and H. Yang, “Large scale recurrent neural network on gpu,” in IJCNN’14, pp. 4062–4069, 2014.
  • [13] N. Vasilache, J. Johnson, M. Mathieu, S. Chintala, S. Piantino, and Y. LeCun, “Fast convolutional nets with fbfft: A gpu performance evaluation,” ICLR’15, 2015.
  • [14] T. Li, Y. Dou, J. Jiang, Y. Wang, and Q. Lv, “Optimized deep belief networks on cuda gpus,” in IJCNN’15, pp. 1–8, July 2015.
  • [15] H. Cui, H. Zhang, G. R. Ganger, P. B. Gibbons, and E. P. Xing, “Geeps: Scalable deep learning on distributed gpus with a gpu-specialized parameter server,” EuroSys ’16, pp. 4:1–4:16, 2016.
  • [16] S. Park, S. Choi, J. Lee, M. Kim, J. Park, and H. J. Yoo, “14.1 a 126.1mw real-time natural ui/ux processor with embedded deep-learning core for low-power smart glasses,” in ISSCC’16, pp. 254–255, Jan 2016.
  • [17] Y. H. Chen, T. Krishna, J. Emer, and V. Sze, “14.5 eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks,” in ISSCC’16, pp. 262–263, Jan 2016.
  • [18] J. Sim, J. S. Park, M. Kim, D. Bae, Y. Choi, and L. S. Kim, “14.6 a 1.42tops/w deep convolutional neural network recognition processor for intelligent ioe systems,” in ISSCC’16, pp. 264–265, Jan 2016.
  • [19] L. Cavigelli, D. Gschwend, C. Mayer, S. Willi, B. Muheim, and L. Benini, “Origami: A convolutional network accelerator,” GLSVLSI ’15, pp. 199–204, 2015.
  • [20] G. Lacey, G. W. Taylor, and S. Areibi, “Deep learning on fpgas: Past, present, and future,” arXiv preprint arXiv:1602.04283, 2016.
  • [21] S. I. Venieris and C.-S. Bouganis, “fpgaconvnet: A framework for mapping convolutional neural networks on fpgas,” FCCM’16, pp. 40–47, 2016.
  • [22] N. Suda, V. Chandra, G. Dasika, A. Mohanty, Y. Ma, S. Vrudhula, J.-s. Seo, and Y. Cao, “Throughput-optimized opencl-based fpga accelerator for large-scale convolutional neural networks,” FPGA, pp. 16–25, 2016.
  • [23] M. N. Bojnordi and E. Ipek, “Memristive boltzmann machine: A hardware accelerator for combinatorial optimization and deep learning,” in HPCA’16, pp. 1–13, March 2016.
  • [24] Y. Kim, Y. Zhang, and P. Li, “A reconfigurable digital neuromorphic processor with memristive synaptic crossbar for cognitive computing,” J. Emerg. Technol. Comput. Syst., vol. 11, pp. 38:1–38:25, Apr. 2015.
  • [25] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “Eie: Efficient inference engine on compressed deep neural network,” ISCA’16, 2016.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
3982
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description