Hardware-Driven Nonlinear Activation for Stochastic Computing Based Deep Convolutional Neural Networks

Hardware-Driven Nonlinear Activation for Stochastic Computing Based Deep Convolutional Neural Networks

Ji Li jli724@usc.edu Department of Electrical Engineering, University of Southern California, Los Angeles, CA, USA Zihao Yuan zihaoyua@usc.edu Department of Electrical Engineering, University of Southern California, Los Angeles, CA, USA Zhe Li zli89@syr.edu College of Engineering and Computer Science, Syracuse University, Syracuse, NY, USA Caiwen Ding cading@syr.edu College of Engineering and Computer Science, Syracuse University, Syracuse, NY, USA Ao Ren aren@syr.edu College of Engineering and Computer Science, Syracuse University, Syracuse, NY, USA Qinru Qiu qiqiu@syr.edu College of Engineering and Computer Science, Syracuse University, Syracuse, NY, USA Jeffrey Draper draper@isi.edu Yanzhi Wang ywang393@syr.edu College of Engineering and Computer Science, Syracuse University, Syracuse, NY, USA

Recently, Deep Convolutional Neural Networks (DCNNs) have made unprecedented progress, achieving the accuracy close to, or even better than human-level perception in various tasks. There is a timely need to map the latest software DCNNs to application-specific hardware, in order to achieve orders of magnitude improvement in performance, energy efficiency and compactness. Stochastic Computing (SC), as a low-cost alternative to the conventional binary computing paradigm, has the potential to enable massively parallel and highly scalable hardware implementation of DCNNs. One major challenge in SC based DCNNs is designing accurate nonlinear activation functions, which have a significant impact on the network-level accuracy but cannot be implemented accurately by existing SC computing blocks. In this paper, we design and optimize SC based neurons, and we propose highly accurate activation designs for the three most frequently used activation functions in software DCNNs, i.e, hyperbolic tangent, logistic, and rectified linear units. Experimental results on LeNet-5 using MNIST dataset demonstrate that compared with a binary ASIC hardware DCNN, the DCNN with the proposed SC neurons can achieve up to 61X, 151X, and 2X improvement in terms of area, power, and energy, respectively, at the cost of small precision degradation. In addition, the SC approach achieves up to 21X and 41X of the area, 41X and 72X of the power, and 198200X and 96443X of the energy, compared with CPU and GPU approaches, respectively, while the error is increased by less than 3.07%. ReLU activation is suggested for future SC based DCNNs considering its superior performance under a small bit stream length.

Deep Convolutional Neural Networks; Stochastic Computing; Deep Learning; Activation Function.


I Introduction

Deep learning has achieved unprecedented success in solving problems that have resisted the best attempts of the artificial intelligence community for many years [1, 2]. Artificial neural networks with deep learning can automatically extract a hierarchy of representations by composing nonlinear modules, which transform the representations at one level (starting from raw input, e.g., pixel values of an image) into representations at a more abstract and more invariant level [2]. Arbitrarily complex functions can be approximated in a sufficiently large neural network using nonlinear activation functions [3]. However, in practice the sizes of networks are finite, and the choice of nonlinearity affects both the learning dynamics and the network’s expressive power [4].

Recently, the Deep Convolutional Neural Network (DCNN) has achieved breakthroughs in many fundamental applications, such as image/video classification [5, 6] and visual/text recognition [7, 8]. DCNN is now recognized as the dominant method for almost all recognition and detection tasks and surpasses human performance on certain tasks [2]. Since the deep layered structures require a large amount of computation resources, from a practical standpoint, large-scale DCNNs are mainly implemented in high performance server clusters [9]. The huge power/energy consumptions of software DCNNs prevent their widespread deployment in wearable and Internet of Things (IoT) devices, which emerge with repercussions across the industry spectrum [10]. Hence, there is a timely need to map the latest DCNNs to application-specific hardware, in order to achieve orders of magnitude improvement in performance, energy efficiency and compactness.

Many existing works have explored hardware implementations of DCNNs using GPUs [11] and FPGAs [12, 13]. Nevertheless, more efficient design is required to resolve the conflict between the resource-hungry DCNNs and the resource-constrained IoT entities. Stochastic Computing (SC), which uses the probability of 1s in a random bit stream to represent a number, has the potential to enable massively parallel and ultra-low footprint hardware based DCNNs [14, 15, 16, 10, 17]. In SC, arithmetic operations like multiplication can be performed using simple logic elements and SC provides better soft error resiliency [18, 19, 15, 20]. In this regard, considerable efforts have been invested in the context of designing Artificial Neural Networks (ANNs), Deep Belief Network (DBNs) and DCNNs using SC components [10, 21, 22, 14, 16, 17, 23].

One key challenge is designing accurate nonlinear activation function. A small imprecision induced by the convolution and down sampling operations can be significantly amplified by the inaccurate nonlinear activation function, propagated to the subsequent neurons, and further amplified by the activation functions in the following neurons. Hence, without an accurate activation, the network accuracy can easily decrease to an unacceptable level. Besides, while the type of activation has a significant impact on the performance of DCNNs, only two basic Finite State Machine (FSM) based hyperbolic tangent (tanh) activations are designed for ANNs and DBNs in [15, 10], whilst other possible activations have hardly been explored, especially the Rectified Linear Units (ReLUs), which are essential for the state-of-the-art neural networks [24].

In this paper, we first propose three accurate SC-based neuron designs for DCNNs with the popular activation functions that are widely used in the software, i.e., tanh, logistic (or sigmoid), and ReLU. The SC configuration parameters are jointly optimized considering the down sampling operation, block size, and connection patterns in order to yield the maximum precision. Then we conduct a comprehensive comparison of the aforementioned neuron designs using different activations under different input sizes and stochastic bit stream lengths. After constructing the DCNNs using the proposed neurons, we further evaluate and compare the network performance of DCNNs. Experimental results demonstrate that proposed SC based DCNNs have much smaller area and power consumption than the binary ASIC DCNNs, with up to 61X, 151X, 2X improvement in terms of area, power, and energy, respectively, at the cost of accuracy degradation. Moreover, the SC approach achieves up to 21X and 41X of the area, 41X and 72X of the power, and 198200X and 96443X of the energy, compared with CPU and GPU approaches, respectively, while the test error is increased by less than 3.07%. ReLU activation is suggested for future SC based DCNNs considering its superior accuracy, area, and energy performance under a small bit stream length.

The remainder of the paper is organized as follows. Section II reviews related work. DCNN architecture and related SC components are given in Section III. Section IV presents the three activation function designs in a neuron cell using SC. Section V reports the experimental results, and this paper is concluded in Section VI.

Ii Related Work

Ii-a Activation Function Studies

The hyperbolic tangent has generally shown to provide more robust performance than logistic function in DBNs [25]. However, both suffer from the vanishing gradient problem, resulting in a slowed training process or convergence to a poor local minimum [25]. ReLUs do not have this problem, since an activated unit gives a constant gradient of 1. Recently, the parametric ReLU proposed by K. He et al. was reported to surpass human-level performance on the ImageNet large scale visual recognition challenge [24].

Ii-B Hardware-Based DCNN Studies

In order to exploit the parallelism and reduce the area, power, and energy, many existing hardware-based DCNNs have come into existence, including GP-GPUs based DCNNs [26] and FPGA-based DCNNs [13, 12]. GP-GPU was the most commonly used platform for accelerating DBNs around 2011 [26]. As for FPGA, M. Motamedi et al. developed an FPGA-based accelerator to meet performance and energy-efficiency constraints of DCNNs [13]. Recently, SC becomes a very attractive candidate for implementing ANNs and DBNs. Y. Ji et al. applied SC to a radial basis function ANN and significantly reduced the required hardware in [21], and hardware-oriented optimization for SC based DCNN was developed in [16]. K. Kim et al. proposed hardware-based DBN using SC components, in which a SC based neuron cell was designed and optimized [10]. A multiplier SC neuron and a structure optimization method were proposed in [14] for DCNN. A. Ren et al. developed a DCNN architecture with weight storage optimization and a novel max pooling design in the SC domain [17]. Besides, Z. Li et al. explored eight different neuron implementations in DCNNs using SC [23].

However, no existing works have explored the ReLU and logistic activations (popular activation choices in software) for hardware based DCNNs using SC.

Iii Overview of Hardware-Based DCNN

Iii-a General Architecture of DCNNs

Figure 1 shows a general DCNN architecture that is composed of a stack of convolutional layers, pooling layers, and fully connected layers.

Fig. 1: A general DCNN architecture.

In each convolutional layer, common patterns in local regions of inputs are extracted by convolving a filter over the inputs. The convolution result is stored in a feature map as a measure of how well the filter matches each portion of the inputs. After convolution, a subsampling step is performed by the pooling layer to aggregate statistics of these features, reduce the dimensions of data and mitigate over-fitting issues. A nonlinear activation function is applied to generate the output of the layer. The stack of convolutional and pooling layers is followed by fully connected layers, which further aggregate the local information learned in the convolutional and pooling layers for class discrimination.

By alternating the topologies of convolutional and pooling layers, powerful DCNNs can be built for specific applications, such as LeNet [27], AlexNet [11], and GoogLeNet [28]. With no loss of generality, we use LeNet-5 (i.e., the fifth generation of LeNet for digits recognition) in our discussion and experiments throughout the paper, and the proposed design methodology can accommodate other DCNNs as well.

Fig. 2: A hardware-based neuron cell in DCNN.

Iii-B Hardware-Based Neuron Cell

As the basic building block in DCNNs, a neuron performs three fundamental operations, i.e., convolution, pooling, and activation, as shown in Figure 2.

The convolution operation calculates the inner products of input () and weights (), and the pooling operation performs sub-sampling for the inner product results, i.e., generating one result out of several inner products. There are two conventional choices for pooling: max and average. The former chooses the largest element in each pooling region, whereas the latter calculates the arithmetic mean. An activation function is applied before the output, with tanh , logistic function , and ReLU being popular choices.

From a hardware design perspective, convolution and pooling can be implemented efficiently using basic building blocks (described in Section IV-A) to achieve accurate results. The nonlinear activation function, on the other hand, is usually implemented using Look-Up Tables (LUTs), requiring large memories to achieve adequate precision. Moreover, without careful design and optimization, the nonlinear activation function can result in serious accuracy degradation as it directly affects the final output accuracy of the neuron cell and the imprecision can be amplified by the subsequent calculations and activations. Hence, it is imperative to design and optimize the activation function to achieve sufficient accuracy.

Iv Proposed Hardware-Driven Nonlinear Activation for DCNNs

Iv-a Stochastic Computing for Neuron Design

Stochastic computing (SC) is an encoding scheme that represents a numeric value using a bit stream , in which the probability of ones is [15]. For instance, the bit stream “01000” contains a single one in a five-bit stream, thus it represents . In addition to this unipolar encoding format, another commonly used format is bipolar format, where a numeric value is processed by . Using bipolar format, the previous example could be represented by “10110”. Next, we present the fundamental operations of a neuron using SC components.

Convolution. Convolution performs multiplication and addition. An XNOR gate, as shown in Figure 3 (a), is used for multiplication with bipolar stochastic encoding as . Multiplexers (MUXes) can perform bipolar SC addition [15], which randomly selects one input as output. When input size is large, summation using MUXes incurs significant accuracy loss since only one input is selected at a time and all the other inputs are not used. To achieve a good trade-off between accuracy and hardware cost in terms of area, power, and energy, we adopt the Approximate Parallel Counter (APC) developed in [29] for addition instead of MUXes.

Pooling. In the SC domain, average pooling can be implemented efficiently with a simple MUX for stochastic inputs. However, as the outputs of APCs (i.e., inputs of pooling) are binary, MUX cannot be applied here. Instead, we use a binary adder to calculate the sum and remove the last 2 bits of the sum as a division by 4 operation, as illustrated in Figure 3 (b). Note that 4-to-1 average pooling is used in the LeNet-5, whereas max pooling is used in other DCNNs, such as AlexNet [11] and GoogLeNet [28]. The SC-based max pooling design is developed in [17].

Activation. Nonlinear activation is the key operation that enhances the representation capability of a neuron. There are many types of activation functions that have been explored in software DCNNs. Nevertheless, due to the design difficulty, only tanh is designed in the SC domain (e.g., an FSM-based tanh function proposed in [15], as shown in Figure 3 (c)), and there is a serious lack of studies on other activation functions for SC-based DCNNs. For the convenience of discussions, we use the naming conventions in Table I.

Fig. 3: Stochastic computing for neuron design: (a) XNOR gate for bipolar multiplication, (b) binary adder for average pooling, and (c) FSM-based tanh for stochastic inputs.

Iv-B Proposed Neuron Design and Nonlinear Activation

We found that directly applying SC to DCNNs leads to severe accuracy degradation which is not acceptable in common cases. The main reason is that the calculation of each computation block (convolution, pooling, and activation) is inaccurate in the SC domain. The overall accuracy of a neuron is affected by number of input streams, bit stream length, hardware configurations, and most importantly the activation functions. In this paper, we design and optimize the neuron structures as well as the activations by jointly considering all the aforementioned factors in order to reach the accuracy level that can be achieved by fixed point binary arithmetic.

the length of bipolar bit stream
-to- average pooling
input size: the number of input bit streams (or the number of
input and weight pairs) in each convolution block
the -th input bit of the -th convolution block
the -th weight bit of the -th convolution block
the set of all the input bits of the -th convolution block
the set of all the weight bits of the -th convolution block
the sum calculated by the -th convolution block,
which is a -bit binary number
the number of states in the state machine
the -th stochastic output bit for the SC-based hyperbolic tangent neuron
TABLE I: Naming Conventions in a Stochastic Computing Based Neuron

To be more specific, we adopt bipolar encoding scheme and average pooling. The proposed neuron designs use XNOR gates and APCs for addition and multiplication (as convolution operation), respectively. Average pooling is implemented using a binary adder (as shown in Figure 3 (b)). The nonlinear activations are designed for binary inputs since the output of APC and the pooling result are binary. The output of the proposed neuron, on the other hand, must be a stochastic bit stream, which will be fed into the neurons in the subsequent layers. The proposed neurons using hyperbolic tangent, stochastic logistic and ReLU activations are as follows:

SC-tanh: SC-Based Hyperbolic Tangent Neuron. The authors in [10] adopted a saturated up/down counter to implement a binary hyperbolic tangent activation function. Nevertheless, this activation function is designed for DBNs and cannot be applied directly for DCNNs. Algorithm 1 presents the proposed SC-based hyperbolic tangent neuron design SC-tanh() for DCNNs, where step 5-9 correspond to the inner product calculation using XNOR gates and APCs, step 11 is the average pooling with a binary adder, and step 12-18 are the tanh activation part, which is implemented with a saturated counter. The connections between two layers are based on the filters that execute convolution operations only for portions of the inputs (i.e., receptive fields), which greatly reduce the connections between consecutive layers. The counted binary number generated by APC is taken by the saturated up/down counter which represents the amount of increase and decrease. By setting half of the states’ output to 1 and the rest to 0 (i.e., setting boundary state to ), this neuron design accurately imitates the hyperbolic tangent function.

input :  are the input bit streams
input :  indicates the -to- average pooling
input :  is internal FSM state number
output :  is the -th stochastic output bit for the SC-tanh neuron
  /* max state */
  /* boundary state */
  /* current state */
1 for  to  do
          /* processing each convolution block */
2          for  to  do
                   /* inner product calculation */
3                   for  to  do
                             /* XNOR multiplication */
5                   end for
                    /* APC addition */
7          end for
         /* average pooling and tanh activation */
9          if   then /* saturated counter */
10                   0;
12          else if   then
14          if   then /* output logic */
16         else
18          end if
20 end for
Algorithm 1 Proposed SC-tanh ()

SC-logistic: Proposed SC-Based Logistic Neuron. There are two important differences between the hyperbolic tangent neuron and the logistic neuron: (i) unlike the above-mentioned hyperbolic tangent neuron, the output of which is in the range of [-1,1], the logistic neuron always outputs non-negative number (i.e., within [0,1]), and (ii) the -value of the logistic’s midpoint is instead of in hyperbolic tangent.

In order to tackle the -st difference, we introduce a history shift register array using shift registers to record the last bits of the output stochastic bit streams. A shadow counter is used to calculate the sum of the stochastic bits in the history shift register array, which is denoted by , i.e., . Hence, the proposed SC-logistic keeps tracking the last output bits and predicts the sign of the current value based on the sum calculated by the shadow counter. To be more specific, if the sum is less than half of the maximum achievable sum , the current value is predicted to be negative. Otherwise, it is predicted as positive (note that value is half 1s and half 0s in bipolar format). As any negative output raises an error, the proposed activation mitigates such errors by outputting 1 (as compensation) whenever the predicted current value is negative.

As for the -nd difference, we need to move the -value of the logistic’s midpoint to . The probability of 1s in a stochastic bit stream that represents the value is . Hence, we design the logic of activation such that of the states output s, whereas the other portion output s. This is realized by setting the boundary state to , where represents the internal FSM state number (in the saturated counter). Note that the boundary value in hyperbolic tangent is since the midpoint of -value of tanh is and is represented by a stochastic stream where half of the bits is and the other half is (i.e., ). Algorithm 2 provides the pseudo code of the proposed SC-logistic neuron.

input :  are the input bit streams
input :  is the number of registers in the temporary array
input :  is the configuration bit. 1:SC-logistic, 0:SC-ReLU
input :  indicates the -to- average pooling
input :  is internal FSM state number
output :  is the -th stochastic output bit for the SC-logistic/ReLU neuron
  /* max state */
1 if  then /* boundary state configuration */
           /* SC-logistic */
           /* SC-ReLU */
5 end if
  /* current state */
  /* is an iterator */
  /* initialize history array */
  /* initialize shadow counter */
6 for  to  do
7          if  then
                    /* negative value compensation */
9         else
10                   for  to  do
11                            for  to  do
                                      /* XNOR multiplication */
13                            end for
                             /* APC addition */
15                   end for
17                   if   then /* saturated counter */
19                   else if S  then
21                   end if
22                  if   then /* output logic */
24                  else
26                   end if
28          end if
29         while  do
                    /* update the history array */
31          end while
           /* update the shadow counter */
34 end for
Algorithm 2 Proposed SC-logistic/ReLU ()

SC-ReLU: Proposed SC-Based ReLU Neuron. According to [2], the ReLU activation becomes the most popular activation function in 2015. We apply the same design concept as in the SC-tanh neuron with the following modifications: First, we introduce a history shift register array and its shadow counter to predict the sign of the current value, and compensate the negative value in the same way SC-logistic neuron does. Second, unlike the SC-logistic neuron that changes boundary state to such that the entire waveform is moved up to , the SC-ReLU is centered at (0,0) so the boundary state needs to be kept as . Noticing the similarities between the SC-logistic and SC-ReLU, we introduce a configuration bit and combine these two neurons, and Algorithm 2 provides the pseudo code of the proposed SC-logistic and SC-ReLU neuron.

Fig. 4: The result comparison between the proposed SC neuron (bit stream ) and the corresponding original software neuron: (a) SC-tanh vs Tanh, (b) SC-logistic vs Logistic, and (c) SC-ReLU vs ReLU.

Note that in the proposed neurons, the FSM state number is generated using a simple binary search algorithm under different input sizes to yield the highest precision. Unlike the software implementation that has limited allowable degree of parallelism and high coordination overheads, the proposed specific hardware based neurons can execute the commands in Algorithm 1 and 2 fully parallelly. Figure 4 (a), (b), and (c) show that the proposed SC-tanh, SC-logistic, and SC-ReLU neurons and their corresponding software results are almost identical to each other. Detailed error analysis is provided in Section V.

V Experimental Results

We now present (i) performance evaluation of the proposed SC neurons, (ii) comparison with binary ASIC neurons, and (iii) DCNN performance evaluation and comparison. The neurons and DCNNs are synthesized in Synopsys Design Compiler with the 45nm Nangate Library [30] using Verilog.

V-a Performance Evaluation and Comparison among the Proposed Neuron Designs

For each neuron, the accuracy is dependent on (i) the bit stream length , and (ii) the input size . Longer bit stream length yields higher accuracy, and the precision can be leveraged by adjusting the bit stream length without hardware modification. On the other hand, the input size , which is determined by the DCNN topology, affects both the accuracy of the neuron as well as the hardware footprint.

Thus, considering the aforementioned factors, we evaluate the inaccuracy of the proposed SC-tanh, SC-logistic and SC-ReLU neurons under a wide range of bit stream lengths and input sizes, as shown in Figure 5 (a), (b) and (c), respectively. The corresponding hardware costs of the proposed SC-tanh, SC-logistic and SC-ReLU neurons are shown in Figure 6 (a), (b) and (c), respectively. One can observe that the precisions of SC-logistic and SC-ReLU neurons consistently outperform SC-tanh neurons under different combinations of input size and bit stream length, whereas the area, power and energy of SC-tanh neurons are slightly lower than the other two neurons. Moreover, SC-logistic and SC-ReLU neurons have better scalability in terms of accuracy than the SC-tanh neuron. Note that the inaccuracy here is calculated by comparing with the software neuron results. A lower inaccuracy only indicates that the hardware neuron with this type of activation is closer to its software version, but this does not indicate that a DCNN implemented with this neuron can yield a lower test error.

Fig. 5: Input size versus absolute inaccuracy under different bit stream lengths for (a) SC-tanh neuron, (b) SC-logistic neuron, and (c) SC-ReLU neuron.
Fig. 6: Input size versus (a) area, (b) total power, and (c) total energy for the neuron designs using tanh, logistic and ReLU activation functions.
8 bit binary ASIC proposed SC neuron improvement
tanh logistic ReLU tanh logistic ReLU tanh logistic ReLU
power 34760 34760 32176 173 223 231 201X 156X 139X
energy 128267 128267 94346 858 1130 1147 149X 114X 82X
area 42319 42319 35506 699 916 916 61X 46X 39X
TABLE II: Neuron Cell Performance Comparison with 8 Bit Fixed Point Binary Implementation when and

V-B Comparison with Binary ASIC Neurons

We further compare the proposed SC-based neurons with the binary ASIC hardware neurons. The input size is set to 25 since most neurons in LeNet-5 DCNN are connected to receptive fields. The binary nonlinear activations logistic and ReLU are implemented using LUTs, whereas binary ReLU is built with a comparator and a MUX. Clearly, the number of bits in fixed-point numbers affects both the hardware cost and the accuracy. To make the comparison fair, we use the minimum fixed point (8 bit) that yields a DCNN network accuracy that is almost identical to the software DCNN (with difference in network test error). Table II shows the neuron cell performance comparison between the proposed SC neurons and the 8 bit binary ASIC neurons. Compared with binary ASIC neurons, the proposed SC neurons achieve up to 201X, 149X, and 61X improvement in terms of power, energy, and area, respectively, indicating significant hardware savings.

activation approach bit stream valid. error test error area (mm2) power () energy ()
tanh SC 1024 1.74% 1.41% 12.5 3.1 15.8
512 1.57% 1.65% 7.9
256 1.71% 1.61% 3.9
128 1.84% 2.13% 2.0
64 2.37% 2.34% 1.0
binary - 1.42% 1.34% 769.3 470.0 2.0
CPU - 1.41% 1.34% 263 130.0 198200
GPU - 1.41% 1.34% 520 225.0 96443
logistic SC 1024 3.98% 4.49% 15.8 3.9 20.1
512 4.35% 4.70% 10.0
256 4.24% 4.34% 5.0
128 5.23% 5.58% 2.5
64 6.30% 6.06% 1.3
binary - 2.88% 3.01% 769.3 585.7 2.4
CPU - 2.87% 2.99% 263 130.0 198200
GPU - 2.87% 2.99% 520 225.0 96443
ReLU SC 1024 1.69% 1.69% 15.8 3.9 20.3
512 1.67% 1.69% 10.1
256 1.67% 1.63% 5.1
128 1.65% 1.67% 2.5
64 1.67% 1.63% 1.3
binary - 1.65% 1.65% 664.9 557.5 1.8
CPU - 1.64% 1.64% 263 130.0 198200
GPU - 1.64% 1.64% 520 225.0 96443
TABLE III: Comparison among Software DCNN, Binary ASIC DCNN, and Various SC Based DCNN Designs Implementing LeNet 5

V-C DCNN Performance Evaluation and Comparison

To evaluate the network performance, we construct the LeNet-5 DCNNs using the proposed SC neurons as well as the 8 bit binary neurons in a pipelined manner. LeNet 5 is a widely-used DCNN structure with a configuration of 784-11520-2880-3200-800-500-10. The DCNNs are evaluated with the MNIST handwritten digit image dataset [31], which consists of 60,000 training data and 10,000 testing data. We apply the same training time in software so as to make a fair comparison among different activations.

Table III concludes the performance of DCNNs using CPU, GPU, binary neurons and the proposed SC neurons. The CPU approach uses two Intel Xeon W5580, whereas the GPU approach utilizes NVIDIA Tesla C2075. Note that the power for software is estimated using Thermal Design Power (TDP), and the energy is calculated by multiplying the run time and TDP. On the other hand, the power, energy, and area for hardware are calculated using the synthesized netlists with Synopsys Design Compiler. One can observe that for each type of activation, the proposed SC based DCNNs have much smaller area and power consumption than the corresponding binary DCNNs, with up to 61X, 151X, 2X improvement in terms of area, power, and energy, respectively. Note that though the the binary ASIC has a competitive energy performance, it is an ideal pipelined structure. The extremely large area () and power () makes binary ASIC unpractical for implementation. Moreover, Table III shows that the proposed SC approach achieves up to 21X and 41X of the area, 41X and 72X of the power, and 198200X and 96443X of the energy, compared with CPU and GPU approaches, respectively, while the error is increased by less than 3.07%.

Among different activations, with a long bit stream (), SC-tanh is the most accurate. Otherwise (), SC-ReLU has the highest precision. SC-logistic has the lowest precision due to the following reasons: (i) logistic activation in software has the worst accuracy performance, and (ii) the imprecision of activation (if larger than a certain threshold) amplifies the inaccuracy in DCNNs. The bit stream length can be reduced to improve energy performance. One important observation is that the proposed SC-ReLU has better scalability than SC-tanh (i.e., with bit stream length decreasing, the accuracy degradation of SC-ReLU is slower than SC-tanh). Hence, ReLU activation is suggested for future SC based DCNNs considering its superior accuracy, area, and energy performance under a small bit stream length (e.g., ). Note that the small bit stream length leads to significant improvement in terms of delay and energy performance.

Vi Conclusion

In this paper we presented three novel SC neurons designs using tanh, logistic, and ReLU nonlinear activations. LeNet-5 DCNNs were constructed using the proposed neurons. Experimental results on the MNIST dataset demonstrated that compared to the binary ASIC DCNN, the proposed SC based DCNNs were able to significantly reduce the area, power and energy footprint with a small accuracy degradation. ReLU was suggested for future SC based DCNNs implementations.


  • [1] Y. Bengio, “Learning deep architectures for ai,” Foundations and trends® in Machine Learning, vol. 2, no. 1, pp. 1–127, 2009.
  • [2] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
  • [3] Y. Cho and L. K. Saul, “Large-margin classification in infinite neural networks,” Neural computation, vol. 22, no. 10, pp. 2678–2697, 2010.
  • [4] F. Agostinelli, M. Hoffman, P. Sadowski, and P. Baldi, “Learning activation functions to improve deep neural networks,” arXiv preprint arXiv:1412.6830, 2014.
  • [5] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  • [6] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-scale video classification with convolutional neural networks,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2014, pp. 1725–1732.
  • [7] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell, “Decaf: A deep convolutional activation feature for generic visual recognition.” in ICML, 2014, pp. 647–655.
  • [8] Y. Li, Z. Li, and Q. Qiu, “Assisting fuzzy offline handwriting recognition using recurrent belief propagation,” in 2016 IEEE Symposium Series on Computational Intelligence (SSCI), Dec 2016, pp. 1–8.
  • [9] A. Ardakani, F. Leduc-Primeau, N. Onizawa, T. Hanyu, and W. J. Gross, “Vlsi implementation of deep neural network using integral stochastic computing,” arXiv preprint arXiv:1509.08972, 2015.
  • [10] K. Kim, J. Kim, J. Yu, J. Seo, J. Lee, and K. Choi, “Dynamic energy-accuracy trade-off using stochastic computing in deep neural networks,” in Proceedings of the 53rd Annual Design Automation Conference.      ACM, 2016, p. 124.
  • [11] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
  • [12] A. Rahman, J. Lee, and K. Choi, “Efficient fpga acceleration of convolutional neural networks using logical-3d compute array,” in 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).      IEEE, 2016, pp. 1393–1398.
  • [13] M. Motamedi, P. Gysel, V. Akella, and S. Ghiasi, “Design space exploration of fpga-based deep convolutional neural networks,” in 2016 21st Asia and South Pacific Design Automation Conference (ASP-DAC).      IEEE, 2016, pp. 575–580.
  • [14] J. Li, A. Ren, Z. Li, C. Ding, B. Yuan, Q. Qiu, and Y. Wang, “Towards acceleration of deep convolutional neural networks using stochastic computing,” in The 22nd Asia and South Pacific Design Automation Conference (ASP-DAC). IEEE, 2017.
  • [15] B. D. Brown and H. C. Card, “Stochastic neural computation. i. computational elements,” IEEE Transactions on computers, vol. 50, no. 9, pp. 891–905, 2001.
  • [16] Z. Li, A. Ren, J. Li, Q. Qiu, Y. Wang, and B. Yuan, “Dscnn: Hardware-oriented optimization for stochastic computing based deep convolutional neural networks,” in Computer Design (ICCD), 2016 IEEE 34th International Conference on.      IEEE, 2016, pp. 678–681.
  • [17] A. Ren, J. Li, Z. Li, C. Ding, X. Qian, Q. Qiu, B. Yuan, and Y. Wang, “Sc-dcnn: highly-scalable deep convolutional neural network using stochastic computing,” arXiv preprint arXiv:1611.05939.
  • [18] J. Li and J. Draper, “Joint soft-error-rate (ser) estimation for combinational logic and sequential elements,” in VLSI (ISVLSI), 2016 IEEE Computer Society Annual Symposium on.      IEEE, 2016, pp. 737–742.
  • [19] A. Alaghi and J. P. Hayes, “Survey of stochastic computing,” ACM Transactions on Embedded computing systems (TECS), vol. 12, no. 2s.
  • [20] J. Li and J. Draper, “Accelerating soft-error-rate (ser) estimation in the presence of single event transients,” in Proceedings of the 53rd Annual Design Automation Conference.      ACM, 2016, p. 55.
  • [21] Y. Ji, F. Ran, C. Ma, and D. J. Lilja, “A hardware implementation of a radial basis function neural network using stochastic logic,” in Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition.      EDA Consortium, 2015, pp. 880–883.
  • [22] K. Sanni, G. Garreau, J. L. Molin, and A. G. Andreou, “Fpga implementation of a deep belief network architecture for character recognition using stochastic computation,” in Information Sciences and Systems (CISS), 2015 49th Annual Conference on.      IEEE, 2015, pp. 1–5.
  • [23] Z. Li, A. Ren, J. Li, Q. Qiu, B. Yuan, J. Draper, and Y. Wang, “Structural design optimization for deep convolutional neural networks using stochastic computing,” 2017.
  • [24] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1026–1034.
  • [25] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural network acoustic models,” in Proc. ICML, vol. 30.
  • [26] D. C. Ciresan, U. Meier, J. Masci, L. Maria Gambardella, and J. Schmidhuber, “Flexible, high performance convolutional neural networks for image classification,” in IJCAI Proceedings-International Joint Conference on Artificial Intelligence, vol. 22, no. 1, 2011, p. 1237.
  • [27] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
  • [28] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1–9.
  • [29] K. , J. Lee, and K. Choi, “Approximate de-randomizer for stochastic circuits,” Proc. ISOCC, 2015.
  • [30] Nangate 45nm Open Library, Nangate Inc., 2009. [Online]. Available: http://www.nangate.com/
  • [31] L. Deng, “The mnist database of handwritten digit images for machine learning research,” IEEE Signal Processing Magazine, vol. 29, no. 6.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description