SparselyConnected Neural Networks: Towards Efficient VLSI Implementation of Deep Neural Networks
Abstract
Recently deep neural networks have received considerable attention due to their ability to extract and represent highlevel abstractions in data sets. Deep neural networks such as fullyconnected and convolutional neural networks have shown excellent performance on a wide range of recognition and classification tasks. However, their hardware implementations currently suffer from large silicon area and high power consumption due to the their high degree of complexity. The power/energy consumption of neural networks is dominated by memory accesses, the majority of which occur in fullyconnected networks. In fact, they contain most of the deep neural network parameters. In this paper, we propose sparselyconnected networks, by showing that the number of connections in fullyconnected networks can be reduced by up to 90% while improving the accuracy performance on three popular datasets (MNIST, CIFAR10 and SVHN). We then propose an efficient hardware architecture based on linearfeedback shift registers to reduce the memory requirements of the proposed sparselyconnected networks. The proposed architecture can save up to 90% of memory compared to the conventional implementations of fullyconnected neural networks. Moreover, implementation results show up to 84% reduction in the energy consumption of a single neuron of the proposed sparselyconnected networks compared to a single neuron of fullyconnected neural networks.
SparselyConnected Neural Networks: Towards Efficient VLSI Implementation of Deep Neural Networks
Arash Ardakani, Carlo Condo and Warren J. Gross 

Department of Electrical and Computer Engineering 
McGill University, Montréal, Québec, Canada 
Email: arash.ardakani@mail.mcgill.ca, carlo.condo@mail.mcgill.ca, warren.gross@mcgill.ca 
1 Introduction
Deep neural networks (DNNs) have shown remarkable performance in extracting and representing highlevel abstractions in complex data (Lecun et al. (2015)). DNNs rely on multiple layers of interconnected neurons and parameters to solve complex tasks, such as image recognition and classification (Krizhevsky et al. (2012)). While they have been proven very effective in said tasks, their hardware implementations still suffer from high memory and power consumption, due to the complexity and size of their models. Therefore, research efforts have been conducted towards more efficient implementations of DNNs (Han et al. (2016)). In the past few years, the parallel nature of DNNs has led to the use of graphical processing units (GPUs) to execute neural networks tasks (Han et al. (2015)). However, their large latency and power consumption have pushed researchers towards applicationspecific integrated circuits (ASICs) for hardware implementations (Cavigelli et al. (2015)). For instance, in (Han et al. (2016)), it was shown that a DNN implemented with customized hardware can accelerate the classification task by 189 and 13, while saving 24,000 and 3,400 energy compared to CPU (Intel i75930k) and GPU (GeForce TITAN X), respectively.
Convolutional layers in DNNs are used to extract high level abstractions and features of data. In such layers, the connectivity between neurons follows a pattern inspired by the organization of the animal visual cortex. It was shown that the computation in the visual cortex can mathematically be described by a convolution operation (LeCun et al. (1989)). Therefore, each neuron is only connected to a few neurons based on a pattern and a set of weights is shared among all neurons. In contrast, in a fullyconnected layer, each neuron is connected to every neuron in the previous and next layers and each connection is associated with a weight. These layers are usually used to learn nonlinear combinations of given data. Fig. 1 shows a twolayer fullyconnected network. The main computation kernel performs numerous vectormatrix multiplications followed by nonlinear functions in each layer. In (Courbariaux & Bengio (2016); Horowitz (2014); Han et al. (2016)), it was shown that the power/energy consumption of DNNs is dominated by memory accesses. Fullyconnected layers, which are widely used in recurrent neural networks (RNNs) and adopted in many stateoftheart neural network architectures (Krizhevsky et al. (2012); Simonyan & Zisserman (2014); Zeiler & Fergus (2013); Szegedy et al. (2015); Lecun et al. (1998)), independently or as a part of convolutional neural networks, contain most of the weights of a DNN. For instance, the first fullyconnected layer of VGGNet (Simonyan & Zisserman (2014)), which is composed of 13 convolution layers and three fullyconnected layers, contains 100M weights out of a total of 140M. Such large storage requirements in fullyconnected networks result in copious power/energy consumption.
To overcome the aforementioned issue, a pruning technique was first introduced in (Han et al. (2015)) to reduce the memory required by DNN architectures for mobile applications. However, it makes use of an additional training stage, while information addresses identifying the pruned connections still need to be stored in a memory. More recently, several works have focused on the binarization and ternarization of the weights of DNNs (Courbariaux & Bengio (2016); Courbariaux et al. (2015); Lin et al. (2015); Kim & Smaragdis (2016)). While these approaches reduce weight quantization and thus the memory width, the number of weights is unchanged.
In (Shafiee et al. (2016b)), an alternative deep network connectivity named StochasticNet and inspired from the brain synaptic connection between neurons was explored on lowpower CPUs. StochasticNet is formed by randomly removing up to 61% connections in both fullyconnected and convolution layers of DNNs, speeding up the classification task.
In (Wen et al. (2016)), a method named structured sparsity learning (SSL) was introduced to regularize the convolutional layers’ structures of DNNs. SSL can learn a structured sparsity of DNNs to efficiently speed up the convolutional computations both on CPU and GPU platforms.
In this paper, we propose sparselyconnected networks by randomly removing some of the connections in fullyconnected networks. Random connection masks are generated by linearfeedback shift registers (LFSRs), which are also used in the VLSI implementation to disable the connections. Experimental results on three commonly used datasets show that the proposed networks can improve network accuracy while removing up to 90% of the connections. Additionally, we apply the proposed algorithm on top of the binarizing/ternarizing technique achieving a better misclassification rate than the best binarized/ternarized networks reported in literature. Finally, an efficient very large scale integration (VLSI) hardware architecture of a DNN based on sparselyconnected network is proposed, which saves up to 90% memory and 84% energy with respect to the traditional architectures.
The rest of the paper is organized as follows. Section 2 briefly introduces DNNs and their hardware implementation challenges, while Section 3 describes the proposed sparselyconnected network and their training algorithm. In Section 4 the experimental results over three datasets are presented and compared to the state of the art. Section 5 portrays the proposed VLSI architecture for the sparselyconnected network, and conclusions are drawn in Section 6.
2 Preliminaries
2.1 Deep Neural Networks
DNNs are constructed using multiple layers of neurons between the input and output layers. These are usually referred to as hidden layers. They are used in many current image and speech applications to perform complex tasks as recognition or classification. DNNs are trained through an initial phase, called the learning stage, that uses data to prepare the DNN for the task that will follow in the inference stage. Two subcategories of DNNs which are widely used in detection and recognition tasks are convolutional neural networks (CNNs) and RNNs (Han et al. (2016)). Due to parameter reuse in convolutional layers, they are wellstudied and can be efficiently implemented with customized hardware platforms (Chen et al. (2016); Shafiee et al. (2016a); Chen et al. (2016)). On the other hand, fullyconnected layers, which are widely used in RNNs like long shortterm memories and as a part of CNNs, require a large number of parameters to be stored in memories.
DNNs are mostly trained by the backpropagation algorithm in conjunction with stochastic gradient descent (SGD) optimization method (Rumelhart et al. (1986)). This algorithm computes the gradient of a cost function with respect to all the weights in all the layers. A common choice for the cost function is using the modified hinge loss introduced in (Tang (2013)). The obtained errors are then backward propagated through the layers to update the weights in an attempt to minimize the cost function. Instead of using a whole dataset to update parameters, data are first divided in minibatches and parameters are updated using each minibatch several times to speed up the convergence of the training algorithm. The weight updating speed is controlled by a learning rate . Batch normalization is also commonly used to regularize each minibatch of data (Ioffe & Szegedy (2015)): it speeds up the training process by allowing the use of a bigger .
2.2 Towards hardware implementation of DNNs
DNNs have shown excellent performance in applications such as computer vision and speech recognition: since the number of neurons has a linear relationship with the ability of a DNN to perform tasks, highperformance DNNs are extremely complex in hardware. AlexNet (Krizhevsky et al. (2012)) and VGGNet (Simonyan & Zisserman (2014)) are two models comprising convolutional layers followed by some fullyconnected layers, which are widely used in classification algorithms. Despite their very good classification performance, they require large amounts of memory to store the numerous parameters. Most of these parameters (more than 96%) lie in fullyconnected layers. In (Han et al. (2016)), it was shown that the total energy of DNNs is dominated by the required memory accesses. Therefore, the majority of power in a DNN is dissipated through fullyconnected layers of DNNs. Moreover, the huge memory requirements make possible only for very small DNNs to be fitted in onchip RAMs in ASIC/FPGA platforms.
Recently, many works tried to reduce the computational complexity of DNNs. In (Akopyan et al. (2015)), the spiking neural network based on stochastic computing (Smithson et al. (2016)) was introduced, where 1bit calculations are performed throughout the whole architecture. In (Ardakani et al. (2015)), integral stochastic computing was used to reduce the computation latency, showing that stochastic computing can consume less energy than conventional binary radix implementations. However, both works do not manage to reduce the DNN memory requirements.
Network pruning, compression and weight sharing have been proposed in (Han et al. (2016)), together with weight matrix sparsification and compression. However, additional indexes denoting the pruned connections are required to be stored along with the compressed weight matrices. In (Han et al. (2015)), it was shown that the number of indexes are almost the same as the number of nonzero elements of weight matrices, thus increasing the word length of the required memories. Moreover, the encoding and compression techniques require inverse computations to obtain decoded and decompressed weights, and introduce additional hardware complexity for hardware implementation compared to the conventional computational architectures. Other pruning techniques presented in literature such as (Anwar et al. (2015)) try to reduce the memory required to store the pruned locations by introducing a structured sparsity in DNNs. However, the resulting network yields up to misclassification rate on the CIFAR10 dataset.
3 SparselyConnected Neural Networks
Considering a fullyconnected neural network layer with input and output nodes, the forward computations are performed as follow
(1) 
where represents the weights and the biases, while is the nonlinear activation function in which ReLU is used in most cases (Nair & Hinton (2010)). The network’s inputs and outputs are denoted by and , respectively.
Let us introduce the sparse weight matrix as the elementwise multiplication
(2) 
where and are sparser than . The Mask binary matrix can be defined as
where each element of Mask , and . Note that the dimensions of are the same as the weight matrix . Similarly to a fullyconnected network (1), the forward computation of the sparselyconnected network can be expressed as
(3) 
We propose the use of LFSRs to form each column of , similar to the approach used in stochastic computing to generate a binary stream (Gaines (1969)). In general, an bit LFSR serially generates numbers . A random binary stream with expected value of can be obtained by comparing with a constant value of . This unit is hereafter referred to as stochastic number generator (SNG). Therefore, a random binary stream element is when , and otherwise. Fig. 2 shows the formation of a small sparselyconnected network using binary streams generated by LFSR units. Fig. 2 shows a 3bit LFSR unit with its 7 different values and a random binary stream with expected value of . A total of LFSRs of bit length with different seed values are required to form . By tuning the value of it is possible to change the sparsity degree of , and thus of the sparselyconnected network. Fig. 2 and Fig. 2 show the fullyconnected network based on and the sparselyconnected version based on .
Algorithm 1 summarizes the training algorithm for the proposed sparselyconnected network. The algorithm itself is very similar to what would be used with a fullyconnected network, but considers each network layer to have a mask that disables some of the connections. The forward propagation (line 15) follows (3), while derivatives in the backward computations (line 616) are computed with respect to . It is worth mentioning that most CNNs use fullyconnected layers and the proposed training algorithm can still be used for those layers in CNNs.
Case  Method  Network  Misclassification  Number of 
Configuration  Rate (%)  Parameters  
1  FullyConnected  78451251210  1.18  669706 
SparselyConnected 50%  78451251210  1.19  335370  
2  FullyConnected  78425625610  1.35  269322 
SparselyConnected 60%  78451251210  1.20  268503  
SparselyConnected 70%  78451251210  1.31  201636  
3  FullyConnected  78414514510  1.41  136455 
SparselyConnected 80%  78451251210  1.28  134768  
4  FullyConnected  784777710  1.75  67231 
SparselyConnected 90%  78451251210  1.75  67901  
5  FullyConnected  784121210  4.68  9706 
SparselyConnected 90%  78410010010  3.16  8961  
4 Experimental Results
We have validated the effectiveness of the proposed sparselyconnected network and its training algorithm on three datasets: MNIST (LeCun & Cortes (2010)), CIFAR10 (Krizhevsky (2009)) and SVHN (Netzer et al. (2011)) using the Theano library (Team (2016)) in Python.
4.1 Experimental Results on MNIST
The MNIST dataset contains grayscale images ( for training and for testing), falling into classes. A deep fullyconnected neural network is used for evaluation and the hinge loss is considered as the cost function. The training set is divided into two separate parts. The first images are used as the training set and the rest for the validation and test sets. All models are trained using SGD without momentum, a batch size of , epochs and the batch normalization method.
Table 1 summarizes the misclassification rate of sparselyconnected neural networks compared to fullyconnected neural networks for different network configurations, using singleprecision floatingpoint format. We adopted a fullyconnected network with  network configuration as a reference network, in which each number represent the number of inputs to each fullyconnected layer. From this, we formed sparse weight matrices with different sparsity degrees. For instance, sparselyconnected denotes sparse weight matrices containing zero elements. Case shows that a sparselyconnected neural network with fewer connections achieves approximately the same accuracy as the fullyconnected network using the same network configuration. In Cases and , the sparselyconnected networks with and fewer connections achieve a better misclassification rate than the fullyconnected network while having approximately the same number of parameters. Case 4 shows no gain in performance and number of parameters for a sparselyconnected and network configuration of  compared to the fullyconnected at the same number of parameters. However, we can still reduce the connections up to using a smaller network, as shown in Case .
Method  Misclassification Rate (%)  # of Parameters  
Without  With  
Data Augmentation  Data Augmentation  
SinglePrecision FloatingPoint (SPFP)  1.33  0.67  2913290 
SparselyConnected 50% + SPFP  1.17  0.64  1458186 
SparselyConnected 90% + SPFP  1.33  0.66  294103 
BinaryConnect^{a} (Courbariaux et al. (2015))  1.23  0.76  2913290 
TernaryConnect^{b} (Lin et al. (2015))  1.15  0.74  2913290 
SparselyConnected 50% + BinaryConnect^{a}  0.99  0.75  1458186 
SparselyConnected 60% + BinaryConnect^{a}  1.03  0.81  1167165 
SparselyConnected 70% + BinaryConnect^{a}  1.16  0.85  876144 
SparselyConnected 80% + BinaryConnect^{a}  1.32  1.06  585124 
SparselyConnected 90% + BinaryConnect^{a}  1.33  1.36  294103 
SparselyConnected 50% + TernaryConnect^{b}  0.95  0.63  1458186 
SparselyConnected 60% + TernaryConnect^{b}  1.05  0.64  1167165 
SparselyConnected 70% + TernaryConnect^{b}  1.01  0.73  876144 
SparselyConnected 80% + TernaryConnect^{b}  1.11  0.85  585124 
SparselyConnected 90% + TernaryConnect^{b}  1.41  1.05  294103 

Binarizing algorithm was only used in the learning phase and singleprecision floatingpoint weights were used during the test run.

Ternarizing algorithm was only used in the learning phase and singleprecision floatingpoint weights were used during the test run.
Recently, BinaryConnect and TernaryConnect neural networks have outperformed the stateoftheart on different datasets (Courbariaux et al. (2015); Lin et al. (2015)). In BinaryConnect, weights are represented with either 1 or 1, whereas they can be 1, 0 or 1 in TernaryConnect. These networks have emerged to facilitate hardware implementations of neural networks by reducing the memory requirements and removing multiplications. We applied our training method to BinaryConnect and TernaryConnect training algorithms: the obtained results are provided in Table 2. The source Python codes used for comparison are the same used in (Courbariaux et al. (2015); Lin et al. (2015)), available online (Lin et al. (2015)). The simulation results show that up to and of connections can be dropped by the proposed method from BinaryConnect and TernaryConnect networks without any compromise in performance without using data augmentation, respectively. Moreover, the binarized and ternarized sparselyconnected improve the accuracy compared to the conventional binarized and ternarized fullyconnected networks. Considering data augmentation (affine transformation), our method can drop up to and of connections from BinaryConnect and TernaryConnect networks without any compromise in performance, respectively. However, using data augmentation results in a better misclassification rate when it is used on networks trained with singleprecision floatingpoint weights as shown in Table 2. In this case, our method still can drop up to 90% of connections without any performance degradation. It is worth specifying that we only used the binarized/ternarized algorithm during the learning phase, and we used singleprecision floatingpoint weights during the test run in Section 4, similar to the approach used in (Lin et al. (2015)).
4.2 Experimental Results on CIFAR10
The CIFAR10 dataset consists of a total number of RGB images. Similar to MNIST, we split the images into , and training, validation and test datasets, respectively. As our model, we adopt a convolutional network comprising  channels for six convolution/pooling layers and two node fullyconnected layers followed by a classification layer. This architecture is inspired by VGGNet (Simonyan & Zisserman (2014)) and was also used in (Courbariaux et al. (2015)). Hinge loss is used for training with batch normalization and a batch size of .
In order to show the performance of the proposed technique, we use sparselyconnected networks instead of fullyconnected networks in the convolutional network. Again, we compare our results with the binarized and ternarized models since they are the most hardwarefriendly models reported todate. As summarized in Table 3, simulation results show significant improvement in accuracy compared to the ordinary network while having significantly fewer parameters.
Method  Misclassification Rate (%)  # of Parameters  
Without  With  
Data Augmentation  Data Augmentation  
SinglePrecision FloatingPoint (SPFP)  12.45  9.77  14025866 
SparselyConnected 90% + SPFP  12.05  9.30  5523184 
BinaryConnect ^{a} (Courbariaux et al. (2015))  9.91  8.01  14025866 
TernaryConnect ^{b} (Lin et al. (2015))  9.32  7.83  14025866 
SparselyConnected 50% + BinaryConnect ^{a}  8.95  7.27  9302154 
SparselyConnected 90% + BinaryConnect ^{a}  8.05  6.92  5523184 
SparselyConnected 50% + TernaryConnect ^{b}  8.45  7.13  9302154 
SparselyConnected 90% + TernaryConnect ^{b}  7.88  6.99  5523184 

Binarizing algorithm was only used in the learning phase and singleprecision floatingpoint weights were used during the test run.

Ternarizing algorithm was only used in the learning phase and singleprecision floatingpoint weights were used during the test run.
Method  Misclassification  Number of 
Rate (%)  Parameters  
SinglePrecision FloatingPoint  4.734615  14025866 
BinaryConnect ^{a} (Courbariaux et al. (2015))  2.134615  14025866 
TernaryConnect ^{b} (Lin et al. (2015))  2.9  14025866 
SparselyConnected 90% + BinaryConnect ^{a}  2.003846  5523184 
SparselyConnected 90% + TernaryConnect ^{b}  1.957692  5523184 

Binarizing algorithm was only used in the learning phase and singleprecision floatingpoint weights were used during the test run.

Ternarizing algorithm was only used in the learning phase and singleprecision floatingpoint weights were used during the test run.
4.3 Experimental Results on SVHN
SVHN dataset contains RGB images ( images for training and roughly images for testing) of street house numbers. Also, images are separated from the training part for validation. Similar to the CIFAR10 case, we use a convolutional network comprising  channels for six convolution/pooling layers and two fullyconnected layers followed by a classification layer. Hinge loss is used as the cost function with batch normalization and batch size of .
Table 4 summarizes the accuracy performance of using the proposed sparselyconnected network in the convolutional network model, compared to the hardwarefriendly binarized and ternarized models. Despite the fewer parameters that the proposed sparselyconnected network provides, it also yields stateoftheart results in terms of accuracy performance.
4.4 Comparison with the State of the Art
The proposed sparselyconnected network has been compared to other networks in literature in terms of misclassification rate in Table 5. In Section 4.1 to 4.3, we used the binarization/ternarization algorithm to train our models in the learning phase while using singleprecision floatingpoint weights during the test run (i.e. inference phase). The first part of Table 5 applies the same technique, while in the second part we use binarized/ternarized weights also during the test run. We thus exploit a deterministic method introduced in (Courbariaux et al. (2015)) to perform the test run using binarized/ternarized weights. The weights are obtained as follows:
where and denote binarized and ternarized weights, respectively.
Datasets  
MNIST  SVHN  CIFAR10  
Method  Binarized/Ternarized Weights During Test Run  
BNN (Torch7) (Courbariaux & Bengio (2016))  1.40%  2.53%  10.15% 
BNN (Theano) (Courbariaux & Bengio (2016))  0.96%  2.80%  11.40% 
(Baldassi et al. (2015))  1.35%  –  – 
BinaryConnect (Courbariaux et al. (2015))  1.29%  2.30%  9.90% 
EBP (Cheng et al. (2015))  2.2%  –  – 
Bitwise DNNs (Kim & Smaragdis (2016))  1.33%  –  – 
(Hwang & Sung (2014))  1.45%  –  – 
SparselyConnected + BinaryConnect  1.08%  2.053846%  8.66% 
SparselyConnected + TernaryConnect  0.98%  1.992308%  8.24% 
Method  SinglePrecision FloatingPoint Weights During Test Run  
TernaryConnect (Lin et al. (2015))  1.15%  2.42%  12.01% 
Maxout Networks (Goodfellow et al. (2013))  0.94%  2.47%  11.68% 
Network in Network (Lin et al. (2013))  –  2.35%  10.41% 
Gated pooling (Lee et al. (2015))  –  1.69%  7.62% 
SparselyConnected + BinaryConnect  0.99%  2.003846%  8.05% 
SparselyConnected + TernaryConnect  0.95%  1.957692%  7.88% 
From the results presented in Table 5, we can see that our proposed work outperforms the stateoftheart models with binarized/ternarized weights during the test run while achieving performance close to the stateoftheart result of the model with no binarization/ternarization in the test run. The former are the most suitable and hardwarefriendly models for hardware implementation of DNNs: our model shows a better performance in terms of both accuracy/misclassification rate and memory requirements. The obtained results suggest that the proposed network acts as a regularizer to prevent models from overfitting. Similar conclusions were also obtained in (Courbariaux et al. (2015)). It is worth noting that no data augmentation was used in our simulations throughout this paper except for the results reported in Table 2 and Table 3.
5 VLSI Implementation of SparselyConnected Neural Networks
In this Section, we propose an efficient hardware architecture for the proposed sparselyconnected network. In fullyconnected networks, the main computational core is the matrixvector multiplication that computes (1). This computation is usually implemented in parallel on GPUs. However, parallel implementation of this unit requires parallel access to memories and causes routing congestion, leading to large silicon area and power/energy consumption in customized hardware. Thus, VLSI architectures usually opt for semiparallel implementations of such networks. In this approach, each neuron performs its computations serially, and a certain number of neurons are instantiated in parallel (Moreno et al. (2008)). Every neuron is implemented using multiplyandaccumulate (MAC) units as shown in Fig. 3. The number of inputs of each neuron determines the latency of this architecture. For example, considering a hidden layer with inputs and outputs, MACs are required in parallel and each MAC requires clock cycles to perform computations of this layer. In general, a counter is required to count from to where is the number of inputs of each neuron. It provides the addresses for the memory in which a column of the weight matrix is stored. In this way, each input and its corresponding weight are fed to the multiplier every clock cycle (see Fig. 3). For binarized/ternarized networks, the multiplier in 3 is substituted with a multiplexer.
In Section 3, we described the formation of the Mask matrix using an SNG unit (see Fig. 2). The value of , through which it is possible to tune the sparsity degree of networks, also corresponds to the occurrence of in a binary stream generated by SNG. Therefore, we can save up to of memory by storing only the weights corresponding to the s in the SNG stream. For instance, considering a Mask matrix in Fig. 2, is formed as
and the compressed matrix stored in onchip memories is
The smaller memory can significantly reduce the silicon area and the power consumption of DNNs architectures. Depending on the value of , the size of the memory varies. In general, the depth of the weight memory in each neuron is .
Fig. 3 depicts the architecture of a single neuron of the proposed sparselyconnected network. Decompression is performed using an SNG generating the enable signal of the counter and accumulator. Inputs are fed into each neuron sequentially in each clock cycle. If the output of the SNG is , the counter counts upward and provides an address for the memory. Then, the multiplication of an input and its corresponding weight is computed, the result stored in the internal register of the accumulator. If instead the output of the SNG is , the counter holds its previous value, while the internal register of the accumulator is not enabled, and does not load a new value. The latency of the proposed architecture is the same as that of the conventional architecture.
Sparsity Degree  
=  =  =  =  =  
FullyConnected (FC)  SparselyConnected  SparselyConnected  SparselyConnected  SparselyConnected  
Memory Size [bits]  
Area [] (improvement w.r.t. FC)  ( )  ( )  4221 ( )  2662 ( )  
Power [W]  
Energy [pJ] (improvement w.r.t. FC)  ( )  220 ( )  154 ( )  110 ( )  
Latency [s] 
Table 6 shows the ASIC implementation results of the neuron in Fig. 3 supposing inputs. The proposed architectures were described in VHDL and synthesized in TSMC 65 nm CMOS technology with Cadence RTL compiler, for different sparsity degrees . For the provided syntheses we used a binarized network. Implementation results show up to decrement in energy consumption and up to less area compared to the conventional fullyconnected architecture.
6 Conclusion
DNNs are capable of solving complex tasks: their ability to do so depends on the number of neurons and their connections. Fullyconnected layers in DNNs contain more than of the total neural network parameters, pushing the designers to use offchip memories which are bandwidth limited and consume large amounts of energy. In this paper, we proposed sparselyconnected networks and their training algorithm to substantially reduce the memory requirements of DNNs. The sparsity degree of the proposed network can be tuned by an SNG, which is implemented using an LFSR unit and a comparator. We used the proposed sparselyconnected network instead of fullyconnected networks in a VGGlike network on three commonly used datasets: we achieved better accuracy results with up to fewer connections than the state of the art. Moreover, our simulation results confirm that the proposed network can be used as a regularizer to prevent models from overfitting. Finally, we implemented a single neuron of the sparselyconnected network in in 65 nm CMOS technology for different sparsity degrees. The implementation results show that the proposed architecture can save up to energy and silicon area compared to the conventional fullyconnected network while having a lower misclassification rate.
References
 Akopyan et al. (2015) F. Akopyan, J. Sawada, A. Cassidy, R. AlvarezIcaza, J. Arthur, P. Merolla, N. Imam, Y. Nakamura, P. Datta, G. J. Nam, B. Taba, M. Beakes, B. Brezzo, J. B. Kuang, R. Manohar, W. P. Risk, B. Jackson, and D. S. Modha. TrueNorth: design and tool flow of a 65 mW 1 million neuron programmable neurosynaptic chip. IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems, 34(10):1537–1557, Oct 2015. ISSN 02780070. doi: 10.1109/TCAD.2015.2474396.
 Anwar et al. (2015) Sajid Anwar, Kyuyeon Hwang, and Wonyong Sung. Structured pruning of deep convolutional neural networks. CoRR, abs/1512.08571, 2015. URL http://arxiv.org/abs/1512.08571.
 Ardakani et al. (2015) Arash Ardakani, François LeducPrimeau, Naoya Onizawa, Takahiro Hanyu, and Warren J. Gross. VLSI implementation of deep neural network using integral stochastic computing. CoRR, abs/1509.08972, 2015. URL http://arxiv.org/abs/1509.08972.
 Baldassi et al. (2015) Carlo Baldassi, Alessandro Ingrosso, Carlo Lucibello, Luca Saglietti, and Riccardo Zecchina. Subdominant dense clusters allow for simple learning and high computational performance in neural networks with discrete synapses. Phys. Rev. Lett., 115:128101, Sep 2015.
 Cavigelli et al. (2015) Lukas Cavigelli, David Gschwend, Christoph Mayer, Samuel Willi, Beat Muheim, and Luca Benini. Origami: a convolutional network accelerator. CoRR, abs/1512.04295, 2015. URL http://arxiv.org/abs/1512.04295.
 Chen et al. (2016) YuHsin Chen, Joel Emer, and Vivienne Sze. Eyeriss: a spatial architecture for energyefficient dataflow for convolutional neural networks. In Proceedings of the 43rd International Symposium on Computer Architecture, ISCA ’16, pp. 367–379, Piscataway, NJ, USA, 2016. IEEE Press. ISBN 9781467389471. doi: 10.1109/ISCA.2016.40. URL http://dx.doi.org/10.1109/ISCA.2016.40.
 Chen et al. (2016) YuHsin Chen, Tushar Krishna, Joel Emer, and Vivienne Sze. Eyeriss: An EnergyEfficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. In IEEE International SolidState Circuits Conference, ISSCC 2016, Digest of Technical Papers, pp. 262–263, 2016.
 Cheng et al. (2015) Zhiyong Cheng, Daniel Soudry, Zexi Mao, and Zhenzhong Lan. Training binary multilayer neural networks for image classification using expectation backpropagation. CoRR, abs/1503.03562, 2015.
 Courbariaux & Bengio (2016) Matthieu Courbariaux and Yoshua Bengio. BinaryNet: Training deep neural networks with weights and activations constrained to +1 or 1. CoRR, abs/1602.02830, 2016.
 Courbariaux et al. (2015) Matthieu Courbariaux, Yoshua Bengio, and JeanPierre David. BinaryConnect: Training deep neural networks with binary weights during propagations. CoRR, abs/1511.00363, 2015.
 Gaines (1969) B. R. Gaines. Stochastic Computing Systems, pp. 37–172. Springer US, Boston, MA, 1969. ISBN 9781489958419.
 Goodfellow et al. (2013) Ian J. Goodfellow, David Wardefarley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout networks. In In ICML, 2013.
 Han et al. (2016) S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally. EIE: Efficient inference engine on compressed deep neural network. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 243–254, June 2016. doi: 10.1109/ISCA.2016.30.
 Han et al. (2015) Song Han, Huizi Mao, and William J. Dally. Deep compression: compressing deep neural network with pruning, trained quantization and huffman coding. CoRR, abs/1510.00149, 2015.
 Horowitz (2014) Mark Horowitz. 1.1 computing’s energy problem (and what we can do about it). In 2014 IEEE International SolidState Circuits Conference Digest of Technical Papers (ISSCC), pp. 10–14, Feb 2014. doi: 10.1109/ISSCC.2014.6757323.
 Hwang & Sung (2014) K. Hwang and W. Sung. Fixedpoint feedforward deep neural network design using weights 1, 0, and 1. In 2014 IEEE Workshop on Signal Processing Systems (SiPS), pp. 1–6, Oct 2014.
 Ioffe & Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch normalization: accelerating deep network training by reducing internal covariate shift. CoRR, abs/1502.03167, 2015. URL http://arxiv.org/abs/1502.03167.
 Kim & Smaragdis (2016) Minje Kim and Paris Smaragdis. Bitwise neural networks. CoRR, abs/1601.06071, 2016.
 Krizhevsky (2009) Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.
 Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (eds.), Advances in Neural Information Processing Systems 25, pp. 1097–1105. Curran Associates, Inc., 2012.
 LeCun et al. (1989) Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Comput., 1(4):541–551, December 1989. ISSN 08997667. doi: 10.1162/neco.1989.1.4.541. URL http://dx.doi.org/10.1162/neco.1989.1.4.541.
 LeCun & Cortes (2010) Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010. URL http://yann.lecun.com/exdb/mnist/.
 Lecun et al. (1998) Yann Lecun, LÃ©on Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. In Proceedings of the IEEE, pp. 2278–2324, 1998.
 Lecun et al. (2015) Yann Lecun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 5 2015. ISSN 00280836. doi: 10.1038/nature14539.
 Lee et al. (2015) ChenYu Lee, Patrick W. Gallagher, and Zhuowen Tu. Generalizing pooling functions in convolutional neural networks: mixed, gated, and tree. CoRR, abs/1509.08985, 2015.
 Lin et al. (2013) Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. CoRR, abs/1312.4400, 2013.
 Lin et al. (2015) Zhouhan Lin, Matthieu Courbariaux, Roland Memisevic, and Yoshua Bengio. Neural networks with few multiplications. CoRR, abs/1510.03009, 2015.
 Moreno et al. (2008) F. Moreno, J. Alarcon, R. Salvador, and T. Riesgo. Fpga implementation of an image recognition system based on tiny neural networks and online reconfiguration. In Industrial Electronics, 2008. IECON 2008. 34th Annual Conference of IEEE, pp. 2445–2452, Nov 2008. doi: 10.1109/IECON.2008.4758340.
 Nair & Hinton (2010) Vinod Nair and Geoffrey E. Hinton. Rectified linear units improve restricted boltzmann machines. In Johannes FÃ¼rnkranz and Thorsten Joachims (eds.), Proceedings of the 27th International Conference on Machine Learning (ICML10), pp. 807–814. Omnipress, 2010. URL http://www.icml2010.org/papers/432.pdf.
 Netzer et al. (2011) Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011, 2011. URL http://ufldl.stanford.edu/housenumbers/nips2011_housenumbers.pdf.
 Rumelhart et al. (1986) D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Parallel distributed processing: Explorations in the microstructure of cognition, vol. 1. chapter Learning Internal Representations by Error Propagation, pp. 318–362. MIT Press, Cambridge, MA, USA, 1986. ISBN 026268053X. URL http://dl.acm.org/citation.cfm?id=104279.104293.
 Shafiee et al. (2016a) A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M. Hu, R. S. Williams, and V. Srikumar. ISAAC: a convolutional neural network accelerator with insitu analog arithmetic in crossbars. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 14–26, June 2016a. doi: 10.1109/ISCA.2016.12.
 Shafiee et al. (2016b) M. J. Shafiee, P. Siva, and A. Wong. StochasticNet: forming deep neural networks via stochastic connectivity. IEEE Access, 4:1915–1924, 2016b. ISSN 21693536.
 Simonyan & Zisserman (2014) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. CoRR, abs/1409.1556, 2014.
 Smithson et al. (2016) Sean C. Smithson, Kaushik Boga, Arash Ardakani, Brett H. Meyer, and Warren J. Gross. Stochastic computing can improve upon digital spiking neural networks. In 2016 IEEE Workshop on Signal Processing Systems (SiPS), pp. 309–314, Oct 2016.
 Szegedy et al. (2015) C. Szegedy, Wei Liu, Yangqing Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–9, June 2015. doi: 10.1109/CVPR.2015.7298594.
 Tang (2013) Yichuan Tang. Deep learning using support vector machines. CoRR, abs/1306.0239, 2013. URL http://arxiv.org/abs/1306.0239.
 Team (2016) Theano Development Team. Theano: A Python framework for fast computation of mathematical expressions. arXiv eprints, abs/1605.02688, May 2016.
 Wen et al. (2016) Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. CoRR, abs/1608.03665, 2016. URL http://arxiv.org/abs/1608.03665.
 Zeiler & Fergus (2013) Matthew D. Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. CoRR, abs/1311.2901, 2013. URL http://arxiv.org/abs/1311.2901.