Hardware-aware Pruning of DNNs using LFSR-Generated Pseudo-Random Indices
Deep neural networks (DNNs) have been emerged as the state-of-the-art algorithms in broad range of applications. To reduce the memory foot-print of DNNs, in particular for embedded applications, sparsification techniques have been proposed.Unfortunately, these techniques come with a large hardware overhead. In this paper, we present a hardware-aware pruning method where the locations of non-zero weights are derived in real-time from a Linear Feedback Shift Registers (LFSRs). Using the proposed method, we demonstrate a total saving of energy and area up to 63.96% and 64.23% for VGG-16 network on down-sampled ImageNet, respectively for iso-compression-rate and iso-accuracy.
Sparse Neural Network Linear Feedback shift Register DNN accelerator
Neural networks (NNs) have shown remarkable performance in a broad range of applications, from computer vision [6, 18] to health-care data analysis [17, 9]. These algorithms have gradually evolved to larger models with deeper layers and increasing number of parameters .While the large DNNs are powerful and provides high accuracy for complex tasks, they cannot be easily deployed on embedded mobile applications due to two major problems . Firstly, because the models are large and often over-parameterized, it must be stored in an external DRAM. The second major issue is that accessing model parameters from an external DRAM consumes large amounts of energy . For example, in the 45nm CMOS technology, accessing a 32bit DRAM memory requires 640pJ, which is 3 order of magnitudes higher than a 32bit floating point add operation (0.9 pJ) . Therefore, it is hard to deploy large DNNs on battery constrained mobile platforms.
Network compression via pruning techniques is one possible solution to fit large DNNs such as VGG-16 (138M parameters, 520MB) in on-chip SRAM [8, 13, 7]. However, it is challenging because sparse matrices add extra levels of irregularity to the weights’ addresses [8, 1]. In , a pruning method has been proposed which prunes the connections that have smaller weights than a threshold via an iterative process of pruning and retraining. Although this method prunes the network with no loss of accuracy; the method is heuristic and the threshold values need to carefully selected. Further, resultant matrix lacks structure. In , the authors prune the network by learning the important connections using a thresholding mechanism and then applying weight sharing and Huffman coding to compress the networks even further. The problem with this method is that they need to store three vectors: (1) the sparse weights matrix’s value, (2) the location or address of the non-zero matrix weights and (3) the pointer vector to keep track of the weights in each column. In addition, weight sharing adds another level of indirection and complexity . In summary, the baseline pruning method is powerful from an algorithm perspective, but its mapping to hardware is inefficient and requires a memory foot-print as high as that of the model size.
In this paper, we present a hardware-aware method to prune dense DNNs which reduces the memory footprint while preserving the original accuracy. We utilize an on-die linear feedback shift register (LFSR) using a known seed, to generate a pseudo random sequence (PRS). Next, we use this PRS to regularize and prune the network. In the last step, we retrain the sparse network so that the model can perform better with the pruned model. In addition, during inference we use the LFSR with the same seed to generate the indices in real-time to perform multiplication between the sparse weight matrix and input/activation vector. Consequently, we no longer need to store the sparse weight addresses – thereby reducing the memory foot-print significantly.
2 Proposed Hardware-Aware Pruning
Proposed pruning and baseline methods are illustrated in Figure1. The proposed pruning method consists four steps. It begins with generating a pseudo-random sequence (PRS) using two LFSRs, one for row indices and another one for column indices. We then use the generated PRS as the indices to sparsify the synapses (i.e. connections) by using standard regularization methods in an iterative approach. The specified synapses are regularized to force them to be zero in the training step and be pruned away in the pruning step. The last step is retraining the sparse model so that the sparse model can provide higher accuracy. Using a PRS for generating the locations of the zero weights in the connectivity matrix provides good accuracy, while making it easier to generate the indices on the fly, instead of being stored in a separate memory sub-bank.
We use Han et. al, 2015  as the state-of-the-art baseline pruning techniques for the proposed algorithm. In , illustrated in Figure 1, a pruning method was proposed to prune the redundant connections in an iterative process based on a threshold. First, the neural network is trained starting with random initial conditions. Next, the connections less than a threshold are pruned iteratively and finally, the pruned network is fine tuned using several epochs of retraining. Interested readers are pointed to  for further details.
2.1 Linear Feedback Shift Register (LFSR)
LFSR  is one of the most commonly used topologies for implementing and generating pseudo random bit sequences . The advantages of using an LFSR to generate the indices are: (1) they can be easily be implemented in hardware, (2) the PRS has key statistical properties that preserves the rank of the generated connectivity matrix . LFSR, consists of an array of flip-flops with an initial state called input seed (), followed by linear feedback performed by several exclusive-or (XOR) gates (). LFSRs can be mathematically described through order characteristic polynomials:
To obtain the maximum PRS length, the characteristic polynomials have to be primitive  where the maximum period without repetition is . In the proposed method we utilize two LFSRs, to generate random sequence for the indices of the rows and columns, separately. The row index indicates the address of the element in the input vector while the column address encodes the address of the output vector.
2.2 Training with PRS based Regularization
A fully connected (FC) layer of DNN performs the following function:
Where is a weight matrix, is an input, is a vector of bias values, is a non-linear activation function, typically a Rectified Linear Unit (ReLU) . For simplification, can be merged with by appending an additional column to the end of matrix . Then we can write the above equation as:
where c and d are the original weight matrix’s indices correspond to rows and columns, respectively. After selecting the synapses using the PRS sequence, we regularize them during the training process. We apply strong regularization on the selected synapses, based on LFSRs indices, in order to force the network to zero-out these synapses. We have studied both L1 and L2 regularization  to penalize non-zero weight values. For L2 regularization, a regularizer component is added to the cost (J) , as shown in Eq.4 and weights will be updated based on Eq. 5.
Where i and j are correspond to the indices from LFSR 1 and LFSR 2 for rows and columns, respectively. is the learning rate. L is the layer’s number in the network. is the regularization parameter and can be tuned. Larger will more penalized the weights values and make them closer to zero.
2.3 Pruning and Retraining
After heavily regularizing the selected weights, a pruning step is employed to make sure that all the selected weights are exactly zero as regularization makes the connection very close to zero, but not exactly zero. With the LFSR based pruning method, the activation computation of Eq. 6 becomes
where S is a sparse weight matrix where i and j belong to the first and second LFSR, respectively. The last step of the process is to retrain the pruned network to fine tune the remaining synapses so that they better compensate for the removed connections.
2.4 DNN compression
To fully understand advantages and limitations of baseline and proposed algorithms, we implement both methods in digital hardware (Figure 2). The two circuit diagrams show how hardware resources/operations differ to solve a sparsely connected fully-connected layer, with N input neuron, M output neuron and as sparsity. For the hardware implementation of our proposed method, the LFSR is used to generate the index for the input to be multiplied to the corresponding weight in sparse matrix weight. LFSR generate the pseudo random sequence with values between 1 to . In order to keep the values between number of input neurons, we multiply the generated value to the length of input neurons and select the most significant bits (MSBs). The goal is to avoid redundant clock cycles when the generated number is greater that the number of neurons. After doing multiplication and accumulation operations, the result is stored in the output buffer. The index comes from the LFSR with different input seed which is responsible for generating the sequence for output addresses. The exact number of memory reads from the input and the output buffer depend on the model size as well as the number of multiply and accumulate units available for parallel compute. For the baseline algorithm, the sparse weight matrix is compressed in three vectors including the non-zero values of the weights (S), location of the non-zero weights (I) and a pointer vector to point to the start of each column of the weight matrix (P), that should be saved in the memory. Moreover, each entry bit-width of S and I is designed to be four-bit or eight-bits, and additional memory usage ratio resulted from limited index representation is denoted by . For instance, if more than 15 zeros appear before a non-zero four-bit entry, a zero is added to vectors S and I.
3.1 Simulation results for the proposed pruning algorithm
The proposed methodology is demonstrated on three pruned networks: LeNet-300-100, LeNet-5 and VGG-16 using MNIST, CIFAR-10 and down-sampled ImageNet  data-sets. Training is carried out on Nvidia GTX 1080 Ti GPUs. The key parameters for hardware implementation are shown in Table 1. Rate of compression along with the top-1 accuracy error before and after pruning for three mentioned models are shown in Table 2. We observe that the proposed method does not affect the rank of weight matrices (Table 3). As the matter of fact the rank in the proposed approach is close to full rank (as in unpruned models) of the dense matrix. Since the PRS maintain the matrix rank, we infer that the expressibility of the weight matrices and accuracy of the network can remain unchanged.
|Technology Node||TSMC 65nm|
|Index Bitwidth||4b, 8b|
|Memory Bank Size||256B, 512B, 1KB, 4KB|
|Modified VGG-16 Unpruned||48.5%||23M|
|Modified VGG-16 Pruned||52.1%||3.3M|
FC layers (unpruned)
3.1.1 Pruning on Fully Connected Layers
Large DNNs are over-parameterized; this mostly correspond to the large fully connected layers of these networks. For instance, 124M out of 138M of the parameters are related to the 3 fully connected layer in VGG-16. Because of this, we focused on pruning fully connected layers’ connection as they consume most of the energy and memory size from hardware perspective.
3.1.2 LeNet on MNIST
LeNet-300-100 is a fully connected network which has two hidden layers of length 300 and 100 neurons each which achieves 4.9% error rate on the MNIST dataset. Accuracy loss versus different sparsity rates on MNIST is illustrated in Figure3. Three different have been tested on the proposed method and the results before and after retraining are illustrated. The results shows that moderate and strong regularization, equals to 2 and 10, respectively, have better performance before and after retraining. We picked equals to 2 to trade-off between pruning and convergence of the loss function while preventing over-fitting. We use both L1 and L2 regularization and the results have been illustrated 3. L1 has better performance before retraining while L2 achieves better performance after retraining.
The second model tested on MNIST is a convolutional network, LeNet-5 that has two convolutional layers followed by two fully connected layers. LeNet-5 achieves 1.6% error rate on MNIST dataset. The accuracy versus different sparsity of LeNet on MNIST is illustrated in Figure 4, which shows that our method achieves the same accuracy as the baseline for different sparsity rates.
3.1.3 LeNet-5 on CIFAR-10
We uses LeNet-5 on CIFAR-10 and the comparison between the accuracy of proposed method and the baseline method for 5 trials is shown in Figure 4. The results shows that the proposed method is more reliable and has less std while preserving the original accuracy as it is not based on the thresholding method.
3.1.4 VGG-16 on down-sampled Imagenet
We used VGG-16 on ImageNet data  with 1000 different classes, but initially down-sampled it to . Apart from a single crop with no rotation, we have not used any other pre-processing or augmentation. For this dataset we have utilized largest batch size, 32 images/batch, allowed by the GPU memory. We then classified down-sampled ImageNet using VGG-16 with some modification to be fit to the spatial size (i.e. ) of down-sampled ImageNet which is due to the fact that the feature size should maintain enough spatial size before each pooling layer. The fully connected layers size was changed to 2048 and the last pooling layer was eliminated. The results have shown in Figure 4 which shows that the proposed pruning method can preserve the accuracy even in high sparsity rates.
3.2 65nm CMOS Hardware Implementation
We have synthesized baseline and proposed methods with 65nm CMOS technology to measure hardware metrics. Implementation parameters are shown in Table 1. The pre-layout analysis demonstrates 1.51 to 2.94 reduction in required memory footprint between proposed method and 4-8b indexed baseline pruning technique (Figure 5). Besides memory, the overall system (memory, multiplier, accumulator and input/output buffers) parameters are also measured. The power and area measurements are demonstrated in Table 4 and 5. We observe a maximum of 63.96% power savings and 68.18% area savings across varying sparsity, indexing bit-widths and baseline designs. Although significant savings are observed for the proposed method, it should also be noted that the LFSR based column indexing introduces additional output buffer access (1 cycle read and 1 cycle write). This is included in our design and results. We note that additional power is negligible compared to the total power savings.
Total Saving (%)
Total Saving (%)
In this paper we propose a new method of indexing to use a sparse network for inference, to enhance the memory usage and energy efficiency of DNNs. To achieve that, we have utilized an LFSR based indexing by generating two pseudo random sequence as indices instead of saving the indices in a separate memory. The generated indices are used to decide which weights need to be pruned and which ones to be retained. We show that our method can prune large networks without loss of accuracy. In addition, we demonstrated, a maximum of 63.96% power savings and 68.18% area savings across varying sparsity, indexing bit-widths can be achieved.
This project was supported by the Semiconductor Research Corporation under grant JUMP CBRIC task ID 2777.004, 2777.005 and 2777.006.
-  (2019) Eyeriss v2: a flexible accelerator for emerging deep neural networks on mobile devices. IEEE Journal on Emerging and Selected Topics in Circuits and Systems. Cited by: §1.
-  (2017) Cryptographic boolean functions and applications. Academic Press. Cited by: §2.1, §2.1.
-  (2016) EIE: efficient inference engine on compressed deep neural network. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 243–254. Cited by: §1.
-  (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §1, §1.
-  (2015) Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pp. 1135–1143. Cited by: §1, Figure 1, §2.2, §2, Figure 4.
-  (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
-  (2008) Sparse deep belief net model for visual area v2. In Advances in neural information processing systems, pp. 873–880. Cited by: §1.
-  (2019) SqueezeFlow: a sparse cnn accelerator exploiting concise convolution rules. IEEE Transactions on Computers 68 (11), pp. 1663–1677. Cited by: §1.
-  (2017) Deep learning for healthcare: review, opportunities and challenges. Briefings in bioinformatics 19 (6), pp. 1236–1246. Cited by: §1.
-  (2002) A novel pseudo random bit generator for cryptography applications. In 9th International conference on electronics, circuits and systems, Vol. 2, pp. 489–492. Cited by: §2.1.
-  (2009) CACTI 6.0: a tool to model large caches. HP laboratories 27, pp. 28. Cited by: §2.2.
-  (2016) Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759. Cited by: §3.1.4.
-  (2017) Scnn: an accelerator for compressed-sparse convolutional neural networks. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), pp. 27–40. Cited by: §1.
-  (2018) A double stage implementation for 1-k pseudo rng using lfsr and trivium. Journal of Computer Science and Control Systems 11 (1), pp. 13–17. Cited by: §2.1.
-  (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Cited by: §3.1.4, §3.1.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1.
-  (2018) Neural network analysis of sleep stages enables efficient diagnosis of narcolepsy. Nature communications 9 (1), pp. 5229. Cited by: §1.
-  (2018) Deep learning for computer vision: a brief review. Computational intelligence and neuroscience 2018. Cited by: §1.
-  (2016) Cambricon-x: an accelerator for sparse neural networks. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 20. Cited by: §1.