# Dynamic Sparse Graph for Efficient Deep Learning

###### Abstract

We propose to execute deep neural networks (DNNs) with dynamic and sparse graph (DSG) structure for compressive memory and accelerative execution during both training and inference. The great success of DNNs motivates the pursuing of lightweight models for the deployment onto embedded devices. However, most of the previous studies optimize for inference while neglect training or even complicate it. Training is far more intractable, since (i) the neurons dominate the memory cost rather than the weights in inference; (ii) the dynamic activation makes previous sparse acceleration via one-off optimization on fixed weight invalid; (iii) batch normalization (BN) is critical for maintaining accuracy while its activation reorganization damages the sparsity. To address these issues, DSG activates only a small amount of neurons with high selectivity at each iteration via a dimension-reduction search (DRS) and obtains the BN compatibility via a double-mask selection (DMS). Experiments show significant memory saving (1.7-4.5x) and operation reduction (2.3-4.4x) with little accuracy loss on various benchmarks.

Dynamic Sparse Graph for Efficient Deep Learning

Liu Liu, Lei Deng, Xing Hu, Maohua Zhu, Yufei Ding, Yuan Xie University of California, Santa Barbara {liu_liu, leideng, huxing, maohuazhu, yufeiding, yuanxie}@ucsb.edu Guoqi Li Tsinghua University liguoqi@mail.tsinghua.edu.cn

noticebox[b]Preprint. Work in progress.\end@float

## 1 Introduction

Deep Neural Networks (DNNs) [1] have been achieving impressive progress in a wide spectrum of domains [2, 3, 4, 5, 6], while the models are extremely memory- and compute-intensive. The high representational and computational cost motivates many researchers to investigate approaches on improving the execution performance, including matrix or tensor decomposition [7, 8, 9, 10, 11], data quantization [12, 13, 14, 15, 16, 17, 18], and network pruning [19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]. However, most of the previous work aims at inference while the challenges for reducing the representational and computational cost of training are not well-studied. Although some work demonstrate acceleration in distributed training [32, 33, 34], we target at single-node optimization, and our method can also boost training in a distributed fashion.

DNN training, which demands much more hardware resources in terms of both memory capacity and computation volume, is far more challenging than inference. Firstly, activation data in training will be stored for backpropagation, significantly increasing the memory consumption. Secondly, training iteratively updates model parameters using mini-batched stochastic gradient descent (SGD). We almost always expect larger mini-batches for higher throughput (Figure 1(a)), faster convergence, and better accuracy [35]. However, memory capacity is often the limiting factor (Figure 1(b)); it may cause performance degradation or even make large models with deep structures or targeting high-resolution vision tasks hard to train [3, 36].

It is difficult to apply existing sparsity techniques towards inference phase to training phase because of the following reasons: 1) Prior arts mainly compress the pre-trained and fixed weight parameters to reduce the off-chip memory access in inference [37, 38], while instead, the dynamic neuronal activations turn out to be the crucial bottleneck in training [39], making the prior inference-oriented methods inefficient. Besides, during training we need to stash vast batched activation space for the backward gradient calculation. Therefore, neuron activations creates a new memory bottleneck (Figure 1(c)). In this paper, we will sparsify the neuron activations for training compression. 2) The existing inference accelerations usually add extra optimization problems onto the critical path [26, 27, 22, 25, 40, 31], i.e., ‘complicated training simplified inference’, which embarrassingly complicates the training phase. 3) Moreover, previous studies reveal that batch normalization (BN) is crucial for improving accuracy and robustness (Figure 1(d)) through activation fusion across different samples within one mini-batch for better representation [41, 42]. BN almost becomes a standard training configuration; however, inference-oriented methods seldom discuss BN and treat BN parameters as scaling and shift factors in the forward pass. We further find that BN will damage the sparsity due to the activation reorganization (Figure 1(e)). Since this work targets both training and inference, the BN compatibility problem should be addressed.

From the view of information representation, the activation of each neuron reflects its selectivity to the current stimulus sample [41], and this selectivity dataflow propagates layer by layer forming different representation levels. Fortunately, there is much representational redundancy, for example, lots of neuron activations for each stimulus sample are so small and can be removed (Figure 1(f)). Motivated by above comprehensive analysis regarding memory and compute, we propose to search critical neurons for constructing a sparse graph at every iteration. By activating only a small amount of neurons with a high selectivity, we can significantly save memory and simplify computation with tolerable accuracy degradation. Because the neuron response dynamically changes under different stimulus samples, the sparse graph is variable. The neuron-aware dynamic and sparse graph (DSG) is fundamentally distinct from the static one in previous work on permanent weight pruning since we never prune the graph but activate part of them each time. Therefore, we maintain the model expressive power as much as possible. A graph selection method, dimension-reduction search (DRS), is designed for both compressible activations with element-wise unstructured sparsity and accelerative vector-matrix multiplication (VMM) with vector-wise structured sparsity. Through double-mask selection (DMS) design, it is also compatible with BN. We can use the same selection pattern and extend our method to inference. In a nutshell, we propose a compressible and accelerative DSG approach supported by DRS and DMS methods. It can achieve 1.7-4.5x memory compression and 2.3-4.4x computation reduction with minimal accuracy loss. This work simultaneously pioneers the approach towards efficient online training and offline inference, which can benefit the deep learning in both the cloud and the edge.

## 2 Approach

Our method forms DSGs for different inputs, which are accelerative and compressive, as shown in Figure2(a). On the one hand, choosing a small number of critical neurons to participate in computation, DSG can reduce the computational cost by eliminating calculations of non-critical neurons. On the other hand, it can further reduce the representational cost via compression on sparsified activations. Different from previous methods using permanent pruning, our approach does not prune any neuron and the associated weights; instead, it activates a sparse graph according to the input sample at each iteration. Therefore, DSG does not compromise the expressive power of the model.

Constructing DSG needs to determine which neurons are critical. A naive approach is to select critical neurons according to the output activations. If the output neurons have a small or negative activation value, i.e., not selective to current input sample, they can be removed for saving representational cost. Because these activations will be small or absolute zero after the following ReLU non-linear function (i.e., ReLU() max(0, )), it’s reasonable to set all of them to be zero. However, this naive approach requires computations of all VMM operations within each layer before the selection of critical neurons, which is very costly.

### 2.1 Dimension Reduction Search

To avoid the costly VMM operations in the mentioned naive selection, we propose an efficient method, i.e., dimension reduction search (DRS), to estimate the importance of output neurons. As shown in Figure2(b), we first reduce the dimensions of X and W, and then execute the lightweight VMM operations in the low-dimension space at minimal cost. After that, we estimate the neuron importance according to the virtual output activations. Then, a binary mask can be produced in which the zeros represent the non-critical neurons with small activations that are removable. We use a top-k search method that only keeps largest k neurons, where an inter-sample threshold sharing mechanism is leveraged to greatly reduce the search cost ^{1}^{1}1Implementation details are shown in Appendices.. Note that k is determined by the output size and a pre-configured sparsity parameter . Then we can compute the accurate activations of the critical neurons in the original high-dimension space and avoid calculating the non-critical neurons. Thus, besides the compressive sparse activations, DRS can further save a significant amount of expensive operations in high-dimensional space.

In this way, a vector-wise structured sparsity can be achieved, as shown in Figure 3(b). The ones in the selection mask (marked as colored blocks) denote the critical neurons, and the non-critical ones can bypass the memory access and computation of a corresponding whole column of the weight matrix. Furthermore, the generated sparse activations can be compressed via the zero-value compression [43, 44, 45] (Figure 3(c)). Consequently, it is critical to reduce the vector dimension but keep the activations calculated in low-dimension space as accurate as possible, compared to the ones in original high-dimension space.

### 2.2 Sparse Random Projection for Efficient DRS

Notations: Each CONV layer has a four dimensional weight tensor (, , , ), where is the number of filters, i.e., the number of output feature maps (FMs); is the number of input FMs; (, ) represents the kernel size. Thus, the CONV layer in Figure 3(a) can be converted to many VMM operations, as shown in Figure 3(b). Each row in the matrix of input FMs is the activations from a sliding window across all input FMs (), and after the VMM operation with the weight matrix () it can generate points at the same location across all output FMs. Further considering the size of each output FM and the mini-batch size of , the whole rows of VMM operations has a computational complexity of . For the FC layer with input neurons and output neurons, this complexity is . Note that here we switch the order of BN and ReLU layer from ‘CONV/FC-BN-ReLU’ to ‘CONV/FC-ReLU-BN’, because it’s hard to determine the activation value of the non-critical neurons if the following layer is BN (this value is zero for ReLU). As shown in previous work, this reorganization could bring better accuracy [46].

For the sake of simplicity, we just consider the operation for each sliding window in the CONV layer or the whole FC layer under one single input sample as a basic optimization problem. The generation of each output activation requires an inner product operation, as follows:

(1) |

where is the -th row in the matrix of input FMs (for the FC layer, there is only one X vector), is the -th column of the weight matrix , and is the neuronal transformation (e.g., ReLU function, here we abandon bias). Now, according to equation (1), the preservation of the activation is equivalent to preserve the inner product.

We introduce a dimension-reduction lemma, named Johnson–Lindenstrauss Lemma (JLL) [47], to implement the DRS with inner product preservation. This lemma states that a set of points in a high-dimensional space can be embedded into a low-dimensional space in such a way that the Euclidean distances between these points are nearly preserved. Specifically, given , a set of points in (i.e., all and ), and a number of , there exists a linear map such that

(2) |

for any given and pair, where is a hyper-parameter to control the approximation error, i.e., larger larger error. When is sufficiently small, one corollary from JLL is the following norm preservation [48, 49]:

(3) |

where Z could be any or , and denotes a probability. It means the vector norm can be preserved with a high probability controlled by . Given these basics, we can further get the inner product preservation:

(4) |

The detailed proof can be found in the Appendices.

Random projection [48, 50, 51] is widely used to construct the linear map . Specifically, the original -dimensional vector is projected to a -dimensional () one, using a random matrix R. Then we can reduce the dimension of all and by

(5) |

The random projection matrix R can be generated from Gaussian distribution [50]. In this paper, we adopt a simplified version, termed as sparse random projection [51, 52, 53] with

(6) |

for all elements in R. This R only has ternary values that can remove the multiplications during projection, and the remained additions are very sparse. Therefore, the projection overhead is negligible compared to other high-precision operations involving multiplication. Here we set with 67% sparsity in statistics.

Equation (4) indicates the low-dimensional inner product can still approximate the original high-dimensional one in equation (1) if the reduced dimension is sufficiently high. Therefore, it is possible to calculate equation (1) in a low-dimensional space for activation estimation, and select the important neurons. As shown in Figure 3(b), each sliding window dynamically selects its own important neurons for the calculation in high-dimensional space, marked in red and blue as two examples. Figure 4 visualizes two sliding windows in a real network to help understand the dynamic DRS process. Here the neuronal activation vector ( length) is reshaped to a matrix for clarity. Now For the CONV layer, the computational complexity is only , which is less than the original high-dimensional computation with complexity because we usually have . For the FC layer, we also have .

### 2.3 DMS for BN Compatibility

To deal with the important but intractable BN layer, we propose a double-mask selection (DMS) method presented in Figure 2(c). After the DRS estimation, we produce a sparsifying mask that removes the unimportant neurons. The ReLU activation function can maintain this mask by inhibiting the negative activation (actually all the activations from the CONV layer or FC layer after the DRS mask are positive under reasonably large sparsity). However, the BN layer will damage this sparsity through inter-sample activation fusion. To address this issue, we copy the same DRS mask and directly use it on the BN output. It is straightforward but reasonable because we find that although BN causes the zero activation to be non-zero (Figure 1(f)), these non-zero activations are still very small and can also be removed. This is because BN just scales and shifts the activations that won’t change the relative sort order. In this way, we can achieve fully sparse activation dataflow.

## 3 Experimental Results

### 3.1 Experiment Setup

The overall training algorithm is presented in the Appendices. Going through the dataflow where the red color denotes the sparse tensors, a widespread sparsity in both the forward and backward passes is demonstrated. Regarding the evaluation network models, we use LeNet [54] and a multi-layered perceptron (MLP) on a small-scale FASHION dataset [55], VGG8 [12, 14]/ResNet8 (a customized ResNet-variant with 3 residual blocks and 2 FC layers)/ResNet20/WRN-8-2 [56] on medium-scale CIFAR10 dataset [57], VGG8 and WRN-8-2 on another medium-scale CIFAR100 dataset [57], and ResNet18 [3]/WRN-18-2 [56]/VGG16 [2] on the large-scale ImageNet dataset [58] as workloads. The programming framework is PyTorch and the training platform is based on NVIDIA Titan Xp GPU. We adopt the zero-value compression method [43, 44, 45] for memory compression and MKL compute library [59] on Intel Xeon CPU for the acceleration evaluation.

### 3.2 Accuracy Analysis

In this section, we provide a comprehensive analysis regarding the influence of sparsity on accuracy and explore the robustness of MLP and CNN, the graph selection strategy, the BN compatibility, and the importance of width and depth.

Accuracy using DSG. Figure 5(a) presents the accuracy curves on small and medium scale models by using DSG under different sparsity levels. Three conclusions are observed: 1) The proposed DSG affects little on the accuracy when the sparsity is 60%, and the accuracy will present an abrupt descent with sparsity larger than 80%. 2) Usually, the ResNet model family is more sensitive to the sparsity increasing since fewer parameters than the VGG family. For the VGG8 on the CIFAR10 dataset, the accuracy loss is still within 0.5% when sparsity reaches 80%. 3) Compared to MLP, CNN can tolerate more sparsity. Figure 5(b) further shows the results on large scale ImageNet models. Because training large model is time costly, we only present several experimental points. Consistently, the VGG16 shows better robustness compared to the ResNet18, and the WRN with wider channels on each layer performs much better than the other two models. We will discuss the topic of width and depth later.

Graph Selection Strategy. To investigate the influence of graph selection strategy, we repeat the sparsity vs. accuracy experiments on CIFAR10 dataset under different selection methods. Two baselines are used here: the Oracle one that keeps the neurons with top-k activations after the whole VMM computation at each layer, and the random one that randomly selects neurons to keep. The results are shown in Figure 5(c), in which we can see that our DRS and the Oracle one perform much better than the random selection under high sparsity condition. Moreover, DRS achieves nearly the same accuracy with the oracle top-k selection, which indicates the proposed random projection method can find an accurate activation estimation in low-dimensional space. In detail, Figure 5(d) shows the influence of parameter that reflects the degree of dimension reduction. Lower can approach the original inner product more accurately, that brings higher accuracy but at the cost of more computation for graph selection since less dimension reduction. With , the accuracy loss is within 1% even if the sparsity reaches 80%.

BN Compatibility. Figure 5(e) focuses the BN compatibility issue. Here we use DRS for the graph sparsifying, and compare three cases: 1) removing the BN operation and using single mask; 2) keeping BN and using only single mask (the first one in Figure 2(c)); 3) keeping BN and using double masks (i.e. DMS). The one without BN is very sensitive to the graph ablation, which indicates the importance of BN for training. Comparing the two with BN, the DMS even achieves better accuracy since the regularization effect. This observation indicates the effectiveness of the proposed DMS method for simultaneously recovering the sparsity damaged by the BN layer and maintaining the accuracy.

Width or Depth. Furthermore, we investigate an interesting comparison regarding the network width and depth, as shown in Figure 5(f). On the training set, WRN with fewer but wider layers demonstrates more robustness than the deeper one with more but slimmer layers. On the validation set, the results are a little more complicated. Under small and medium sparsity, the deeper ResNet performs better (1%) than the wider one. While when the sparsity increases substantial (75%), WRN can maintain the accuracy better. This indicates that, in medium-sparse space, the deeper network has stronger representation ability because of the deep structure; however, in ultra-high-sparse space, the deeper structure is more likely to collapse since the accumulation of the pruning error layer by layer. In reality, we can determine which type of model to use according to the sparsity requirement. In Figure 5(b) on ImageNet, the reason why WRN-18-2 performs much better is that it has wider layers without reducing the depth.

### 3.3 Representational Cost Reduction

This section presents the benefits from DSG on representational cost. We measure the memory consumption over five CNN benchmarks on both the training and inference phases. For data compression, we use zero-value compression algorithm [43, 44, 45]. Figure 6 shows the memory optimization results, where the model name, mini-batch size, and the sparsity are provided. In training, besides the parameters, the activations across all layers should be stashed for the backward computation. Consistent with the observation mentioned above that the neuron activation beats weight to dominate memory overhead, which is different from the previous work on inference. We can reduce the overall representational cost by average 1.7x (2.72 GB), 3.2x (4.51 GB), and 4.2x (5.04 GB) under 50%, 80% and 90% sparsity, respectively. If only considering the neuronal activation, these ratios could be higher up to 7.1x. The memory overhead for the selection masks is minimal (2%).

During inference, only memory space to store the parameters and the activations of the layer with maximum neuron amount is required. The benefits in inference are relatively smaller than that in training since weight is the dominant memory. On the ResNet152, the extra mask overhead even offsets the compression benefit under 50% sparsity, whereas, we can still achieve up to average 7.1x memory reduction for activations and 1.7x for overall memory. Although the compression is limited for inference, it still can achieve noticeable acceleration that will be shown in the next section. Moreover, reducing costs for both training and inference is our major contribution.

### 3.4 Computational Cost Reduction

We assess the results on reducing the computational cost of both training and inference.
As shown in Figure 7, both the forward and backward pass consume much fewer operations, i.e., multiply-and-accumulate (MAC). On average, 1.4x (5.52 GMACs), 1.7x (9.43 GMACs), and 2.2x (10.74 GMACs) operation reduction are achieved in training under 50%, 80% and 90% sparsity, respectively. For inference with only forward pass, the results increase to 1.5x (2.26 GMACs), 2.8x (4.22 GMACs), and 3.9x (4.87 GMACs), respectively. The overhead of the DRS computation in low-dimensional space is relatively larger (6.5% in training and 19.5% in inference) compared to the mask overhead in memory cost. Note that the training demonstrates less improvement than the inference, which is because the acceleration of the backward pass is partial. The error propagation is accelerative, but the weight gradient generation is not because of the irregular sparsity that is hard to obtain practical acceleration. Although the computation of this part is also very sparse with much fewer operations^{2}^{2}2See Algorithm 1 in the Appendices, we don’t include its GMACs reduction for practical concern.

Finally, we evaluate the execution time on CPU using Intel MKL kernels ([59]). As shown in Figure 8(a), we evaluate the execution time of these layers after the DRS selection on VGG-8. Comparing to VMM baselines, our approach can achieve 2.0x, 5.0x, and 8.5x speedup under 50%, 80%, and 90% sparsity, respectively. When the baselines change to GEMM (general matrix multiplication), the speedup decreases to 0.6x, 1.6x, and 2.7x, respectively. The reason is that DSG generates dynamic vector-wise sparsity, which is not well supported by GEMM.

We further compare our approach with smaller dense models which could be another way to reduce computational cost. As shown in Figure 8(b), comparing with dense baseline, our approach can reduce training time with little accuracy loss. Even though the equivalent smaller dense models with the same effective nodes, i.e., reduced MACs, save more training time, the accuracy is much worse than our DSG approach.

## 4 Related Work

DNN Compression [19] achieved up to 90% weight sparsity by randomly removing connections. [20, 21] reduced the weight parameters by pruning the unimportant connections. However, the compression is mainly achieved on FC layers, that makes it ineffective for CONV layer-dominant networks, e.g., ResNet. Moreover, it is difficult to obtain practical speedup due to the irregularity of the element-wise sparsity. Even if designing ASIC from scratch [37, 38], the index overhead is enormous and it only works under high sparsity. These methods usually require a pre-trained model, iterative pruning and fine-tune retrain, that targets inference optimization.

DNN Acceleration Different from compression, the acceleration work consider more on the sparse pattern. In contrast to the fine-grain compression, coarse-grain sparsity was further proposed to optimize the execution speed. Channel-level sparsity was gained by removing unimportant weight filters [60], training penalty coefficients [22], or introducing group-lasso optimization [25, 24, 40]. [26] introduced a L2-norm group-lasso optimization for both medium-grain sparsity (row/column) and coarse-grain weight sparsity (channel/filter/layer). [27] introduced the Taylor expansion for neuron pruning. However, it just benefits the inference acceleration, and the extra solving of the optimization problem usually makes the training more complicated. [30] demonstrated predicting important neurons then bypassed the unimportant ones via low-precision pre-computation on small networks. [29] leveraged the randomized hashing to predict the important neurons. However, the hashing search aims at finding neurons whose weight bases are similar to the input vector, which cannot estimate the inner product accurately thus will probably cause significant accuracy loss on large models. [28] used a straightforward top-k pruning on the back propagated errors for training acceleration. But they only simplified the backward pass and presented the results on tiny FC models. Furthermore, the BN compatibility problem that is very important for large-model training still remains untouched. [32] pruned the gradients for accelerating distributed training, but the focus is on multi-node communication rather than the computation topic discussed in this paper.

## 5 Conclusion

In this work, we propose DSG (dynamic and sparse graph) structure for efficient DNN training and inference through a DRS (dimension reduction search) sparsity forecast for compressive memory and accelerative execution and a DMS (double-mask selection) for BN compatibility without sacrificing model’s expressive power. It can be easily extended to the inference by using the same selection pattern after training. Our experiments over various benchmarks demonstrate significant memory saving (4.5x for training and 1.7x for inference) and computation reduction (2.3x for training and 4.4x for inference). Through significantly boosting both forward and backward passes in training, as well as in inference, DSG promises efficient deep learning in both the cloud and edge.

## References

- [1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, p. 436, 2015.
- [2] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
- [3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- [4] O. Abdel-Hamid, A.-r. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu, “Convolutional neural networks for speech recognition,” IEEE/ACM Transactions on audio, speech, and language processing, vol. 22, no. 10, pp. 1533–1545, 2014.
- [5] J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” arXiv preprint, vol. 1612, 2016.
- [6] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al., “Google’s neural machine translation system: Bridging the gap between human and machine translation,” arXiv preprint arXiv:1609.08144, 2016.
- [7] J. Xue, J. Li, D. Yu, M. Seltzer, and Y. Gong, “Singular value decomposition based low-footprint speaker adaptation and personalization for deep neural network,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pp. 6359–6363, IEEE, 2014.
- [8] A. Novikov, D. Podoprikhin, A. Osokin, and D. P. Vetrov, “Tensorizing neural networks,” in Advances in Neural Information Processing Systems, pp. 442–450, 2015.
- [9] T. Garipov, D. Podoprikhin, A. Novikov, and D. Vetrov, “Ultimate tensorization: compressing convolutional and fc layers alike,” arXiv preprint arXiv:1611.03214, 2016.
- [10] Y. Yang, D. Krompass, and V. Tresp, “Tensor-train recurrent neural networks for video classification,” arXiv preprint arXiv:1707.01786, 2017.
- [11] J. M. Alvarez and M. Salzmann, “Compression-aware training of deep networks,” in Advances in Neural Information Processing Systems, pp. 856–867, 2017.
- [12] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio, “Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1,” arXiv preprint arXiv:1602.02830, 2016.
- [13] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, “Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients,” arXiv preprint arXiv:1606.06160, 2016.
- [14] L. Deng, P. Jiao, J. Pei, Z. Wu, and G. Li, “Gxnor-net: Training deep neural networks with ternary weights and activations without full-precision memory under a unified discretization framework,” Neural Networks, vol. 100, pp. 49–58, 2018.
- [15] C. Leng, H. Li, S. Zhu, and R. Jin, “Extremely low bit neural network: Squeeze the last bit out with admm,” arXiv preprint arXiv:1707.09870, 2017.
- [16] W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li, “Terngrad: Ternary gradients to reduce communication in distributed deep learning,” in Advances in Neural Information Processing Systems, pp. 1508–1518, 2017.
- [17] S. Wu, G. Li, F. Chen, and L. Shi, “Training and inference with integers in deep neural networks,” arXiv preprint arXiv:1802.04680, 2018.
- [18] J. L. McKinstry, S. K. Esser, R. Appuswamy, D. Bablani, J. V. Arthur, I. B. Yildiz, and D. S. Modha, “Discovering low-precision networks close to full-precision networks for efficient embedded inference,” arXiv preprint arXiv:1809.04191, 2018.
- [19] A. Ardakani, C. Condo, and W. J. Gross, “Sparsely-connected neural networks: towards efficient vlsi implementation of deep neural networks,” arXiv preprint arXiv:1611.01427, 2016.
- [20] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” in Advances in neural information processing systems, pp. 1135–1143, 2015.
- [21] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” arXiv preprint arXiv:1510.00149, 2015.
- [22] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, “Learning efficient convolutional networks through network slimming,” in 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2755–2763, IEEE, 2017.
- [23] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning filters for efficient convnets,” arXiv preprint arXiv:1608.08710, 2016.
- [24] Y. He, X. Zhang, and J. Sun, “Channel pruning for accelerating very deep neural networks,” in International Conference on Computer Vision (ICCV), vol. 2, p. 6, 2017.
- [25] J.-H. Luo, J. Wu, and W. Lin, “Thinet: A filter level pruning method for deep neural network compression,” arXiv preprint arXiv:1707.06342, 2017.
- [26] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured sparsity in deep neural networks,” in Advances in Neural Information Processing Systems, pp. 2074–2082, 2016.
- [27] P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz, “Pruning convolutional neural networks for resource efficient inference,” 2016.
- [28] X. Sun, X. Ren, S. Ma, and H. Wang, “meprop: Sparsified back propagation for accelerated deep learning with reduced overfitting,” arXiv preprint arXiv:1706.06197, 2017.
- [29] R. Spring and A. Shrivastava, “Scalable and sustainable deep learning via randomized hashing,” in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 445–454, ACM, 2017.
- [30] Y. Lin, C. Sakr, Y. Kim, and N. Shanbhag, “Predictivenet: An energy-efficient convolutional neural network via zero prediction,” in Circuits and Systems (ISCAS), 2017 IEEE International Symposium on, pp. 1–4, IEEE, 2017.
- [31] T. Zhang, K. Zhang, S. Ye, J. Li, J. Tang, W. Wen, X. Lin, M. Fardad, and Y. Wang, “Adam-admm: A unified, systematic framework of structured weight pruning for dnns,” arXiv preprint arXiv:1807.11091, 2018.
- [32] Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally, “Deep gradient compression: Reducing the communication bandwidth for distributed training,” arXiv preprint arXiv:1712.01887, 2017.
- [33] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, “Accurate, large minibatch sgd: training imagenet in 1 hour,” arXiv preprint arXiv:1706.02677, 2017.
- [34] Y. You, Z. Zhang, C. Hsieh, J. Demmel, and K. Keutzer, “Imagenet training in minutes,” CoRR, abs/1709.05011, 2017.
- [35] S. L. Smith, P.-J. Kindermans, and Q. V. Le, “Don’t decay the learning rate, increase the batch size,” arXiv preprint arXiv:1711.00489, 2017.
- [36] Y. Wu and K. He, “Group normalization,” arXiv preprint arXiv:1803.08494, 2018.
- [37] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “Eie: efficient inference engine on compressed deep neural network,” in Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on, pp. 243–254, IEEE, 2016.
- [38] S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo, S. Yao, Y. Wang, et al., “Ese: Efficient speech recognition engine with sparse lstm on fpga,” in Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 75–84, ACM, 2017.
- [39] A. Jain, A. Phanishayee, J. Mars, L. Tang, and G. Pekhimenko, “Gist: Efficient data encoding for deep neural network training,” in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pp. 776–789, IEEE, 2018.
- [40] L. Liang, L. Deng, Y. Zeng, X. Hu, Y. Ji, X. Ma, G. Li, and Y. Xie, “Crossbar-aware neural network pruning,” arXiv preprint arXiv:1807.10816, 2018.
- [41] A. S. Morcos, D. G. Barrett, N. C. Rabinowitz, and M. Botvinick, “On the importance of single directions for generalization,” arXiv preprint arXiv:1803.06959, 2018.
- [42] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
- [43] Y. Zhang, J. Yang, and R. Gupta, “Frequent value locality and value-centric data cache design,” ACM SIGPLAN Notices, vol. 35, no. 11, pp. 150–159, 2000.
- [44] N. Vijaykumar, G. Pekhimenko, A. Jog, A. Bhowmick, R. Ausavarungnirun, C. Das, M. Kandemir, T. C. Mowry, and O. Mutlu, “A case for core-assisted bottleneck acceleration in gpus: enabling flexible data compression with assist warps,” in ACM SIGARCH Computer Architecture News, vol. 43, pp. 41–53, ACM, 2015.
- [45] M. Rhu, M. O’Connor, N. Chatterjee, J. Pool, Y. Kwon, and S. W. Keckler, “Compressing dma engine: Leveraging activation sparsity for training deep neural networks,” in High Performance Computer Architecture (HPCA), 2018 IEEE International Symposium on, pp. 78–91, IEEE, 2018.
- [46] D. Mishkin and J. Matas, “All you need is a good init,” arXiv preprint arXiv:1511.06422, 2015.
- [47] W. B. Johnson and J. Lindenstrauss, “Extensions of lipschitz mappings into a hilbert space,” Contemporary mathematics, vol. 26, no. 189-206, p. 1, 1984.
- [48] K. K. Vu, Random projection for high-dimensional optimization. PhD thesis, Université Paris-Saclay, 2016.
- [49] I. S. Kakade and G. Shakhnarovich, “Cmsc 35900 (spring 2009) large scale learning lecture: 2 random projections,” 2009.
- [50] N. Ailon and B. Chazelle, “The fast johnson–lindenstrauss transform and approximate nearest neighbors,” SIAM Journal on computing, vol. 39, no. 1, pp. 302–322, 2009.
- [51] D. Achlioptas, “Database-friendly random projections,” in Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pp. 274–281, ACM, 2001.
- [52] E. Bingham and H. Mannila, “Random projection in dimensionality reduction: applications to image and text data,” in Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 245–250, ACM, 2001.
- [53] P. Li, T. J. Hastie, and K. W. Church, “Very sparse random projections,” in Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 287–296, ACM, 2006.
- [54] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
- [55] H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms,” arXiv preprint arXiv:1708.07747, 2017.
- [56] S. Zagoruyko and N. Komodakis, “Wide residual networks,” arXiv preprint arXiv:1605.07146, 2016.
- [57] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” 2009.
- [58] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 248–255, IEEE, 2009.
- [59] E. Wang, Q. Zhang, B. Shen, G. Zhang, X. Lu, Q. Wu, and Y. Wang, “Intel math kernel library,” in High-Performance Computing on the Intel® Xeon Phi™, pp. 167–188, Springer, 2014.
- [60] Y. He, G. Kang, X. Dong, Y. Fu, and Y. Yang, “Soft filter pruning for accelerating deep convolutional neural networks,” arXiv preprint arXiv:1808.06866, 2018.

## Appendix A DRS for Inner Product Preservation

Theorem 1. Given a set of points in (i.e. all and ), and a number of , there exist a linear map and a , for we have

(7) |

for all and .

Proof. According to the definition of inner product and vector norm, any two vectors a and b satisfy

(8) |

It is easy to further get

(9) |

Therefore, we can transform the target in equation (7) to

(10) |

which is also based on the fact that . Now recall the definition of random projection in equation (5) of the main text

(11) |

Substituting equation (11) into equation (10), we have

(12) |

Further recalling the norm preservation in equation (3) of the main text: there exist a linear map and a , for we have

(13) |

Substituting the equation (13) into equation (12) yields

(14) |

Combining equation (12) and (14), finally we have

(15) |

It can be seen that, for any given and pair, the inner product can be preserved if the is sufficiently small. Actually, previous work [51, 52, 48] discussed a lot on the random projection for various big data applications, here we re-organize these supporting materials to form a systematical proof. We hope this could help readers to follow this paper. In practical experiments, there exists a trade-off between the dimension reduction degree and the recognition accuracy. Smaller usually brings more accurate inner product estimation and better recognition accuracy while at the cost of higher computational complexity with larger , and vice versa. Because the and are not strictly bounded, the approximation may suffer from some noises. Anyway, from the abundant experiments in the main text, the effectiveness of our approach for training dynamic and sparse neural networks has been validated.

## Appendix B Implementation and overhead

The training algorithm for producing DSG is presented in Algorithm 1. Furthermore, the generation procedure of the critical neuron mask based on the virtual activations estimated in low-dimensional space is presented in Figure 9, which is a typical top-k search. The k value is determined by the activation size and the desired sparsity . To reduce the search cost, we calculate the first input sample within the current mini-batch and then conduct a top-k search over the whole virtual activation matrix for obtaining the top-k threshold under this sample. The remaining samples share the top-k threshold from the first sample to avoid costly searching overhead. At last, the overall activation mask is generated by setting the mask element to one if the estimated activation is larger than the top-k threshold and setting others to zero. In this way, we greatly reduce the search cost. Note that, for the FC layer, each sample is a vector.

Layers | Dimension | Operations (MMACs) | ||||||||
---|---|---|---|---|---|---|---|---|---|---|

, , | BL | 0.3 | 0.5 | 0.7 | 0.9 | BL | 0.3 | 0.5 | 0.7 | 0.9 |

1024, 1152, 128 | 1152 | 539 | 232 | 148 | 119 | 144 | 67.37 | 29 | 18.5 | 14.88 |

256, 1152, 256 | 1152 | 616 | 266 | 169 | 136 | 72 | 38.5 | 16.63 | 10.56 | 8.5 |

256, 2304, 256 | 2304 | 616 | 266 | 169 | 136 | 144 | 38.5 | 16.63 | 10.56 | 8.5 |

64, 2304, 512 | 2304 | 693 | 299 | 190 | 154 | 72 | 21.65 | 9.34 | 5.94 | 4.81 |

64, 4608, 512 | 4608 | 693 | 299 | 190 | 154 | 144 | 21.65 | 9.34 | 5.94 | 4.81 |

Furthermore, we investigate the influence of the on the DRS computation cost for importance estimation. We take several layers from the VGG8 on CIFAR10 as a study case, as shown in Table 1. With larger, the DRS can achieve lower dimension with much fewer operations. The average dimension reduction is 3.6x (), 8.5x (), 13.3x (), and 16.5x (). The resulting operation reduction is 3.1x, 7.1x, 11.1x, and 13.9x, respectively.