SGAD: SoftGuided AdaptivelyDropped Neural Network
Abstract
Deep neural networks (DNNs) have been proven to have many redundancies. Hence, many efforts have been made to compress DNNs. However, the existing model compression methods treat all the input samples equally while ignoring the fact that the difficulties of various input samples being correctly classified are different. To address this problem, DNNs with adaptive dropping mechanism are well explored in this work. To inform the DNNs how difficult the input samples can be classified, a guideline that contains the information of input samples is introduced to improve the performance. Based on the developed guideline and adaptive dropping mechanism, an innovative softguided adaptivelydropped (SGAD) neural network is proposed in this paper. Compared with the 32 layers residual neural networks, the presented SGAD can reduce the FLOPs by with less than drop in accuracy on CIFAR10.
SGAD: SoftGuided AdaptivelyDropped Neural Network
Zhisheng Wang, Fangxuan Sun, Jun Lin, Zhongfeng Wang and Bo Yuan School of Electronic Science and Engineering, Nanjing University, P.R. China Department of Electrical Engineering, City University of New York, City College {zswang, fxsun}@smail.nju.edu.cn {jlin, zfwang}@nju.edu.cn, byuan@ccny.cuny.edu
noticebox[b]Preprint. Work in progress.\end@float^{1}^{1}footnotetext: Authors contributed equally.
1 Introduction
Deep neural networks (DNNs) have achieved the stateoftheart accuracy and gained wide adoption in various artificial intelligence (AI) fields, such as computer vision, speech recognition and nature langue processing He et al. (2016); Krizhevsky et al. (2012); Simonyan and Zisserman (2014); Amodei et al. (2015). However, the remarkable accuracy of DNNs comes at the expense of huge computational cost, which has already posed severe challenges on the existing DNN computing hardware performance in terms of processing time and power consumption. Even worse, it is widely acknowledged that the computational cost of modern DNNs will continue to increase rapidly due to the evergrowing demands for improved accuracy in AI applications. Consider the limited progress of hardware technology, the huge computational cost of DNNs, if not being properly addressed, would largely prevent the largescale deployment of DNNs on various resourceconstrained platforms, such as mobile devices and InternetofThing (IoT) equipment.
To address this challenge, several computationreducing approaches have been proposed in Wen et al. (2016, 2017); Han et al. (2016); Sun et al. (2016); Garipov et al. (2016). To date, most of the existing works focus on modifying the original popular DNN architectures via different techniques (such as pruning and decomposition etc.). In those modelpruning/decomposition works, all the input samples are treated equally and they are processed by all the layers of DNNs. Consider that shallow models with relatively poor model capacities can also correctly classify some input samples, thus, different samples in the same dataset exhibit different levels of ease on accurate classification. By leveraging such characteristics, an inputspecific adaptive computational approach can be exploited to avoid unnecessary computation.
A natural way to skip a layer is to add a bypass which directly outputs the inputs. Among various DNNs, residual networks He et al. (2016) (ResNets) exhibit a unique architecture which is friendly to the adaptive computational approach. Hence, this paper focuses on the adaptive computation of ResNets. There are two more reasons for choosing ResNets, 1) ResNet is the currently most popular and widely deployed DNN architecture, especially in computer vision field; 2) previous work Veit et al. (2016) showed that ResNets can be seen as ensembles of many shallow blocks with weak dependencies which can be utilized for adaptive computation.
In this paper, we propose a novel endtoend trainable softguided adaptivelydropped neural network (SGAD) to reduce the inputspecific redundant computations while retaining high accuracy. As shown in Fig. 1, all blocks in the original ResNet are always busy. However, in SGAD, each block will be adaptively dropped according to the input samples. To smartly and efficiently decide which blocks will be dropped, a soft guideline is developed to generate a group of discrete masks. Experimental results show that the proposed SGAD can reduce floatingpoint operations (FLOPs) with less than accuracy loss compared with ResNet32 on CIFAR10. On CIFAR100, SGAD can improve the accuracy by with less FLOPs as compared with ResNet32. The contributions of this paper are summarized as follows:

A novel soft informationbased guideline is proposed to quantize the level of difficulties of input samples being classified correctly. Such guideline is then used to direct the expected drop ratio of residual blocks during training via an efficient mapping strategy. At the inference stage, the guideline can be removed without incurring additional overhead.

We introduce a small but efficient model with binary output, which determines the positions of layers that will be skipped according to the current input sample under the direction of the proposed guideline. Straight through estimator (STE) Bengio et al. (2013) is introduced to approximate the nondifferential of the rounding function during the training phase.

The learned dropping behavior of SGAD is explored. Our experiments show that layers of original network (e.g. ResNet32) with less contribution to the model capacity are likely to be dropped in SGADbased model (e.g. SGAD32).
2 Related Works
The proposed SGAD is motivated by recent studies on exploring the behavior of residual networks. Andreas et al. found that residual networks can be seen as ensembles of many weaklydependent paths with varying lengths, where only the short paths are needed during training Veit et al. (2016). Besides, removing individual layers from a trained residual network at test time only leads to misclassification on few borderline samples with minor accuracy drop Greff et al. (2016). These observations indicate that most input samples may be easily classified with limited number of layers, thus we can adaptively allocate different computation budgets between “easy” and “hard” samples. Several approaches have been proposed based on this concept.
The earlytermination approaches Bolukbasi et al. (2017); Teerapittayanon et al. (2016); Panda et al. (2016) add additional sidebranch classifiers inside a deep neural network. Hence, input samples that are judged as being able to be classified by a certain sidebranch classifier can exit from the network immediately without executing the whole model. In contrast, the proposed SGAD enables adaptivecomputation behavior by utilizing the ensemble nature of residual networks. Neither handcrafted network architectures nor extra sidebranch classifier is needed, thereby making our approach more simple and effective.
The adaptive computation approaches are close to our work. Spatially Adaptive Computation Time (SACT) Figurnov et al. (2016) dynamically decides the number of executed layers inside a set of residual units. SkipNet Wang et al. (2017) and BlockDrop Wu et al. (2018) utilize reinforcement learning to dynamically choose executed residual units in a pretrained ResNet for different input samples. Adanets Andreas and Serge (2017) enable adaptive computation graphs by adding layerwise gating functions to decide whether to skip the computation of a certain layer or not. Different from these approaches, the proposed SGAD uses a shallow network, whose behavior is guided by an extra guideline during training, to generate binary vectors for adaptively masking those unused residual units. Compared to above mentioned approaches, SGAD is able to achieve higher savings in computational cost with no accuracy loss in most cases.
3 SoftGuided AdaptivelyDropped Approach
In this section, we present the softguided adaptivelydropped neural network. First, we introduce a binary mask network (BMNet) to decide which blocks should be used for a specific input. The size of BMNet is quite small and hence it introduces very little computation overhead. Then, in order to solve the nondifferentiable problem incurred by using these discrete binary masks in the training phase, straight through estimator (STE) Bengio et al. (2013) is introduced to approximate the gradient of the original nondifferential rounding function during back propagation. Finally, we propose a soft guideline network (SGNet) to improve the overall classification accuracy. The SGNet can extract the soft information of different inputs during the training phase, and thereby aiding the training of BMNet through a regularization term to force BMNet drop dynamically. At the inference phase, the regularization term is no longer used, thus the SGNet can be removed.
3.1 Binary Mask
Generally, the main part of ResNets consists of several blocks. Let and be the input and output of the (+1)th block, respectively. The computation of is shown below:
(1) 
where the details of can be referred in He et al. (2016). Beside the blocks, the ResNets usually start with a single convolutional layer and end with a fullyconnected layer.
Binary Mask: As indicated in the first paragraph of this section, we introduce a binary mask to determine whether each block should be skipped or not in the inference of a specific input sample. Specifically, for the th block, its output determined by the binary mask is as follows:
(2) 
where denotes the input of the first block, namely the output of the single convolutional layer in ResNet. is a hypothesis with weight which decides whether or not this block should be dropped. To simplify the deduction of gradient, Eq. (2) can be rewritten in the following form:
(3) 
Assume that a batch of data contains pairs during the training phase. The weights of the th block of ResNets are denoted by . The update of at the (+)th iteration can be written as belows:
(4) 
where and denote the learning rate used in the training phase and the training loss at the th iteration, respectively.
For original ResNets, the gradient can be represented as follows:
(5) 
where denotes the number of blocks in a ResNet. denotes the differential of to and denotes the differential of to . Taking the binary mask into consideration, the update of gradients can be represented as:
(6) 
Generally, the gradients calculated in the training phase is much less than 1 (the magnitude of gradients are about according to Section.4.1). Hence, in Eq. (5) can be seen as 1. Eq. (5) can thusly be simplified to:
(7) 
Let be the ratio that does not dropped in a batch for the th block. Combining Eqs. (5  7) and the definition of , the updating of the weights of ResNets with binary mask can be approximated as follows:
(8) 
where is the actual learning rate. Since in different blocks are not the same, each block will have an unique learning rate. Hence, the proposed binary mask can adaptively adjust the learning rate of different blocks according to the level of contributions to the model capacity (LCMC) of blocks Veit et al. (2016). To explore which blocks contribute more to the model capacity, we will study the dropping behavior of SGAD with details discussed in Section 4.1.
The design of binary mask called BMNet is introduced and shown in Fig. 2 (b). Note that small perturbations can result in quite different binary masks if the output of the sigmoid unit is near the rounding threshold (0.5), thereby making the BMNet instable. Inspired by Salakhutdinov and Hinton (2009), additive noises are injected before the sigmoid unit. The magnitude of noise is increased over time so that the magnitude of inputs will also be increased to alleviate impact of the noise. With the use of this method, the sigmoid unit can be trained to be saturated in nearly 0 or 1 for all input samples. Hence, more stable and confident decisions are generated during both the training and the inference phases.
3.2 Soft Guideline
With the proposed BMNet, an adaptively dropped ResNet can be realized. However, how BMNet decides the dropping ratio is unknown. Consider that our goal is to make the networks adaptively adjust the computational complexity according to the difficulty of classification of input samples, the information of whether or not the input samples are easy to be classified should be generated and sent to BMNet to improve the correctness of decisions. Based on this concept, an additional network, called the soft guideline network (SGNet), is proposed to produce the required information and guide the dropping behavior of the BMNet.
Soft Guideline: Generally, each input sample couples with a hard target which only contains the information of the truth label class. The information of whether or not the input samples are easy to be classified can not be gained from the hard targets. Inspired by Hinton et al. (2015), the soft target, namely the class probabilities produced by the softmax layer, can provide much more information than the hard target. In this paper, the soft target of the SGNet is used to obtain the information which indicates the difficulty of classification. More specifically, the variance of the soft target is used as the guideline. For input sample (), corresponding variance can be written as follows:
(9) 
where is the number of classes. are the elements of the softmax output for . Intuitively, smaller value of indicates that the SGNet is less confident for its classification result. Thus, it tends harder to correctly classify .
In order to make the BMNet learn to adaptively drop more (less) residual blocks for easily (hardly) classified input samples, the guideline is first transformed to produce an expected drop ratio . Easily classified will have higher . Then, the L1norm between the and the calculated drop ratio for all input samples in a batch, denoted as , is added to the loss function as an regularizer to push the BMNet allocate desired drop ratio for different input samples, where
(10) 
where is the measured average drop ratio (computed by BMNet). The application of this regularizer can push the actual drop ratio and the desired drop ratio closer.
Based on the above discussion, a proper transformation is needed to map to . The details of the transformation will be given in the following part.
Mapping Strategy: One simple intuition is to map larger to larger since input samples that are judged as to be easily classified by the SGNet are expected to bypass more blocks. Generally, a relatively shallow network can correctly classify a large proportion of input samples, indicating that most input samples are “easy” samples and only few are hard to be correctly classified. Based on this observation, an exponent functionbased mapping strategy is proposed and can be expressed as follows:
(11) 
where denotes the allowed maximum drop ratio and . transforms to the level of difficulties. Considering (avoid model with too little complexity) and , we can get . The proposed mapping strategy tends to map more different values to large . This approach is consistent with the distribution of the “easy” and “hard” samples as discussed above.
At the training phase, the SGAD, which includes the BMNet, the SGNet and the ResNet, can be endtoend trained from scratch. As shown in Fig. 2, Input samples will be fetched to the ResNet and the SGNet simultaneously. The SGNet outputs its own classification results as well as the guideline. The BMNet fetches the first layers’s output of the ResNet to produce the binary mask. The ResNet learns to adaptively drop the rest residual blocks based on the output of the binary mask and also produces its own classification results. Then, all the weights in SGAD are updated based on regularization loss , the classification error of the SGNet and the ResNet . The final loss function of SGAD can be expressed as:
(12) 
where , and denote the weighting factors for ResNet, BMNet and SGNet, respectively. During inference, the regularization term is useless. Thus, the SGNet can be removed and only the BMNet and the ResNet are needed after training.
4 Experiments
We evaluate the performance of the proposed SGAD on two datasets: CIFAR10 and CIFAR100. The influences of the guideline is investigated. In addition, we also explore different blocks’ contributions to the model capacity and the dropping behavior of SGAD.
Model Size: Both ResNet32 and ResNet110 are adopted as baseline in our experiments. The details of structures can be referred to He et al. (2016). The design of BMNet is crucial to the overall complexity of SGAD. On CIFAR10, the use of BMNets only introduces and computation overheads as compared with ResNet32 and ResNet110, respectively. The memory overheads of BMNets are and compared to ResNet32 and ResNet110, respectively. Hence, using the proposed BMNet will only render very minor overheads.
Training Details: PyTorch is used to implement the SGAD. The stochastic gradient descent is used as optimizer with momentum 0.9. The learning rate is initialized at 0.1 and decayed by after the 128, 160 and 192 epochs. SGAD is trained for 220 epochs with a batch size of 128. and are set to 1.0, 1.0 and 0.3 by default, respectively. In our experiments, only adjusting while leaving other hyperparameters as default can affect the dropping behavior and works in most cases. The last block in SGAD is fixed for all inputs in order to ensure more robust output predictions.
4.1 Comparisons and Discussion
We train the SGAD model under two typical settings: 1) a relatively smaller , resulting in a model (MFSGAD) with more FLOPs. 2) a larger , which produces a model (LFSGAD) with less FLOPs. For the latter case, we fine tune the model from a pretrained MFSGAD to obtain a faster convergence instead of training from random initialization. The performances of MFSGAD and LFSGAD are shown in Table. 1. For comparison, we also provide the training results of the original ResNets. It can be found that at most cases, for smaller , the SGAD can achieve comparable and even better accuracy with less FLOPs as compared to the original ResNets, which indicates the effectiveness of the proposed SGAD. More aggressive reduction in FLOPs can also be obtained under large at a cost of small accuracy loss. For example, the FLOPs can be reduced by 77 with only 0.87 loss in accuracy (CIFAR10, 110 layers).
Dataset  Layers  ResNet  MFSGAD,  LFSGAD,  
accuracy  accuracy  nFLOPs  accuracy  nFLOPs  
CIFAR10  32  93.02  93.11  0.86  92.18  0.47 
110  94.57  94.20  0.86  93.70  0.23  
CIFAR100  32  70.38  70.85  0.77  70.09  0.71 
110  73.94  73.94  0.94  73.94  0.75 
Comparisons with Existing Works: In this subsection, we compare the proposed SGAD with previous works. The performances are shown in Fig. 3, which contains the results of SACT Figurnov et al. (2016), ACT Figurnov et al. (2016), SkipNet Wang et al. (2017), and BlockDrop Wu et al. (2018). The proposed SGAD outperforms all existing networks at most cases.
For ACT and SACT, since the results on CIFAR are not reported, we conduct the experiments using the code provided by the authors of SACT. Compared with the SACT, the FLOPs of SGAD can be reduced by with even higher accuracy on CIFAR10. On CIFAR100, the accuracy can be enhanced by with less FLOPs. The proposed SGAD also outperforms other algorithms such as ACT and SkipNet. Compared with the BlockDrop which currently achieves the stateoftheart results, SGAD can also improve the accuracy by with less computational complexity on CIFAR10.
Discussion: The dropping behavior of SGAD is explored here. The experiments are conducted using ResNet32 and SGAD32 on CIFAR10. Fig. 4 shows the comparisons of magnitude of gradients and normalized flops of each blocks. Since the last block is always fixed in SGAD, the nFLOPs of the last block is always 1 and is not listed. It is worth noting that in ResNet32, every 5 blocks share the same number of output channels, Cblock is used here to denote a cluster of 5 blocks. From Fig. 4 we can obtain the followings:

In the original ResNets, different blocks usually have different magnitudes of gradients (MGs). In each Cblock, the MGs decrease gradually from the first block to the fifth block. Such phenomenon shows that the first several blocks in a Cblock have relatively higher LCMC than the others. The discovery gained here is consistent with the reports from previous works Veit et al. (2016); Jastrzebski et al. (2017).

According to our experiments, in each Cblock, the dropping behavior is closely related to the MGs. Blocks with higher MGs usually have a higher nFLOPs. Combining the analysis in Section 3.1 and the experimental results, the blocks with less MGs are tended to be skipped. Thus, the updates of these blocks are further decreased in SGAD. To reduce the FLOPs while maintaining the performance, SGAD tries to keep the blocks with higher LCMC.

As shown in Fig. 4, some blocks will be skipped by all the input samples after training, thus leads to zero nFLOPs for these blocks. As an additional benefit, these dead blocks can be removed during inference to reduce the memory storage requirement.
5 Conclusion and Future Work
SGAD is proposed to exploit an adaptive processing pattern for different input samples. To enable the propagation of gradients, STE is introduced to approximate the nondifferential rounding function during the training phase. The information contained in softmax layer is explored to inform the SGAD the difficulties of various input samples being classified correctly. In addition, a dedicatedly designed mapping strategy is introduced to combine the difficulties and the dropping ratio. The experiments demonstrate that the proposed SGAD outperforms previous works under the same baselines. While the reduction in FLOPs may not accurately reflect the real running latencies under different hardware devices (eg. CPUS, GPUs), the real speedup measurements will be conducted in the future.
References
 Amodei et al. [2015] D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, J. Chen, M. Chrzanowski, A. Coates, G. Diamos, et al. Deep speech 2: Endtoend speech recognition in english and mandarin. arXiv preprint arXiv:1512.02595, 2015.
 Andreas and Serge [2017] V. Andreas and B. Serge. Convolutional networks with adaptive computation graphs. arXiv preprint arXiv:1711.11503, 2017.
 Bengio et al. [2013] Y. Bengio, N. Lonard, and A. Courville. Estimating or propagating gradients through stochastic neurons. Computer Science, 2013.
 Bolukbasi et al. [2017] T. Bolukbasi, J. Wang, O. Dekel, and V. Saligrama. Adaptive neural networks for fast testtime prediction. arXiv preprint arXiv:1702.07811, 2017.
 Figurnov et al. [2016] M. Figurnov, M. Collins D, Y. Zhu, L. Zhang, J. Huang, D. P. Vetrov, and R. Salakhutdinov. Spatially adaptive computation time for residual networks. arXiv preprint arXiv:1612.02297, 2016.
 Garipov et al. [2016] T. Garipov, D. Podoprikhin, A. Novikov, and D. Vetrov. Ultimate tensorization: compressing convolutional and fc layers alike. arXiv preprint arXiv:1611.03214, 2016.
 Greff et al. [2016] K. Greff, R. K. Srivastava, and J. Schmidhuber. Highway and residual networks learn unrolled iterative estimation. arXiv preprint arXiv:1612.07771, 2016.
 Han et al. [2016] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. international conference on learning representations, 2016.
 He et al. [2016] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. pages 770–778, 2016.
 Hinton et al. [2015] G. E. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
 Jastrzebski et al. [2017] S. Jastrzebski, D. Arpit, N. Ballas, V. Verma, T. Che, and Y. Bengio. Residual connections encourage iterative inference. arXiv preprint arXiv:1710.04773, 2017.
 Krizhevsky et al. [2012] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 Panda et al. [2016] P. Panda, A. Sengupta, and K. Roy. Conditional deep learning for energyefficient and enhanced pattern recognition. In Design, Automation & Test in Europe Conference & Exhibition (DATE), 2016, pages 475–480. IEEE, 2016.
 Salakhutdinov and Hinton [2009] R. Salakhutdinov and G. Hinton. Semantic hashing. International Journal of Approximate Reasoning, 50(7):969–978, 2009.
 Simonyan and Zisserman [2014] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 Sun et al. [2016] F. Sun, J. Lin, and Z. Wang. Intralayer nonuniform quantization of convolutional neural network. In Wireless Communications & Signal Processing (WCSP), 2016 8th International Conference on, pages 1–5. IEEE, 2016.
 Teerapittayanon et al. [2016] S. Teerapittayanon, B. McDanel, and H. Kung. Branchynet: Fast inference via early exiting from deep neural networks. In Pattern Recognition (ICPR), 2016 23rd International Conference on, pages 2464–2469. IEEE, 2016.
 Veit et al. [2016] A. Veit, M. J. Wilber, and S. J. Belongie. Residual networks behave like ensembles of relatively shallow networks. neural information processing systems, pages 550–558, 2016.
 Wang et al. [2017] X. Wang, F. Yu, Z. Dou, and J. E. Gonzalez. Skipnet: Learning dynamic routing in convolutional networks. arXiv preprint arXiv:1711.09485, 2017.
 Wen et al. [2016] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. neural information processing systems, pages 2074–2082, 2016.
 Wen et al. [2017] W. Wen, C. Xu, C. Wu, Y. Wang, Y. Chen, and H. Li. Coordinating filters for faster deep neural networks. arXiv preprint arXiv:1703.09746, 2017.
 Wu et al. [2018] Zuxuan Wu, Tushar Nagarajan, Abhishek Kumar, Steven Rennie, Larry S Davis, Kristen Grauman, and Rogerio Feris. Blockdrop: Dynamic inference paths in residual networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8817–8826, 2018.