S2DNAS: Transforming Static CNN Model for Dynamic Inference via Neural Architecture Search
Abstract
Recently, dynamic inference has emerged as a promising way to reduce the computational cost of deep convolutional neural networks (CNNs). In contrast to static methods (e.g., weight pruning), dynamic inference adaptively adjusts the inference process according to each input sample, which can considerably reduce the computational cost on “easy” samples while maintaining the overall model performance.
In this paper, we introduce a general framework, S2DNAS, which can transform various static CNN models to support dynamic inference via neural architecture search. To this end, based on a given CNN model, we first generate a CNN architecture space in which each architecture is a multistage CNN generated from the given model using some predefined transformations. Then, we propose a reinforcement learning based approach to automatically search for the optimal CNN architecture in the generated space. At last, with the searched multistage network, we can perform dynamic inference by adaptively choosing a stage to evaluate for each sample. Unlike previous works that introduce irregular computations or complex controllers in the inference or redesign a CNN model from scratch, our method can generalize to most of the popular CNN architectures and the searched dynamic network can be directly deployed using existing deep learning frameworks in various hardware devices.
1 Introduction
In the past years, deep convolutional neural networks (CNNs) have gained great success in many computer vision tasks, such as image classification [21, 13, 18], object detection [35, 33, 30], and image segmentation [12, 4]. However, the remarkable performance of CNNs always comes with huge computational cost, which impedes their deployment in resource constrained hardware devices. Thus, various methods have been proposed to improve computational efficiency of the CNN inference, including network pruning [22, 11, 23] and weight quantization [9, 45, 32]. Most of the previous methods are static approaches, which use fixed computation graphs for all test samples.
Recently, dynamic inference has emerged as a promising alternative to speed up the CNN inference by dynamically changing the computation graph according to each input sample [40, 2, 6, 5, 46, 17, 29, 8]. The basic idea is to allocate less computation for “easy” samples while more computation for “hard” ones. As a result, the dynamic inference can considerably save the computational cost of “easy” samples without sacrificing the overall model performance. Moreover, the dynamic inference can naturally exploit the tradeoff between accuracy and computational cost to meet varying requirements (e.g., computational budget) in realworld scenarios.
To enable the dynamic inference of a CNN model, most previous works aim to develop dedicated strategies to dynamically skip some computation operations during the CNN inference according to different input samples. To achieve this goal, these works attempted to add extra controllers inbetween the original model to select which computations are executed. For example, welldesigned gatefunctions were proposed as the controller to select a subset of channels or pixels for the subsequent computation of the convolution layer [8, 5, 15]. However, these methods lead to irregular computation at channel level or spatial level, which are not efficiently supported by existing software and hardware devices [43, 47, 16]. To address this issue, a more aggressive strategy that dynamically skips whole layers was proposed for efficient inference [46, 42, 41]. Unfortunately, this strategy can only be applied to the CNN model with residual connection [13]. Moreover, the controllers of some methods comes with a considerable complex structure, which cause the increase of the overall computational cost in the inference (see experimental results in Section 4).
To mitigate these problems, researchers propose early exiting the “easy” input samples at inference time [31, 40, 17, 1]. A typical solution is to add intermediate prediction layers at multiple layers of a normal CNN model, and then exit the inference when the confidence score of the intermediate classifier is higher than a given threshold. Figure 1\colorreda shows the paradigm of these early exiting methods [31, 1]. In this paradigm, prediction layers are directly added inbetween the original network and the network is split into multiple stages along the layer depth. However, these solutions face the challenge that early classifiers are unable to leverage the semanticlevel features produced by the deeper layers. It may cause a significant accuracy drop [17].
As illustrated in Figure 1\colorreda, three prediction layers are added to different depth of the network.
Thus, the classifier
Huang et al. [17] proposed a novel CNN model, called MSDNet, for solving this issue. The core design of MSDNet is a twodimensional multiscale architecture that maintains the coarse and fine level features in every layer as shown in Figure 1\colorredb. Based on this design, MSDNet can leverage the semanticlevel features in every prediction layer and achieve the best result. However, MSDNet needs to design specialized network architecture, which cannot generalize to other CNN models and needs massive expertise in architecture design.
To solve the aforementioned issue without designing CNNs from scratch, we propose to transform a given CNN model into a channelwise multistage network, which comes with the advantage that the classifier in the early stages can leverage the semanticlevel features. Figure 1\colorredc intuitively demonstrates the idea behind our method. Different from the normal paradigm in Figure 1\colorreda, our method split the original network into multiple stages along the channel width. The prediction layers are added only to the last convolutional layer, thus all classifiers can leverage the semanticlevel features. To reduce the computational cost of the classifiers in the early stages, we propose to cut down the number of channels of each layer in different stages (more details can be found in Section 3).
Based on the highlevel idea introduced above, we present a general framework called S2DNAS. Given a specific CNN model, the framework can automatically generate the dynamic model following the paradigm showed in Figure 1\colorredc. S2DNAS consists of two components: S2D and NAS. First, the component S2D, which means “static to dynamic”, is used to generate a CNN model space based on the given model. This space comprises of different multistage CNN networks generated from the given model based on the predefined transformations. Then, NAS is used to search for the optimal model in the generated space with the help of reinforcement learning. Specifically, we devise an RNN to decide the setting of each transformation for generating the model. To exploit tradeoff between accuracy and computational cost, we design a reward function that can reflect both the classification accuracy and the computational cost inspired by the prior works [39, 14, 44]. We then use a policygradient based algorithm [37] to train the RNN. The RNN will generate better CNN models with reinforcement learning and we can further use the searched model for dynamic inference.
To verify the effectiveness of S2DNAS, we perform extensive experiments by applying our method to various CNN models. With a comparable model accuracy, our method can achieve further computation reduction in contrast to the previous works for dynamic inference.
2 Related Work
Static Method for Efficient CNN Inference. Numerous methods are proposed for improving the efficiency of CNN inference. Two representative research directions are network pruning [22, 11, 10, 23] and quantization [9, 45, 26, 32]. Specifically, network pruning aims to remove redundant weights in a welltrained CNN without sacrificing the model accuracy. In contrast, network quantization aims to reduce the bitwidth of both activations and weights. Most works in the above two directions are static, which refers to using the same computation graph for all test samples. Next, we introduce an emerging direction of utilizing dynamic inference for improving the efficiency of CNN inference.
Dynamic Inference. Dynamic inference also refers to adaptive inference in previous works [41, 24]. Most previous works aim to develop dedicated strategies to dynamically skip some computation during inference. They attempted to add extra controllers to select which computations are executed [5, 3, 34, 25, 15, 8, 46, 41, 42]. Dong et al. [5] proposed to compute the spatial attention using extra convolutional layers then skipping the computation of inactive pixels. Gao et al. [8] proposed to compute the importance of each channel then skipping the computation of those unimportant channels. However, these methods lead to irregular computation at channel level or spatial level, which is not efficiently supported by existing deep learning frameworks and hardware devices. To address this issue, a more aggressive strategy that dynamically skips the whole layers or blocks is proposed [46, 41, 42]. For example, BlockDrop [46] introduced a policy network to decide which layers should be skipped. Unfortunately, this strategy can only be applied to the CNN model with residual connection. Moreover, these methods introduce extra controllers into the computational graph, the computational cost will remain the same or even increase in some cases. On the other hand, early exiting methods propose to divide a CNN model into multiple stages and exit the inference of “easy” samples in the early stages [31, 40, 17, 1]. The stateoftheart is MSDNet [17] in which the authors manually design a novel multistage network architecture to serve the purpose of dynamic inference.
Neural Architecture Search. Recently, neural architecture search (NAS) has emerged as a promising direction to automatically design the network architecture to meet varying requirements of different tasks [48, 49, 14, 27, 44, 28]. There are two typical types of works in this research direction, RLbased searching algorithms [48] and differentiable searching algorithms [28]. In this paper, according to the formulation of our specific problem, we choose the RLbased searching algorithm to search for the optimal model in a design space.
3 Our Approach
3.1 Overview of S2DNAS
The overview of S2DNAS is depicted in Figure 2. At a high level, S2DNAS can be divided into two components, namely, S2D and NAS. Here, S2D means “statictodynamic”, which is used to generate a search space comprises of dynamic models based on a given static CNN model. Specifically, we define two transformations and then apply the transformations to the original model for generating different dynamic models in the search space. Each of these dynamic models is a multistage CNN that can be directly used for dynamic inference. All these generated models form the search space. Once the search space is generated, NAS searches for the optimal model in the space. In what follows, we will give the details of these two components.
3.2 The Details of S2d
Given a CNN model , the goal of S2D is to generate the search space which consists of different dynamic models transformed from . Each network in is a multistage CNN model in which each stage contains one classifier. These multistage CNNs can be generated from using two transformations, namely, split and concat. First, we propose split to split the original model along the channel width as Figure 3 shows. Specifically, we divide the input channels in each layer of the original model into different subsets. And each classifier can use features from different subsets for prediction. The prediction can be done by adding a prediction layer (shown as yellow squares in Figure 3). Moreover, to enhance the feature interactions between different stages for further performance boost, we propose concat to enforce the classifier in the current stage to reuse the features from previous stages. Next, we will present the details of these two transformations, split and concat. Before that, we first present some basic notations.
Notation. We start with the notation of a normal convolutional layer. Taking the th layer of a deep CNN as an example, the input of the th layer is denoted as , where is the number of input channels and is the th feature map with a resolution of . We denote the weights as , where is the number of output channels and ( is the kernel size). In the following parts, we will present two transformations that can be applied to the original model. The goal of the transformations is to transform a static CNN model to a multistage model, which can be represented as , where is the classifier in the th stage. Next, we will introduce the details of the proposed two transformations.
Split. The split transformation is responsible for assigning different subsets of the input channels to the classifiers in different stages. We denote the number of stages as . A direct way is splitting the input channels into subsets and allocating the th subset to the classifier in the th stage. However, this splitting method results in a considerable large search space which poses the obstacle to the subsequent search process (i.e., NAS). In order to reduce the search space generated by this transformation, we propose to first divide the input channels into groups and then assign these groups to different classifiers.
Specifically, we first evenly divide the input channels
Concat. The concat transformation is used for enhancing the interaction between different stages.
The basic idea is to enable the classifiers in later stages to reuse the features from previous stages. Formally, we use indicator matrices to indicate whether to enable the feature reuse at different positions.
Here denotes the th layer and is the depth
Architecture Search Space. Based on the above two transformations, we can generate the search space by transforming the original CNN model. Specifically, there are two adjustable settings for the two transformations, splitting points and indicator matrices. Adjusting the splitting points will change the way to assign the feature groups, which is used for the tradeoff between accuracy and computational cost of different classifiers. For example, we can assign more features to the early stages for improving the model performance on “easy” samples. Adjusting the indicator matrices accompanies the change of the feature reuse strategy. To reduce the size of the search space, we restrict the feature layers with the same resolution to use the same split and concat settings in our experiments. Through changing these two settings, we can generate the search space which consists of different multistage models. In the following section, we will demonstrate how to search for the optimal model in the generated space.
3.3 The Details of Nas
Once we obtain the search space from the above procedure of S2D, the goal of NAS is to find the optimal model with high accuracy and low computational cost. Note that the model is jointly determined by the settings of the above two transformations, i.e., the split points and the indicator matrices. With a slight abuse of notation, we also refer the architecture as these two settings and denote as the space which consists of these different settings. Thus the optimization goal reduces to search for the optimal settings of the proposed transformations which can maximize our predefined metric (see details in the following section).
However, searching the optimal setting is nontrivial due to the huge search space . For example, in our experiment on MobileNetV2 [7], the size of the search space is around . Motivated by the recent progress in neural architecture search (NAS) [48, 49, 14, 39], we propose to use a policy gradient based reinforcement learning algorithm for searching. The goal of the algorithm is to optimize the policy which further proceeds the optimal model. This process can be formulated into a nested optimization problem:
(1)  
s.t. 
where is the corresponding weights of the model and is the policy which generates the settings of the transformations. and denote the validation and training datasets, respectively. And is the reward function for evaluating the quality of the multistage model.
To solve the nested optimization problem in Equation 1, we need to solve two subproblems, namely, optimizing when is given and optimizing when the architecture is given. We first present how to optimize the policy when is given.
Optimization of the Transformation Settings. Similar to previous works [48, 49], we use a customized recurrent neural network (RNN) to generate the distribution of different transformation settings for each layer of the CNN model. Then a policy gradient based algorithm [37] is used for optimizing the parameters of the RNN to maximize the expected reward, which is defined in Equation 2. Specifically, the reward in our paper is defined as a weighted product considering both the accuracy and the computational cost:
(2) 
where is the accuracy of the multistage model on the dataset . The is the average computational cost over the samples of the dataset using dynamic inference.
For a fair comparison with other works of dynamic inference, we use FLOPs
Optimization of the Multistage CNN. The inner optimization problem (i.e., solving for ) can be solved using the gradient descent algorithm. Specifically, we modify the normal classification loss function (i.e., crossentropy function) for the case of training multistage models. Formally, the loss function is defined as:
(3) 
Here, CE denotes the crossentropy function. The optimization of the above equation can be regarded as jointly optimizing all the classifiers in different stages. The optimization can be implemented using stochastic gradient descent (SGD) and its variants. We use the optimized for assessing the quality of the model generated by the RNN, which can be further used for optimizing the RNN. In practice, to reduce the search time, following the previous work [49], we approximate by updating it for only several training epochs, without solving the inner optimization problem completely by training the network until convergence.
Dynamic Inference of the Searched CNN.
Once the optimal multistage model is found, we can directly perform dynamic inference using it. Specifically, we set a predefined threshold for each stage. Formally, the threshold of the th stage is set to . Then, we can use these thresholds to decide at which stage that the inference should stop. Specifically, given a input sample , the inference stops at the th stage when the th classifier outputs a top1 confidence score , here, .
4 Experiments
To verify the effectiveness of S2DNAS, we compare it with different dynamic inference methods on different CNN models. Our experiments have covered a wide range of previous methods of dynamic inference [5, 46, 31, 40]. We also evaluate different aspects of S2DNAS, which are presented in the discussion part.
Model  Method  CIFAR10  CIFAR100  
FLOPs  Reduction  Accuracy  FLOPs  Reduction  Accuracy  
ResNet20  Original  41M    91.25%  41M    67.78% 
LCCL  30M  28%  90.95%  40M  1%  68.26%  
BlockDrop  45M  11%  91.31%  53M  29%  67.39%  
Naive  34M  18%  91.27%  39M  5%  66.77%  
BranchyNet  33M  20%  91.37%  45M  9%  67.00%  
S2DNAS  16M  61%  91.41%  25M  39%  67.29%  
ResNet56  Original  126M    93.03%  126M    71.32% 
LCCL  102M  19%  92.99%  106M  16%  70.33%  
BlockDrop  74M  41%  92.98%  129M  2%  72.39%  
Naive  68M  46%  92.78%  108M  14%  71.58%  
BranchyNet  73M  42%  92.51%  120M  5%  71.22%  
S2DNAS  37M  71%  92.42%  62M  51%  71.20%  
ResNet110  Original  254M    93.57%  254M    73.55% 
LCCL  166M  35%  93.44%  210M  17%  72.72%  
BlockDrop  76M  70%  93.00%  153M  40%  73.70%  
Naive  158M  38%  93.13%  217M  15%  73.06%  
BranchyNet  147M  42%  93.33%  243M  5%  73.25%  
S2DNAS  76M  70%  93.39%  113M  56%  73.06%  
VGG16BN  Original  313M    93.72%  313M    72.93% 
LCCL  269M  14%  92.75%  264M  16%  70.46%  
Naive  185M  41%  93.34%  202M  36%  72.78%  
BranchyNet  162M  48%  93.39%  239M  24%  72.39%  
S2DNAS  66M  79%  93.51%  104M  67%  72.00%  
MobileNetV2  Original  91M    93.89%  91M    74.21% 
LCCL  77M  15%  93.13%  73M  20%  71.11%  
Naive  38M  58%  91.90%  61M  33%  74.03%  
BranchyNet  35M  61%  91.76%  74M  18%  73.71%  
S2DNAS  25M  73%  92.25%  39M  57%  73.50% 
4.1 Experiment Settings
Model Setup. In our experiments, we conduct experiments on three CNN architectures: ResNet [13], VGG [38], and MobileNetV2 [36]
Training Details. The CIFAR [20] dataset contains 50k training images and 10k test images. We randomly choose 5k images from the training images as the validation dataset and leave the other 45k images as the training dataset. We use the same input preprocessing for both CIFAR10 and CIFAR100. To be specific, the training images are zeropadded with 4 pixels and then randomly cropped to 32x32 resolution. The randomly horizontal flip is used for data augmentation.
For the training of the RNN, the PPO algorithm [37] is used. And we use Adam [19] as the optimizer to perform the parameter update in RNN. The details of the hyperparameters settings can be found in the appendix. For the training of the multistage model, we use SGD as the optimizer. The momentum is set to . The initial learning rate is set to and the learning rate is divided by a factor of at and of the total epochs. More details of the training settings of different models can be found in the appendix.
For the hyperparameters of S2DNAS, we set the group number for every layer. And we set the number of stages . For comparing with MSDNet which contains 5 stages, we set the for performing S2DNAS on the devised model. The in Equation 2 is set to and all of the in Equation 3 is set to for all experiments.
4.2 Classification Results
In this part, we compare our method with other methods of dynamic inference. To give a comprehensive study of our method, we have covered a wide range of methods, including LCCL [5], BlockDrop [46], Naive [31] and BranchyNet [40]. We conduct experiments on two widelyused image classification benchmarks, CIFAR10 and CIFAR100. To show the effectiveness of S2DNAS in reducing the computational cost of CNN models with different architectures, we apply S2DNAS to five typical CNNs with various depth, width, and substructures.
The overall results are shown in Table 1. Note that different thresholds ( defined in the previous section) lead to different tradeoffs between model accuracy and the computational cost. In our experiments, we chose the threshold which leads to the highest reward on the validation dataset. We also provide further results of using different thresholds in the discussion subsection.
As shown in Table 1, for most of the architectures and tasks, our method (denoted as S2DNAS in Table 1) can significantly reduce the computational cost with comparable accuracy with the original CNN model. As mentioned above, we use average FLOPs on the whole test dataset as the metric to measure the computational cost of a given CNN model. For ResNet20 on CIFAR10, S2DNAS has reduced the computation cost of the original net from M to M without the accuracy drop (even with a slight increase as shown in Table 1), which shows a relative cost reduction of .
Our method also shows improvements over other methods for dynamic inference in terms of computational cost reduction. We have reproduced the previous works on these CNN models for comparison. We have also implemented a normal early exiting solution (marked as Naive in Table 1), i.e., directly adding prediction layers (i.e., global average pooling and fullyconnected layers) at the intermediate layers of the original models. For example, for ResNet20 on CIFAR10, compared with BranchyNet [40], our method has achieved a slight accuracy improvement (from to ) with more computational cost reduction.
One interesting observation is that some methods even cause an increase in computational cost. For example, BlockDrop boosts the FLOPs of the original net about . We infer that this is caused by the controller with high computational cost introduced by BlockDrop in the inference process [46]. We also notice that some of the previous works can not be used for the network without residual connection. For instance, BlockDrop cannot be applied to VGG16BN. In contrast, our method can generalize to CNN without residual connection. From Table 1, our method can reduce the computational cost of the original VGG16BN net by with a slight accuracy drop.
Comparison to MSDNet. As mentioned in the introduction section, there is a recent work that proposed a specialized CNN named MSDNet for dynamic inference. Since the method cannot directly be applied to general CNN models, thus for comparison with MSDNet, we design a DenseNetlike [18] model based on the prior work [17], which has similar structure with MSDNet. More details of the devised model can be found in the appendix. We then apply S2DNAS to it and generate the dynamic models. The results are plotted in Figure 5. The varying FLOPs metrics of the xcoordinate can be obtained by adjusting the thresholds of each classifier of the dynamic CNN models. As Figure 5 shows, in most cases, our method can achieve similar accuracycomputation tradeoffs. In the case of CIFAR10, MSDNet outperforms our method when FLOPs is relative to 15M. However, the superiority of MSDNet comes with the cost of manually designing the CNN architecture. In contrast, as Table 1 shows, our method can be applied to various general CNN models.
Dataset  Model  Stage1  Stage2  Stage3  

Accuracy  Fractions  Accuracy  Fractions  Accuracy  Fractions  
CIFAR10  ResNet20  98.44%  10.24%  98.89%  41.59%  83.45%  48.17% 
ResNet56  98.25%  67.50%  89.72%  11.19%  75.36%  21.31%  
ResNet110  98.43%  61.66%  93.22%  22.28%  74.28%  16.06%  
VGG16BN  96.54%  87.29%  91.44%  2.22%  68.73%  10.49%  
MobileNetV2  98.62%  50.59%  94.04%  33.21%  68.70%  16.20%  
CIFAR100  ResNet20  85.27%  58.72%  54.11%  20.68%  29.27%  20.60% 
ResNet56  97.13%  22.27%  86.64%  29.86%  49.51%  47.87%  
ResNet110  95.83%  28.04%  85.05%  28.90%  50.19%  43.06%  
VGG16BN  97.21%  7.18%  90.08%  44.97%  51.20%  47.85%  
MobileNetV2  97.81%  8.68%  90.38%  45.84%  51.85%  45.48% 
4.3 Discussion
Here, we present some discussions on our method for providing further insights.
Tradeoff of Accuracy and Computational Cost. A key hyperparameter of dynamic inference is the threshold setting , where is the number of stages. When the model is trained, different threshold settings lead to different tradeoffs between the accuracy and the computational cost. To demonstrate how the threshold affects the final model performances, we conduct experiments with different thresholds and plot the results in Figure 6. All these results show the trend that the increase of computational cost leads to a performance boost. Thus, for practical use, we can set the threshold based on the computational budget of the given hardware device. Moreover, this property also helps to solve the anytime prediction task proposed in the prior work [17].
Difficulty Distribution of Test Dataset. The basic idea of our method is early exiting “easy” samples from the early stages.
In this part, we give the statistics of all the samples in the test dataset (=3, i.e., there are three stages in the trained model). As shown in Table 2, for ResNet20 on CIFAR10/100, the inference process of about test samples exits from the first two stages. As a result, S2DNAS can considerably reduce the average computation cost. Further, we observe that the accuracy of the classifier in the first stage
5 Conclusion
In this paper, we present a general framework called S2DNAS, for transforming various static CNN models into multistage models to support dynamic inference. Empirically, our method can be applied to various CNN models to reduce the computational cost, without sacrificing model performance. In contrast to previous methods for dynamic inference, our method comes with two advantages: (1) With our method, we can obtain a dynamic model generated from an existing CNN model instead of manually redesigning a new CNN architecture. (2) The inference of the generated dynamic model does not introduce irregular computations or complex controllers. Thus the generated model can be easily deployed on various hardware devices using existing deep learning frameworks.
These advantages are appealing for deploying a given CNN model into hardware devices with limited computational resources. To be specific, we can first use S2DNAS to transform the given model into the dynamic one then deploy it on the hardware devices. Moreover, our method is orthogonal to previous pruning/quantization methods, which can further reduce the computational cost of the given CNN model. All these properties of our method imply a wide range of application scenarios where the efficient CNN inference is desired.
Appendix A Details of RNN Model and its Optimization
The RNN model contains a GRU layer with 64 hidden units, predictors and an embedding layer. The predictors are used to output the probabilities of different transformation settings (split settings and concat settings) for different layers. The number of different split settings in a layer is the combination number , in which is the number of groups and is the number of stages. Thus the predictor (contains a fullyconnected layer and a softmax function) is used to predict the probabilities of selecting the settings. The number of different concat locations is and we use the predictor (contains a fullyconnected layer and a logistic function) to predict the probability for each concat locations. Finally, the embedding layer turns the sampled settings of the previous step into dense vectors of fixed size as the input of the GRU layer.
We employ Proximal Policy Optimization (PPO) to optimize the parameters of the RNN. Adam is used for optimizing the parameters of the RNN model, with a learning rate of 0.001. The number of epochs for PPO is set to 4, the clip parameter is set to 0.1, the minibatch size is set to 4, the coefficient of value function loss is set to 0.5 and the entropy coefficient is set to 0.01.
Appendix B Details of DenseNetlike Model
To compare with MSDNet, a DenseNetlike model is devised. Specifically, we modify the DenseNetBC (k=8, depth=100) by doubling the growth rate after each transition layer and modify the convolution in the bottleneck layers of DenseNet by halving the number of output channels. We denote it as DensNet*.
Appendix C Details of Training Settings
During network architecture search, we optimize the multistage model for 6 epochs using the training dataset to approximate . 10k models are sampled from the architecture search space for each experiment. Then the models with top10 rewards are used to apply the full training. Table 3 demonstrates the hyperparameters of the full training for different architectures. The scheme of learning rate warmup is used for 100 iterations.
Model  Datasets  Batch size  Training epochs  Weight decay 

ResNet  CIFAR  128  200  1e4 
VGG  CIFAR  64  200  5e4 
Mobilenet  CIFAR  128  300  1e4 
DenseNet*  CIFAR  64  300  1e4 
Appendix D Demonstration of a Searched Multistage Model
Table 4 demonstrates the structure of searched multistage ResNet56 model on CIFAR10. From this table, we can see that each layer is split into three stages and each stage contains a subset of the original channels. Different stages are concated at different layers thus the feature maps generated by previous layers are reused in the later stages. The accumulated FLOPs increase from 21M to 90M for the three stages. As a result, we can save a lot of computation cost when the ”easy” input samples are stopped at stage 1 and stage 2.
layer name  output size  stage 1  stage 2  stage 3  concat settings  
conv1  , 6  , 2  , 8  
conv2_x  stage2stage1 stage3stage1  
conv3_x  stage2stage1  
conv4_x  stage3stage2  






accumulated FLOPs  21M  30M  90M 
Footnotes
 footnotemark:
 In this paper, the classifier refers to the whole subnetwork in the current stage.
 We do not split the input layer.
 Omit the batch normalization and pooling layers.
 Refer to the last layer of the classifier for prediction.
 Here, we regard one multiplyaccumulate (MAC) as one floatingpoint operation (FLOP).
 We use the batch normalization after each convolution layer in VGG and change the stride of the first convolution layer in MobileNetV2 from 2 to 1 for CIFAR.
 Here, we only consider samples that exit from this stage.
References
 (2019) Dynamically sacrificing accuracy for reduced computation: cascaded inference based on softmax confidence. In Artificial Neural Networks and Machine Learning  ICANN 2019: Deep Learning  28th International Conference on Artificial Neural Networks, Munich, Germany, September 1719, 2019, Proceedings, Part II, pp. 306–320. External Links: Link, Document Cited by: §1, §2.
 (2017) Adaptive neural networks for efficient inference. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 611 August 2017, pp. 527–536. External Links: Link Cited by: §1, §4.3.
 (2019) SeerNet: predicting convolutional neural network featuremap sparsity through lowbit quantization. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 1620, 2019, pp. 11216–11225. External Links: Link Cited by: §2.
 (2018) DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 40 (4), pp. 834–848. External Links: Link, Document Cited by: §1.
 (2017) More is less: A more complicated network with less inference complexity. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 2126, 2017, pp. 1895–1903. External Links: Link, Document Cited by: §1, §1, §2, §4.2, §4.3, §4.
 (2017) Spatially adaptive computation time for residual networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 2126, 2017, pp. 1790–1799. External Links: Link, Document Cited by: §1.
 (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §3.3.
 (2019) Dynamic channel pruning: feature boosting and suppression. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 69, 2019, External Links: Link Cited by: §1, §1, §2, §4.3.
 (2015) Deep learning with limited numerical precision. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 611 July 2015, pp. 1737–1746. External Links: Link Cited by: §1, §2.
 (2016) Deep compression: compressing deep neural network with pruning, trained quantization and huffman coding. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 24, 2016, Conference Track Proceedings, External Links: Link Cited by: §2.
 (2015) Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 712, 2015, Montreal, Quebec, Canada, pp. 1135–1143. External Links: Link Cited by: §1, §2.
 (2017) Mask RCNN. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 2229, 2017, pp. 2980–2988. External Links: Link, Document Cited by: §1.
 (2016) Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 2730, 2016, pp. 770–778. External Links: Link, Document Cited by: §1, §1, §4.1.
 (2018) AMC: automl for model compression and acceleration on mobile devices. In Computer Vision  ECCV 2018  15th European Conference, Munich, Germany, September 814, 2018, Proceedings, Part VII, pp. 815–832. External Links: Link, Document Cited by: §1, §2, §3.3.
 (2018) Channel gating neural networks. CoRR abs/1805.12549. External Links: Link, 1805.12549 Cited by: §1, §2.
 (2019) Boosting the performance of CNN accelerators with dynamic finegrained channel gating. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2019, Columbus, OH, USA, October 1216, 2019., pp. 139–150. External Links: Link, Document Cited by: §1.
 (2018) Multiscale dense networks for resource efficient image classification. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30  May 3, 2018, Conference Track Proceedings, External Links: Link Cited by: §1, §1, §1, §2, §4.2, §4.3, §4.3.
 (2017) Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 2126, 2017, pp. 2261–2269. External Links: Link, Document Cited by: §1, §4.2.
 (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings, External Links: Link Cited by: §4.1.
 (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §4.1.
 (2012) ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 36, 2012, Lake Tahoe, Nevada, United States., pp. 1106–1114. External Links: Link Cited by: §1.
 (1989) Optimal brain damage. In Advances in Neural Information Processing Systems 2, [NIPS Conference, Denver, Colorado, USA, November 2730, 1989], pp. 598–605. External Links: Link Cited by: §1, §2.
 (2017) Pruning filters for efficient convnets. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 2426, 2017, Conference Track Proceedings, External Links: Link Cited by: §1, §2.
 (2019) Improved techniques for training adaptive deep networks. CoRR abs/1908.06294. External Links: Link, 1908.06294 Cited by: §2.
 (2017) Not all pixels are equal: difficultyaware semantic segmentation via deep layer cascade. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 2126, 2017, pp. 6459–6468. External Links: Link, Document Cited by: §2.
 (2016) Fixed point quantization of deep convolutional networks. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 1924, 2016, pp. 2849–2858. External Links: Link Cited by: §2.
 (2019) Autodeeplab: hierarchical neural architecture search for semantic image segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 1620, 2019, pp. 82–92. External Links: Link Cited by: §2.
 (2019) DARTS: differentiable architecture search. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 69, 2019, External Links: Link Cited by: §2.
 (2018) Dynamic deep neural networks: optimizing accuracyefficiency tradeoffs by selective execution. In Proceedings of the ThirtySecond AAAI Conference on Artificial Intelligence, (AAAI18), the 30th innovative Applications of Artificial Intelligence (IAAI18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI18), New Orleans, Louisiana, USA, February 27, 2018, pp. 3675–3682. External Links: Link Cited by: §1.
 (2016) SSD: single shot multibox detector. In Computer Vision  ECCV 2016  14th European Conference, Amsterdam, The Netherlands, October 1114, 2016, Proceedings, Part I, pp. 21–37. External Links: Link, Document Cited by: §1.
 (2016) Conditional deep learning for energyefficient and enhanced pattern recognition. In 2016 Design, Automation & Test in Europe Conference & Exhibition, DATE 2016, Dresden, Germany, March 1418, 2016, pp. 475–480. External Links: Link Cited by: §1, §2, §4.2, §4.
 (2016) XNORnet: imagenet classification using binary convolutional neural networks. In Computer Vision  ECCV 2016  14th European Conference, Amsterdam, The Netherlands, October 1114, 2016, Proceedings, Part IV, pp. 525–542. External Links: Link, Document Cited by: §1, §2.
 (2016) You only look once: unified, realtime object detection. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 2730, 2016, pp. 779–788. External Links: Link, Document Cited by: §1.
 (2018) SBNet: sparse blocks network for fast inference. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 1822, 2018, pp. 8711–8720. External Links: Link, Document Cited by: §2.
 (2015) Faster RCNN: towards realtime object detection with region proposal networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 712, 2015, Montreal, Quebec, Canada, pp. 91–99. External Links: Link Cited by: §1.
 (2018) MobileNetV2: inverted residuals and linear bottlenecks. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 1822, 2018, pp. 4510–4520. External Links: Link, Document Cited by: §4.1.
 (2017) Proximal policy optimization algorithms. CoRR abs/1707.06347. External Links: Link, 1707.06347 Cited by: §1, §3.3, §4.1.
 (2015) Very deep convolutional networks for largescale image recognition. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings, External Links: Link Cited by: §4.1.
 (2018) MnasNet: platformaware neural architecture search for mobile. CoRR abs/1807.11626. External Links: Link, 1807.11626 Cited by: §1, §3.3.
 (2016) BranchyNet: fast inference via early exiting from deep neural networks. In 23rd International Conference on Pattern Recognition, ICPR 2016, Cancún, Mexico, December 48, 2016, pp. 2464–2469. External Links: Link, Document Cited by: §1, §1, §2, §4.2, §4.2, §4.
 (2018) Convolutional networks with adaptive inference graphs. In Computer Vision  ECCV 2018  15th European Conference, Munich, Germany, September 814, 2018, Proceedings, Part I, pp. 3–18. External Links: Link, Document Cited by: §1, §2.
 (2018) SkipNet: learning dynamic routing in convolutional networks. In Computer Vision  ECCV 2018  15th European Conference, Munich, Germany, September 814, 2018, Proceedings, Part XIII, pp. 420–436. External Links: Link, Document Cited by: §1, §2.
 (2016) Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 510, 2016, Barcelona, Spain, pp. 2074–2082. External Links: Link Cited by: §1.
 (2019) FBNet: hardwareaware efficient convnet design via differentiable neural architecture search. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 1620, 2019, pp. 10734–10742. External Links: Link Cited by: §1, §2.
 (2016) Quantized convolutional neural networks for mobile devices. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 2730, 2016, pp. 4820–4828. External Links: Link, Document Cited by: §1, §2.
 (2018) BlockDrop: dynamic inference paths in residual networks. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 1822, 2018, pp. 8817–8826. External Links: Link, Document Cited by: §1, §1, §2, §4.2, §4.2, §4.
 (2017) Scalpel: customizing DNN pruning to the underlying hardware parallelism. In Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA 2017, Toronto, ON, Canada, June 2428, 2017, pp. 548–560. External Links: Link Cited by: §1.
 (2017) Neural architecture search with reinforcement learning. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 2426, 2017, Conference Track Proceedings, External Links: Link Cited by: §2, §3.3, §3.3.
 (2018) Learning transferable architectures for scalable image recognition. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 1822, 2018, pp. 8697–8710. External Links: Link, Document Cited by: §2, §3.3, §3.3, §3.3.