S2DNAS: Transforming Static CNN Model for Dynamic Inference via Neural Architecture Search

S2DNAS: Transforming Static CNN Model for Dynamic Inference via Neural Architecture Search

Abstract

Recently, dynamic inference has emerged as a promising way to reduce the computational cost of deep convolutional neural networks (CNNs). In contrast to static methods (e.g., weight pruning), dynamic inference adaptively adjusts the inference process according to each input sample, which can considerably reduce the computational cost on “easy” samples while maintaining the overall model performance.

In this paper, we introduce a general framework, S2DNAS, which can transform various static CNN models to support dynamic inference via neural architecture search. To this end, based on a given CNN model, we first generate a CNN architecture space in which each architecture is a multi-stage CNN generated from the given model using some predefined transformations. Then, we propose a reinforcement learning based approach to automatically search for the optimal CNN architecture in the generated space. At last, with the searched multi-stage network, we can perform dynamic inference by adaptively choosing a stage to evaluate for each sample. Unlike previous works that introduce irregular computations or complex controllers in the inference or re-design a CNN model from scratch, our method can generalize to most of the popular CNN architectures and the searched dynamic network can be directly deployed using existing deep learning frameworks in various hardware devices.

\cvprfinalcopy

1 Introduction

In the past years, deep convolutional neural networks (CNNs) have gained great success in many computer vision tasks, such as image classification [21, 13, 18], object detection [35, 33, 30], and image segmentation [12, 4]. However, the remarkable performance of CNNs always comes with huge computational cost, which impedes their deployment in resource constrained hardware devices. Thus, various methods have been proposed to improve computational efficiency of the CNN inference, including network pruning [22, 11, 23] and weight quantization [9, 45, 32]. Most of the previous methods are static approaches, which use fixed computation graphs for all test samples.

Recently, dynamic inference has emerged as a promising alternative to speed up the CNN inference by dynamically changing the computation graph according to each input sample [40, 2, 6, 5, 46, 17, 29, 8]. The basic idea is to allocate less computation for “easy” samples while more computation for “hard” ones. As a result, the dynamic inference can considerably save the computational cost of “easy” samples without sacrificing the overall model performance. Moreover, the dynamic inference can naturally exploit the trade-off between accuracy and computational cost to meet varying requirements (e.g., computational budget) in real-world scenarios.

To enable the dynamic inference of a CNN model, most previous works aim to develop dedicated strategies to dynamically skip some computation operations during the CNN inference according to different input samples. To achieve this goal, these works attempted to add extra controllers in-between the original model to select which computations are executed. For example, well-designed gate-functions were proposed as the controller to select a subset of channels or pixels for the subsequent computation of the convolution layer [8, 5, 15]. However, these methods lead to irregular computation at channel level or spatial level, which are not efficiently supported by existing software and hardware devices [43, 47, 16]. To address this issue, a more aggressive strategy that dynamically skips whole layers was proposed for efficient inference [46, 42, 41]. Unfortunately, this strategy can only be applied to the CNN model with residual connection [13]. Moreover, the controllers of some methods comes with a considerable complex structure, which cause the increase of the overall computational cost in the inference (see experimental results in Section 4).

To mitigate these problems, researchers propose early exiting the “easy” input samples at inference time [31, 40, 17, 1]. A typical solution is to add intermediate prediction layers at multiple layers of a normal CNN model, and then exit the inference when the confidence score of the intermediate classifier is higher than a given threshold. Figure 1\colorreda shows the paradigm of these early exiting methods [31, 1]. In this paradigm, prediction layers are directly added in-between the original network and the network is split into multiple stages along the layer depth. However, these solutions face the challenge that early classifiers are unable to leverage the semantic-level features produced by the deeper layers. It may cause a significant accuracy drop [17]. As illustrated in Figure 1\colorreda, three prediction layers are added to different depth of the network. Thus, the classifier2 in the previous stage cannot make use of the semantic-level features produced by the classifier in the late stage.

Huang et al. [17] proposed a novel CNN model, called MSDNet, for solving this issue. The core design of MSDNet is a two-dimensional multi-scale architecture that maintains the coarse and fine level features in every layer as shown in Figure 1\colorredb. Based on this design, MSDNet can leverage the semantic-level features in every prediction layer and achieve the best result. However, MSDNet needs to design specialized network architecture, which cannot generalize to other CNN models and needs massive expertise in architecture design.

To solve the aforementioned issue without designing CNNs from scratch, we propose to transform a given CNN model into a channel-wise multi-stage network, which comes with the advantage that the classifier in the early stages can leverage the semantic-level features. Figure 1\colorredc intuitively demonstrates the idea behind our method. Different from the normal paradigm in Figure 1\colorreda, our method split the original network into multiple stages along the channel width. The prediction layers are added only to the last convolutional layer, thus all classifiers can leverage the semantic-level features. To reduce the computational cost of the classifiers in the early stages, we propose to cut down the number of channels of each layer in different stages (more details can be found in Section 3).

Based on the high-level idea introduced above, we present a general framework called S2DNAS. Given a specific CNN model, the framework can automatically generate the dynamic model following the paradigm showed in Figure 1\colorredc. S2DNAS consists of two components: S2D and NAS. First, the component S2D, which means “static to dynamic”, is used to generate a CNN model space based on the given model. This space comprises of different multi-stage CNN networks generated from the given model based on the predefined transformations. Then, NAS is used to search for the optimal model in the generated space with the help of reinforcement learning. Specifically, we devise an RNN to decide the setting of each transformation for generating the model. To exploit trade-off between accuracy and computational cost, we design a reward function that can reflect both the classification accuracy and the computational cost inspired by the prior works [39, 14, 44]. We then use a policy-gradient based algorithm [37] to train the RNN. The RNN will generate better CNN models with reinforcement learning and we can further use the searched model for dynamic inference.

Figure 1: Three paradigms of early exiting methods. (a) The layer-wise approach splits the network into multiple stages along the layer depth. (b) MSDNet devises a multi-stage CNN in which each stage maintains a feature pyramid. (c) Our proposed channel-wise approach splits the network along the channel width.

To verify the effectiveness of S2DNAS, we perform extensive experiments by applying our method to various CNN models. With a comparable model accuracy, our method can achieve further computation reduction in contrast to the previous works for dynamic inference.

2 Related Work

Static Method for Efficient CNN Inference. Numerous methods are proposed for improving the efficiency of CNN inference. Two representative research directions are network pruning [22, 11, 10, 23] and quantization [9, 45, 26, 32]. Specifically, network pruning aims to remove redundant weights in a well-trained CNN without sacrificing the model accuracy. In contrast, network quantization aims to reduce the bit-width of both activations and weights. Most works in the above two directions are static, which refers to using the same computation graph for all test samples. Next, we introduce an emerging direction of utilizing dynamic inference for improving the efficiency of CNN inference.

Dynamic Inference. Dynamic inference also refers to adaptive inference in previous works [41, 24]. Most previous works aim to develop dedicated strategies to dynamically skip some computation during inference. They attempted to add extra controllers to select which computations are executed [5, 3, 34, 25, 15, 8, 46, 41, 42]. Dong et al. [5] proposed to compute the spatial attention using extra convolutional layers then skipping the computation of inactive pixels. Gao et al. [8] proposed to compute the importance of each channel then skipping the computation of those unimportant channels. However, these methods lead to irregular computation at channel level or spatial level, which is not efficiently supported by existing deep learning frameworks and hardware devices. To address this issue, a more aggressive strategy that dynamically skips the whole layers or blocks is proposed [46, 41, 42]. For example, BlockDrop [46] introduced a policy network to decide which layers should be skipped. Unfortunately, this strategy can only be applied to the CNN model with residual connection. Moreover, these methods introduce extra controllers into the computational graph, the computational cost will remain the same or even increase in some cases. On the other hand, early exiting methods propose to divide a CNN model into multiple stages and exit the inference of “easy” samples in the early stages [31, 40, 17, 1]. The state-of-the-art is MSDNet [17] in which the authors manually design a novel multi-stage network architecture to serve the purpose of dynamic inference.

Neural Architecture Search. Recently, neural architecture search (NAS) has emerged as a promising direction to automatically design the network architecture to meet varying requirements of different tasks [48, 49, 14, 27, 44, 28]. There are two typical types of works in this research direction, RL-based searching algorithms [48] and differentiable searching algorithms [28]. In this paper, according to the formulation of our specific problem, we choose the RL-based searching algorithm to search for the optimal model in a design space.

3 Our Approach

Figure 2: Overview of S2DNAS. S2D first generates a search space from the original CNN model. Then, NAS searches for the optimal model in the generated space.

3.1 Overview of S2DNAS

The overview of S2DNAS is depicted in Figure 2. At a high level, S2DNAS can be divided into two components, namely, S2D and NAS. Here, S2D means “static-to-dynamic”, which is used to generate a search space comprises of dynamic models based on a given static CNN model. Specifically, we define two transformations and then apply the transformations to the original model for generating different dynamic models in the search space. Each of these dynamic models is a multi-stage CNN that can be directly used for dynamic inference. All these generated models form the search space. Once the search space is generated, NAS searches for the optimal model in the space. In what follows, we will give the details of these two components.

3.2 The Details of S2d

Given a CNN model , the goal of S2D is to generate the search space which consists of different dynamic models transformed from . Each network in is a multi-stage CNN model in which each stage contains one classifier. These multi-stage CNNs can be generated from using two transformations, namely, split and concat. First, we propose split to split the original model along the channel width as Figure 3 shows. Specifically, we divide the input channels in each layer of the original model into different subsets. And each classifier can use features from different subsets for prediction. The prediction can be done by adding a prediction layer (shown as yellow squares in Figure 3). Moreover, to enhance the feature interactions between different stages for further performance boost, we propose concat to enforce the classifier in the current stage to reuse the features from previous stages. Next, we will present the details of these two transformations, split and concat. Before that, we first present some basic notations.

Notation. We start with the notation of a normal convolutional layer. Taking the -th layer of a deep CNN as an example, the input of the -th layer is denoted as , where is the number of input channels and is the -th feature map with a resolution of . We denote the weights as , where is the number of output channels and ( is the kernel size). In the following parts, we will present two transformations that can be applied to the original model. The goal of the transformations is to transform a static CNN model to a multi-stage model, which can be represented as , where is the classifier in the -th stage. Next, we will introduce the details of the proposed two transformations.

Figure 3: Illustration of how split and concat are applied to a CNN model. Note that Group is an intermediate step of spilt for reducing the size of search space (see more details in the main text).

Split. The split transformation is responsible for assigning different subsets of the input channels to the classifiers in different stages. We denote the number of stages as . A direct way is splitting the input channels into subsets and allocating the -th subset to the classifier in the -th stage. However, this splitting method results in a considerable large search space which poses the obstacle to the subsequent search process (i.e., NAS). In order to reduce the search space generated by this transformation, we propose to first divide the input channels into groups and then assign these groups to different classifiers.

Specifically, we first evenly divide the input channels3 into groups thus each group consists of input channels. Taking the -th layer as an example, this process can be formally denoted as , where . Once the grouping is finished, these groups are assigned to the classifiers in different stages. Precisely, we use the split points to split the groups, here and are two peculiar points, which denote the start and end points. With the split points, we can assign the channel groups to the classifier in the -th stage. Note that the connection (of the original model ) between different classifiers are removed (see Figure 3).

Concat. The concat transformation is used for enhancing the interaction between different stages. The basic idea is to enable the classifiers in later stages to reuse the features from previous stages. Formally, we use indicator matrices to indicate whether to enable the feature reuse at different positions. Here denotes the -th layer and is the depth4 of the CNN model. The element indicates whether to reuse the features of the -th stage in the -th stage at the -th layer, i.e., means that the classifier in the -th stage will concat all the feature maps (of -th layer) from the -th stage. Note that we restrict the previous stages from concat the features of the later stages, i.e., . Moreover, we force the -th layer (the prediction layer5) to concat the features from all the previous stages. We demonstrate a concrete example in Figure 3 to illustrate how to use the above two transformations to reshape a CNN model.

Architecture Search Space. Based on the above two transformations, we can generate the search space by transforming the original CNN model. Specifically, there are two adjustable settings for the two transformations, splitting points and indicator matrices. Adjusting the splitting points will change the way to assign the feature groups, which is used for the trade-off between accuracy and computational cost of different classifiers. For example, we can assign more features to the early stages for improving the model performance on “easy” samples. Adjusting the indicator matrices accompanies the change of the feature reuse strategy. To reduce the size of the search space, we restrict the feature layers with the same resolution to use the same split and concat settings in our experiments. Through changing these two settings, we can generate the search space which consists of different multi-stage models. In the following section, we will demonstrate how to search for the optimal model in the generated space.

3.3 The Details of Nas

Once we obtain the search space from the above procedure of S2D, the goal of NAS is to find the optimal model with high accuracy and low computational cost. Note that the model is jointly determined by the settings of the above two transformations, i.e., the split points and the indicator matrices. With a slight abuse of notation, we also refer the architecture as these two settings and denote as the space which consists of these different settings. Thus the optimization goal reduces to search for the optimal settings of the proposed transformations which can maximize our predefined metric (see details in the following section).

However, searching the optimal setting is nontrivial due to the huge search space . For example, in our experiment on MobileNetV2 [7], the size of the search space is around . Motivated by the recent progress in neural architecture search (NAS) [48, 49, 14, 39], we propose to use a policy gradient based reinforcement learning algorithm for searching. The goal of the algorithm is to optimize the policy which further proceeds the optimal model. This process can be formulated into a nested optimization problem:

(1)
s.t.

where is the corresponding weights of the model and is the policy which generates the settings of the transformations. and denote the validation and training datasets, respectively. And is the reward function for evaluating the quality of the multi-stage model.

To solve the nested optimization problem in Equation 1, we need to solve two sub-problems, namely, optimizing when is given and optimizing when the architecture is given. We first present how to optimize the policy when is given.

Optimization of the Transformation Settings. Similar to previous works [48, 49], we use a customized recurrent neural network (RNN) to generate the distribution of different transformation settings for each layer of the CNN model. Then a policy gradient based algorithm [37] is used for optimizing the parameters of the RNN to maximize the expected reward, which is defined in Equation 2. Specifically, the reward in our paper is defined as a weighted product considering both the accuracy and the computational cost:

(2)

where is the accuracy of the multi-stage model on the dataset . The is the average computational cost over the samples of the dataset using dynamic inference. For a fair comparison with other works of dynamic inference, we use FLOPs 6 as the proxy of the computational cost. is a hyper-parameter which can be used for controlling the trade-off between model performance and the computational cost. Next, we will introduce how to solve the inner optimization problem, i.e., optimizing on the training dataset when the model is given.

Optimization of the Multi-stage CNN. The inner optimization problem (i.e., solving for ) can be solved using the gradient descent algorithm. Specifically, we modify the normal classification loss function (i.e., cross-entropy function) for the case of training multi-stage models. Formally, the loss function is defined as:

(3)

Here, CE denotes the cross-entropy function. The optimization of the above equation can be regarded as jointly optimizing all the classifiers in different stages. The optimization can be implemented using stochastic gradient descent (SGD) and its variants. We use the optimized for assessing the quality of the model generated by the RNN, which can be further used for optimizing the RNN. In practice, to reduce the search time, following the previous work [49], we approximate by updating it for only several training epochs, without solving the inner optimization problem completely by training the network until convergence.

Figure 4: The process of NAS. The RNN model is responsible for outputting the policy , i.e., settings of split and concat, which further produces a model . We can then optimize for approximating the optimal parameter . The computational cost and the accuracy of is used from evaluating the generated policy .

Dynamic Inference of the Searched CNN.

Once the optimal multi-stage model is found, we can directly perform dynamic inference using it. Specifically, we set a predefined threshold for each stage. Formally, the threshold of the -th stage is set to . Then, we can use these thresholds to decide at which stage that the inference should stop. Specifically, given a input sample , the inference stops at the -th stage when the -th classifier outputs a top-1 confidence score , here, .

4 Experiments

To verify the effectiveness of S2DNAS, we compare it with different dynamic inference methods on different CNN models. Our experiments have covered a wide range of previous methods of dynamic inference [5, 46, 31, 40]. We also evaluate different aspects of S2DNAS, which are presented in the discussion part.

Model Method CIFAR-10 CIFAR-100
FLOPs Reduction Accuracy FLOPs Reduction Accuracy
ResNet-20 Original 41M - 91.25% 41M - 67.78%
LCCL 30M 28% 90.95% 40M 1% 68.26%
BlockDrop 45M -11% 91.31% 53M -29% 67.39%
Naive 34M 18% 91.27% 39M 5% 66.77%
BranchyNet 33M 20% 91.37% 45M -9% 67.00%
S2DNAS 16M 61% 91.41% 25M 39% 67.29%
ResNet-56 Original 126M - 93.03% 126M - 71.32%
LCCL 102M 19% 92.99% 106M 16% 70.33%
BlockDrop 74M 41% 92.98% 129M -2% 72.39%
Naive 68M 46% 92.78% 108M 14% 71.58%
BranchyNet 73M 42% 92.51% 120M 5% 71.22%
S2DNAS 37M 71% 92.42% 62M 51% 71.20%
ResNet-110 Original 254M - 93.57% 254M - 73.55%
LCCL 166M 35% 93.44% 210M 17% 72.72%
BlockDrop 76M 70% 93.00% 153M 40% 73.70%
Naive 158M 38% 93.13% 217M 15% 73.06%
BranchyNet 147M 42% 93.33% 243M 5% 73.25%
S2DNAS 76M 70% 93.39% 113M 56% 73.06%
VGG16-BN Original 313M - 93.72% 313M - 72.93%
LCCL 269M 14% 92.75% 264M 16% 70.46%
Naive 185M 41% 93.34% 202M 36% 72.78%
BranchyNet 162M 48% 93.39% 239M 24% 72.39%
S2DNAS 66M 79% 93.51% 104M 67% 72.00%
MobileNetV2 Original 91M - 93.89% 91M - 74.21%
LCCL 77M 15% 93.13% 73M 20% 71.11%
Naive 38M 58% 91.90% 61M 33% 74.03%
BranchyNet 35M 61% 91.76% 74M 18% 73.71%
S2DNAS 25M 73% 92.25% 39M 57% 73.50%
Table 1: Evaluations on CNN models with different architectures. The number in the Reduction column denotes the relative cost reduction compared with the original model. Some results are missing because there is no implementation of these CNN models in the reference papers.

4.1 Experiment Settings

Model Setup. In our experiments, we conduct experiments on three CNN architectures: ResNet [13], VGG [38], and MobileNetV2 [36] 7. Moreover, to compare with MSDNet, we devised a DenseNet-like model (see more details in the appendix) which has a similar structure with the MSDNet model. We then perform S2DNAS to the devised model. We use the same RNN for all our experiments and the details of the RNN is presented in the appendix.

Training Details. The CIFAR [20] dataset contains 50k training images and 10k test images. We randomly choose 5k images from the training images as the validation dataset and leave the other 45k images as the training dataset. We use the same input preprocessing for both CIFAR-10 and CIFAR-100. To be specific, the training images are zero-padded with 4 pixels and then randomly cropped to 32x32 resolution. The randomly horizontal flip is used for data augmentation.

For the training of the RNN, the PPO algorithm [37] is used. And we use Adam [19] as the optimizer to perform the parameter update in RNN. The details of the hyper-parameters settings can be found in the appendix. For the training of the multi-stage model, we use SGD as the optimizer. The momentum is set to . The initial learning rate is set to and the learning rate is divided by a factor of at and of the total epochs. More details of the training settings of different models can be found in the appendix.

For the hyper-parameters of S2DNAS, we set the group number for every layer. And we set the number of stages . For comparing with MSDNet which contains 5 stages, we set the for performing S2DNAS on the devised model. The in Equation 2 is set to and all of the in Equation 3 is set to for all experiments.

4.2 Classification Results

In this part, we compare our method with other methods of dynamic inference. To give a comprehensive study of our method, we have covered a wide range of methods, including LCCL [5], BlockDrop [46], Naive [31] and BranchyNet [40]. We conduct experiments on two widely-used image classification benchmarks, CIFAR-10 and CIFAR-100. To show the effectiveness of S2DNAS in reducing the computational cost of CNN models with different architectures, we apply S2DNAS to five typical CNNs with various depth, width, and sub-structures.

The overall results are shown in Table 1. Note that different thresholds ( defined in the previous section) lead to different trade-offs between model accuracy and the computational cost. In our experiments, we chose the threshold which leads to the highest reward on the validation dataset. We also provide further results of using different thresholds in the discussion subsection.

As shown in Table 1, for most of the architectures and tasks, our method (denoted as S2DNAS in Table 1) can significantly reduce the computational cost with comparable accuracy with the original CNN model. As mentioned above, we use average FLOPs on the whole test dataset as the metric to measure the computational cost of a given CNN model. For ResNet-20 on CIFAR-10, S2DNAS has reduced the computation cost of the original net from M to M without the accuracy drop (even with a slight increase as shown in Table 1), which shows a relative cost reduction of .

Our method also shows improvements over other methods for dynamic inference in terms of computational cost reduction. We have reproduced the previous works on these CNN models for comparison. We have also implemented a normal early exiting solution (marked as Naive in Table 1), i.e., directly adding prediction layers (i.e., global average pooling and fully-connected layers) at the intermediate layers of the original models. For example, for ResNet-20 on CIFAR-10, compared with BranchyNet [40], our method has achieved a slight accuracy improvement (from to ) with more computational cost reduction.

One interesting observation is that some methods even cause an increase in computational cost. For example, BlockDrop boosts the FLOPs of the original net about . We infer that this is caused by the controller with high computational cost introduced by BlockDrop in the inference process [46]. We also notice that some of the previous works can not be used for the network without residual connection. For instance, BlockDrop cannot be applied to VGG16-BN. In contrast, our method can generalize to CNN without residual connection. From Table 1, our method can reduce the computational cost of the original VGG16-BN net by with a slight accuracy drop.

Comparison to MSDNet. As mentioned in the introduction section, there is a recent work that proposed a specialized CNN named MSDNet for dynamic inference. Since the method cannot directly be applied to general CNN models, thus for comparison with MSDNet, we design a DenseNet-like [18] model based on the prior work [17], which has similar structure with MSDNet. More details of the devised model can be found in the appendix. We then apply S2DNAS to it and generate the dynamic models. The results are plotted in Figure 5. The varying FLOPs metrics of the x-coordinate can be obtained by adjusting the thresholds of each classifier of the dynamic CNN models. As Figure 5 shows, in most cases, our method can achieve similar accuracy-computation trade-offs. In the case of CIFAR-10, MSDNet outperforms our method when FLOPs is relative to 15M. However, the superiority of MSDNet comes with the cost of manually designing the CNN architecture. In contrast, as Table 1 shows, our method can be applied to various general CNN models.

Figure 5: Comparison to MSDNet.
Dataset Model Stage1 Stage2 Stage3
Accuracy Fractions Accuracy Fractions Accuracy Fractions
CIFAR-10 ResNet-20 98.44% 10.24% 98.89% 41.59% 83.45% 48.17%
ResNet-56 98.25% 67.50% 89.72% 11.19% 75.36% 21.31%
ResNet-110 98.43% 61.66% 93.22% 22.28% 74.28% 16.06%
VGG-16BN 96.54% 87.29% 91.44% 2.22% 68.73% 10.49%
MobileNetV2 98.62% 50.59% 94.04% 33.21% 68.70% 16.20%
CIFAR-100 ResNet-20 85.27% 58.72% 54.11% 20.68% 29.27% 20.60%
ResNet-56 97.13% 22.27% 86.64% 29.86% 49.51% 47.87%
ResNet-110 95.83% 28.04% 85.05% 28.90% 50.19% 43.06%
VGG-16BN 97.21% 7.18% 90.08% 44.97% 51.20% 47.85%
MobileNetV2 97.81% 8.68% 90.38% 45.84% 51.85% 45.48%
Table 2: Accuracy and fractions of samples in test dataset that exit from each stage.

4.3 Discussion

Here, we present some discussions on our method for providing further insights.

Trade-off of Accuracy and Computational Cost. A key hyper-parameter of dynamic inference is the threshold setting , where is the number of stages. When the model is trained, different threshold settings lead to different trade-offs between the accuracy and the computational cost. To demonstrate how the threshold affects the final model performances, we conduct experiments with different thresholds and plot the results in Figure 6. All these results show the trend that the increase of computational cost leads to a performance boost. Thus, for practical use, we can set the threshold based on the computational budget of the given hardware device. Moreover, this property also helps to solve the anytime prediction task proposed in the prior work [17].

Difficulty Distribution of Test Dataset. The basic idea of our method is early exiting “easy” samples from the early stages. In this part, we give the statistics of all the samples in the test dataset (=3, i.e., there are three stages in the trained model). As shown in Table 2, for ResNet-20 on CIFAR-10/100, the inference process of about test samples exits from the first two stages. As a result, S2DNAS can considerably reduce the average computation cost. Further, we observe that the accuracy of the classifier in the first stage8 is much higher than the classifier in later stages, which indicates that the classifier can easily classify those samples, i.e., those samples are “easy” samples. This observation also validates the intuition (“easy” samples can be classified using fewer computations) pointed out by some recent works [2, 5, 17, 8].

Figure 6: Trade off accuracy with computational cost by adjusting thresholds of different stages.

5 Conclusion

In this paper, we present a general framework called S2DNAS, for transforming various static CNN models into multi-stage models to support dynamic inference. Empirically, our method can be applied to various CNN models to reduce the computational cost, without sacrificing model performance. In contrast to previous methods for dynamic inference, our method comes with two advantages: (1) With our method, we can obtain a dynamic model generated from an existing CNN model instead of manually re-designing a new CNN architecture. (2) The inference of the generated dynamic model does not introduce irregular computations or complex controllers. Thus the generated model can be easily deployed on various hardware devices using existing deep learning frameworks.

These advantages are appealing for deploying a given CNN model into hardware devices with limited computational resources. To be specific, we can first use S2DNAS to transform the given model into the dynamic one then deploy it on the hardware devices. Moreover, our method is orthogonal to previous pruning/quantization methods, which can further reduce the computational cost of the given CNN model. All these properties of our method imply a wide range of application scenarios where the efficient CNN inference is desired.

Appendix A Details of RNN Model and its Optimization

The RNN model contains a GRU layer with 64 hidden units, predictors and an embedding layer. The predictors are used to output the probabilities of different transformation settings (split settings and concat settings) for different layers. The number of different split settings in a layer is the combination number , in which is the number of groups and is the number of stages. Thus the predictor (contains a fully-connected layer and a softmax function) is used to predict the probabilities of selecting the settings. The number of different concat locations is and we use the predictor (contains a fully-connected layer and a logistic function) to predict the probability for each concat locations. Finally, the embedding layer turns the sampled settings of the previous step into dense vectors of fixed size as the input of the GRU layer.

We employ Proximal Policy Optimization (PPO) to optimize the parameters of the RNN. Adam is used for optimizing the parameters of the RNN model, with a learning rate of 0.001. The number of epochs for PPO is set to 4, the clip parameter is set to 0.1, the mini-batch size is set to 4, the coefficient of value function loss is set to 0.5 and the entropy coefficient is set to 0.01.

Appendix B Details of DenseNet-like Model

To compare with MSDNet, a DenseNet-like model is devised. Specifically, we modify the DenseNet-BC (k=8, depth=100) by doubling the growth rate after each transition layer and modify the convolution in the bottleneck layers of DenseNet by halving the number of output channels. We denote it as DensNet*.

Appendix C Details of Training Settings

During network architecture search, we optimize the multi-stage model for 6 epochs using the training dataset to approximate . 10k models are sampled from the architecture search space for each experiment. Then the models with top-10 rewards are used to apply the full training. Table 3 demonstrates the hyper-parameters of the full training for different architectures. The scheme of learning rate warm-up is used for 100 iterations.

Model Datasets Batch size Training epochs Weight decay
ResNet CIFAR 128 200 1e-4
VGG CIFAR 64 200 5e-4
Mobilenet CIFAR 128 300 1e-4
DenseNet* CIFAR 64 300 1e-4
Table 3: Training Settings

Appendix D Demonstration of a Searched Multi-stage Model

Table 4 demonstrates the structure of searched multi-stage ResNet-56 model on CIFAR-10. From this table, we can see that each layer is split into three stages and each stage contains a subset of the original channels. Different stages are concated at different layers thus the feature maps generated by previous layers are reused in the later stages. The accumulated FLOPs increase from 21M to 90M for the three stages. As a result, we can save a lot of computation cost when the ”easy” input samples are stopped at stage 1 and stage 2.

layer name output size stage 1 stage 2 stage 3 concat settings
conv1 , 6 , 2 , 8
conv2_x stage2-stage1 stage3-stage1
conv3_x stage2-stage1
conv4_x stage3-stage2
prediction
layers
average pool
fully-connected
softmax
average pool
fully-connected
softmax
average pool
fully-connected
softmax
stage2-stage1
stage3-stage1
stage3-stage2
accumulated FLOPs 21M 30M 90M
Table 4: The structure of searched multi-stage ResNet-56 model on CIFAR-10. Building block are shown in brackets, with each element represents the kernel size and the number of output channels.

Footnotes

  1. footnotemark:
  2. In this paper, the classifier refers to the whole sub-network in the current stage.
  3. We do not split the input layer.
  4. Omit the batch normalization and pooling layers.
  5. Refer to the last layer of the classifier for prediction.
  6. Here, we regard one multiply-accumulate (MAC) as one floating-point operation (FLOP).
  7. We use the batch normalization after each convolution layer in VGG and change the stride of the first convolution layer in MobileNetV2 from 2 to 1 for CIFAR.
  8. Here, we only consider samples that exit from this stage.

References

  1. K. Berestizshevsky and G. Even (2019) Dynamically sacrificing accuracy for reduced computation: cascaded inference based on softmax confidence. In Artificial Neural Networks and Machine Learning - ICANN 2019: Deep Learning - 28th International Conference on Artificial Neural Networks, Munich, Germany, September 17-19, 2019, Proceedings, Part II, pp. 306–320. External Links: Link, Document Cited by: §1, §2.
  2. T. Bolukbasi, J. Wang, O. Dekel and V. Saligrama (2017) Adaptive neural networks for efficient inference. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pp. 527–536. External Links: Link Cited by: §1, §4.3.
  3. S. Cao, L. Ma, W. Xiao, C. Zhang, Y. Liu, L. Zhang, L. Nie and Z. Yang (2019) SeerNet: predicting convolutional neural network feature-map sparsity through low-bit quantization. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 11216–11225. External Links: Link Cited by: §2.
  4. L. Chen, G. Papandreou, I. Kokkinos, K. Murphy and A. L. Yuille (2018) DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 40 (4), pp. 834–848. External Links: Link, Document Cited by: §1.
  5. X. Dong, J. Huang, Y. Yang and S. Yan (2017) More is less: A more complicated network with less inference complexity. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 1895–1903. External Links: Link, Document Cited by: §1, §1, §2, §4.2, §4.3, §4.
  6. M. Figurnov, M. D. Collins, Y. Zhu, L. Zhang, J. Huang, D. P. Vetrov and R. Salakhutdinov (2017) Spatially adaptive computation time for residual networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 1790–1799. External Links: Link, Document Cited by: §1.
  7. H. G., M. Zhu, Chen,Bo, Kalenichenko,Dmitry, W. Wang, Weyand,Tobias, Andreetto,Marco and Adam,Hartwig (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §3.3.
  8. X. Gao, Y. Zhao, L. Dudziak, R. Mullins and C. Xu (2019) Dynamic channel pruning: feature boosting and suppression. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, External Links: Link Cited by: §1, §1, §2, §4.3.
  9. S. Gupta, A. Agrawal, K. Gopalakrishnan and P. Narayanan (2015) Deep learning with limited numerical precision. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pp. 1737–1746. External Links: Link Cited by: §1, §2.
  10. S. Han, H. Mao and W. J. Dally (2016) Deep compression: compressing deep neural network with pruning, trained quantization and huffman coding. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, External Links: Link Cited by: §2.
  11. S. Han, J. Pool, J. Tran and W. J. Dally (2015) Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pp. 1135–1143. External Links: Link Cited by: §1, §2.
  12. K. He, G. Gkioxari, P. Dollár and R. B. Girshick (2017) Mask R-CNN. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pp. 2980–2988. External Links: Link, Document Cited by: §1.
  13. K. He, X. Zhang, S. Ren and J. Sun (2016) Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 770–778. External Links: Link, Document Cited by: §1, §1, §4.1.
  14. Y. He, J. Lin, Z. Liu, H. Wang, L. Li and S. Han (2018) AMC: automl for model compression and acceleration on mobile devices. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part VII, pp. 815–832. External Links: Link, Document Cited by: §1, §2, §3.3.
  15. W. Hua, C. D. Sa, Z. Zhang and G. E. Suh (2018) Channel gating neural networks. CoRR abs/1805.12549. External Links: Link, 1805.12549 Cited by: §1, §2.
  16. W. Hua, Y. Zhou, C. D. Sa, Z. Zhang and G. E. Suh (2019) Boosting the performance of CNN accelerators with dynamic fine-grained channel gating. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2019, Columbus, OH, USA, October 12-16, 2019., pp. 139–150. External Links: Link, Document Cited by: §1.
  17. G. Huang, D. Chen, T. Li, F. Wu, L. van der Maaten and K. Q. Weinberger (2018) Multi-scale dense networks for resource efficient image classification. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, External Links: Link Cited by: §1, §1, §1, §2, §4.2, §4.3, §4.3.
  18. G. Huang, Z. Liu, L. van der Maaten and K. Q. Weinberger (2017) Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 2261–2269. External Links: Link, Document Cited by: §1, §4.2.
  19. D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, External Links: Link Cited by: §4.1.
  20. A. Krizhevsky and G. Hinton (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §4.1.
  21. A. Krizhevsky, I. Sutskever and G. E. Hinton (2012) ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States., pp. 1106–1114. External Links: Link Cited by: §1.
  22. Y. LeCun, J. S. Denker and S. A. Solla (1989) Optimal brain damage. In Advances in Neural Information Processing Systems 2, [NIPS Conference, Denver, Colorado, USA, November 27-30, 1989], pp. 598–605. External Links: Link Cited by: §1, §2.
  23. H. Li, A. Kadav, I. Durdanovic, H. Samet and H. P. Graf (2017) Pruning filters for efficient convnets. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, External Links: Link Cited by: §1, §2.
  24. H. Li, H. Zhang, X. Qi, R. Yang and G. Huang (2019) Improved techniques for training adaptive deep networks. CoRR abs/1908.06294. External Links: Link, 1908.06294 Cited by: §2.
  25. X. Li, Z. Liu, P. Luo, C. C. Loy and X. Tang (2017) Not all pixels are equal: difficulty-aware semantic segmentation via deep layer cascade. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 6459–6468. External Links: Link, Document Cited by: §2.
  26. D. D. Lin, S. S. Talathi and V. S. Annapureddy (2016) Fixed point quantization of deep convolutional networks. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pp. 2849–2858. External Links: Link Cited by: §2.
  27. C. Liu, L. Chen, F. Schroff, H. Adam, W. Hua, A. L. Yuille and F. Li (2019) Auto-deeplab: hierarchical neural architecture search for semantic image segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 82–92. External Links: Link Cited by: §2.
  28. H. Liu, K. Simonyan and Y. Yang (2019) DARTS: differentiable architecture search. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, External Links: Link Cited by: §2.
  29. L. Liu and J. Deng (2018) Dynamic deep neural networks: optimizing accuracy-efficiency trade-offs by selective execution. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pp. 3675–3682. External Links: Link Cited by: §1.
  30. W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C. Fu and A. C. Berg (2016) SSD: single shot multibox detector. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part I, pp. 21–37. External Links: Link, Document Cited by: §1.
  31. P. Panda, A. Sengupta and K. Roy (2016) Conditional deep learning for energy-efficient and enhanced pattern recognition. In 2016 Design, Automation & Test in Europe Conference & Exhibition, DATE 2016, Dresden, Germany, March 14-18, 2016, pp. 475–480. External Links: Link Cited by: §1, §2, §4.2, §4.
  32. M. Rastegari, V. Ordonez, J. Redmon and A. Farhadi (2016) XNOR-net: imagenet classification using binary convolutional neural networks. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part IV, pp. 525–542. External Links: Link, Document Cited by: §1, §2.
  33. J. Redmon, S. K. Divvala, R. B. Girshick and A. Farhadi (2016) You only look once: unified, real-time object detection. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 779–788. External Links: Link, Document Cited by: §1.
  34. M. Ren, A. Pokrovsky, B. Yang and R. Urtasun (2018) SBNet: sparse blocks network for fast inference. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp. 8711–8720. External Links: Link, Document Cited by: §2.
  35. S. Ren, K. He, R. B. Girshick and J. Sun (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pp. 91–99. External Links: Link Cited by: §1.
  36. M. Sandler, A. G. Howard, M. Zhu, A. Zhmoginov and L. Chen (2018) MobileNetV2: inverted residuals and linear bottlenecks. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp. 4510–4520. External Links: Link, Document Cited by: §4.1.
  37. J. Schulman, F. Wolski, P. Dhariwal, A. Radford and O. Klimov (2017) Proximal policy optimization algorithms. CoRR abs/1707.06347. External Links: Link, 1707.06347 Cited by: §1, §3.3, §4.1.
  38. K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, External Links: Link Cited by: §4.1.
  39. M. Tan, B. Chen, R. Pang, V. Vasudevan and Q. V. Le (2018) MnasNet: platform-aware neural architecture search for mobile. CoRR abs/1807.11626. External Links: Link, 1807.11626 Cited by: §1, §3.3.
  40. S. Teerapittayanon, B. McDanel and H. T. Kung (2016) BranchyNet: fast inference via early exiting from deep neural networks. In 23rd International Conference on Pattern Recognition, ICPR 2016, Cancún, Mexico, December 4-8, 2016, pp. 2464–2469. External Links: Link, Document Cited by: §1, §1, §2, §4.2, §4.2, §4.
  41. A. Veit and S. J. Belongie (2018) Convolutional networks with adaptive inference graphs. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part I, pp. 3–18. External Links: Link, Document Cited by: §1, §2.
  42. X. Wang, F. Yu, Z. Dou, T. Darrell and J. E. Gonzalez (2018) SkipNet: learning dynamic routing in convolutional networks. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XIII, pp. 420–436. External Links: Link, Document Cited by: §1, §2.
  43. W. Wen, C. Wu, Y. Wang, Y. Chen and H. Li (2016) Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp. 2074–2082. External Links: Link Cited by: §1.
  44. B. Wu, X. Dai, P. Zhang, Y. Wang, F. Sun, Y. Wu, Y. Tian, P. Vajda, Y. Jia and K. Keutzer (2019) FBNet: hardware-aware efficient convnet design via differentiable neural architecture search. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 10734–10742. External Links: Link Cited by: §1, §2.
  45. J. Wu, C. Leng, Y. Wang, Q. Hu and J. Cheng (2016) Quantized convolutional neural networks for mobile devices. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 4820–4828. External Links: Link, Document Cited by: §1, §2.
  46. Z. Wu, T. Nagarajan, A. Kumar, S. Rennie, L. S. Davis, K. Grauman and R. S. Feris (2018) BlockDrop: dynamic inference paths in residual networks. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp. 8817–8826. External Links: Link, Document Cited by: §1, §1, §2, §4.2, §4.2, §4.
  47. J. Yu, A. Lukefahr, D. J. Palframan, G. S. Dasika, R. Das and S. A. Mahlke (2017) Scalpel: customizing DNN pruning to the underlying hardware parallelism. In Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA 2017, Toronto, ON, Canada, June 24-28, 2017, pp. 548–560. External Links: Link Cited by: §1.
  48. B. Zoph and Q. V. Le (2017) Neural architecture search with reinforcement learning. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, External Links: Link Cited by: §2, §3.3, §3.3.
  49. B. Zoph, V. Vasudevan, J. Shlens and Q. V. Le (2018) Learning transferable architectures for scalable image recognition. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp. 8697–8710. External Links: Link, Document Cited by: §2, §3.3, §3.3, §3.3.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
402337
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description