# Doubly Nested Network for Resource-Efficient Inference

## Abstract

We propose doubly nested network(DNNet) where all neurons represent their own sub-models that solve the same task. Every sub-model is nested both layer-wise and channel-wise. While nesting sub-models layer-wise is straight-forward with deep-supervision as proposed in [21], channel-wise nesting has not been explored in the literature to our best knowledge. Channel-wise nesting is non-trivial as neurons between consecutive layers are all connected to each other. In this work, we introduce a technique to solve this problem by sorting channels topologically and connecting neurons accordingly. For the purpose, channel-causal convolutions are used. Slicing doubly nested network gives a working sub-network. The most notable application of our proposed network structure with slicing operation is resource-efficient inference. At test time, computing resources such as time and memory available for running the prediction algorithm can significantly vary across devices and applications. Given a budget constraint, we can slice the network accordingly and use a sub-model for inference within budget, requiring no additional computation such as training or fine-tuning after deployment. We demonstrate the effectiveness of our approach in several practical scenarios of utilizing available resource efficiently.

## 1 Introduction

Deep neural network (DNN) based methods have shown remarkable performance over a wide variety of fields [17, 3, 11, 16, 2]. Some of notable deep learning architectures such as ResNet [6] or DenseNet [10] stack layers very deep or very wide and they result in very high performance. Most methods that have performed well in competitions or benchmark datasets have elaborate architecture designs typically requiring too high capacity for to be deployed in practice. Computing power is never free and therefore applying sophisticated but resource-intensive models directly to real environments with time or memory budget constraints is impractical especially in the context of mobile devices or embedded systems.

To address this issue, techniques have been proposed which can improve the memory or time efficiency by eliminating the redundancy of the existing network model. Among them, pruning is the most widely used technique and this approach usually cut back connections between layers [1, 23], the number of layers [19], or the number of channels [7]. Another strategy involves methods using low-rank approximation of weight tensors by minimizing the difference between the original network and the reconstructed reduced network [12, 22]. In addition, there has been a growing interest in knowledge distillation based methods that build a compact network from a pre-trained large network while maintaining the performance of a large network [8]. Overall, in order to reduce the redundancy of the original model, these techniques have in common that the compact model must be re-trained each time a new specification or constraint is required.

Although the above techniques have shown the applicability of the learned model in the low capacity environment, there is still one issue that has been rarely addressed yet. Considering the practical application of techniques to reduce model redundancy, a relatively large original network needs to be applied to numerous devices with different specifications or budget constraints rather than being applied once to a specific device. Unfortunately, conventional techniques re-train the original network again every time the environment specification changes in terms of memory or time, so it is inefficient and infeasible to apply them when there are diverse target environments.

The most challenging scenario with regards to resource budget is when both time and memory available for inference dynamically change as shown in Fig. 1. If our deployed model is supposed to use time or memory exceeding the limits, we need a mechanism for the model to be adapted to utilize available resource only while maintaining its original performance as high as possible. It is widely known that the number of layers is one of dominant factors determining inference time and the number of channels dominantly affects maximum required memory. While there can be several directions to tackle the challenge of reducing time and memory usage dynamically, a baseline is to prepare many different models with varying resource requirements. This requires training models many times and storing all of them consumes too much of space. One can expect that models with slight differences in the number of layers or channels mostly share similarities. It is ideal for all models to share as many parameters as possible to reduce training time and memory to store them.

To resolve the aforementioned issues, we propose a neural network architecture called doubly nested network(DNNet) where all models share parameters maximally by nesting all-in-one. Nesting models layer-wise is not a new concept. It is proposed in [21], where each layer has its side output and deeply supervised using [13].

While nesting models layer-wise is relatively straight-forward, nesting them channel-wise is challenging. One of the reasons for the difficulty is that all channels between two consecutive layers are fully connected and slicing them seems unnatural. To resolve the issue, we propose to sort channels topologically so that some channels are relatively more important than the others. This can be implemented with channel-causal convolution. Since information flows from lower indexed channels to higher index channels and not vice versa, prediction with only lowest channels still works.

Slicing DNNet does not need further re-training to satisfy each new specification or budget constraint. Allowing both vertical and horizontal slicing, inherited from doubly nested structure, is very important since there can be scenarios where time (usually correlated with computational cost) is the limiting factor or memory is the limit factor as illustrated in Fig. 1.

## 2 Related Works

Along with the growth of deep neural networks, considerable efforts have been devoted to reduce the redundancy of original model or propose the resource efficient model for applications in environments that require low computation memories as well as fast inference speed. Early investigation of these approaches focus on network pruning, which is a conventional way to reduce the network time and space complexities. Han et al. [5, 4] proposed an approach that iteratively prune and retrain the network using regularization [5] and also showed that more reduction could be achieved by combining trained quantization techniques [4]. Structured or coarse-grained sparsity has been also studied to achieve actual speed–up in common parallel processing environment including channel-level pruning [7, 19] and layer-level pruning [20].

Although the above techniques have shown the applicability of the learned model in the low capacity environment, they commonly require retraining or fine-tuning to build the model to meet continuously changing budget constraints. To address this issue, some pioneering approaches have been proposed in terms of inference speed or memory footprint. Runtime neural pruning (RNP) [14] is similar to our method in that the both can dynamically adjust the channel size per layer at run time. However, unlike our model, which can be deployed after slicing according to the budget constraint, RNP preserves the full architecture of the original network and conducts pruning adaptively, which is beneficial only in terms of time and not in terms of parameter memory. In considering a single network that can maximizes accuracy on various devices which have different computational constraints, Multi-Scale DenseNet (MSDNet) [9] is closely related to our approach. They produce features of all scales (fine-to-coarse) by the multi-scale architecture, which facilitates good classification early on but also extracts low-level features that only become useful after several more layers of processing. To slice the original network along depth axis (i.e., layer-wise), they used multiple early exit classifiers in the middle of the layers. In contrast, we have devised a slicing technique in the width axis (i.e., channel-wise) which has not been well investigated previously, and ultimately propose an architecture capable of slicing at a higher degree of freedom, taking both width and depth into consideration.

The optimal architecture selection for a given task has been one of the long standing challenges in designing neural network-based algorithms. A few previous studies have addressed this problem by giving more flexibility to the selection of width and depth of the network. In [15], a three-dimensional trellis-like architecture, called “convolutional neural fabrics” was introduced to tackle this challenge by training a large network which is able to embed multiple architectures with different architectural hyper-parameters (e.g., the number of stride at each layer, the number of channels at each layer) conceptually. This architecture has been adopted as a baseline in one algorithm about discovering cost-constrained optimal architectures [18]. Our work benefits from a large degree of freedom in choosing the optimal neural network architectures, but, differs from these works in that they do not consider extraction of partial models with various computation and memory costs from a large model without retraining or fine-tuning.

## 3 Proposed method

### 3.1 Network architecture of DNNet

Our goal is to design a single network that could be sliced and used according to a variety of specifications or budget constraints without re-training. In conventional convolutional neural networks (CNNs), there are full connections with trainable weights between consecutive layers in the convolution layer as well as in the fully connected layer. Looking closer at the convolution operation between consecutive layers, output activation values from a specific layer can be computed by aggregating all output activations from previous layers as shown in the left of Fig. 2 (a). That is, a simple omission of some channels in input features of specific layer might generate unpredictable output activations, which would be propagated to the following layers continuously and end up with severe performance degradation.

To solve this problem, we suggest a new neural network architecture that has restricted synaptic connections between two consecutive layers so that sliced partial models do not depend on the output from other part not included in the sliced model any more. More specifically, we divide the whole network into a predefined number of groups along the channels, and design a channel-causal convolution filters so that the th channel group in one layer is calculated only with activation values from the channel groups from the first to th channel group in the previous layer. This specialized architecture enables a partial model only having th or lower channel groups to work independently with ()th or higher channel groups both in training and in inference processes. In this way, the convolution filters are connected in series, and the classifiers are constructed in a similar way. Concretely, the channels of the feature maps passing through the last convolution filter are conditionally connected to the classifiers with the same size as the number of channel groups, so that the th classifier receives the input from the first channel to the th channel as can be seen in the center of Fig. 2 (a).

Our final goal is to propose a structure that can be sliced by layer as well as channel wise. To this end, we add intermediate classifiers to consecutive layers in the network so that learned feature maps from the previous layer can be reused in subsequent layers, which lead to a sliceable architecture along layer axis. As a result, we assign classifiers to each layer (each residual block is used in ResNet case) in which the classifier for each layer is composed of the above-mentioned channel conditional manner as can be seen in the right of Fig. 2 (a). Multi-scale schemes or dense connections between layers, which can guarantee high-performance but have relatively high budget (e.g., memory or inference speed) can be considered for layer-wise slicing. However, the focus of this paper is to propose a sliceable architecture and ultimately to identify the relationship between the two slice criteria (i.e., layer and channel), not the performance enhancement itself. Therefore, complementing the proposed architecture with more elaborate modules could be a future work. Finally, we can obtain a architecture that can be sliced in both axes (i.e., layer and channel), with a total of classifiers through a combination of layers and channel groups as illustrated in Fig. 3. As can be seen in the figure, all convolved feature maps of each layer group pass through the global average pooling, which outputs a single vector whose number of elements is same as the number of channels of the feature map.

### 3.2 Training 2D sliceable networks

The classifiers in the training phase make it possible to generate partial models that can be sliced from the original architecture. Although our model contains partial models with their own computational and memory requirements, we train all partial models simultaneously using a standard training technique rather than taking care of each partial model respectively with its own loss function. It is enabled by adopting multiple classifier nodes whose number is equal to the number of all possible partial models and a single aggregate loss function which encompasses all loss values from these classifiers.

**Loss function details** The loss of the overall architecture is the sum of the losses of the partial models and we first describe the loss of partial models on the channel axis. While our proposed network generally works with any learning problem such as classification or regression as long as typical deep neural network is applicable. For the sake of simplicity, we explain the setting of supervised classification and generalizing to other settings naturally follows.

Suppose that the size of last feature map after global average pooling is and the total number of class label is . In conventional CNNs, logits are calculated by multiplying the feature map by a weight matrix , which results in . We then obtain the predicted probability distributions of class through softmax function. Finally, we can train the network by minimizing loss function as follows:

(1) |

where is a ground truth, and .

To construct a conditional classifier on the channel in a cumulative manner, unlike the existing techniques, we calculate the logit in the following manner, i.e., . This allows to use only the features from . We then pass the logits through softmax function, . Finally, we can calculate the loss function in the conditional channel as:

(2) |

Since we want to obtain an architecture that is a layer-wise slice as well as a channel-wise slice, we attach intermediate classifiers on all layers with global average pooling. This allows us to get the feature map for each layer. If the channel conditional classifier is applied to each layer similarly to (2), we can obtain loss function for the architecture which can be sliced layer-wise as well as channel-wise as follows:

(3) |

where is softmax output of of the th sample.

**An aggregate loss function** An individual loss for each partial model can be calculated using the obtained logit values and the target value by a conventional loss function, which is a cross-entropy in our case. We make a single objective function by combining all loss functions rather than optimizing each loss function respectively. Our baseline implementation of this aggregate objective function is simply a sum of all loss functions coming from all possible partial models. Applying standard backpropagation to this single loss function enables its gradient values to flow into all connected loss functions and optimize all partial models at the same time for each iteration.

(4) |

We extend this baseline aggregate loss function by introducing additional parameters each of which serves as a multiplier to the individual loss function, which is a weight to the loss function, and reflects users’ preferences. These parameters can be represented by a single matrix whose individual element indicates a relative importance of the corresponding partial models in terms of the final performance (e.g., classification accuracy).

In this study, we propose three weight matrices. The first type () focuses on low-complexity models. The weight () to the loss function of a specific model with layer groups and channel groups decays exponentially as or gets larger, where is a hyper-parameter larger than one. On the contrary, the second type () focuses on high-performance partial models. In addition we can customize the loss weight matrix, i.e., . We verified this prioritization scheme works actually as we intended and the quantitative results and other details are shown in Section 4.

## 4 Experiments

### 4.1 Experimental setup

We evaluated the proposed method in widely used CIFAR-10. We also used the street view house numbers (SVHN) dataset. Our architecture is based on the ResNet-32 model, which consists 15 residual blocks and one fully-connected layer. Our architecture includes a classifier between the residual blocks, and the first feature map has 16 channels, so it contains a total of 256 (16x16) classifiers. To train this architecture, we use a momentum optimizer with a 0.9 momentum term with mini-batch size 128. Our models are trained for 40k steps, with an initial learning rate of 0.1, which is divided by a factor 10 after 60k steps and 60k steps.

### 4.2 Performance change according to channel-wise slice

One of the key advantages of the proposed architecture is that it is possible to slice on the channel axis, with less performance degradation. Fig. 4 shows the performance changes of the proposed scheme and comparison models when sliced by channel on CIFAR-10 and SVHN datasets. The proposed architecture is based on Resnet-32 model. Therefore, the lower bound of the performance could be a classification accuracy obtained by the scenario of truncating the channel in the existing Resnet model. As expected, conventional CNNs has serious performance degradation by the channel slicing because there are full connections between consecutive layers. On the other hand, the upper bound of performance is the case of fine-tuning the entire network again after truncating channels in the original network. In addition, a model in which the parameters of all channels are fixed and only the classifier is re-trained after cutting the channel from the original network. Our model shows relatively low performance degradation even though the total channel is getting smaller in both datasets.

When constructing a sliceable architecture based on the channel, our base model handles all the channels separately. During training our sliceable architecture, instead of processing all sliced channels separately, adjacent channels can be processed by group. In this experiment, we compared the performance changes with respect to channel slicing when learning the architecture with 8 slices (i.e., processing two consecutive channels together) and 4 slices (i.e., processing 4 consecutive channels together) in addition to base model with 16 slices. As can be seen in Fig. 5, we can see a tradeoff between performance degradation and sliceable degrees of freedom when grouping channels.

### 4.3 Effects of slicing both in width and depth directions

**Slicing with the baseline loss function** Fig. 6 shows classification accuracies of all partial models when training the full network with up to 16 layer groups and 22 channel groups, which is the finest-grained grouping in our setup. As intended, the wider and deeper neural network models show the higher accuracies without degrading the performance of the largest model seriously.

**Effects of non-flat loss weight matrices** In Section 3.2, we have introduced the loss weight matrix to prioritize the constituent partial models. Other than the flat matrix used in the baseline loss function, we explore three variants of loss weight matrices.

Fig. 6(a) and 6(b) show the changes of classification accuracies from the baseline when we apply two different loss weight matrices that give more weights to low cost models and low performance models respectively. The results show that the partial models with higher weights are able to achieve higher performance than the baseline while penalizing other models with relatively lower weights as expected. We can observe that the impact is more distinguishable in the low cost-preferred training also.

We also tested a case to give a high loss weight to a single specific model. When we assign a 100-times weight only to the model (L8, C8) in the center ( for i=8, j=8, otherwise ), the result in Fig. 6(c) shows that the target model outperforms the baseline with the same configuration by 7.2%.

We present another example to show the advantages of the doubly nested network over the previous slicing scheme only with a single degree of freedom in the supplementary material.

## 5 Discussion

This paper proposes a neural network architecture called DNNet in which all models share parameters maximally by nesting all-in-one. This nested structure allows for slicing into channel wise or layer wise while maintaining its original performance as high as possible. For channel-wise slice which has not been explored to date, we design channel-causal convolutions which sort channels topologically and connecting neurons accordingly. In addition, we add intermediate classifiers to consecutive layers in the network so that learned feature maps from the previous layer can be reused in subsequent layers, which lead to a sliceable architecture along layer axis.

Through various experiments, we show that the doubly nested network is robust to the channel-level slicing without causing severe performance degradation. We also verify that the channel-level slicing can be integrated successfully with the layer-level slicing so that we can benefit from increased degrees of freedom while leading to better solutions in terms of computational efficiency.

#### Acknowledgments

The authors would like to thank Jung Kwon Lee for the valuable feedback on this work and Nam-Gyu Cho for the great illustration of whole concept as figures.

### References

- J. M. Alvarez and M. Salzmann. Learning the number of neurons in deep networks. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 2262–2270, 2016.
- L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2018.
- R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014.
- S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In International Conference on Learning Representations (ICLR), 2016.
- S. Han, J. Pool, J. Tran, and W. J. Dally. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems (NIPS), 2015.
- K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 770–778, 2016.
- Y. He, X. Zhang, and J. Sun. Channel pruning for accelerating very deep neural networks. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 1398–1406, 2017.
- G. E. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. CoRR, abs/1503.02531, 2015.
- G. Huang, D. Chen, T. Li, F. Wu, L. van der Maaten, and K. Q. Weinberger. Multi-scale dense networks for resource efficient image classification. CoRR, abs/1703.09844, 2017.
- G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 2261–2269, 2017.
- T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim. Learning to discover cross-domain relations with generative adversarial networks. arXiv preprint arXiv:1703.05192, 2017.
- Y. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin. Compression of deep convolutional neural networks for fast and low power mobile applications. In International Conference on Learning Representations (ICLR), 2015.
- C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeply-supervised nets. In Artificial Intelligence and Statistics, pages 562–570, 2015.
- J. Lin, Y. Rao, J. Lu, and J. Zhou. Runtime neural pruning. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 2178–2188, 2017.
- S. Saxena and J. Verbeek. Convolutional neural fabrics. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 4053–4061, 2016.
- H. Shin, J. K. Lee, J. Kim, and J. Kim. Continual learning with deep generative replay. In Advances in Neural Information Processing Systems, pages 2994–3003, 2017.
- C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, et al. Going deeper with convolutions. Cvpr, 2015.
- T. Veniat and L. Denoyer. Learning time-efficient deep architectures with budgeted super networks. CoRR, abs/1706.00046, 2017.
- W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 2074–2082, 2016.
- W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning structured sparsity in deep neural networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, pages 2082–2090, USA, 2016. Curran Associates Inc.
- S. Xie and Z. Tu. Holistically-nested edge detection. In Proceedings of the IEEE international conference on computer vision, pages 1395–1403, 2015.
- X. Zhang, J. Zou, K. He, and J. Sun. Accelerating very deep convolutional networks for classification and detection. IEEE Trans. Pattern Anal. Mach. Intell., 38(10):1943–1955, 2016.
- H. Zhou, J. M. Alvarez, and F. Porikli. Less is more: Towards compact cnns. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part IV, pages 662–677, 2016.