Deep Virtual Networks for Memory Efficient Inference of Multiple Tasks
Deep networks consume a large amount of memory by their nature. A natural question arises can we reduce that memory requirement whilst maintaining performance. In particular, in this work we address the problem of memory efficient learning for multiple tasks. To this end, we propose a novel network architecture producing multiple networks of different configurations, termed deep virtual networks (DVNs), for different tasks. Each DVN is specialized for a single task and structured hierarchically. The hierarchical structure, which contains multiple levels of hierarchy corresponding to different numbers of parameters, enables multiple inference for different memory budgets. The building block of a deep virtual network is based on a disjoint collection of parameters of a network, which we call a unit. The lowest level of hierarchy in a deep virtual network is a unit, and higher levels of hierarchy contain lower levels’ units and other additional units. Given a budget on the number of parameters, a different level of a deep virtual network can be chosen to perform the task. A unit can be shared by different DVNs, allowing multiple DVNs in a single network. In addition, shared units provide assistance to the target task with additional knowledge learned from another tasks. This cooperative configuration of DVNs makes it possible to handle different tasks in a memory-aware manner. Our experiments show that the proposed method outperforms existing approaches for multiple tasks. Notably, ours is more efficient than others as it allows memory-aware inference for all tasks.
Recently, deep learning methods have made remarkable progress in computer vision and machine learning [21, 30, 13]. Although successful in many applications, it is well-known that many deep neural networks have a high memory footprint [10, 17]. This limits their practical applications, such as mobile phones, robots, and autonomous vehicles of low capacity. The issue has been addressed by research aimed at reducing the number of parameters of a deep network to create a lightweight network [12, 14].
Unfortunately, developing such a compact network is accompanied by a tradeoff between accuracy and the number of parameters (referred as the memory111We call the number of parameters as memory throughout the paper.) at test time [11, 16]. This requires efforts to find a proper network that gives competitive performance under a given memory budget . Besides, when a network model with a different memory budget is required, we define and train a new network, which incurs additional training cost.
Recently, several studies have been conducted on multiple inference under different memory budgets in a single trained architecture [22, 19], called memory efficient inference. This problem can be achieved by designing a network structure (e.g., nested  and fractal  structures) which enables multiple inference corresponding to different memory budgets. It allows flexible accuracy-memory tradeoffs within a single network and thus can avoid introducing multiple networks for different memory budget. Note that memory budget may vary when tasks are performed simultaneously in a memory-limited device (e.g., an autonomous vehicle with real-time visual and non-visual inference tasks to process at once).
Obviously, memory efficient inference can be an efficient strategy to provide different predictions in a network. However, prior works have applied the strategy to a single task learning problem individually [31, 19], and addressing multiple tasks jointly (often called multi-task learning [2, 29]) with the strategy has been considered less. Learning multiple tasks222Multiple tasks refer to multiple datasets, unless stated otherwise. simultaneously in a network can have a single training stage and reduce the number of networks [2, 26]. This approach also has the potential to improve generalization performance by sharing knowledge that represents associated tasks [5, 39, 7]. Despite its compelling benefits, little progress has been made so far in connection with memory efficient inference. This is probably due to the difficulty of constructing a single network that allows memory efficient inference for different tasks. The difficulty lies in the structural limitation of a neural network to possess a different structure for each task.
In this work, we aim to develop an efficient deep learning approach that performs memory efficient inference for multiple tasks in a single network. To this end, we propose a novel architecture containing multiple networks of different configurations termed deep virtual networks (DVNs). Each DVN shares parameters of the architecture and performs memory efficient inference for its corresponding task. A virtual network resembles a virtual machine  in a computer system as multiple virtual machines can share resources of a physical computer. Figure 1 gives an overview of the proposed approach.
The proposed architecture is based on a backbone architecture, and we divide the network parameters into multiple disjoint sets along with their corresponding structures termed units. Specifically, units are collected by dividing a set of feature maps in each layer into multiple subsets throughout the layers in the architecture (see Figure 2). A DVN is structured hierarchically which contains multiple levels of hierarchy corresponding to different numbers of units, and a lower level of hierarchy assigns fewer units and a higher level of hierarchy contains more units. For example, the lowest level of the hierarchy has a single unit. Each level of the hierarchy in a DVN contains all preceding lower levels’ units and one additional unit. Hence, different levels of hierarchy in a DVN enables multiple inference according to different memory budgets. In the proposed architecture, a unit can be shared by different DVNs. This allows multiple DVNs in a single deep network for multiple tasks. Each deep virtual network has a unique configuration (i.e., a hierarchical structure with a different order of units), and is specialized for a single task. The unique configuration is determined by the proposed rule discussed in Section 3.2. The proposed approach can selectively provide an inference output from its DVNs for a given task with the desired memory budget. The approach is realized in a single training stage based on a single backbone architecture (e.g., a residual network ), which significantly reduces training efforts and network storage.
We apply our method to joint learning scenarios of multiple tasks using popular image classification datasets. Our results show that for all tasks DVNs are learned successfully under different memory budgets. Even more, the results are better than other approaches. We also measure the actual processing time during inference to verify the practicality of the proposal. In addition, we demonstrate our approach on the task of sequential learning .
The proposed approach introduces a new concept of virtual networks in deep learning to perform multiple tasks in a single architecture, making it highly efficient.
2 Related Work
Multi-task learning. The aim of multi-task learning  is to improve the performance of multiple tasks by jointly learning them. Two popular approaches are learning a single shared architecture with multiple output branches [25, 24] and learning multiple different networks according to tasks [27, 35]. We are particularly interested in multi-task learning with a single shared network as it is memory efficient. Recently, a few approaches have been proposed to perform multiple tasks in a single network by exploiting unnecessary redundancy of the network [26, 19]. PackNet  divides a set of network parameters into multiple disjoint subsets to perform multiple tasks by iteratively pruning and packing the parameters. NestedNet  is a collection of networks of different sizes which are constructed in a network-in-network style manner. However, for a fixed budget the size of the assigned parameters of each network will be reduced as the number of tasks increases, which may cause a decrease in performance. Moreover, they can produce an inference output for each task. Whereas, our approach can overcome the issues by introducing deep virtual networks sharing disjoint subsets of parameters in our architecture and their different configurations make it possible to address multiple tasks (see Figure 2).
Multi-task learning can be extended to sequential learning [24, 38, 3], where tasks are learned sequentially without accessing the datasets of old tasks. Following the popular strategy in , we apply the proposed approach to sequential learning problems (see Section 3.3 and 4.5).
Memory efficient learning. Memory efficient learning is a learning strategy to perform multiple inference according to different budgets on the number of parameters (called memory) in a single network [22, 37, 19]. It enables flexible inference under varying memory budget, which is often called the anytime prediction . To realize the anytime prediction, a self-similarity based fractal structure  was proposed. A feedback system based on a recurrent neural network  was proposed to perform different predictions according to memory or time budgets. A nested network , which consists of multiple networks of different scales, was proposed to address different memory budget. However, these approaches are confined to performing an individual task. In contrast, our method enables the anytime prediction for multiple tasks using deep virtual networks.
To our knowledge, this work is the first to introduce deep virtual networks of different configurations from a single deep network, which enables flexible prediction under varying memory conditions for multiple tasks.
3.1 Memory efficient learning
We discuss the problem of memory efficient learning to perform multiple inference with respect to different memory budgets for a single task. Assume that given a backbone network we divide the network parameters into disjoint subsets, i.e., . We design the network to be structured hierarchically by assigning the subsets, in a way that the -th level of hierarchy () contains the subsets in the ()-th level and one additional subset . The lowest level of the hierarchy () assigns a single subset and the highest level contains all subsets (i.e., . For example, when we can assign to the lowest level in the hierarchy, to the intermediate level, and to the highest level. A hierarchical structure is determined by an order of subsets, which is designed by a user before learning. In this work, the number of levels of hierarchy, denoted as , is set to the number of subsets, . Each level of hierarchy defines a network corresponding to the subsets and produces an output. The hierarchical structure thus enables inference for different numbers of subsets (memory budgets).
Given a dataset consisting of image-label pairs and levels of hierarchy , the set of parameters can be optimized by the sum of loss functions
where is a set of parameters of that are assigned to the -th level of hierarchy. There is a constraint on such that a higher level set includes a lower level set, i.e., , for a structure sharing parameters . is a standard loss function (e.g., cross-entropy) of a network associated with . In addition, we enforce regularization on (e.g., decay) for improved learning. By solving (1), a learned network is collected and can perform inference corresponding to memory budgets.
The function can be designed by a pruning operation on in element-wise  or group-wise (for feature maps) . Since our approach targets a practical time-dependent inference, we follow the philosophy of group-wise pruning approaches [33, 14] in this work. Note that the problem (1) is applied to a single task (here, a dataset ), rarely considering multiple tasks (or datasets). This issue will be addressed in the following subsection with the introduction of deep virtual networks.
3.2 Deep virtual network
Building block. Our network architecture is based on a backbone architecture, and we divide the network parameters into multiple disjoint subsets. Assume that there are disjoint subsets in a network, which are collected by dividing feature maps in each layer into subsets across all layers.333For simplicity, we omit a fully-connected layer. However, it is appended on top of the last convolutional layer to produce an output. Formally, a set of network parameters is represented as , where is the number of layers and . The -th subset of is denoted as . Here, and are the width and height of the convolution kernel of the -th layer, respectively. and are the number of input and output feature maps of the -th layer, respectively, such that and . The set of the -th subsets over all layers is written as
We call the corresponding network structure defined by as unit , which produces an inference output.
Hierarchical structure. The proposed approach produces deep virtual networks (DVNs) of different network configurations (i.e., hierarchical structures) using shared units in a network architecture, as illustrated in Figure 2. Each unit is coupled with other units along the feature map direction to form a hierarchical structure similar to the strategy described in Section 3.1. The number of levels of hierarchy is , where a level of the hierarchy includes all preceding lower levels’ units and one additional unit. A different hierarchical structure is constructed by a different order of units. This introduces a unique DVN which is specialized for a task. Thus, multiple DVNs of different network configurations can be realized in a single network by sharing units, for different tasks (see Figure 3). Whereas, the problem (1) is for a single task with a network configuration, which is equivalent to producing a single DVN.
Rules for configuring virtual networks. In order to determine different network configurations of deep virtual networks, we introduce a simple rule. We assume that datasets are collected sequentially, along with their task ID numbers, and the datasets with adjacent task ID numbers are from similar domains. The proposed rule is: (i) The unit is assigned to the task , and it becomes the lowest level in the hierarchy for the task. (ii) The unit is coupled with adjacent units that are not coupled. (iii) If there are two adjacent units, the unit with a lower task ID number is coupled. For example, assume that is a function that selects the subset of of the -th level of hierarchy for the task . When and , where denotes the parameters for the unit , we construct the following hierarchical structure444When units are used together, additional parameters (interconnection between units) are added to parameters of stand-alone units, ’s. from the rule for the task
The configuration is different depending on the order of units (see Figure 3 for an example).
Objective function. Given datasets for tasks, , deep virtual networks, the set of parameters , and levels of hierarchy for each deep virtual network, the proposed method can be optimized by solving the sum of loss functions
where is a function that selects the subset of corresponding to the -th level of hierarchy for the -th task (or deep virtual network ), such that , for all . Note that in the case when , the problem (4) reduces to the problem (1) for a single task .
Learning. The unit is learned based on the following gradient with respect to
where . returns the level number at which the -th unit is added to the hierarchy for the -th task (see Figure 3). The unit is learned by aggregating multiple gradients from the hierarchical structures of deep virtual networks for all tasks. Note that, for given , the difference influences on the amount of the gradient (significance) of the unit for the task as the gradients from more levels accumulate. As the difference is larger, the significance of the unit will be higher for the task . The proposed approach is trained in a way that each unit is learned to have different significance (different ) for all tasks. Note that the total amount of gradients of a unit over all tasks is about same to those of other units using the proposed configuration rule. This prevents units from having irregular scales of gradients.
3.3 Deep virtual network for sequential tasks
The proposed approach can also handle sequential tasks . Assume that the old tasks, from the first to the -th task, have been learned beforehand. For the current (new) task , we construct an architecture with units, where units correspond to the old tasks and the -th unit represents the current task. Based on the units, we construct deep virtual networks as described in Section 3.2.
Given a dataset for the task , , the set of parameters , deep virtual networks, and levels of hierarchy, the problem () is formulated as
where is a distillation loss between the output of a network whose corresponding structure is determined by and the output of the task from the old network when a new input is given. The only exception from the problem (4) (which jointly learns tasks) is that we use a distillation loss function to preserve the knowledge of the old tasks in the current sequence  (due to the absence of the old datasets). For , we adopt the modified cross entropy function  following the practice in . The gradient of (6) with respect to is
Compared to our gradient in (7), LwF learns a single set of parameters , which reveals that the network has no hierarchical structure and all tasks are performed without memory efficient inference.
4.1 Experimental setup
We tested our approach on several supervised learning problems using visual images. The proposed method was applied to standard multi-task learning (joint learning) , where we learn multiple tasks jointly, and sequential learning , where we focus on the -th sequence with the learned network for old tasks. We also applied the proposed approach to hierarchical classification , which is the problem of classifying coarse-to-fine class categories. Our approach was performed based on four benchmark datasets: CIFAR-10 and CIFAR-100 , STL-10 , and Tiny-ImageNet555https://tiny-imagenet.herokuapp.com/, based on two popular (backbone) models, WRN--  and ResNet- , where and are the number of layers and the scale factor over the number of feature maps, respectively.
We first organized three scenarios for joint learning of multiple tasks. We performed a scenario (J1) consisting of two tasks using the CIFAR-10 and CIFAR-100 datasets and another scenario (J2) of four tasks whose datasets are collected by dividing the number of classes of Tiny-ImageNet into four subsets evenly. The third scenario (J3) consists of three datasets, CIFAR-100, Tiny-ImageNet, and STL-10, of different image scales (from 3232 to 9696). For hierarchical classification (H1), CIFAR-100 was used which contains coarse classes (20 classes) and fine classes (100 classes). For sequential learning, we considered two scenarios, where a scenario (S1) has two tasks whose datasets are collected by dividing the number of classes of CIFAR-10 into two subsets evenly, and another scenario (S2) consists of two tasks using CIFAR-10 and CIFAR-100.
4.2 Implementation details
All the compared architectures were based on ResNet  or WRN . We followed the practice of constructing the number of feature maps in residual blocks in  for all applied scenarios. We constructed the building block of a network for Tiny-ImageNet based on the practice of ImageNet . All the compared methods were learned from scratch until the same epoch number and were initialized using the Xavier method . The proposed network was trained by the SGD optimizer with Nesterov momentum of 0.9, where the mini-batch sizes were 128 for CIFAR and 64 for Tiny-ImageNet, respectively. We adopted batch normalization  after each convolution operation.
We constructed units with respect to feature maps across the convolution layers, except the first input layer. Our deep virtual networks have task-specific input layers for different tasks or input scales, respectively, and the dimensionality of their outputs are set to the same by varying the stride size using convolution. When two units are used together, the feature map size doubles and additional parameters (i.e., interconnection between the units) are needed to cover the increased feature map size, in addition to parameters (intraconnection) of stand-alone units. We also appended a fully connected layer of a compatible size on top of each level of hierarchy. All the proposed approaches were implemented under the TensorFlow library , and their evaluations were provided based on an NVIDIA TITAN Xp graphics card.
4.3 Joint learning
We conducted experiments for joint learning by comparing with two approaches: PackNet (a grouped variant of PackNet  to achieve actual inference speed-up by dividing feature maps into multiple subsets similar to ours), and NestedNet (with channel pruning)  which can perform either multi-task learning or memory efficient learning.
For the first scenario (J1) using the two CIFAR datasets, we split the number of parameters almost evenly along the feature map dimension and assigned the first half and all of the parameters to the first and second task, respectively, for PackNet and NestedNet. Our architecture contains two deep virtual networks (DVNs), and each DVN consists of two units (and two levels of hierarchy) by splitting a set of feature maps in every layer into two subsets evenly throughout all associated layers. Here, each stand-alone unit has 25 of the parameter density, since inter-connected parameters between the two units are ignored (see Section 4.2). For this scenario, WRN-32-4  was used for all compared approaches. Table 1 shows the results of the compared approaches. Our approach gives four evaluations according to tasks and memory budgets. Among them, the evaluations using each stand-alone unit (top) do not compromise much on performance compared to those using all units (bottom) on average. PackNet and NestedNet give the comparable performance to our approach, but their maximum performance leveraging the whole network capacity are poorer than ours. Baseline gives comparable performance to the multi-task learning approaches, but it requires 2 larger number of parameters in this problem. The average inference times (and the numbers of parameters) of our DVN using single and all associated units are 0.11ms (1.9M) and 0.3ms (7.4M) for a single image, respectively. We also provide the performance curve of the proposed approach on the test sets in Figure 4.
|Method||Task 1||Task 2||Average|
Figure 5(a) shows the results for the second scenario (J2) using Tiny-ImageNet (four tasks). The ratios of parameters for PackNet and NestedNet were from task 1 to task 4, by dividing parameters into four subsets almost evenly and assigning the first subsets to task . Our architecture contains four DVNs each of which has four units and four levels of hierarchy. The ratios of parameters in each hierarchy were for each DVN. All compared approaches were based on ResNet-42 . As shown in the figure, our approach outperforms the competitors under similar memory budgets for all tasks. Moreover, ours provides additional outputs for different memory budgets, making it highly efficient. Even though NestedNet has the similar strategy of sharing parameters, it performs poorer than ours. Unlike the previous example, the baseline shows unsatisfying results and even requires larger network storage than ours to perform the same tasks.
|Inference time (ms)||0.18||0.38||0.67||1.05|
In addition, we compared with NestedNet  on the same scenario (J2) for memory efficient inference. Since NestedNet performs memory efficient inference for a task, we trained it four times according to the number of tasks. Whereas, our architecture was trained once and performed memory efficient inference for all the tasks from our DVNs. Figure 5(b) shows that our method gained significant performance improvement over NestedNet for all the tasks. Table 2 summarizes the number of parameters and its associated speed-ups of the proposed network.
For the third scenario (J3) on three different tasks, a set of feature maps is divided into three subsets for the compared methods. The ratios of parameters were from task 1 (Tiny-ImageNet) to task 3 (STL-10). Each DVN has the same density ratios in its hierarchical structure. ResNet-42  was applied by carefully following the network design and learning strategy designed for ImageNet . Figure 6 shows the results for the tasks. The proposed method performs better than the compared approaches on average under similar parameter density ratios. While PackNet and NestedNet give comparable performance to ours for Tiny-ImageNet, they perform poorer than ours for the other two tasks. Moreover, they produce a single output for every task with a fixed parameter density condition, while ours provides multiple outputs under different density conditions for each dataset. The numbers of parameters and their inference times of our DVN are 0.65ms (7.5M), 1.02ms (16.8M), and 1.51ms (29.8M), respectively, for a single image from STL-10.
|Task 1 (20)||Task 2 (100)|
|Method||Feature Extraction ||DA-CNN ||LwF ||NestedNet ||Ours|
4.4 Hierarchical classification
As another application of joint learning, we experimented with the scenario (H1), hierarchical classification . The aim is to model multiple levels of hierarchy of class category for a dataset, and each level is considered as a task. We evaluated on CIFAR-100 which has two-level hierarchy of class category as described in Section 4.1. Our architecture contains two deep virtual networks, and each contains two units by dividing feature maps equally into two sets. Thus, it produces four different inference outputs. We compared with NestedNet  which can perform hierarchical classification in a single network. The backbone network was WRN-32-4.
Table 3 shows the results of the applied methods. We also provide the baseline results by learning an individual network (WRN-32-2 or WRN-32-4) for the number of parameters and the number of classes. Overall, our approach performs better than other compared methods for all cases. Ours and NestedNet outperform the baseline probably due to their property of sharing parameters between the tasks as they are closely related to each other. The proposed approach produces a larger number of inference outputs than NestedNet while keeping better performance.
4.5 Sequential learning
We conducted the scenario (S1) which consists of two sequential tasks based on CIFAR-10, where the old (task 1) and new (task 2) tasks consist of the samples from the first and last five classes of the dataset, respectively. We compared our approach with other methods that can perform sequential tasks: Feature Extraction , LwF , DA-CNN  (with two additional fully-connected layers), and NestedNet  (whose low- and high-level of hierarchy in the network represent old and new tasks, respectively).
The proposed network consists of two units by dividing feature maps into two subsets evenly (each stand-alone unit has 25 parameter density ratio). It constructs two deep virtual networks providing four inference outputs. We applied the WRN-32-4 architecture for all compared approaches. Table 4 shows the results of the compared methods. We observe that the proposed approach outperforms other approaches. Notably, the results using stand-alone units are better than others on average. Feature Extraction and DA-CNN nearly preserve the performance for the first task by maintaining the parameters of the first task unchanged, but their performances give the unsatisfactory results for the following task. Whereas, the results from LwF and NestedNet are much better than those mentioned above for the second task, but their results are worse than ours.
We also applied the proposal to another scenario (S2) consisting of CIFAR-10 (old, task 1) and CIFAR-100 (new, task 2). All the compared approaches were performed based on WRN-32-8. Our DVNs were constructed and trained under the same strategy to (S1). The results of the scenario are summarized in Table 5. Our result using all units (right column) gives the best performance on average among the compared approaches. Moreover, our result using a stand-alone unit (left column) also performs better than the best competitors, LwF and NestedNet, which use the same distillation loss function .
In this work, we have presented a novel architecture producing deep virtual networks (DVNs) to address multiple objectives with respect to different tasks and memory budgets. Each DVN has a unique hierarchical structure for a task and enables multiple inference for different memory budgets. Based on the proposed network, we can adaptively choose a DVN and one of its level of hierarchy for a given task with the desired memory budget. The efficacy of the proposed method has been demonstrated under various multi-task learning scenarios. To the best of our knowledge, this is the first work introducing the concept of virtual networks in deep learning for multi-task learning.
Acknowledgements. This work was supported by the ERC grant ERC-2012-AdG 321162-HELIOS, EPSRC grant Seebibyte EP/M013774/1, EPSRC/MURI grant EP/N019474/1, Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT (NRF-2017R1A2B2006136), and AIR Lab (AI Research Lab) of Hyundai Motor Company through HMC-SNU AI Consortium Fund. We would also like to acknowledge the Royal Academy of Engineering and FiveAI.
-  Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. TensorFlow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
-  Rich Caruana. Multitask learning. Machine learning, 28(1):41–75, 1997.
-  Arslan Chaudhry, Puneet K Dokania, Thalaiyasingam Ajanthan, and Philip H.S. Torr. Riemannian walk for incremental learning: Understanding forgetting and intransigence. In European Conference on Computer Vision. Springer, 2018.
-  Adam Coates, Honglak Lee, and Andrew Y. Ng. An analysis of single layer networks in unsupervised feature learning. In AISTATS, 2011.
-  Ronan Collobert and Jason Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In International Conference on Machine Learning, pages 160–167. ACM, 2008.
-  Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In International Conference on Machine Learning, 2014.
-  Ross Girshick. Fast R-CNN. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
-  Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, 2010.
-  Ariel Gordon, Elad Eban, Ofir Nachum, Bo Chen, Hao Wu, Tien-Ju Yang, and Edward Choi. Morphnet: Fast & simple resource-constrained structure learning of deep networks. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
-  Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A Horowitz, and William J Dally. EIE: efficient inference engine on compressed deep neural network. In Computer Architecture, 2016 ACM/IEEE 43rd Annual International Symposium on, pages 243–254. IEEE, 2016.
-  Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In International Conference on Learning Representations, 2016.
-  Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Conference on Neural Information Processing Systems, 2015.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.
-  Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. In International Conference on Computer Vision, 2017.
-  Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
-  Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
-  Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and 0.5 MB model size. arXiv preprint arXiv:1602.07360, 2016.
-  Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, 2015.
-  Eunwoo Kim, Chanho Ahn, and Songhwai Oh. NestedNet: Learning nested sparse structures in deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
-  Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
-  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet classification with deep convolutional neural networks. In Conference on Neural Information Processing Systems, 2012.
-  Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. FractalNet: Ultra-deep neural networks without residuals. In International Conference on Learning Representations, 2017.
-  Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. In International Conference on Learning Representations, 2017.
-  Zhizhong Li and Derek Hoiem. Learning without forgetting. In European Conference on Computer Vision. Springer, 2016.
-  Mingsheng Long, Zhangjie Cao, Jianmin Wang, and S Yu Philip. Learning multiple tasks with multilinear relationship networks. In Advances in Neural Information Processing Systems, pages 1594–1603, 2017.
-  Arun Mallya and Svetlana Lazebnik. PackNet: Adding multiple tasks to a single network by iterative pruning. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
-  Ishan Misra, Abhinav Shrivastava, Abhinav Gupta, and Martial Hebert. Cross-stitch networks for multi-task learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3994–4003, 2016.
-  Gerald J Popek and Robert P Goldberg. Formal requirements for virtualizable third generation architectures. Communications of the ACM, 17(7):412–421, 1974.
-  Sebastian Ruder. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098, 2017.
-  Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015.
-  Andreas Veit and Serge Belongie. Convolutional networks with adaptive inference graphs. arXiv preprint arXiv:1711.11503, 2017.
-  Yu-Xiong Wang, Deva Ramanan, and Martial Hebert. Growing a brain: Fine-tuning by increasing model capacity. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.
-  Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In Conference on Neural Information Processing Systems, 2016.
-  Zhicheng Yan, Hao Zhang, Robinson Piramuthu, Vignesh Jagadeesh, Dennis DeCoste, Wei Di, and Yizhou Yu. HD-CNN: Hierarchical deep convolutional neural networks for large scale visual recognition. In IEEE International Conference on Computer Vision, 2015.
-  Yongxin Yang and Timothy M Hospedales. Trace norm regularised deep multi-task learning. arXiv preprint arXiv:1606.04038, 2016.
-  Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
-  Amir R Zamir, Te-Lin Wu, Lin Sun, William B Shen, Bertram E Shi, Jitendra Malik, and Silvio Savarese. Feedback networks. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.
-  Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. In International Conference on Machine Learning, 2017.
-  Zhanpeng Zhang, Ping Luo, Chen Change Loy, and Xiaoou Tang. Facial landmark detection by deep multi-task learning. In European Conference on Computer Vision, pages 94–108. Springer, 2014.
-  Shlomo Zilberstein. Using anytime algorithms in intelligent systems. AI magazine, 17(3):73, 1996.