Model Slicing for Supporting Complex Analytics with Elastic Inference Cost and Resource Constraints
Deep learning models have been used to support analytics beyond simple aggregation, where deeper and wider models have been shown to yield great results. These models consume a huge amount of memory and computational operations. However, most of the large-scale industrial applications are often computational budget constrained. In practice, the peak workload of inference service could be 10x higher than the average cases, with the presence of unpredictable extreme cases. Lots of computational resources could be wasted during off-peak hours and the system may crash when the workload exceeds system capacity. How to support deep learning services with dynamic workload cost-efficiently remains a challenging problem. In this paper, we address the challenge with a general and novel training scheme called model slicing, which enables deep learning models to provide predictions within the prescribed computational resource budget dynamically. Model slicing could be viewed as an elastic computation solution without requiring more computational resources. Succinctly, each layer in the model is divided into groups of contiguous block of basic components (i.e. neurons in dense layers and channels in convolutional layers), and then partially ordered relation is introduced to these groups by enforcing that groups participated in each forward pass always starts from the first group to the dynamically-determined rightmost group. Trained by dynamically indexing the rightmost group with a single parameter slice rate, the network is engendered to build up group-wise and residual representation. Then during inference, a sub-model with fewer groups can be readily deployed for efficiency whose computation is roughly quadratic to the width controlled by the slice rate. Extensive experiments show that models trained with model slicing can effectively support on-demand workload with elastic inference cost.
Model Slicing for Supporting Complex Analytics with Elastic Inference Cost and Resource Constraints \vldbAuthorsShaofeng Cai, Gang Chen, Beng Chin Ooi, Jinyang Gao \vldbDOIhttps://doi.org/10.14778/3364324.3364325 \vldbVolume13 \vldbNumber2 \vldbYear2019
Database management systems (DBMS) have been widely used and optimized to support OLAP-style analytics. In present-day applications, more and more data-driven machine learning based analytics have been grafted into DBMS to support complex analysis (e.g., stock prediction, disease progression analysis) and/or to enable predictive query and system optimization. To better understand the data and decipher the information that truly counts in the era of Big Data with its ever-increasing data size and complexity, many advanced large-scale machine learning models have been devised, from million-dimension linear models (e.g., Logistic Regression [richardson2007predicting], feature selection [zhang2016materialization]) to complex models like Deep Neural Networks [krizhevsky2012imagenet]. To meet the demand for more complex analytic queries, OLAP database vendors have integrated Machine Learning (ML) libraries into their systems (e.g., SQL Server pymssql111https://docs.microsoft.com/en-us/sql/connect/python/pymssql/python-sql-driver-pymssql, DB2 python_ibm_db222https://github.com/ibmdb/python-ibmdb and etc). It is widely recognized that the integration of ML analytics into data systems yields seamless effects since the ML task is treated as an operator of the query plan instead of an individual black-box system on top of data systems. Naturally, a higher-level abstraction provides more space for optimization. For example, query planning [pangzi14, msms], lazy evaluation [lazy10], materialization [zhang2016materialization] and operator optimization [boehm2016systemml] could be considered in a fine-grained manner.
Cost and accuracy are always the two most crucial criteria considered for analytic tasks. Lots of research on approximate query processing have been conducted [li2016wander, bolin2017aqp] to provide faster yet approximate analytical query results in modern large-scale analytical database systems, while such a trade-off is not equally well researched for modern ML analytic tasks, particularly deep neural network models. There are two characteristics of the inference cost of analytic tasks for deep neural network models. Firstly, with the development of high-end hardware and large-scale datasets, recent deep models are growing deeper [krizhevsky2012imagenet, he2016deep] and wider [zagoruyko2016wide, xie2017aggregated]. State-of-the-art models have been designed with up to hundreds of layers and tens of millions of parameters, which leads to a dramatic increase in the inference cost. For instance, a 152-layer ResNet [he2016deep] with over 60 million parameters requires up to 20 Giga FLOPs for the inference of one single image. The surging computational cost severely affects the viability of many deep models in industry-scale applications. Secondly, for most of the analytic tasks, the workload is usually not constant, e.g., the number of images per query for person re-id [zheng2015scalable] service in peak hours could be five times more than the workload in the off-peak hours. Therefore, such a trade-off should be naturally supported in the inference phase rather than the training phase: using one single deep model with fixed inference cost to support the peak workload could lead to huge amounts of resources wasting in off-peak hours, and may not be able to handle the unexpected extreme workload. How to trade off the accuracy and cost during deep model inference remains a challenging problem of great importance.
Existing model architecture re-design [iandola2016squeezenet, howard2017mobilenets] or model compression [han2015deep, han2015learning, liu2017learning] methods are not able to handle elastic inference satisfactorily, and we shall use an application example to highlight the challenges. Singles Day shopping festival333https://en.wikipedia.org/wiki/Singles%27_Day around 11 November was introduced by Taobao.com and is now becoming one of the biggest online shopping festivals around the world. In 2018, the Singles Day festival generated close to 30 billion dollars of sales in one single day and had attracted hundreds of millions of users from more than 200 different countries. The peak level of trade rate reached 0.256 million per second, and 42 million processing in the database in the first half hour. In Singles Day, the search traffic of the e-commerce search engine increases about three times than in a common day, and could be 10x in its first hour. Meanwhile, the workload of most other services in Alibaba such as OLTP transaction may also hit the peak at the same time [cao2018tcprt], and consequently, it is not possible to scale up the service by acquiring more hardware resources from Alibaba Cloud. The system degradation is often executed in two simple and naive approaches: First, some costly deep learning models are replaced by simple GBDT [chen2016xgboost, ke2017lightgbm] models; Second, the size of the candidate items for ranking is reduced. The search accuracy suffers dramatically due to the system degradation in such a coarse-grained manner. With a deep learning model supporting elastic inference cost, the system degradation management can become more fine-grained where the inference cost and accuracy trade-off per query sample can be dynamically determined based on the current system workload.
In this paper, instead of constructing small models based on each individual workload requirement, we propose and address a related but slightly different research problem: developing a general framework to support deep learning models with elastic inference cost. We base the framework on a pay-as-you-go model to support dynamic trade-offs between computation cost and accuracy during inference time. That is, dynamic optimization is supported based on system workload, availability of resources and user requirements.
An ML model abstraction with elastic inference cost would greatly benefit the optimization of the system design for complex analytics. We shall examine the problem from a fresh system perspective and propose our solution – model slicing, a general network training mechanism supporting elastics inference cost, to satisfy the run-time memory and computation budget dynamically during the inference phase. The crux of our approach is to decompose each layer of the model into groups of a contiguous block of basic components, i.e. neurons in dense layers and channels in convolutional layers, and facilitate group residual learning by imposing partially ordered relation on these groups. Specifically, if one group participates in the forward pass of model computation, then all of its preceding groups in this layer are also activated under such a structural constraint. Therefore, we can use a single parameter slice rate to control the proportion of groups participated in the forward pass during inference. We empirically share the slice rate among all layers in the network; thus the computational resources required can be regulated precisely by the slice rate.
The slice rate is structurally the same concept as width multiplier [howard2017mobilenets] which controls the width of the network. However, instead of training only one fixed narrower model as in [howard2017mobilenets], we train the network in a dynamic manner to enhance the representation capacity of all the subnets it subsumes. For each forward pass during training, as illustrated in Figure 1, we sample the slice rate from a distribution predetermined in the Slice Rate Scheduling Scheme, and train the corresponding sub-layers. The main challenges of training one model that supports inference at different widths include: how to determine proper candidate subnets (i.e. scheduling the slice rate) for each training iteration; and more importantly, how to stabilize the scale of output for each component (i.e. neurons or channels) as the the number of input components varies. Independent to our work, Slimmable Neural Network [yu2018slimmable] (SlimmableNet) also proposes to train a single network executable at different widths. In [yu2018slimmable], candidate subnets are considered to be equally important during training, by statically scheduling all subnets for every training pass and incorporating a set of batch normalization [ioffe2015batch] (BN) layers into each layer, one for each candidate sub-layer, to address the output scale instability issue. In contrast, we consider the importance of the subnets to be different in model slicing (e.g., the full and the base network are the two most important subnets), and propose to dynamically schedule the training accordingly; besides the multi-BN solution, we further propose a more efficient solution with the group normalization[wu2018group] layer (GN) to prevent the scale instability, which works in accordance with the dynamic group-wise training and engenders the group residual representation. We shall provide more discussions on Section 3.
The model slicing training scheme can be scrutinized under the perspective of residual learning [he2016deep, he2016identity] and knowledge distillation [hinton2015distilling]. Under the random training process of model slicing, groups of each layer need to build up the representation increasingly, where the preceding groups carry the most fundamental information and the following groups the residual representation relatively. Structurally, the final learned network is an ensemble of subnets, with being the number of groups, each corresponds to one slice rate. The parameters of these subnets are tied together and during each forward training pass, one subnet uniquely indexed by the slice rate is selected and trained. We conjecture that the accuracy of the resulting full trained network should be comparable to the network trained conventionally. Meanwhile, smaller subnets gradually distill knowledge from larger subnets as the training progresses, and thus can achieve comparable or even higher accuracy than their counterparts individually trained. Consequently, we can provide the same functionality of an ensemble of models with only one model by width slicing.
The proposed training scheme has many advantages over existing methods on various issues such as model compression, model cascade and anytime prediction. First, model slicing is readily applicable to existing neural networks, requiring no iterative retraining or dedicated library/hardware support as compared with most compression methods [han2015learning, liu2017learning]. Second, instead of training a set of models and optimize the scheduling of these models with different accuracy-efficiency trade-offs as is in conventional model cascade [kang2017noscope, wang2017idk], model slicing provides the same functionality of producing an approximate low-cost prediction with one single model. Third, the structure of the model trained with model slicing naturally supports applications where the model is required to give prediction within a given computational budget dynamically, e.g., anytime prediction [huang2017multi, hu2019learning].
Our main technical contributions are:
We develop a general training and inference framework model slicing that enables deep neural network models to support complex analytics with the trade-off between accuracy and inference cost/resource constraints on a per-input basis.
We formally introduce the group residual learning of model slicing to general neural network models and further convolutional and recurrent neural networks. We also study the training details of model slicing and their impact in depth.
We empirically validate through extensive experiments that neural networks trained with model slicing can achieve performance comparable to an ensemble of networks with one single model and support fluctuating workload with up to 16x volatility. Example applications are also provided to illustrate the usability of model slicing. The code is available at GitHub 444 https://github.com/ooibc88/modelslicing, which has been included in [ooi2015singa].
The rest of the paper is organized as follows. Section 2 provides a literature survey of related works. Section 3 introduces model slicing and how it can be applied to various deep learning models, including Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs) and etc. We then show how model slicing can support fine-grained system degradation management for present industrial deep learning services and we also provide an illustrating application of cascade ranking in Section 4. Experimental evaluations of model slicing are given in Section 5, under prevailing natural language processing and computer vision tasks on public benchmark datasets. Visualizations and detailed discussions of the results are also provided. Section 6 concludes the paper and points out some further research directions.
2 Related Work
2.1 Resource-aware Model Optimization
Many recent works directly devise networks [huang2017multi, wang2018skipnet, cai2019isbnet] that are more economical in producing predictions. SkipNet [wang2018skipnet] incorporates reinforcement learning into the network design, which guides the gating module whether to bypass the current layer for each residual block. SkipNet can provide predictions more efficiently yet in a less controlled manner inherently. In MoE [shazeer2017outrageously], a gating network is introduced to select a smaller number of networks out a mixture-of-experts which consists of up to thousands of networks during inference for each sample. This kind of model ensemble approach aims to scale up the model capacity without introducing much overhead, while our approach enables every single model trained to scale down and support elastic inference cost.
MSDNet [huang2017multi] supports classification with computational resource budgets at test time by inserting multiple classifiers into a 2D multi-scale version of DenseNet [huang2017densely]. By early-exit into a classifier, MSDNet can provide predictions within given computation constraints. ANNs [hu2019learning] adopts a similar design strategy of introducing auxiliary classifiers with Adaptive Loss Balancing, which supports the trade-off between accuracy and computational cost by using the intermediate features. [mcintosh2018recurrent] also develops a model that can successively improve prediction quality with each iteration but this approach is specific to segmenting videos with RNN models. These methods can largely alleviate the computational efficiency problem. However, they are highly specialized networks, which restrict their applicability. Functionally, models trained with model slicing also reuse intermediate features and support progressive prediction but with width slicing. Model slicing works similarly to these networks yet is more efficient, flexible and general.
2.2 Model Compression
Reducing the model size and computational cost has become a central problem in the deployment of deep learning solutions in real-world applications. Many works have been proposed to resolve the challenges of growing network size and surging resource expenditure incurred, mainly memory and computation. The mainstream solutions are to compress networks into smaller ones, including low-rank approximation [denton2014exploiting], network quantization [courbariaux2016binarized, han2015deep, han2015learning], weight pruning [han2015learning, han2015deep], network sparsification on different level of structure [wen2016learning, liu2017learning] etc.
To this end, many model compression approaches attempt to reduce the model size on the trained networks. [denton2014exploiting] reduces model redundancy with tensor decomposition on the weight matrix. [courbariaux2016binarized] and [han2015learning] instead propose to quantize the network weights to save storage space. HashNet [chen2015compressing] also proposes to hash network weights into different groups and sharing weight values within each group. These techniques are effective in reducing model size. For instance, [han2015learning] achieves up to 35x to 49x compression rates on AlexNet [krizhevsky2012imagenet]. Although a considerable amount of storage can be saved, these techniques can hardly reduce run-time memory or inference time, and they typically need a dedicated library and/or hardware support.
Many studies propose to prune weights, filters or channels in the networks. These approaches are generally effective because typically, deep networks are highly redundant in model representation. [han2015deep, han2015learning] iteratively prune unimportant connections of small weights in trained neural networks. [srinivas2017training] further guides the sparsification of neural networks during training by explicitly imposing sparse constraints over each weight with a gating variable. The resulting networks are highly sparse, which can be stored compactly in a sparse format. However, the speedup of inference time of these methods depend heavily on dedicated sparse matrix operation libraries or hardware, and the saving of run-time memory is again very limited since most of the memory consumption comes from the activation maps instead of these weights. [wen2016learning, liu2017learning] reduce the model size more radically by imposing regularization on the channel or filter and then prune the unimportant components. Like model slicing, channel and filter level sparsity can reduce the model size, run-time memory footprint and also lower the number of computational operations. However, these methods often require iterative fine-tuning to regain performance and support no inference time control.
2.3 Efficient Model Design
Instead of compressing existing large neural networks during or after training, recent works have also been exploring more efficient network design. ResNet [he2016deep, he2016identity] proposes residual learning via an identity mapping shortcut and the efficient bottleneck structure, which enables the training of very deep networks without introducing more parameters. [veit2016residual] shows that ResNet behaves like an ensemble of shallow networks and it can still function normally with a certain fraction of layers being removed. FractalNet [larsson2016fractalnet] contains a series of the duplication of the fractal architecture with interacting subpaths. FractalNet adopts drop-path training which randomly selects certain paths during training, allowing for the extraction of fixed-depth subnetworks after training without significant performance loss. To some extent, these network architectures can support on-demand workload by slicing subnets layer-wise or path-wise. However, these methods are not generally applicable to other networks and the accuracy significantly drops when shortening or narrowing the network.
Many recent works focus on designing lightweight networks. SqueezeNet [iandola2016squeezenet] reduces parameters and computation with the fire module. MobileNet [howard2017mobilenets] and Xception [chollet2017xception] utilize depth-wise and point-wise convolution for more parameter efficient convolutional networks. ShuffleNet [zhang2018shufflenet] proposes point-wise group convolution with channel shuffle to help the information flowing across channels. These architectures scrutinize the bottleneck in conventional convolutional neural networks and search for more efficient transformation, reducing the model size and computation greatly.
3 Model Slicing
We aim to provide a general training scheme for neural networks to support on-demand workload with elastic inference cost. More specifically, the target is to enable the neural network to produce prediction within prescribed computational resources budget for each input instance, and meanwhile maintain the accuracy.
Existing methods of model compression, model ensemble and anytime prediction models can partially address this problem, but each has its limitations. Model compression methods such as network slimming [liu2017learning] which compresses channel width each layer, produce efficient models while they typically take longer training time for iterative pruning and retraining, and more importantly, have no control over resources required during inference. Model ensemble methods, e.g., the ensemble of varying depth or width networks, support inference time resources control by scheduling the model for the immediate prediction task. However, deploying an ensemble of the models multiply the amount of disk storage and memory consumption; further, scheduling of these models is a non-trivial task to the system in deployment. Many works [huang2017multi, hu2019learning, mcintosh2018recurrent] instead exploit intermediate features for faster approximate prediction. For instance, Multi-Scale DenseNet [huang2017multi] (MSDNet) inserts multiple classifiers into the model and thus supports anytime prediction by early-exit on a classifier.
Our model slicing also exploits and reuses intermediate features produced by the model while sidesteps the aforementioned problems. The key idea is to develop a general training and inference mechanism called model slicing which slices a narrower subnet for faster computation. With model slicing, neural networks are able to dynamically control the width of the subnet and thus regulate the computational resource consumption with one single parameter slice rate. In Figure 2, we illustrate by comparing the accuracy-efficiency trade-offs of ResNet trained with different approaches. We can observe that model ensemble methods are strong baselines which trade off accuracy for lower inference cost and that the Ensemble of ResNet with varying width performs better than varying depth. This finding indicates the superiority of width slicing over depth slicing, which is corroborated by the rapid loss in accuracy of ResNet with Multi-Classifiers (single model) in Figure 2. We will show that trained with model slicing, one single model is able to provide inference performance comparable to the ensemble of varying width networks. Therefore, model slicing is an ideal solution for neural networks to support elastic inference cost and resource constraints.
3.1 Model Slicing for Neural Networks
We start by introducing model slicing to fully-connected layer (dense layer) for general neural networks. Each dense layer in the neural network transforms via a weight matrix : , where , a -dimension input vector, corresponds to input neurons and , output neurons correspondingly. Details such as the bias and non-linearity are omitted here for brevity. As illustrated in Figure 1, a gating variable is implicitly introduced to impose a structural constraint on each input neuron :
Each gating variable thus controls the participation of the corresponding neuron in each forward pass during both training and inference. Formally, the structural constraint is obtained by imposing partial ordered relation on these gating variables:
which requires that the set of activated neurons during each forward pass forms a contiguous block starting from the first neuron. Based on the relation, we further divide these neurons into ordered groups, i.e. , each group corresponds to a contiguous block of neurons. We denote the index of the rightmost neuron of the first groups as , and the corresponding sub-layer as Sub-layer-, where the slice rate . Then the set of groups participated in the current forward pass can be determined by indexing the rightmost group , and the set of neurons involved corresponds to . Note that the group number is a pre-defined hyper-parameter, which could be set from 1 (the original layer) to (each component forms a group).
Empirically, the slice rate is shared among all the layers in the network and we denote the subnet of first groups in each layer as Subnet-. Thus the width of the whole network can be regulated by the single parameter . As illustrated in Figure 1, only the sliced part of the weight matrix and components are activated and required to reside in memory for inference in the current forward pass. We denote the computational operation required by the full network as , then the computational operation required by the subnet of slice rate is roughly . Therefore, the run-time computational resources limit can be dynamically satisfied by restricting slice rate by:
Consequently, a subnet can be readily sliced and deployed out of the network trained with model slicing whose disk storage and run-time memory consumption are also roughly quadratic to the slice rate . Besides satisfying the run-time computational constraint, another primary concern is how to maintain the performance of these subnets. To this end, we propose the model slicing training in Algorithm 1. For each training pass, a list of slice rate is sampled from the predefined slice rate list by a scheduling scheme , and the corresponding subnets are optimized under the current training batch. We shall elaborate on the scheduling scheme in Section 3.4.
Notice that the parameters of all subnets are tied together and any subnet indexed by a slice rate subsumes all smaller subnets. The structural constraint of model slicing is reminiscent of residual learning [he2016deep, he2016identity], where the Subnet- (the base network) carries the base representation. With the new input group introduced as grows, each is optimized to learn from finer input details and thus the group residual presentation. We shall provide more discussions on this effect in Section 3.5. From the viewpoint of knowledge distillation [hinton2015distilling], the Subnet- (Subnet-) maintains the capacity of the full model and as the training progresses, each Subnet- gradually distills the representation from larger subnets and transfers the knowledge to smaller ones. Under this training scheme, we conjecture that the full network can maintain the accuracy, or possibly improve due to the regularization and ensemble effect; and in the meantime, the subnets can gradually pick up the performance by distilling knowledge from larger subnets.
3.2 Convolutional Neural Networks
Model slicing is readily applicable to convolutional neural networks in a similar manner. The most fundamental operation in CNNs comes from the convolutional layer which can be constructed to represent any given transformation , where is the input with channels of size , the output likewise. Denoting and in vector of channels, the parameter set associated with each convolutional layer is a set of filter kernels . In a way similar to the dense layer, model slicing for the convolutional layer can be represented as:
where denotes convolution operation, is a 2D spatial kernel associated with output channel and convolves on input channel . Consequently, treating channels in convolutional layers analogously to neurons in dense layers, model slicing can be directly applied to CNNs with the same training scheme.
Nonetheless, the output scale instability issue arises when applying model slicing to CNNs. Specifically, each convolutional layer is typically coupled with a batch normalization layer [ioffe2015batch] to normalize outputs in the batch dimension, which stabilizes the mean and variance of input channels received by channels in the next layer. In the implementation of Equation 5, each batch-norm layer normalizes outputs with the batch mean and variance and keeps records of running estimates of them which will be used directly after training. Here, and are learnable affine transformation parameters of this batch-norm layer associated with each channel. However, with model slicing, the number of inputs received by a given output channel is no longer fixed, which is instead determined by the slice rate during each forward pass. Consequently, the mean and variance of the batch-norm layer on the output fluctuate drastically; thus one single set of the running estimates is unable to stabilize the distribution of the output channel.
We propose to address this issue with Group Normalization [wu2018group], an adaptation to Batch-norm. Group-norm divides channels into groups and normalizes channels in the same way as is in Equation 5 with the only difference that the mean and variance are calculated dynamically within each group. Formally, given the total number of groups , the mean and variance of -th group are estimated within the set of channels in Equation 6 and shared among all the channels in the -th group for normalization.
Group-norm normalizes channels group-wise instead of batch-wise, avoiding running estimates of the batch mean and variance in batch-norm whose error increases rapidly as the batch size decreases. Experiments in [wu2018group], which is also validated by our experiments on various network architectures, show that the accuracy of group-norm is relatively stable with respect to the batch size and group number. Besides stabling the scale, another benefit of group-norm is that it engenders the group-wise representation, which is in line with the group residual learning effect of model slicing training. To introduce model slicing to CNNs, we only need to replace batch-norm with group-norm and slice the normalization layers together with convolutional layers at the granularity of the group.
3.3 Recurrent Neural Networks
Model slicing can be readily applied to recurrent layers similarly to fully-connected layers. Take the vanilla recurrent layer expressed in Equation 7 for demonstration, the difference is that the output is computed from two sets of inputs, namely and .
Consequently, we can slice each input of the recurrent layer separately and adopt the same training scheme as fully-connected layers. Model slicing for recurrent layers of RNN variants such as GRU [cho2014properties] and LSTM[hochreiter1997long] works similarly. Dynamic slicing is applied to all input and output sets, including hidden/memory states and various gates, regulated by one single parameter slice rate of each layer.
3.4 Slice Rate Scheduling Scheme
As shown in Algorithm 1, for each training pass of model slicing, a list of slice rate is sampled from a predetermined scheduling scheme , and then the corresponding subnets are trained under the current training batch. Formally, the random scheduling can be described as sampling the slice rate from a Distribution . Denoting the list of valid slice rate in order as , then we have:
where is the probability density function, the cumulative distribution function of and the probability of slice rate being sampled. Thereby, the random scheduling (e.g., the Uniform Distribution or the Normal Distribution) can be parameterized with a Categorical Distribution , where each denotes the relative importance of Subnet- over other subnets. Further, the importance of these subnets should be treated differently. In particular, the full and the base network (i.e. Subnet- and Subnet-) should be the two most important subnets, because the full network represents the model capacity and the base network forms the basis for all the subnets. Based on this observation, we propose three categories of scheduling schemes:
Random scheduling, where each of the slice rate is sampled from an parameterized by .
Static scheduling, where all valid slice rates are scheduled for the current training pass.
Random static scheduling, where both a fixed set and a set of randomly sampled slice rates are scheduled.
For random scheduling, the importance of different subnets can be represented in the assigned probabilities, where we can assign higher sampling probabilities to more important subnets (e.g., the full and base network) during training. Likewise, for random static scheduling, we can include the important subnets in the fixed set and meanwhile assign proper probabilities to the remaining subnets. We shall evaluate these slice rate scheduling schemes in Section 5.1.2.
3.5 Group Residual Learning of Model Slicing
The model slicing training scheme structurally is reminiscent of residual learning proposed in ResNet [he2016deep, he2016identity]. In ResNet, a shortcut connection of identity mapping is proposed to forward input to output directly: , where during optimization, the convolutional transformation only needs to learn the residual representation on top of input information , namely . Analogously, networks trained with model slicing learn to accumulate the representation with additional groups introduced (group of neurons in dense layers and group of channels in convolutional layers), i.e. .
To demonstrate the group residual learning effect in model slicing, we take the transformation in a fully-connected layer for example, and analyze the relationship between any two sub-layers of slice rate and with . We have the transformation of Sub-layer- as and the transformation of Sub-layer- in block matrix multiplication as:
Here, is the supplementary input group introduced for Sub-layer- and is the corresponding output group. Generally , then the group residual representation learning can be clarified from two angles. Firstly, the base representation of Sub-layer- is , which is composed of the base representation and the residual representation . Secondly, the newly-introduced output group further forms the residual representation supplementary to the base representation . Higher model capacity is therefore expected of Subnet-.
The justification for the group residual learning effect in model slicing is that as the training progresses, the base representation of alone in Sub-layer- has already been optimized for the learning task. Therefore, the supplementary group introduced to Sub-layer- gradually adapts to learn the residual representation, which is corroborated in the visualization in Section 5.5.1. Furthermore, this group residual learning characteristic provides an efficient way to harness the richer representation for Subnet- based on Subnet- by the simple approximation of . With this approximation in every layer of the network, the most computationally heavy features of could be reused without re-evaluating, thus the representation of Sub-layer- can be updated by calculating only with a significantly lower computational cost.
We note that the model slicing training for group residual representation is applicable to the majority of neural networks. In addition, the group residual learning mechanism of model slicing is ideally suited for networks with layer transformation of multiple branches, e.g., group convolution [zhang2018shufflenet], depth-wise convolution [howard2017mobilenets] and homogeneous multi-branch residual transformation of ResNeXt [xie2017aggregated] etc.
4 Example Applications
In this section, we demonstrate how model slicing can benefit the deployment of deep learning based services. We use model slicing as the base framework to manage fine-grained system degradation for large scale machine learning services of dynamic workload. We also provide an example application of cascade ranking with model slicing.
4.1 Supporting Dynamic Workload Services
For a service with a dynamic workload, fine-grained system degradation management can be supported directly and efficiently with model slicing. Query samples come as a stream, and there is a dynamic latency constraint. Queries are usually batch-processed with vectorized computation for higher efficiency.
We design and implement an example solution to guarantee the latency and throughput requirement via model slicing. Given the processing time per sample for the full model , to satisfy the dynamic latency constraint and unknown query workload, we can build a mini-batch in every time, and utilize the rest time budget for processing: first examine the number of samples in current batch, and choose the slice rate satisfying (Equation 3) so that the processing time for this batch is within the budget . Under such a system design, no computation resource is wasted as the total processing time per mini-batch is exactly the time interval of the batch input. Meanwhile, all samples can be processed within the required latency.
4.2 Implementing Cascade Ranking Application
Many information retrieval and data mining applications such as search and recommendation need to rank a large set of data items with respect to many user requests in an online manner. There are generally two issues in this process: 1). Effectiveness as how accurate the obtained results in the final ranked list are and whether there are a sufficient number of good results; and 2). Efficiency such as whether the results are obtained in a timely manner from the user perspective and whether the computational costs of ranking is low from the system perspective. For large-scale ranking applications, it is of vital importance to address both issues for providing good user experience and achieving a cost-saving solution.
Cascade ranking [wang2011cascade, liu2017cascade] is a strategy designed for such a trade-off. It utilizes a sequence of prediction functions of different costs in different stages. It can thus eliminate irrelevant items (e.g., for a query) in earlier stages with simple features and models, while segregate more relevant items in later stages with more complicated features and models. In general, functions in early stages require low inference cost while functions in later stages require high accuracy.
One critical characteristic of cascade ranking is that the optimization target for each function may depend on all other functions in different stages [liu2017cascade]. For instance, given a positive item set and we aim to build a cascade ranking solution with two stages, suppose that function in stage two mis-drop positive item , a function in stage one mis-drop is better than a function mis-drop , though the former has a higher error rate over the whole dataset (in the first case are left while in the second case only are left). Lots of analysis are given in [wang2011cascade, chen2017efficient, liu2017cascade]. Therefore, we expect the prediction of positive items given by functions in different stages to be consistent so that the accumulated false negatives are minimized. Unfortunately, most implementations of the ranking/filtering function at each stage for cascade ranking use different model architectures with different parameters. The results of different models are thus unlikely to be consistent.
Model slicing would be an ideal solution for cascade ranking. Firstly, it provides the trade-off of model effectiveness and model efficiency with one single model. The ranking functions at different stages can be obtained by as simple as configuring the inference cost of the model. Secondly, as is corroborated in Section 5.5, the prediction results of model slicing sub-models are inherently correlated since the larger model is actually using the smaller model as the base of its model representation. We shall illustrate the effectiveness and efficiency of model slicing in comparison with the traditional model cascade solution in a cascade ranking simulation in Section 5.4.
We evaluate the performance of model slicing on state-of-the-art neural networks on two categories of public benchmark tasks, specifically evaluating model slicing for dense layers, i.e. fully-connected and recurrent layers on language modeling [mikolov2010recurrent, zaremba2014recurrent, press2016using] in Section 5.2 and evaluating model slicing for convolutional layers on image classification [simonyan2014very, he2016deep, zagoruyko2016wide] in Section 5.3. Experimental setups of model slicing are provided in Section 5.1; cascade ranking simulation of example applications and visualization on the model slicing training are given in Section 5.4 and Section 5.5 respectively.
5.1 Model Slicing Setup
5.1.1 General Setup and Baselines
The slice rate corresponds to Subnet-, which is restricted between a lower bound and . In the experiments, the networks trained with model slicing are evaluated with the slice rate list where ranges from (corresponding to around 16x/7x the computational speedup) to in every (the slice granularity). We apply model slicing to all the hidden layers except the input and output layers because both layers are necessary for the inference and further take a negligible amount of parameter and computation in the full network.
We compare model slicing primarily with two baselines. The first baseline is the full network trained without model slicing (, single model), implemented by fixing to during training. During inference, we slice the corresponding Sub-layer- of each layer in the network for comparison. The second baseline is an ensemble of networks of varying width (fixed models). In addition to the above two baselines, we also compare model slicing with model compression (Network Slimming [liu2017learning]), anytime prediction (multi-classifiers methods, e.g. MSDNet [huang2017multi]) and efficient prediction (SkipNet [wang2018skipnet]).
5.1.2 Slice Rate Scheduling Scheme
We evaluate the three slice rate scheduling schemes proposed in Section 3.4 with the slice rate list in Table 1. Specifically, the baseline is the ensemble of fixed models (fixed). For random scheduling, we evaluate the uniform sampling (R-uniform) and the weighted random sampling (R-weighted, weight list ); in particular, R-uniform-k and R-weighted-k denote random scheduling of k slice rates scheduled for each forward pass. For static scheduling (Static), the subnets are regarded as equally important and thus all slice rates are scheduled whose computation grows linearly with the number of subnets configured; For random static scheduling, we evaluate statically scheduling the base network (R-min), the full network (R-max) or both of these two subnet (R-min-max), and meanwhile uniformly sampling one remaining subnets. The detailed training settings are given in Section 5.3.2.
Table 1 shows that weighted sampling of random scheduling achieves higher accuracy than uniformly sampling with a comparable training budget; and training longer further improves the performance. In contrast, static scheduling performs consistently worse than the weighted random scheduling even though it takes more training rounds. The results corroborate our conjuncture that the base and the full network are of greater importance and thus should be scheduled more frequently during training.
We next evaluate the random static scheduling, which consists of statically scheduling the base and/or full network while uniformly sampling the remaining subnets. We observe that statically training the base (R-min) or the full (R-max) network helps to improve the corresponding subnets. Meanwhile, the performance of the neighboring subnets also improves, mainly due to the effect of knowledge distillation. We also compare model slicing with SlimmableNet [yu2018slimmable] (Slimmable) that adopts static scheduling and multi-BN layers instead of one group-norm layer. The results shown in Table 1 reveal that SlimmableNet obtains higher accuracies in larger subnets, which may result from the longer training time; while smaller subnets perform worse than model slicing with random scheduling, e.g., R-weighted or R-min-max, mainly due to the lack of differentiation of varying importance of subnets in static scheduling. In the following experiments, we therefore evaluate model slicing with R-weighted-3 for small datasets and R-min-max for larger datasets for reporting purpose.
5.1.3 The Lower Bound of Slice Rate
For each of the subnet, the computation resources required can be evaluated beforehand. The lower bound controls the width of the base network and thus should be set to Equation 3 under the computational resource limit. Figure 3 shows the accuracies of VGG-13 trained with different lower bounds. Empirically, the accuracy drops steadily as decreases towards (the lower bound ), and networks trained with different s perform rather close. Given a lower bound , however, the accuracy of the corresponding Subnet- is slightly higher than other Subnet-s, which is mainly because the base network is optimized more frequently. When the slice rate decreases over the lower bound, the accuracy drops drastically. This phenomenon meets the expectation that further slicing the base network destroys the base representation, and thus the accuracy suffers significantly. The loss of accuracy is more severe for convolutional neural networks, where the representation depends heavily on all channels of the base network. In the following experiments, we therefore evaluate lower bound for small (e.g. CIFAR, PTB)/large (e.g. ImageNet) datasets respectively for reporting purpose, whose computational cost is roughly 14.1%/6.25% of the full network (i.e. 7.11x/16x speedup) and empirically can be adjusted readily according to the deployment requirement.
5.2 NNLM for Language Modeling
5.2.1 Language modeling task and dataset
The task of language modeling is to model the probability distribution over a sequence of words. Neural Network Language Modeling (NNLM) comprises both fully-connected and recurrent layers; we thus adopt NNLM to evaluate the effectiveness of model slicing for dense layers. NNLM [mikolov2010recurrent, zaremba2014recurrent, press2016using] specifies the distribution over next word given its preceding word sequence with neural networks. Training of NNLM involves minimizing the negative log-likelihood () of the sequence: . Following the common practice for language modeling, we use perplexity () to report the performance: . We adopt the widely benchmarked English Penn Tree Bank (PTB) dataset and use the standard train/test/validation split by [mikolov2010recurrent].
5.2.2 NNLM configuration and training details
Following [mikolov2010recurrent, zaremba2014recurrent, press2016using], the NNLM model in the experiments consists of an input embedding layer, two consecutive LSTM layers, an output dense layer and finally a softmax layer. The embedding dimension is and both LSTM layers contain units. In addition, a dropout layer with dropout rate follows the embedding and two LSTM layers. The models are trained by truncated backpropagation through time for time steps, minimizing during training without any regularization terms with SGD of batch size . The learning rate is initially set to and quartered in the next epoch if the perplexity does not decrease on the validation set. Model slicing applies to both recurrent layers and the output dense layer with output rescaling.
5.2.3 Results of Model Slicing on NNLM
Results in Figure 4 and Table 2 show that model slicing is effective to support on-demand workload with one single model only at the cost of minimum performance loss. The performance of the network trained without model slicing decreases drastically. With model slicing, the performance decreases steadily and stays comparable to the corresponding fixed models. In particular, the performance of the subnet is slightly better than the corresponding fixed model when the slice rate is near . For instance, as is shown in Table 2, the perplexity is for the Subnet- (the full network) while for the full fixed model.
This validates our hypothesis that the regularization and ensemble effect could improve the full model performance. Further, the student-teacher knowledge distillation effect of the group residual learning facilitates the learning process by transferring and sharing representation, and thus helps maintain the performance of subnets.
5.3 CNNs for Image Classification
|Group||Output Size||VGG-13||ResNet-164||ResNet-56-2||Output Size||VGG-16||ResNet-50|
|conv1||3232||[conv33, 64]2||[B-Block, 16]1||[B-Block, 16]1||112112||[conv33, 64]3||[B-Block, 64]1|
|conv2||3232||[conv33, 128]2||[B-Block, 16]18||[B-Block, 16]||5656||[conv33, 128]3||[B-Block, 64]3|
|conv3||1616||[conv33, 256]2||[B-Block, 32]18||[B-Block, 32]||2828||[conv33, 256]3||[B-Block, 128]4|
|conv4||88||[conv33, 512]4||[B-Block, 64]18||[B-Block, 64]||1414||[conv33, 512]3||[B-Block, 256]6|
|conv5||88||-||-||-||77||[conv33, 512]3||[B-Block, 512]3|
|avgPool/FC||10||[avg88, 512]||[avg88, 644]||[avg88, 6424]||1000||[51277,4096,4096]||[avg77,5124]|
In this subsection, we evaluate model slicing for convolutional layers on image classification tasks, mainly focusing on representative types of convolutional neural networks. We first introduce dataset statistics for the evaluation. Then configurations of the networks and training details are introduced. Finally, we discuss and compare with baselines the results of model slicing training scheme for CNNs.
We evaluate the results on CIFAR [krizhevsky2009learning] and ImageNet-12 [deng2009imagenet] image classification datasets.
The CIFAR [krizhevsky2009learning] datasets consist of colors scenery images. CIFAR-10 consists of images drawn from 10 classes. The training and testing sets contain and images respectively. Following the standard data augmentation scheme [he2016deep, huang2016deep, huang2017densely], each image is first zero-padded with 4 pixels on each side, then randomly cropped to produce images again, followed by a random horizontal flip. We normalize the data using the channel means and standard deviations for data pre-processing.
The ILSVRC 2012 image classification dataset contains 1.2 million images for training and another 50,000 for validation from 1000 classes. We adopt the same data augmentation scheme for training images following the convention[he2016deep, zagoruyko2016wide, huang2017densely], and apply a center crop to images at test time. The results are reported on the validation set following common practice.
5.3.2 CNN Architectures and Training Details
Model slicing dynamically slices channels within each layer in CNNs; thus we adopt three representative architectures differing mainly in the channel width for evaluation. The first architecture is VGG [simonyan2014very] whose convolutional layer is a plain conv of medium channel width. The second architecture is the pre-activation residual network [he2016identity] (ResNet). ResNet is composed of the bottleneck block [he2016identity], denoting as B-Block (). We evaluate model slicing on ResNet of varying depth and width, and denote the architecture adopted as ResNet-L, with L being the number of layers. The third architecture is Wide Residual Network [zagoruyko2016wide], which is denoted as ResNet-L-k, with k being the widening factor of the channel width for each layer. Detailed configurations are summarized in Table 3.
To support model slicing, convolutional layers and the batch-norm layers are replaced with counterpart layers supporting model slicing. For both baseline and model slicing trained models, we train 300 epochs on CIFAR-10 with SGD of batch size 128 and initial learning rate 0.1, and 100 epochs on ImageNet-12 with SGD of batch size 128 and learning rate 0.01 with gradual warmup [he2016deep, goyal2017accurate]. The learning rate is divided by 10 at 50% and 75% of the total training epochs for CIFAR-10, and at 30%, 60% and 90% for ImageNet-12. Other training details follow the conventions [he2016identity, zagoruyko2016wide].
5.3.3 Results of Model Slicing on CNNs
Results of representative CNNs on CIFAR and ImageNet datasets are illustrated in Figure 2, Figure 5, and summarized in Table 4. In general, a CNN model trained with model slicing is able to produce prediction with elastic inference cost by dynamically scheduling a corresponding subnet whose accuracy is comparable to or even higher than its conventionally trained counterpart.
We compare the performance of model slicing with more baseline methods on ResNet in Figure 2. We can observe that ResNet-164 trained with model slicing (single model L164) achieves accuracies significantly higher than ResNet with Multi-Classifiers baseline, which confirms the superiority of model slicing over depth slicing. However, its performance is noticeably worse than the ensemble of ResNet of varying width, especially in the lower budget prediction. This is mainly because the convolutional layer of ResNet-164 on CIFAR is narrow. In particular, the convolutional layer in conv1/conv2 comprises 16 channels (see Table 3) and thus with slice rate , only 6 channels remain for inference which leads to limited representational power. With twice the channel width, the single model slicing trained model ResNet-L56-2 achieves accuracies comparable to the strong ensemble baseline of varying depth/width, model width compression baseline Network Slimming [liu2017learning], and achieves higher accuracies than SkipNet [wang2018skipnet] in corresponding inference budgets and generally better accuracy-budget trade-offs than MSDNet [huang2017multi]. This demonstrates that model slicing works more effectively for models of wider convolutional layers, e.g. the VGG-13, ResNet-L56-2 and ResNet-50. For instance, the accuracy is 93.57% for VGG-13-lb-0.375 with slice rate 0.375, which is 0.72% higher than its individually trained counterpart and takes around 14.06% of the computation of the full network (7.11x speedup). This is also confirmed in the wider network VGG-16 and ResNet-50 on the larger dataset ImageNet. Specifically, ResNet-50-lb-0.25 of slice rate 0.25 achieves slightly higher accuracy than the fixed model of the same width and takes only around 6.25% computation of the full network (16x speedup).
We can also notice in Figure 5, Table 4 that the accuracy of CNNs trained conventionally (lower bound =1.0) decreases drastically as more channel groups are sliced off. This shows that with conventional training, channel groups in the same convolutional layer are highly dependent on other groups in the representation learning such that slicing even one channel group off may impair the representation. With the group residual representation learning of model slicing, one single network can achieve accuracy comparable to the ensemble of networks of varying width with significantly less memory and computational operation.
5.4 Simulation of Cascade Ranking
We further simulate a cascade ranking scenario with six stages of classifiers. CIFAR-10 test dataset is adopted for illustration which contains ten types of items (classes) and 1000 items (images) for each type, and VGG-13 (see Table 3) is adopted as the baseline model. The classifier (model) is required to categorize each item into a type and then filter out all the items whose predicted category is not consistent with its previous type. Therefore, the cascade ranking pipeline will only keep items of consistent classification type in all the cascade models. Typically, the pipeline deploys smaller models in early stages to efficiently filter out irrelevant items, and larger but costlier models in subsequent stages for higher retrieval quality. The baseline solution is a cascade model of the baseline model of varying width, which is compared with the model slicing solution with corresponding sub-models sliced off the baseline model trained with model slicing. The parameter size and computation FLOPs of models at each stage are provided in Table 5.
|Model Width ()||0.375||0.500||0.625||0.750||0.875||1.000|
Table 5 summarizes the results on precision and the aggregate recall of each stage. The results show two advantages of the model slicing solution over the conventional cascade model solution: firstly, in terms of effectiveness, the model slicing solution retrieves 88.67% correct items in total as compared with 86.03% of the conventional solution. The significantly higher aggregate recall is mainly because of the more consistent prediction between classifiers which we shall discuss and visualize in Section 5.5.3; secondly, in terms of efficiency, the conventional solution takes totally 29.3M parameters and 3182.8M FLOPs computation for the retrieval of each item, while model slicing solution only takes 9.42M parameters in one model and the computation could be greatly reduced with the computation reusage discussed in Section 3.5.
5.5.1 Residual Learning Effect of Model Slicing
In CNNs trained with model slicing, each of the convolutional layers is followed by a group normalization layer to stabilize the scale of output with a scaling factor, i.e., in Equation 5. The scaling factor largely represents the importance of the corresponding channel. We therefore visualize the evolution of these scaling factors during model slicing training in Figure 6. Specifically, we take the first convolutional layers of conv3 and conv5 in VGG-13 (see Table 3), which corresponds to low and high level feature extractors. We can observe an obvious stratified pattern in Figure 6. Groups from to of the base network gradually learn scaling factors of the largest values. Meanwhile, from to , the average scaling factor values gradually become smaller. This validates our assumption that model slicing training engenders residual group learning, where the base network learns the fundamental representation and following groups residually build up the representation.
5.5.2 Learning Curves of Model Slicing
Figure 7 illustrates learning curves of VGG-13 trained with model slicing compared with the full fixed model. Learning curves of the subnets of VGG-13 trained with model slicing reveal that the error rate drops faster in larger subnets and smaller subnets closely follow the larger subnets. This demonstrates the knowledge distillation effect, where larger subnets learn faster and gradually transfer the knowledge learned to smaller subnets. We notice that the final accuracy of subnets of a relatively larger slice rate approaches the full fixed model, which shows that the model slicing trained model can trade off accuracy for efficiency by inference with a smaller subnet with less memory and computation at the cost of a minor accuracy decrease.
5.5.3 Prediction Consistency of Model Slicing
We also evaluate the consistency of prediction results between the subnets of the model trained with model slicing. Typically, the outputs are not the same for different models trained conventionally. However, trained with model slicing, the model of a larger slice rate incorporates models of lower slice rate as part of its representation. Consequently, the subnets sliced off the model slicing model are expected to produce similar predictions, and larger subnets could be able to correct wrong predictions of smaller models. Figure 8 shows the inclusion coefficient of wrongly predicted samples between each pair of models. The inclusion coefficient measures the fraction of the wrongly predicted samples of the larger model over those of the smaller model. It essentially measures the ratio of error overlapped between two models. Unsurprisingly, the prediction results of model slicing training is much more consistent than that of training different fixed models separately. Therefore, model slicing may not be ideal for applications such as model ensemble which typically requires diversity, but could be extremely useful for applications requiring consistent prediction such as cascade ranking where the accumulated error is expected to be minimized.
Relatively few efforts have been devoted to neural networks dynamically providing predictions within memory and computational operation budget. In this paper, we propose model slicing, a general training framework supporting elastic inference cost for neural networks. The key idea of model slicing is to impose a structural constraint on basic components of each layer both during training and inference, and then regulate the width of the network with a single parameter slice rate during inference given the resource budget on a per-input basis. We have provided detailed analysis and discussion on training details of model slicing and evaluated model slicing through extensive experiments.
Results on NLP and vision tasks show that neural networks trained with model slicing can effectively support on-demand workload by slicing a subnet from the trained network dynamically. With model slicing, neural networks can achieve significant reduction of run-time memory and computation with comparable performance, e.g., 16x speedup with slice rate . Unlike conventional model compression methods where the computation reduction is limited, the required computation decreases quadratically to slice rate.
Model slicing also sheds light on the learning process of neural networks. Networks trained with model slicing engender group residual learning in each layer, where components in the base network learn the fundamental representation while the following groups build up the representation residually. Meanwhile, the learning process is reminiscent of knowledge distillation. During training, larger subnets learn faster and gradually transfer the representation to smaller subnets. Finally, model slicing is readily applicable to the model compression scenario by deploying a proper subnet.
This research is supported by the National Research Foundation Singapore under its AI Singapore Programme [Award No. AISG-GC-2019-002] and Singapore Ministry of Education Academic Research Fund Tier 3 under MOE’s official grant number MOE2017-T3-1-007.