Switchable Precision Neural Networks
Abstract
Instantaneous and on demand accuracyefficiency tradeoff has been recently explored in the context of neural networks slimming. In this paper, we propose a flexible quantization strategy, termed Switchable Precision neural Networks (SPNets), to train a shared network capable of operating at multiple quantization levels. At runtime, the network can adjust its precision on the fly according to instant memory, latency, power consumption and accuracy demands. For example, by constraining the network weights to 1bit with switchable precision activations, our shared network spans from BinaryConnect to Binarized Neural Network, allowing to perform dotproducts using only summations or bit operations. In addition, a selfdistillation scheme is proposed to increase the performance of the quantized switches. We tested our approach with three different quantizers and demonstrate the performance of SPNets against independently trained quantized models in classification accuracy for Tiny ImageNet and ImageNet datasets using ResNet18 and MobileNet architectures.
1 Introduction
Deep Neural Networks (DNNs) have achieved great success in a wide range of vision tasks, such as image classification [10], semantic segmentation [22] and object detection [29]. However, the large model size and expensive computational complexity remain great obstacles for many applications, especially on some constrained devices with limited memory and computing resources. Network quantization is an active field of research focusing on alleviating such issues. In particular, [9, 15, 28] set the foundations for 1bit quantization, while [16, 50] for arbitrary bitwidth quantization. Progressive quantization [2, 1, 53, 33], loss awarequantization [13, 49], improved gradient estimators for nondifferentiable functions [21] and RLaided training [20], have focused on improved training schemes, while mixed precision quantization [36], hardwareaware quantization [38] and architecture search for quantized models [34] have focused on alternatives for standard quantized models. However, these strategies are exclusively focused on improving the performance and efficiency of static networks.
Dynamic routing networks try to provide improvements using an alternative approach. By performing computations conditioned on the inputs, the networks are capable of saving resources by executing just the sufficient amount of operations required to map the input to the desired output. Popular strategies include skipping convolutional layers [39, 6, 41] based on the input data complexity and early classifiers [23].
Our proposed approach taken by slimmable neural networks [44] falls in the category of dynamic networks, but following a different principle, aiming to provide on demand tradeoffs, rather than input dependant. To the best of our knowledge, dynamic quantization of DNNs has not been explored in the literature. Slimmable networks provide width (number of channels) switches that allow to perform inference utilizing only sections of the network according to ondevice demands and resource constraints. Similar to slimmable networks, we attempt to provide the first approach by developing a network whose weights and activations can be quantized at various precision at runtime, permitting instant and adaptive accuracyefficiency tradeoff, which is termed switchable precision. In particular, 1bit DNNs are both an interesting and challenging case of our SPNet. With binary weights, we can train a shared network ranging from BinaryConnect [9] and Binarized Neural Network [15], where the inner product can be efficiently implemented using summation or bit operations according to ondevice resource constraints. Furthermore, our PSNets frequently yields higher accuracy than individually trained quantized networks.
However, different precision switches are difficult to optimize as a whole. And we summarize the reasons in two parts. On one hand, during training, batch normalization (BN) layers use the current batch statistic to perform intermediate feature maps normalization, while estimating the global statistics by accumulating a running mean and running variance, which are used as the replacement during testing. However, the batch statistics of each precision switch are different. As a result, the discrepancy of feature mean and variance across different switches leads to inaccurate accumulated statistics of BN layers. To solve this problem, we follow [44] to train a switchable precision network by using independent BN parameters for each switch, named switchable batch normalization (SBN).
On the other hand, we conjecture that simultaneously optimizing multiple quantized switches will reflect in the loss manifold by progressively quantizing the loss surface, and the higher precision switches will assist the lower precision ones to achieve a less noisy and smoother convergence, potentially leading them to a better minima. Conversely, the network will converge to only minimas which perform well at all the bitwidths potentially harming the performance of some of the switches, particularly the higher precision ones. During training of SPNets, the gradients of each switch are combined before running an optimizer step. However, there is no explicit mechanism impeding the individual switches from moving in distinct directions. In order to encourage the different switches to move in approximately the same direction, we propose a selfdistillation strategy, where the full precision switch provides a guiding signal for the rest of the switches. Specifically, only the teacher fullprecision switch sees the groundtruth while the student lowprecision switches are trained by distilling knowledge from the fullprecision teacher.
In order to increase the flexibility of our model, in addition to equipping the network with switchable precision representations, we extend our approach to 3 different quantizers and equip them with slimming (width switchable) capability.
Our contributions are summarized as follows:

We leverage SBN to train a shared network executable at different bitwidths on weights and activations according to runtime demands.

We propose a selfdistillation scheme to jointly train the fullprecision teacher switch and the lowprecision student switches. By doing so, the fullprecision switch provides a guiding signal to significantly improve the performance of the lowprecision switches.

We investigate the effectiveness of our method on uniform and nonuniform quantization with various quantizers through extensive experiments on image classification.
2 Related Work
Network slimming. Slimmable networks introduced by [44] developed a procedure useful for training a DNN with switchable widths. The motivation is to provide instant and on demand accuracyefficiency tradeoffs. It was further generalized by [43] allowing to efficiently train a network continuously slimmable, executable at any arbitrary width. Moreover, nonuniform slimming was introduced, allowing for layerwise width selection. The training principle behind network slimming has found applications in the fields of pruning [42], network distillation [47], architecture search [4], adaptive inference [31] and finally, in our work, network quantization.
Network quantization. Quantization based methods represent the network weights and/or activations with very low precision, thus yielding highly compact DNN models compared to their floatingpoint counterparts with considerable memory and computation savings. BNNs [15, 28] propose to constrain both weights and activations to binary values (i.e., and ), where the multiplicationaccumulations can be replaced by purely and operations, which are in general much faster. However, BNNs still suffer from significant accuracy decrease compared with the full precision counterparts. To narrow this accuracy gap, ternary networks [19, 51] and even higher bit fixedpoint quantization [50, 48] methods are proposed.
In general, quantization approaches target at tackling two main problems. On one hand, some works target at designing more accurate quantizer to minimize information loss. For the uniform quantizer, works in [7, 17] explicitly parameterize and optimize the upper and/or lower bound of the activation and weights. To reduce the quantization error, nonuniform approaches [26, 46] are proposed to better approximate the data distribution. In particular, LQNet [46] proposes to jointly optimize the quantizer and the network parameters. On the other hand, because of the nondifferentiable quantizer, some literature focuses on relaxing the discrete optimization problem. A typical approach is to train with regularization [13, 49, 2, 1, 33, 8], where the optimization problem becomes continuous while gradually adjusting the data distribution towards quantized values. Apart from the two challenges, with the popularization of neural architecture search (NAS), Wang \etal[37] further propose to employ reinforcement learning to automatically determine the bitwidth of each layer without human heuristics.
Knowledge distillation. Knowledge distillation (KD) is a general approach for model compression, where a powerful wide/deep teacher distills knowledge to a narrow/shallow student to improve its performance [12, 30]. In terms of the definition of knowledge to be distilled from the teacher, existing models typically use teacher’s class probabilities [12] and/or intermediate features [30, 45]. KD has been widely used in many computer vision tasks [40, 11]. Moreover, there are some works [53, 24, 27] study the combination of KD and quantization, where the fullprecision model provides hints to guide the lowprecision model training and significantly improves the performance of the lowprecision networks. Different from the previous literature, we propose a selfdistillation strategy to improve our SPNet training. Specifically, only the fullprecision switch sees the groundtruth while the lowprecision switches are learnt by distilling from the fullprecision teacher.
3 Preliminaries
The optimization problem of traditional Quantized Neural Networks (QNNs) aims at minimizing an objective function given a set of trainable weights , which take values from , a predefined discrete set typically referred as codebook. A common QNN training procedure involves storing a realvalued version of . During inference, the real is quantized using a predetermined pointwise quantization function . The weights are updated during the optimization process by estimating the gradients w.r.t. the realvalued copy. Additionally, internal network activations can be optionally quantized with their own codebook by . Given independent bitwidths and for the network weights and activations, and . Finally, a quantized convolutional layer with filters is computed as follows:
(1) 
where is the th convolutional filter. , and denote the number of input channels, height and width of the filters, respectively. and denote the input activations and output preactivations of the filter, where , , , represent the height and width of the input and output feature maps, respectively.
In the context of QNNs, multiple quantizers and gradient estimators have been proposed in the literature [3, 21, 50, 5, 25]. For our SPNet, we stick to common ones, which will be described in Sec. 3.1. Three nonlinear quantizers for the activations are implemented. For the weights, the based quantizer is used, unless specified otherwise. The motivation to use each quantizer will be explained in each corresponding subsection.
3.1 StraightThrough Estimator (STE) and Base Quantizer
The STE proposed in [3] allows to estimate gradients through the nondifferentiable functions and , making the bit quantizers employed compatible with the backpropagation algorithm. The STE operates by passing the output gradients unaltered to the input, it is equivalent to the derivative of an mapping on the inputs. Commonly, the gradients for inputs outside of the range are suppressed and the derivative of the quantization function becomes equivalent to the derivative of the function.
The bit quantizers described in the next sections share the same base quantization function, :
(2) 
During backpropagation, we use the STE:
(3) 
TanhBased Quantizer
[50, 5] proposed to use different quantizers for weights and activations. Weight quantizers approximate the hyperbolic tangent function (), constraining the weights to . However, is associated with the vanishing gradient problem, thus, it is attractive for activations quantizers to approximate the popular activation function as described in the next section, constraining the activations to the positive range.
The is first used to project the input to range in order to reduce the impact of large values. The quantizer is defined as follows:
(4) 
where hereafter is defined in Eq. (2).
It is appealing to train a network with activations switchable down to 1bit with permissible values since bit operations can be performed given that weights are 1bit as well. Therefore, when training these type of networks, quantizer is used on both weights and activations to allow the activations of the intermediate switches to lie in the range . For this same reason, the layers were reordered as described in [28] (typical layers ordering is QuantConvBN ReLUQuantConv, while layers reordered are QuantConvReLUBNQuantConv).
ReLUBased Quantizer
As mentioned in the previous section, [50, 5] employ a approximation quantizer for the activations with no layer reordering. The method proposed by [50] will be employed here, which consists of simply clipping the activations followed by the base quantizer:
(5) 
In particular, [5] proposed both uniform and nonuniform spacing between the codebook elements. They first clipped the activations to the range , where is some predetermined value, and the codebook for both cases, uniform and nonuniform, is obtained from the network internal statistics. In our tests with quantization, we simply constrain the activations to the range with uniform quantization. In the logaritmic quantizer described in the next section, nonuniform quantization is used.
Logarithmic Quantizer
In fullprecision neural networks, the weights and activations have nonuniform distributions [25]. Taking advantage of that fact, the authors used logarithmic representations, both in weights and activations, achieving higher classification accuracy at the same resolution than uniform quantization schemes at an expense of higher implementation and computation complexity. In the original paper, their layer contains a global Full Scale Range (FSR) parameter which is reported in the paper. Additionally, each layer has its own FSR. In our SPNet, we use a variation of their layer in order to avoid the FSR parameter. Our modified quantizer is defined as follows:
(6) 
where
(7) 
(8) 
(9) 
(10) 
Similarly to Sec. 3.1.1, twosided logarithmic quantization (positive and negative) can be used in order to have activations switchable down to 1bit. Depending on the choice of activations quantization (onesided or twosided), layer reordering should be taken into account. In our experiments, logarithm base2 was used, however, base could provide higher accuracy.
4 Switchable Precision Neural Networks
SPNets generalize QNNs, where the learnable weights are optimized for multiple codebooks and of variable cardinality. The permissible values of the codebooks are determined by the choice of quantizers and , while their cardinality is determined by the bitwidths and .
DNNs at their current state are not naturally precision switchable as empirically demonstrated in Sec. 6.3. Therefore, we appeal to SBN, a technique used to train slimmable neural networks, described in detail in Sec. 4.1. Then in Sec. 4.2, we extend our SPNet to mixed slimmable SPNet.
4.1 Switchable Batch Normalization (SBN)
DNNs at their current state are not naturally slimmable [43] nor of switchable precision due to the inconsistent behavior of batch normalization layers during training and inference. During training, BN layers utilize the current batch mean and variance to perform intermediate feature maps normalization while accumulating a running mean and running variance , which is used as replacement during test.
In a naive SPNet, the minibatch local statistics of each quantization switch are different, however, all the switches will contribute to the accumulated global mean and variance , thus, enabling the network to train properly, but resulting in inaccurate test inference.
SBN layers equip regular BN layers with private and for each of the switches. The overhead of the additional parameters is negligible since BN parameters account for an insignificant portion of the total amount. It is also negligible in terms of runtime complexity since they involve no additional operations.
Additionally, BN layers count with two learnable parameters and used to provide the layer with the capacity of performing a linear mapping. Unlike and , and can be updated by all the switches. Providing with private versions of them is not as crucial, since they allow the network to learn, but it yields an additional accuracy boost [43]. Furthermore, they generate no additional overhead since they can be merged with and after training.
In Algorithm 1, we illustrate the use of SBN in one SPNet training iteration.
4.2 Slimmable SPNet
Network slimming [44] relies on SBN in order to allow each layer of the network to operate at different widths. Given the current width multiplier , a slimmable convolutional layer with filters computes only the first ones.
Network slimming and quantization are complementary techniques and effortlessly work along without technical complications. Therefore, we can train a single shared network with switchable width and precision to increase the flexibility. A single slimmable/SP network can be trained in the cloud and distribute particular switches to different deployment systems based on their specific hardware capabilities, where they can be further slimmed and quantized instantaneously on demand. In Sec. 6.2, we provide a comparison of a slimmable SPNet with the corresponding individually trained switches.
Although the benefits of a SPNet in terms of power and speed improvements are evident by performing dotproduct operations on quantized vectors, the benefits on memory footprint may not be so apparent, given that the full precision weights must be stored on nonvolatile memory at all time, regardless of the active quantization switch. However, during operation, a network clone is stored in the RAM of the processor in order to provide quick access, therefore weight quantization only takes place once every time a new switch is requested and the quantized clone is kept in volatile memory. By quantizing activations, additional RAM savings can be obtained.
5 SelfDistillation
Knowledge distillation is a common strategy used to provide a stronger training signal, typically from a large network to smaller one in a teacherstudent scheme. In a similar fashion to knowledge distillation, by simultaneously optimizing multiple quantized switches, implicitly the high precision switches are providing guidance to the noisier low precision updates. However, this behavior is not explicitly encouraged. Therefore, in this section we present a complementary distillation mechanism, denominated selfdistillation, where only the fullprecision switch sees the groundtruth, while the low bitwidth switches try to match the internal representation as well as the output distribution of the fullprecision switch. The strategy formulated allows the fullprecision switch to guide the optimization process with a significant amount of information flow across all the switches.
In USNet [43] gradients are prevented from flowing from the subnetworks to the largest width switch. Formally, to mimic the outputs, similarly to USNet, we use the KullbackâLeibler divergence as distance measure on the output distributions and of the fullprecision switch and the quantized active switch. Let denote the stopgradient function, the output mimic loss is:
(11) 
Additionally, to create the guidance signal, [53] proposes a hintbased training strategy by comparing the intermediate feature maps between the fullprecision teacher and the lowprecision student. Similarly, let and denote the internal feature maps (\ie, preactivations) of the fullprecision switch and the quantized active switch, the internal representations guidance loss becomes:
(12) 
And in this case, we do not stop the gradients flowing to and quantize .
Finally, the loss for the quantized switches is their combination, while for the full precision switch is simply the regular crossentropy classification loss:
(13) 
(14) 
where denotes the groundtruth labels, and and are the scaling coefficients to balance and , respectively.
6 Experiments
In this section, we test 2 different achitectures, ResNet [10] and MobileNet [14], for the task of image classification on the ImageNet (ILSVRC2012) [32] and Tiny ImageNet datasets with 3 different quantizers: based quantization [50], based quantization [50, 5] and nonuniform Logarithmic quantization [25]. Experiments on ImageNet have been included in the supplementary material.
We implement our SPNets using Pytorch. For our ImageNet experiments, we use a regular (\ie, 1 single switch) fullprecision pretrained model, we then replicate the BN parameters across all the SBN switches and finetune the SPNet model. We use the standard preprocessing and augmentation as reported by [10]. For training with the pretrained fullprecision model, we use Adam [18] optimizer with an initial learning rate of and decrease it by a factor of 10 at epochs 15 and 20 with a total of 25 epochs. For our Tiny ImageNet experiments, all the networks are learned from scratch using SGD optimizer with initial learning rate of 0.1 and decreased by a factor of 10 every 30 epochs during 100 epochs. As in common practice, the first and last layers are not quantized [28].
Implementation Note. The activations of the fullprecision switch, although are not quantized, in some cases they must still be clipped to the same range of the quantized switches, as it can lead the switch to diverge.
6.1 Evaluation on ImageNet
Evaluation on ResNet18 In Table 1, we compare the Top1 accuracy for the edge bitwidth cases in both weights and activations for a ResNet18 architecture. Our SPnet achieved very close accuracy to the independently trained QNNs with the fullprecision switch being the most affected, while providing increased efficiencyaccuracy flexibility.
Weight  Activation  SPNet  Independent 

2  2  62.4  62.6 
2  32  65.3  65.8 
32  2  65.1  65.3 
32  32  68.1  69.5 
Evaluation on MobileNet. The MobileNet architecture makes use of the depthwise separable and pointwise convolutions without residual connections drastically reducing the number of parameters without a major decrease in accuracy. However, it is very sensitive to quantization, specially on the activations. [35] redesigns the MobileNet architecture to be more quantization friendly. For our MobileNet SPNet, we follow their architecture. The redesign involves simple layers replacement and reordering. In Table 2, we report the results of different bitwidths for weights and activations using our SPNet MobileNet and independent networks with the modified architecture. As can be observed, the impact of switchable precisions on MobileNet is higher than on ResNet18.
Weight  Activation  SPNet  Independent 

8  8  59.5  63.5 
8  32  66.0  69.0 
32  8  59.6  63.6 
32  32  66.1  69.7 
6.2 Evaluation on Tiny ImageNet
The Tiny ImageNet dataset is a downsampled ImageNet version () with 100K images for training and 10K for validation, spread across 200 classes. For the experiments on Tiny ImageNet, we used the same ResNet18 architecture as that for ImageNet except for the first layer, whose filter size is with and .
Weight  Activation  SPNet  Independent 

1  1  43.5  40.0 
1  3  48.8  46.3 
1  8  48.8  47.0 
1  32  49.0  49.1 
Comparison of Different Quantizers.
Width  Bitwidth  Quantizer Accuracy(%)  
Weight  Activation  ReLUbased  Tanhbased  Logarithmic  Logarithmic  
0.25  2  2  32.5  29.9  29.1  28.9 
0.25  2  4  34.5  32.7  34.0  33.9 
0.25  2  8  34.8  32.8  34.1  34.1 
0.25  2  32  34.3  32.8  34.4  35.5 
0.25  4  2  34.1  33.7  33.3  32.5 
0.25  4  4  37.4  36.8  37.3  36.7 
0.25  4  8  37.4  36.6  37.5  37.0 
0.25  4  32  37.5  36.6  37.6  37.7 
0.25  8  2  34.4  33.0  33.9  32.6 
0.25  8  4  37.3  36.5  37.2  36.4 
0.25  8  8  37.3  36.9  37.4  36.9 
0.25  8  32  37.5  37.0  37.6  37.7 
0.25  32  2  36.3  33.6  33.5  32.8 
0.25  32  4  37.3  36.5  37.2  36.8 
0.25  32  8  37.2  37.0  37.4  36.9 
0.25  32  32  37.7  37.0  37.5  37.5 
As explained in Sec. 3.1, different quantizers can be used in different scenarios. quantizer is used in activations in general. When it is desirable to have activations quatizable to 1bit, quantizer is used. quantizer is chosen when higher accuracy is desired with the same bitwidth, at an expense of higher complexity.
In Table 4, 4 quantizer configurations were tested for SPNets with multiple weight and activation switches . In our results, based configuration obtained the lowest accuracy across all the switches and based configuration frequently obtained the best results. Logarithmic based configurations ocasionally performed better, particularly at higher activations bitwidths. Logarithmic based configurations were expected to produce the best results, however our FSRfree modified version and logarithm base choice may have harmed their performance. All the networks were slimmed by a factor of to magnify the effect of the quantizers.
1bit SPNet. We trained a network with a single bit per weight and switchable precision activations. Our network spans from the popular BinaryConnect, where dot products are computed using only summation, to Binarized Neural Networks (BNNs), where activations are quantized to 1bit and dot products are computed using and operations.
Training a network with precision switchable down to 1bit per activation is particularly challenging since it causes a significant decrease in performance across all the bitwidths when using a based quantization. Therefore, networks with activations switchable to 1bit are trained with based quantization with layer reordering. Weights are trained by simply using the function with STE.
The results in Table 3 show that our 1bit SPNet surpasses the independently trained counterparts by a large margin when the activations are in extremely lowprecision.
Slimmable SPNet. Our results for slimmable SPNet in Table 5 demonstrate the flexibility of our network, however they show no clear accuracy advantage over independently trained networks. We tested widths of , and , with slimmable SPNet and independently trained networks constantly outperforming each other. However, it is worth noting that our slimmable SPNet network can simultaneously switch width and bitwidth on demand.
Width  Weight  Activation  Slim/SPNet  Independent 

0.25  2  2  30.3  26.5 
0.25  2  32  36.5  34.2 
0.25  32  2  35.6  32.8 
0.25  32  32  39.7  37.8 
0.5  2  2  40.9  41.3 
0.5  2  32  44.8  49.7 
0.5  32  2  46.1  45.8 
0.5  32  32  49.0  52.8 
1.0  2  2  47.8  47.4 
1.0  2  32  50.5  54.4 
1.0  32  2  51.8  50.7 
1.0  32  32  53.0  56.4 
SelfDistillation. Our selfdistillation method in Sec. 5 proved to be very effective, by improving the accuracy across all the switches, as can be observed in Table 6. Additionally it can be observed that the validation performance of intermediate switches is superior to the performance of the fullprecision one, however we credit this to overfitting, since the training accuracy is higher for the full precision switches. We also compared the performance of the distillation losses, confirming our feature maps distillation loss is indeed benefiting the overall performance of the network. The value of is expected to be several orders of magnitude larger than , since it is dictated by the number of layers and the size of the intermediate feature maps. We set the hyperparameters and to bring to the same order of magnitude with . In Figure 2, we plot the accuracy curves during training of our SPNet with and without selfdistillation.
Weight  Activation  Regular  

2  2  50.1  53.0  53.3 
2  3  50.5  53.8  53.9 
2  32  51.2  54.2  53.8 
3  2  50.8  53.4  53.9 
3  3  51.5  54.0  54.4 
3  32  52.3  54.1  54.5 
32  2  51.4  52.6  53.3 
32  3  52.1  53.4  53.9 
32  32  52.9  52.9  52.8 
6.3 Ablation Studies
NonSwitchable Batch Normalization. We replaced our SBN layers with standard BN ones to verify and demonstrate that privatizing all BN layers for each switch is essential to successfully perform inference. Figure 3 shows the validation plots for a SPNet network with 4 switches with and without SBN. The switches of the network with SBN learn at different rates, with the high precision ones learning faster and achieving higher accuracy than the low precision ones, while the accuracy of the switches without SBN quickly stagnates and does not recover.
Impact of Multiple Switches. The impact of increasing the number of switches was investigated. In Table 7, we present the results of SPNet with 4 and 16 switches slimmed with a factor of . The first remark that we notice is that with increasing number of switches, the initial learning rate should be lowered. For the network with 16 switches, we set the initial learning rate to . The accuracy obtained uses less switches is higher, we conjecture is due to the additional constraints, however it indicates an interesting direction of research.
Weights bitwidth  Weights bitwidth  

2  4  8  32  2  4  8  32  
Activations bitwidht  2  41.3      43.8  37.7  41.3  41.4  41.4 
4          38.6  42.3  42.9  42.4  
8          38.4  42.5  42.7  42.5  
32  43.3      45.2  39.6  42.7  43.2  43.1 
Switchable Precision in Weights vs Activations. [52] found that QNNs activations are more sensitive to quantization than weights. Our observations on a SPNet MobileNet confirm their findings. However, for our SPNet ResNet18, we observe that weights and activations are about equally sensitive. For example in Table 4, by freezing the weights to bit and varying the bitwidths of activations in , we can observe that the accuracy gap is with top accuracy of for the based quantizer, and gap of with top accuracy of for Logarithmic 2 quantizer. Similarly, by freezing the activations to 2bit and varying the bitwidths of weights, the gap for based quantizer is with top accuracy of , and for Logarithmic 2 the gap is with top accuracy of .
Additionally, in Figure 3, we can observe that the switch “W:2 A:32” learns slower than “W:32 A:2”, but towards the end of training, it catches up.
7 Conclusion and Future Work
In this paper we have proposed a DNN capable of operating at variable precisions on demand. With this approach, we grant devices and endusers realtime control over the performance of the DNNs powering inference algorithms. Our approach is lightweight and does not require altering the model, making it compatible with connectivityfree and limited memory devices. An additional virtue of our proposed network lies in ability to train a single network that can be distributed to different devices based on their capabilities. We have demonstrated the flexibility of our method with multiple quantization functions and a slimming complementary strategy. Moreover, we have proposed a training procedure to increase the accuracy of our network across the available precisions. Finally, we performed multiple ablation studies to analyse the performance of our approach in different scenarios.
Following the spirit of Universally Slimmable networks, a continuously quantizable SPNet would be an interesting research direction as well as mixed precision SPNet, where each layer is quantized independently ondemand. This method could be used to find the optimal layerwise bitwidth for weights and activations.
8 Acknowledgements
This work was supported by the Australian Research Council through the ARC Centre of Excellence for Robotic Vision (project number CE1401000016).
References
 (2019) Proximal meanfield for neural network quantization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4871–4880. Cited by: §1, §2.
 (2019) Proxquant: quantized neural networks via proximal operators. In Proc. Int. Conf. Learn. Repren., Cited by: §1, §2.
 (2013) Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432. Cited by: §3.1, §3.
 (2019) Once for all: train one network and specialize it for efficient deployment. arXiv preprint arXiv:1908.09791. Cited by: §2.
 (2017) Deep learning with low precision by halfwave gaussian quantization. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 5918–5926. Cited by: §3.1.1, §3.1.2, §3.1.2, §3, §6.
 (2017) Skip rnn: learning to skip state updates in recurrent neural networks. arXiv preprint arXiv:1708.06834. Cited by: §1.
 (2018) Pact: parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085. Cited by: §2.
 (2018) Learning low precision deep neural networks through regularization. arXiv preprint arXiv:1809.00095. Cited by: §2.
 (2015) Binaryconnect: training deep neural networks with binary weights during propagations. In Proc. Adv. Neural Inf. Process. Syst., pp. 3123–3131. Cited by: §1, §1.
 (2016) Deep residual learning for image recognition. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 770–778. Cited by: §1, §6, §6.
 (2019) Knowledge adaptation for efficient semantic segmentation. arXiv preprint arXiv:1903.04688. Cited by: §2.
 (2014) Distilling the knowledge in a neural network. In Proc. Adv. Neural Inf. Process. Syst. Workshops, Cited by: §2.
 (2018) Lossaware weight quantization of deep networks. In Proc. Int. Conf. Learn. Repren., Cited by: §1, §2.
 (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §6.
 (2016) Binarized neural networks. In Proc. Adv. Neural Inf. Process. Syst., pp. 4107–4115. Cited by: §1, §1, §2.
 (2016) Quantized neural networks: training neural networks with low precision weights and activations. arXiv preprint arXiv:1609.07061. Cited by: §1.
 (2019) Learning to quantize deep networks by optimizing quantization intervals with task loss. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 4350–4359. Cited by: §2.
 (2015) Adam: a method for stochastic optimization. In Proc. Int. Conf. Learn. Repren., Cited by: §6.
 (2016) Ternary weight networks. arXiv preprint arXiv:1605.04711. Cited by: §2.
 (2019) Training compact neural networks via auxiliary overparameterization. arXiv preprint arXiv:1909.02214. Cited by: §1.
 (2018) Bireal net: enhancing the performance of 1bit cnns with improved representational capability and advanced training algorithm. In Proc. Eur. Conf. Comp. Vis., Cited by: §1, §3.
 (2015) Fully convolutional networks for semantic segmentation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 3431–3440. Cited by: §1.
 (2017) Deciding how to decide: dynamic routing in artificial neural networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 2363–2372. Cited by: §1.
 (2018) Apprentice: using knowledge distillation techniques to improve lowprecision network accuracy. In Proc. Int. Conf. Learn. Repren., Cited by: §2.
 (2016) Convolutional neural networks using logarithmic data representation. arXiv preprint arXiv:1603.01025. Cited by: §3.1.3, §3, §6.
 (2017) Weightedentropybased quantization for deep neural networks. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 5456–5464. Cited by: §2.
 (2018) Model compression via distillation and quantization. In Proc. Int. Conf. Learn. Repren., Cited by: §2.
 (2016) Xnornet: imagenet classification using binary convolutional neural networks. In Proc. Eur. Conf. Comp. Vis., pp. 525–542. Cited by: §1, §2, §3.1.1, §6.
 (2016) You only look once: unified, realtime object detection. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 779–788. Cited by: §1.
 (2015) Fitnets: hints for thin deep nets. Proc. Int. Conf. Learn. Repren.. Cited by: §2.
 (2019) Adaptative inference cost with convolutional neural mixture models. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1872–1881. Cited by: §2.
 (2015) Imagenet large scale visual recognition challenge. Int. J. Comp. Vis. 115 (3), pp. 211–252. Cited by: §6.
 (2018) True gradientbased training of deep binary activated neural networks via continuous binarization. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2346–2350. Cited by: §1, §2.
 (2019) Searching for accurate binary neural architectures. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 0–0. Cited by: §1.
 (2018) A quantizationfriendly separable convolution for mobilenets. In 2018 1st Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2), pp. 14–18. Cited by: §6.1.
 (2018) CLIPq: deep network compression learning by inparallel pruningquantization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7873–7882. Cited by: §1.
 (2019) HAQ: hardwareaware automated quantization with mixed precision. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 8612–8620. Cited by: §2.
 (2019) HAQ: hardwareaware automated quantization. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §1.
 (2018) Skipnet: learning dynamic routing in convolutional networks. In Proc. Eur. Conf. Comp. Vis., pp. 409–424. Cited by: §1.
 (2018) Quantization mimic: towards very tiny cnn for object detection. In Proc. Eur. Conf. Comp. Vis., pp. 267–283. Cited by: §2.
 (2018) Blockdrop: dynamic inference paths in residual networks. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 8817–8826. Cited by: §1.
 (2019) Network slimming by slimmable networks: towards oneshot architecture search for channel numbers. arXiv preprint arXiv:1903.11728. Cited by: §2.
 (2019) Universally slimmable networks and improved training techniques. arXiv preprint arXiv:1903.05134. Cited by: §2, §4.1, §4.1, §5.
 (2018) Slimmable neural networks. arXiv preprint arXiv:1812.08928. Cited by: §1, §1, §2, §4.2.
 (2017) Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. In Proc. Int. Conf. Learn. Repren., Cited by: §2.
 (2018) LQnets: learned quantization for highly accurate and compact deep neural networks. In Proc. Eur. Conf. Comp. Vis., Cited by: §2.
 (2019) Be your own teacher: improve the performance of convolutional neural networks via self distillation. arXiv preprint arXiv:1905.08094. Cited by: §2.
 (2017) Incremental network quantization: towards lossless cnns with lowprecision weights. Proc. Int. Conf. Learn. Repren.. Cited by: §2.
 (2018) Explicit losserroraware quantization for lowbit deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9426–9435. Cited by: §1, §2.
 (2016) DoReFanet: training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160. Cited by: §1, §2, §3.1.1, §3.1.2, §3, §6.
 (2017) Trained ternary quantization. Proc. Int. Conf. Learn. Repren.. Cited by: §2.
 (2019) Binary ensemble neural network: more bits per network or more networks per bit?. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4923–4932. Cited by: §6.3.
 (2018) Towards effective lowbitwidth convolutional neural networks. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §1, §2, §5.