Switchable Precision Neural Networks

Switchable Precision Neural Networks

Abstract

Instantaneous and on demand accuracy-efficiency trade-off has been recently explored in the context of neural networks slimming. In this paper, we propose a flexible quantization strategy, termed Switchable Precision neural Networks (SP-Nets), to train a shared network capable of operating at multiple quantization levels. At runtime, the network can adjust its precision on the fly according to instant memory, latency, power consumption and accuracy demands. For example, by constraining the network weights to 1-bit with switchable precision activations, our shared network spans from BinaryConnect to Binarized Neural Network, allowing to perform dot-products using only summations or bit operations. In addition, a self-distillation scheme is proposed to increase the performance of the quantized switches. We tested our approach with three different quantizers and demonstrate the performance of SP-Nets against independently trained quantized models in classification accuracy for Tiny ImageNet and ImageNet datasets using ResNet-18 and MobileNet architectures.

\cvprfinalcopy\affiliation

1 Introduction

Deep Neural Networks (DNNs) have achieved great success in a wide range of vision tasks, such as image classification [10], semantic segmentation [22] and object detection [29]. However, the large model size and expensive computational complexity remain great obstacles for many applications, especially on some constrained devices with limited memory and computing resources. Network quantization is an active field of research focusing on alleviating such issues. In particular, [9, 15, 28] set the foundations for 1-bit quantization, while [16, 50] for arbitrary bitwidth quantization. Progressive quantization [2, 1, 53, 33], loss aware-quantization [13, 49], improved gradient estimators for non-differentiable functions [21] and RL-aided training [20], have focused on improved training schemes, while mixed precision quantization [36], hardware-aware quantization [38] and architecture search for quantized models [34] have focused on alternatives for standard quantized models. However, these strategies are exclusively focused on improving the performance and efficiency of static networks.

Dynamic routing networks try to provide improvements using an alternative approach. By performing computations conditioned on the inputs, the networks are capable of saving resources by executing just the sufficient amount of operations required to map the input to the desired output. Popular strategies include skipping convolutional layers [39, 6, 41] based on the input data complexity and early classifiers [23].

Our proposed approach taken by slimmable neural networks [44] falls in the category of dynamic networks, but following a different principle, aiming to provide on demand trade-offs, rather than input dependant. To the best of our knowledge, dynamic quantization of DNNs has not been explored in the literature. Slimmable networks provide width (number of channels) switches that allow to perform inference utilizing only sections of the network according to on-device demands and resource constraints. Similar to slimmable networks, we attempt to provide the first approach by developing a network whose weights and activations can be quantized at various precision at runtime, permitting instant and adaptive accuracy-efficiency trade-off, which is termed switchable precision. In particular, 1-bit DNNs are both an interesting and challenging case of our SP-Net. With binary weights, we can train a shared network ranging from BinaryConnect [9] and Binarized Neural Network [15], where the inner product can be efficiently implemented using summation or bit operations according to on-device resource constraints. Furthermore, our PS-Nets frequently yields higher accuracy than individually trained quantized networks.

However, different precision switches are difficult to optimize as a whole. And we summarize the reasons in two parts. On one hand, during training, batch normalization (BN) layers use the current batch statistic to perform intermediate feature maps normalization, while estimating the global statistics by accumulating a running mean and running variance, which are used as the replacement during testing. However, the batch statistics of each precision switch are different. As a result, the discrepancy of feature mean and variance across different switches leads to inaccurate accumulated statistics of BN layers. To solve this problem, we follow [44] to train a switchable precision network by using independent BN parameters for each switch, named switchable batch normalization (S-BN).

On the other hand, we conjecture that simultaneously optimizing multiple quantized switches will reflect in the loss manifold by progressively quantizing the loss surface, and the higher precision switches will assist the lower precision ones to achieve a less noisy and smoother convergence, potentially leading them to a better minima. Conversely, the network will converge to only minimas which perform well at all the bitwidths potentially harming the performance of some of the switches, particularly the higher precision ones. During training of SP-Nets, the gradients of each switch are combined before running an optimizer step. However, there is no explicit mechanism impeding the individual switches from moving in distinct directions. In order to encourage the different switches to move in approximately the same direction, we propose a self-distillation strategy, where the full precision switch provides a guiding signal for the rest of the switches. Specifically, only the teacher full-precision switch sees the ground-truth while the student low-precision switches are trained by distilling knowledge from the full-precision teacher.

In order to increase the flexibility of our model, in addition to equipping the network with switchable precision representations, we extend our approach to 3 different quantizers and equip them with slimming (width switchable) capability.

Our contributions are summarized as follows:

  • We leverage S-BN to train a shared network executable at different bitwidths on weights and activations according to runtime demands.

  • We propose a self-distillation scheme to jointly train the full-precision teacher switch and the low-precision student switches. By doing so, the full-precision switch provides a guiding signal to significantly improve the performance of the low-precision switches.

  • We investigate the effectiveness of our method on uniform and non-uniform quantization with various quantizers through extensive experiments on image classification.

2 Related Work

Network slimming. Slimmable networks introduced by [44] developed a procedure useful for training a DNN with switchable widths. The motivation is to provide instant and on demand accuracy-efficiency trade-offs. It was further generalized by [43] allowing to efficiently train a network continuously slimmable, executable at any arbitrary width. Moreover, non-uniform slimming was introduced, allowing for layer-wise width selection. The training principle behind network slimming has found applications in the fields of pruning [42], network distillation [47], architecture search [4], adaptive inference [31] and finally, in our work, network quantization.

Network quantization. Quantization based methods represent the network weights and/or activations with very low precision, thus yielding highly compact DNN models compared to their floating-point counterparts with considerable memory and computation savings. BNNs [15, 28] propose to constrain both weights and activations to binary values (i.e., and ), where the multiplication-accumulations can be replaced by purely and operations, which are in general much faster. However, BNNs still suffer from significant accuracy decrease compared with the full precision counterparts. To narrow this accuracy gap, ternary networks [19, 51] and even higher bit fixed-point quantization [50, 48] methods are proposed.

In general, quantization approaches target at tackling two main problems. On one hand, some works target at designing more accurate quantizer to minimize information loss. For the uniform quantizer, works in [7, 17] explicitly parameterize and optimize the upper and/or lower bound of the activation and weights. To reduce the quantization error, non-uniform approaches [26, 46] are proposed to better approximate the data distribution. In particular, LQ-Net [46] proposes to jointly optimize the quantizer and the network parameters. On the other hand, because of the non-differentiable quantizer, some literature focuses on relaxing the discrete optimization problem. A typical approach is to train with regularization [13, 49, 2, 1, 33, 8], where the optimization problem becomes continuous while gradually adjusting the data distribution towards quantized values. Apart from the two challenges, with the popularization of neural architecture search (NAS), Wang \etal[37] further propose to employ reinforcement learning to automatically determine the bit-width of each layer without human heuristics.

Knowledge distillation. Knowledge distillation (KD) is a general approach for model compression, where a powerful wide/deep teacher distills knowledge to a narrow/shallow student to improve its performance [12, 30]. In terms of the definition of knowledge to be distilled from the teacher, existing models typically use teacher’s class probabilities [12] and/or intermediate features [30, 45]. KD has been widely used in many computer vision tasks [40, 11]. Moreover, there are some works [53, 24, 27] study the combination of KD and quantization, where the full-precision model provides hints to guide the low-precision model training and significantly improves the performance of the low-precision networks. Different from the previous literature, we propose a self-distillation strategy to improve our SP-Net training. Specifically, only the full-precision switch sees the ground-truth while the low-precision switches are learnt by distilling from the full-precision teacher.

3 Preliminaries

Figure 1: Overview of the proposed approach. In SP-Nets multiple precision switches share a common architecture, making it capable of adjusting the precision of their representations on the fly, granting devices and end-users real-time control over the network performance (thinner connections indicate reduced bitwidth).

The optimization problem of traditional Quantized Neural Networks (QNNs) aims at minimizing an objective function given a set of trainable weights , which take values from , a predefined discrete set typically referred as codebook. A common QNN training procedure involves storing a real-valued version of . During inference, the real is quantized using a predetermined pointwise quantization function . The weights are updated during the optimization process by estimating the gradients w.r.t. the real-valued copy. Additionally, internal network activations can be optionally quantized with their own codebook by . Given independent bitwidths and for the network weights and activations, and . Finally, a quantized convolutional layer with filters is computed as follows:

(1)

where is the -th convolutional filter. , and denote the number of input channels, height and width of the filters, respectively. and denote the input activations and output pre-activations of the filter, where , , , represent the height and width of the input and output feature maps, respectively.

In the context of QNNs, multiple quantizers and gradient estimators have been proposed in the literature [3, 21, 50, 5, 25]. For our SP-Net, we stick to common ones, which will be described in Sec. 3.1. Three non-linear quantizers for the activations are implemented. For the weights, the -based quantizer is used, unless specified otherwise. The motivation to use each quantizer will be explained in each corresponding subsection.

3.1 Straight-Through Estimator (STE) and Base Quantizer

The STE proposed in [3] allows to estimate gradients through the non-differentiable functions and , making the -bit quantizers employed compatible with the backpropagation algorithm. The STE operates by passing the output gradients unaltered to the input, it is equivalent to the derivative of an mapping on the inputs. Commonly, the gradients for inputs outside of the range are suppressed and the derivative of the quantization function becomes equivalent to the derivative of the function.

The -bit quantizers described in the next sections share the same base quantization function, :

(2)

During backpropagation, we use the STE:

(3)

Tanh-Based Quantizer

[50, 5] proposed to use different quantizers for weights and activations. Weight quantizers approximate the hyperbolic tangent function (), constraining the weights to . However, is associated with the vanishing gradient problem, thus, it is attractive for activations quantizers to approximate the popular activation function as described in the next section, constraining the activations to the positive range.

The is first used to project the input to range in order to reduce the impact of large values. The quantizer is defined as follows:

(4)

where hereafter is defined in Eq. (2).

It is appealing to train a network with activations switchable down to 1-bit with permissible values since bit operations can be performed given that weights are 1-bit as well. Therefore, when training these type of networks, quantizer is used on both weights and activations to allow the activations of the intermediate switches to lie in the range . For this same reason, the layers were re-ordered as described in [28] (typical layers ordering is QuantConvBN ReLUQuantConv, while layers re-ordered are QuantConvReLUBNQuantConv).

ReLU-Based Quantizer

As mentioned in the previous section, [50, 5] employ a approximation quantizer for the activations with no layer re-ordering. The method proposed by [50] will be employed here, which consists of simply clipping the activations followed by the base quantizer:

(5)

In particular, [5] proposed both uniform and non-uniform spacing between the codebook elements. They first clipped the activations to the range , where is some predetermined value, and the codebook for both cases, uniform and non-uniform, is obtained from the network internal statistics. In our tests with quantization, we simply constrain the activations to the range with uniform quantization. In the logaritmic quantizer described in the next section, non-uniform quantization is used.

Logarithmic Quantizer

In full-precision neural networks, the weights and activations have non-uniform distributions [25]. Taking advantage of that fact, the authors used logarithmic representations, both in weights and activations, achieving higher classification accuracy at the same resolution than uniform quantization schemes at an expense of higher implementation and computation complexity. In the original paper, their layer contains a global Full Scale Range (FSR) parameter which is reported in the paper. Additionally, each layer has its own FSR. In our SP-Net, we use a variation of their layer in order to avoid the FSR parameter. Our modified quantizer is defined as follows:

(6)

where

(7)
(8)
(9)
(10)

Similarly to Sec. 3.1.1, two-sided logarithmic quantization (positive and negative) can be used in order to have activations switchable down to 1-bit. Depending on the choice of activations quantization (one-sided or two-sided), layer re-ordering should be taken into account. In our experiments, logarithm base-2 was used, however, base- could provide higher accuracy.

4 Switchable Precision Neural Networks

SP-Nets generalize QNNs, where the learnable weights are optimized for multiple codebooks and of variable cardinality. The permissible values of the codebooks are determined by the choice of quantizers and , while their cardinality is determined by the bitwidths and .

DNNs at their current state are not naturally precision switchable as empirically demonstrated in Sec. 6.3. Therefore, we appeal to S-BN, a technique used to train slimmable neural networks, described in detail in Sec. 4.1. Then in Sec. 4.2, we extend our SP-Net to mixed slimmable SP-Net.

4.1 Switchable Batch Normalization (S-BN)

DNNs at their current state are not naturally slimmable [43] nor of switchable precision due to the inconsistent behavior of batch normalization layers during training and inference. During training, BN layers utilize the current batch mean and variance to perform intermediate feature maps normalization while accumulating a running mean and running variance , which is used as replacement during test.

In a naive SP-Net, the mini-batch local statistics of each quantization switch are different, however, all the switches will contribute to the accumulated global mean and variance , thus, enabling the network to train properly, but resulting in inaccurate test inference.

S-BN layers equip regular BN layers with private and for each of the switches. The overhead of the additional parameters is negligible since BN parameters account for an insignificant portion of the total amount. It is also negligible in terms of run-time complexity since they involve no additional operations.

Additionally, BN layers count with two learnable parameters and used to provide the layer with the capacity of performing a linear mapping. Unlike and , and can be updated by all the switches. Providing with private versions of them is not as crucial, since they allow the network to learn, but it yields an additional accuracy boost [43]. Furthermore, they generate no additional overhead since they can be merged with and after training.

In Algorithm 1, we illustrate the use of S-BN in one SP-Net training iteration.

Require: SP-Net . Lists of weights and activations bitwidths , .
Get mini-batch data and ground-truth ;
for bit_w in bits_w do
       for bit_a in bits_a do
             Switch BN parameters in for the current bitwidths;
             Set , in ;
             Forward pass using input ;
             Compute loss w.r.t. to ;
             Compute gradients using STE;
       end for
      
end for
Update weights using accumulated gradients;
Algorithm 1 One SP-Net iteration using S-BN

4.2 Slimmable SP-Net

Network slimming [44] relies on S-BN in order to allow each layer of the network to operate at different widths. Given the current width multiplier , a slimmable convolutional layer with filters computes only the first ones.

Network slimming and quantization are complementary techniques and effortlessly work along without technical complications. Therefore, we can train a single shared network with switchable width and precision to increase the flexibility. A single slimmable/SP network can be trained in the cloud and distribute particular switches to different deployment systems based on their specific hardware capabilities, where they can be further slimmed and quantized instantaneously on demand. In Sec. 6.2, we provide a comparison of a slimmable SP-Net with the corresponding individually trained switches.

Although the benefits of a SP-Net in terms of power and speed improvements are evident by performing dot-product operations on quantized vectors, the benefits on memory footprint may not be so apparent, given that the full precision weights must be stored on non-volatile memory at all time, regardless of the active quantization switch. However, during operation, a network clone is stored in the RAM of the processor in order to provide quick access, therefore weight quantization only takes place once every time a new switch is requested and the quantized clone is kept in volatile memory. By quantizing activations, additional RAM savings can be obtained.

5 Self-Distillation

Knowledge distillation is a common strategy used to provide a stronger training signal, typically from a large network to smaller one in a teacher-student scheme. In a similar fashion to knowledge distillation, by simultaneously optimizing multiple quantized switches, implicitly the high precision switches are providing guidance to the noisier low precision updates. However, this behavior is not explicitly encouraged. Therefore, in this section we present a complementary distillation mechanism, denominated self-distillation, where only the full-precision switch sees the ground-truth, while the low bitwidth switches try to match the internal representation as well as the output distribution of the full-precision switch. The strategy formulated allows the full-precision switch to guide the optimization process with a significant amount of information flow across all the switches.

In US-Net [43] gradients are prevented from flowing from the sub-networks to the largest width switch. Formally, to mimic the outputs, similarly to US-Net, we use the Kullback–Leibler divergence as distance measure on the output distributions and of the full-precision switch and the quantized active switch. Let denote the stop-gradient function, the output mimic loss is:

(11)

Additionally, to create the guidance signal, [53] proposes a hint-based training strategy by comparing the intermediate feature maps between the full-precision teacher and the low-precision student. Similarly, let and denote the internal feature maps (\ie, pre-activations) of the full-precision switch and the quantized active switch, the internal representations guidance loss becomes:

(12)

And in this case, we do not stop the gradients flowing to and quantize .

Finally, the loss for the quantized switches is their combination, while for the full precision switch is simply the regular cross-entropy classification loss:

(13)
(14)

where denotes the ground-truth labels, and and are the scaling coefficients to balance and , respectively.

6 Experiments

In this section, we test 2 different achitectures, ResNet [10] and MobileNet [14], for the task of image classification on the ImageNet (ILSVRC-2012) [32] and Tiny ImageNet datasets with 3 different quantizers: -based quantization [50], -based quantization [50, 5] and non-uniform Logarithmic quantization [25]. Experiments on ImageNet have been included in the supplementary material.

We implement our SP-Nets using Pytorch. For our ImageNet experiments, we use a regular (\ie, 1 single switch) full-precision pre-trained model, we then replicate the BN parameters across all the S-BN switches and fine-tune the SP-Net model. We use the standard pre-processing and augmentation as reported by [10]. For training with the pre-trained full-precision model, we use Adam [18] optimizer with an initial learning rate of and decrease it by a factor of 10 at epochs 15 and 20 with a total of 25 epochs. For our Tiny ImageNet experiments, all the networks are learned from scratch using SGD optimizer with initial learning rate of 0.1 and decreased by a factor of 10 every 30 epochs during 100 epochs. As in common practice, the first and last layers are not quantized [28].

Implementation Note. The activations of the full-precision switch, although are not quantized, in some cases they must still be clipped to the same range of the quantized switches, as it can lead the switch to diverge.

6.1 Evaluation on ImageNet

Evaluation on ResNet-18 In Table 1, we compare the Top-1 accuracy for the edge bitwidth cases in both weights and activations for a ResNet-18 architecture. Our SP-net achieved very close accuracy to the independently trained QNNs with the full-precision switch being the most affected, while providing increased efficiency-accuracy flexibility.

Weight Activation SP-Net Independent
2 2 62.4 62.6
2 32 65.3 65.8
32 2 65.1 65.3
32 32 68.1 69.5
Table 1: Top-1 accuracy (%) for ResNet-18 on ImageNet.

Evaluation on MobileNet. The MobileNet architecture makes use of the depthwise separable and pointwise convolutions without residual connections drastically reducing the number of parameters without a major decrease in accuracy. However, it is very sensitive to quantization, specially on the activations. [35] re-designs the MobileNet architecture to be more quantization friendly. For our MobileNet SP-Net, we follow their architecture. The re-design involves simple layers replacement and re-ordering. In Table 2, we report the results of different bitwidths for weights and activations using our SP-Net MobileNet and independent networks with the modified architecture. As can be observed, the impact of switchable precisions on MobileNet is higher than on ResNet-18.

Weight Activation SP-Net Independent
8 8 59.5 63.5
8 32 66.0 69.0
32 8 59.6 63.6
32 32 66.1 69.7
Table 2: Top-1 accuracy (%) for MobileNet on ImageNet.

6.2 Evaluation on Tiny ImageNet

The Tiny ImageNet dataset is a downsampled ImageNet version () with 100K images for training and 10K for validation, spread across 200 classes. For the experiments on Tiny ImageNet, we used the same ResNet-18 architecture as that for ImageNet except for the first layer, whose filter size is with and .

Weight Activation SP-Net Independent
1 1 43.5 40.0
1 3 48.8 46.3
1 8 48.8 47.0
1 32 49.0 49.1
Table 3: Top-1 accuracy (%) for a 1-bit SP-Net ResNet-18 on Tiny ImageNet.
Figure 2: Top-1 accuracy for SP-Net ResNet-18 with and without self-distillation (SD) on Tiny ImageNet

Comparison of Different Quantizers.

Width Bitwidth Quantizer Accuracy(%)
Weight Activation ReLU-based Tanh-based Logarithmic Logarithmic
0.25 2 2 32.5 29.9 29.1 28.9
0.25 2 4 34.5 32.7 34.0 33.9
0.25 2 8 34.8 32.8 34.1 34.1
0.25 2 32 34.3 32.8 34.4 35.5
0.25 4 2 34.1 33.7 33.3 32.5
0.25 4 4 37.4 36.8 37.3 36.7
0.25 4 8 37.4 36.6 37.5 37.0
0.25 4 32 37.5 36.6 37.6 37.7
0.25 8 2 34.4 33.0 33.9 32.6
0.25 8 4 37.3 36.5 37.2 36.4
0.25 8 8 37.3 36.9 37.4 36.9
0.25 8 32 37.5 37.0 37.6 37.7
0.25 32 2 36.3 33.6 33.5 32.8
0.25 32 4 37.3 36.5 37.2 36.8
0.25 32 8 37.2 37.0 37.4 36.9
0.25 32 32 37.7 37.0 37.5 37.5
Table 4: Comparison of different quantizers on a SP-Net for ResNet-18 on Tiny ImageNet. The networks were slimmed with a factor of to magnify the effect of the quantizers. -based: quantizer on weights, on activations. -based: quantizer on both weights and activations with layer re-ordering. Logarithmic 1: quantizer on weights and quantizer on activations. Logarithmic 2: quantizer on both on weights and activations.

As explained in Sec. 3.1, different quantizers can be used in different scenarios. quantizer is used in activations in general. When it is desirable to have activations quatizable to 1-bit, quantizer is used. quantizer is chosen when higher accuracy is desired with the same bitwidth, at an expense of higher complexity.

In Table 4, 4 quantizer configurations were tested for SP-Nets with multiple weight and activation switches . In our results, -based configuration obtained the lowest accuracy across all the switches and -based configuration frequently obtained the best results. Logarithmic based configurations ocasionally performed better, particularly at higher activations bitwidths. Logarithmic based configurations were expected to produce the best results, however our FSR-free modified version and logarithm base choice may have harmed their performance. All the networks were slimmed by a factor of to magnify the effect of the quantizers.

1-bit SP-Net. We trained a network with a single bit per weight and switchable precision activations. Our network spans from the popular BinaryConnect, where dot products are computed using only summation, to Binarized Neural Networks (BNNs), where activations are quantized to 1-bit and dot products are computed using and operations.

Training a network with precision switchable down to 1-bit per activation is particularly challenging since it causes a significant decrease in performance across all the bitwidths when using a -based quantization. Therefore, networks with activations switchable to 1-bit are trained with -based quantization with layer re-ordering. Weights are trained by simply using the function with STE.

The results in Table 3 show that our 1-bit SP-Net surpasses the independently trained counterparts by a large margin when the activations are in extremely low-precision.

Slimmable SP-Net. Our results for slimmable SP-Net in Table 5 demonstrate the flexibility of our network, however they show no clear accuracy advantage over independently trained networks. We tested widths of , and , with slimmable SP-Net and independently trained networks constantly outperforming each other. However, it is worth noting that our slimmable SP-Net network can simultaneously switch width and bitwidth on demand.

Width Weight Activation Slim/SP-Net Independent
0.25 2 2 30.3 26.5
0.25 2 32 36.5 34.2
0.25 32 2 35.6 32.8
0.25 32 32 39.7 37.8
0.5 2 2 40.9 41.3
0.5 2 32 44.8 49.7
0.5 32 2 46.1 45.8
0.5 32 32 49.0 52.8
1.0 2 2 47.8 47.4
1.0 2 32 50.5 54.4
1.0 32 2 51.8 50.7
1.0 32 32 53.0 56.4
Table 5: Top-1 accuracy (%) for a slimmable SP-Net ResNet-18 on Tiny ImageNet.

Self-Distillation. Our self-distillation method in Sec. 5 proved to be very effective, by improving the accuracy across all the switches, as can be observed in Table 6. Additionally it can be observed that the validation performance of intermediate switches is superior to the performance of the full-precision one, however we credit this to over-fitting, since the training accuracy is higher for the full precision switches. We also compared the performance of the distillation losses, confirming our feature maps distillation loss is indeed benefiting the overall performance of the network. The value of is expected to be several orders of magnitude larger than , since it is dictated by the number of layers and the size of the intermediate feature maps. We set the hyper-parameters and to bring to the same order of magnitude with . In Figure 2, we plot the accuracy curves during training of our SP-Net with and without self-distillation.

Weight Activation Regular
2 2 50.1 53.0 53.3
2 3 50.5 53.8 53.9
2 32 51.2 54.2 53.8
3 2 50.8 53.4 53.9
3 3 51.5 54.0 54.4
3 32 52.3 54.1 54.5
32 2 51.4 52.6 53.3
32 3 52.1 53.4 53.9
32 32 52.9 52.9 52.8
Table 6: Top-1 accuracy (%) for a SP-Net network with and without self-distillation for ResNet-18 on Tiny ImageNet.

6.3 Ablation Studies

Non-Switchable Batch Normalization. We replaced our S-BN layers with standard BN ones to verify and demonstrate that privatizing all BN layers for each switch is essential to successfully perform inference. Figure 3 shows the validation plots for a SP-Net network with 4 switches with and without S-BN. The switches of the network with S-BN learn at different rates, with the high precision ones learning faster and achieving higher accuracy than the low precision ones, while the accuracy of the switches without S-BN quickly stagnates and does not recover.

Impact of Multiple Switches. The impact of increasing the number of switches was investigated. In Table 7, we present the results of SP-Net with 4 and 16 switches slimmed with a factor of . The first remark that we notice is that with increasing number of switches, the initial learning rate should be lowered. For the network with 16 switches, we set the initial learning rate to . The accuracy obtained uses less switches is higher, we conjecture is due to the additional constraints, however it indicates an interesting direction of research.

Weights bitwidth Weights bitwidth
2 4 8 32 2 4 8 32
Activations bitwidht 2 41.3 - - 43.8 37.7 41.3 41.4 41.4
4 - - - - 38.6 42.3 42.9 42.4
8 - - - - 38.4 42.5 42.7 42.5
32 43.3 - - 45.2 39.6 42.7 43.2 43.1
Table 7: Top-1 accuracy (%) of SP-Nets with 4 and 16 switches slimmed with a factor of .

Switchable Precision in Weights vs Activations. [52] found that QNNs activations are more sensitive to quantization than weights. Our observations on a SP-Net MobileNet confirm their findings. However, for our SP-Net ResNet-18, we observe that weights and activations are about equally sensitive. For example in Table 4, by freezing the weights to -bit and varying the bitwidths of activations in , we can observe that the accuracy gap is with top accuracy of for the -based quantizer, and gap of with top accuracy of for Logarithmic 2 quantizer. Similarly, by freezing the activations to 2-bit and varying the bitwidths of weights, the gap for -based quantizer is with top accuracy of , and for Logarithmic 2 the gap is with top accuracy of .

Additionally, in Figure 3, we can observe that the switch “W:2 A:32” learns slower than “W:32 A:2”, but towards the end of training, it catches up.

Figure 3: Top-1 accuracy for SP-Net ResNet-18 with and without switchable batch normalization (S-BN) on Tiny ImageNet.

7 Conclusion and Future Work

In this paper we have proposed a DNN capable of operating at variable precisions on demand. With this approach, we grant devices and end-users real-time control over the performance of the DNNs powering inference algorithms. Our approach is lightweight and does not require altering the model, making it compatible with connectivity-free and limited memory devices. An additional virtue of our proposed network lies in ability to train a single network that can be distributed to different devices based on their capabilities. We have demonstrated the flexibility of our method with multiple quantization functions and a slimming complementary strategy. Moreover, we have proposed a training procedure to increase the accuracy of our network across the available precisions. Finally, we performed multiple ablation studies to analyse the performance of our approach in different scenarios.

Following the spirit of Universally Slimmable networks, a continuously quantizable SP-Net would be an interesting research direction as well as mixed precision SP-Net, where each layer is quantized independently on-demand. This method could be used to find the optimal layer-wise bitwidth for weights and activations.

8 Acknowledgements

This work was supported by the Australian Research Council through the ARC Centre of Excellence for Robotic Vision (project number CE1401000016).

References

  1. T. Ajanthan, P. K. Dokania, R. Hartley and P. H. Torr (2019) Proximal mean-field for neural network quantization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4871–4880. Cited by: §1, §2.
  2. Y. Bai, Y. Wang and E. Liberty (2019) Proxquant: quantized neural networks via proximal operators. In Proc. Int. Conf. Learn. Repren., Cited by: §1, §2.
  3. Y. Bengio, N. Léonard and A. Courville (2013) Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432. Cited by: §3.1, §3.
  4. H. Cai, C. Gan and S. Han (2019) Once for all: train one network and specialize it for efficient deployment. arXiv preprint arXiv:1908.09791. Cited by: §2.
  5. Z. Cai, X. He, J. Sun and N. Vasconcelos (2017) Deep learning with low precision by half-wave gaussian quantization. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 5918–5926. Cited by: §3.1.1, §3.1.2, §3.1.2, §3, §6.
  6. V. Campos, B. Jou, X. Giró-i-Nieto, J. Torres and S. Chang (2017) Skip rnn: learning to skip state updates in recurrent neural networks. arXiv preprint arXiv:1708.06834. Cited by: §1.
  7. J. Choi, Z. Wang, S. Venkataramani, P. I. Chuang, V. Srinivasan and K. Gopalakrishnan (2018) Pact: parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085. Cited by: §2.
  8. Y. Choi, M. El-Khamy and J. Lee (2018) Learning low precision deep neural networks through regularization. arXiv preprint arXiv:1809.00095. Cited by: §2.
  9. M. Courbariaux, Y. Bengio and J. David (2015) Binaryconnect: training deep neural networks with binary weights during propagations. In Proc. Adv. Neural Inf. Process. Syst., pp. 3123–3131. Cited by: §1, §1.
  10. K. He, X. Zhang, S. Ren and J. Sun (2016) Deep residual learning for image recognition. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 770–778. Cited by: §1, §6, §6.
  11. T. He, C. Shen, Z. Tian, D. Gong, C. Sun and Y. Yan (2019) Knowledge adaptation for efficient semantic segmentation. arXiv preprint arXiv:1903.04688. Cited by: §2.
  12. G. Hinton, O. Vinyals and J. Dean (2014) Distilling the knowledge in a neural network. In Proc. Adv. Neural Inf. Process. Syst. Workshops, Cited by: §2.
  13. L. Hou and J. T. Kwok (2018) Loss-aware weight quantization of deep networks. In Proc. Int. Conf. Learn. Repren., Cited by: §1, §2.
  14. A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto and H. Adam (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §6.
  15. I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv and Y. Bengio (2016) Binarized neural networks. In Proc. Adv. Neural Inf. Process. Syst., pp. 4107–4115. Cited by: §1, §1, §2.
  16. I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv and Y. Bengio (2016) Quantized neural networks: training neural networks with low precision weights and activations. arXiv preprint arXiv:1609.07061. Cited by: §1.
  17. S. Jung, C. Son, S. Lee, J. Son, J. Han, Y. Kwak, S. J. Hwang and C. Choi (2019) Learning to quantize deep networks by optimizing quantization intervals with task loss. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 4350–4359. Cited by: §2.
  18. D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In Proc. Int. Conf. Learn. Repren., Cited by: §6.
  19. F. Li, B. Zhang and B. Liu (2016) Ternary weight networks. arXiv preprint arXiv:1605.04711. Cited by: §2.
  20. Y. Liu, B. Zhuang, C. Shen, H. Chen and W. Yin (2019) Training compact neural networks via auxiliary overparameterization. arXiv preprint arXiv:1909.02214. Cited by: §1.
  21. Z. Liu, B. Wu, W. Luo, X. Yang, W. Liu and K. Cheng (2018) Bi-real net: enhancing the performance of 1-bit cnns with improved representational capability and advanced training algorithm. In Proc. Eur. Conf. Comp. Vis., Cited by: §1, §3.
  22. J. Long, E. Shelhamer and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 3431–3440. Cited by: §1.
  23. M. McGill and P. Perona (2017) Deciding how to decide: dynamic routing in artificial neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2363–2372. Cited by: §1.
  24. A. Mishra and D. Marr (2018) Apprentice: using knowledge distillation techniques to improve low-precision network accuracy. In Proc. Int. Conf. Learn. Repren., Cited by: §2.
  25. D. Miyashita, E. H. Lee and B. Murmann (2016) Convolutional neural networks using logarithmic data representation. arXiv preprint arXiv:1603.01025. Cited by: §3.1.3, §3, §6.
  26. E. Park, J. Ahn and S. Yoo (2017) Weighted-entropy-based quantization for deep neural networks. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 5456–5464. Cited by: §2.
  27. A. Polino, R. Pascanu and D. Alistarh (2018) Model compression via distillation and quantization. In Proc. Int. Conf. Learn. Repren., Cited by: §2.
  28. M. Rastegari, V. Ordonez, J. Redmon and A. Farhadi (2016) Xnor-net: imagenet classification using binary convolutional neural networks. In Proc. Eur. Conf. Comp. Vis., pp. 525–542. Cited by: §1, §2, §3.1.1, §6.
  29. J. Redmon, S. Divvala, R. Girshick and A. Farhadi (2016) You only look once: unified, real-time object detection. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 779–788. Cited by: §1.
  30. A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta and Y. Bengio (2015) Fitnets: hints for thin deep nets. Proc. Int. Conf. Learn. Repren.. Cited by: §2.
  31. A. Ruiz and J. Verbeek (2019) Adaptative inference cost with convolutional neural mixture models. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1872–1881. Cited by: §2.
  32. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla and M. Bernstein (2015) Imagenet large scale visual recognition challenge. Int. J. Comp. Vis. 115 (3), pp. 211–252. Cited by: §6.
  33. C. Sakr, J. Choi, Z. Wang, K. Gopalakrishnan and N. Shanbhag (2018) True gradient-based training of deep binary activated neural networks via continuous binarization. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2346–2350. Cited by: §1, §2.
  34. M. Shen, K. Han, C. Xu and Y. Wang (2019) Searching for accurate binary neural architectures. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 0–0. Cited by: §1.
  35. T. Sheng, C. Feng, S. Zhuo, X. Zhang, L. Shen and M. Aleksic (2018) A quantization-friendly separable convolution for mobilenets. In 2018 1st Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2), pp. 14–18. Cited by: §6.1.
  36. F. Tung and G. Mori (2018) CLIP-q: deep network compression learning by in-parallel pruning-quantization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7873–7882. Cited by: §1.
  37. K. Wang, Z. Liu, Y. Lin, J. Lin and S. Han (2019) HAQ: hardware-aware automated quantization with mixed precision. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 8612–8620. Cited by: §2.
  38. K. Wang, Z. Liu, Y. Lin, J. Lin and S. Han (2019) HAQ: hardware-aware automated quantization. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §1.
  39. X. Wang, F. Yu, Z. Dou, T. Darrell and J. E. Gonzalez (2018) Skipnet: learning dynamic routing in convolutional networks. In Proc. Eur. Conf. Comp. Vis., pp. 409–424. Cited by: §1.
  40. Y. Wei, X. Pan, H. Qin, W. Ouyang and J. Yan (2018) Quantization mimic: towards very tiny cnn for object detection. In Proc. Eur. Conf. Comp. Vis., pp. 267–283. Cited by: §2.
  41. Z. Wu, T. Nagarajan, A. Kumar, S. Rennie, L. S. Davis, K. Grauman and R. Feris (2018) Blockdrop: dynamic inference paths in residual networks. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 8817–8826. Cited by: §1.
  42. J. Yu and T. Huang (2019) Network slimming by slimmable networks: towards one-shot architecture search for channel numbers. arXiv preprint arXiv:1903.11728. Cited by: §2.
  43. J. Yu and T. Huang (2019) Universally slimmable networks and improved training techniques. arXiv preprint arXiv:1903.05134. Cited by: §2, §4.1, §4.1, §5.
  44. J. Yu, L. Yang, N. Xu, J. Yang and T. Huang (2018) Slimmable neural networks. arXiv preprint arXiv:1812.08928. Cited by: §1, §1, §2, §4.2.
  45. S. Zagoruyko and N. Komodakis (2017) Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. In Proc. Int. Conf. Learn. Repren., Cited by: §2.
  46. D. Zhang, J. Yang, D. Ye and G. Hua (2018) LQ-nets: learned quantization for highly accurate and compact deep neural networks. In Proc. Eur. Conf. Comp. Vis., Cited by: §2.
  47. L. Zhang, J. Song, A. Gao, J. Chen, C. Bao and K. Ma (2019) Be your own teacher: improve the performance of convolutional neural networks via self distillation. arXiv preprint arXiv:1905.08094. Cited by: §2.
  48. A. Zhou, A. Yao, Y. Guo, L. Xu and Y. Chen (2017) Incremental network quantization: towards lossless cnns with low-precision weights. Proc. Int. Conf. Learn. Repren.. Cited by: §2.
  49. A. Zhou, A. Yao, K. Wang and Y. Chen (2018) Explicit loss-error-aware quantization for low-bit deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9426–9435. Cited by: §1, §2.
  50. S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen and Y. Zou (2016) DoReFa-net: training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160. Cited by: §1, §2, §3.1.1, §3.1.2, §3, §6.
  51. C. Zhu, S. Han, H. Mao and W. J. Dally (2017) Trained ternary quantization. Proc. Int. Conf. Learn. Repren.. Cited by: §2.
  52. S. Zhu, X. Dong and H. Su (2019) Binary ensemble neural network: more bits per network or more networks per bit?. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4923–4932. Cited by: §6.3.
  53. B. Zhuang, C. Shen, M. Tan, L. Liu and I. Reid (2018) Towards effective low-bitwidth convolutional neural networks. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §1, §2, §5.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
407079
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description