through Weight Equalization and Bias Correction
We introduce a data-free quantization method for deep neural networks that does not require fine-tuning or hyperparameter selection. It achieves near-original model performance on common computer vision architectures and tasks. 8-bit fixed-point quantization is essential for efficient inference in modern deep learning hardware architectures. However, quantizing models to run in 8-bit is a non-trivial task, frequently leading to either significant performance reduction or engineering time spent on training a network to be amenable to quantization. Our approach relies on equalizing the weight ranges in the network by making use of a scale-equivariance property of activation functions. In addition the method corrects biases in the error that are introduced during quantization. This improves quantization accuracy performance, and can be applied ubiquitously to almost any model with a straight-forward API call. For common architectures, such as the MobileNet family, we achieve state-of-the-art quantized model performance. We further show that the method also extends to other computer vision architectures and tasks such as semantic segmentation and object detection.
In recent years, deep learning based computer vision models have moved from research labs into the cloud and onto edge devices. As a result, power consumption and latency of deep learning inference have become an important concern. For this reason fixed-point quantization is often employed to make inference more efficient. By quantizing FP32 values onto a regularly spaced grid, the original FP32 values can be approximated by a set of integers, a scaling factor, and an optional zero point offset . This allows for the use of faster and more power-efficient INT8 operations instead of FP32 operations in matrix multiplication or convolution computations, at the expense of lower representational power. We refer the reader to  for details on commonly used, hardware-friendly quantization methods for deep learning models.
Quantization of 32-bit full precision (FP32) models into 8-bit fixed point (INT8) introduces quantization noise on the weights and activations, which often leads to reduced model performance. This performance degradation ranges from very minor to catastrophic. To minimize the quantization noise or even mitigate its effect, a wide range of different methods have been introduced in the literature (see Section 2). Such methods (e.g. [16, 4, 21]) often rely on quantization-aware fine-tuning or training from scratch.
A major drawback of practical quantization methods is their reliance on data and fine-tuning. As an example, consider real-world actors that manage hardware for quantized models, such as cloud-based deep learning inference providers or cellphone manufacturers. To provide a general use quantization service they would have to receive data from the customers to fine-tune the models, or rely on their customers to do the quantization. In either case, this can add a difficult step to the process. For such stakeholders it would be preferable if FP32 models could be converted directly to INT8, without needing the know-how, data or compute necessary for running traditional quantization methods. Even for model developers that have the capability to go the extra mile to quantize their own models, automation would save significant time.
In this paper, we introduce a quantization approach that does not require data, fine-tuning or hyperparameter tuning, resulting in accuracy improvement with a simple API call. Despite these restrictions we achieve near-original model performance when quantizing FP32 models to INT8. This is achieved by adapting the weight tensors of pre-trained models such that they are more amenable to quantization, and by correcting for the bias of the error that is introduced when quantizing models. We show significant improvements in quantization performance on a wide range of computer vision models previously thought to be difficult to quantize without fine-tuning.
Levels of quantization solutions
Solutions that do not require data or re-training are rare in the literature, and the drawbacks of the practical applications of used methods are rarely discussed. To distinguish between the differences in applicability of quantization methods, we introduce four levels of quantization solutions, in decreasing order of practical applicability. Our hope is that this will enable other authors to explore solutions for each level, and makes the comparison between methods fairer. The axes for comparison are whether or not a method requires data, whether or not a method requires error backpropagation on the quantized model, and whether or not a method is generally applicable for any architecture or requires significant model re-working. We use the following definitions throughout the paper:
- Level 1
No data and no backpropagation required. Method works for any model. As simple as an API call that only looks at the model definition and weights.
- Level 2
Requires data but no backpropagation. Works for any model. The data is used e.g. to re-calibrate the batch normalization mean and variance  or to compute layer-wise loss functions to improve quantization performance. However, no fine-tuning pipeline is required.
- Level 3
- Level 4
Requires data and backpropagation. Only works for specific models. In this case, the network architecture needs non-trivial reworking, and/or the architecture needs to be trained from scratch with quantization in mind (e.g. [4, 31, 21]). Takes significant extra training-time and hyperparameter tuning to work.
2 Background and Related work
There are several works that describe quantization and improving networks for lower bit inference and deployment [9, 10, 16, 34]. These methods all rely strongly on fine-tuning, making them level 3 methods, whereas data-free quantization improves performance similarly without that requirement. Our method could be used as a pre-processing for the above methods before applying fine-tuning to improve quantization.
The only work we have found that improves INT8 quantization without the need for data is a whitepaper by Krishnamoorthi , that describes how having an activation range per channel alleviates the problems later discussed in 3.1. A major drawback of this method is that it is not supported on all hardware, and that it creates unnecessary overhead in the computation due to the necessity of scale and offset values for each channel individually. We show that our method improves on per-channel quantization, while keeping a single set of scale and offset values for the whole weight tensor instead.
Other methods to improve quantization need architecture changes or training with quantization in mind from the start [1, 21, 31, 32, 35]. These methods are even more involved than doing quantization and fine-tuning. They also incur a relatively large overhead during training because of sampling and noisy optimization, and introduce extra hyperparameters to optimize. This makes them level 4 methods.
Methods that binarize [5, 15, 27, 28] or ternarize  networks operate at great efficiency for inference as expensive multiplications and additions are replaced by bit-shift operations. However, quantizing models to binary often leads to strong performance degradation. Generally they need to be trained from scratch as well. Thus, binarization methods are also level 4 methods.
Other approaches use low-bit floating point operations, or other custom quantization implementations [8, 17, 24, 34]. Our approach could also work to improve the results for models quantized with such custom floating point formats. Other approaches use codebooks , which put stringent restrictions on the hardware for an efficient implementation. We do not consider codebooks in our approach.
In work developed concurrently with ours,  also exploits the scale invariance of the ReLU function to rescale weight channels and notice the biased error introduced by weight quantization for certain models, leading to a method that resembles our data-free quantization approach. Our work was developed independently from theirs.
While many trained FP32 models can be quantized to INT8 without much loss in performance, some models exhibit a significant drop in performance after quantization ([18, 31]). For example, when quantizing a trained MobileNetV2  model, Krishnamoorthi  reports a drop in top-1 accuracy from 70.9% to 0.1% on the ImageNet  validation set. The author restores near original model performance by either applying per-channel quantization, fine-tuning or both. Per-channel quantization defines a separate quantization range for each output channel by scaling weights before usage.
3.1 Weight tensor channel ranges
The fact that per-channel quantization yields much better performance on MobileNetV2 than per-tensor quantization suggests that, in some layers, the weight distributions differ so strongly between outputs that the same set of quantization parameters cannot be used to quantize the full weight tensor effectively. For example, in the case where one channel has weights in the range and another channel has weights in the range , the weights in the latter channel will either all be quantized to when quantizing to 8-bits.
Figure 2 shows that large differences in output channel weight ranges do indeed occur in a (trained) MobileNetV2 model. This figure shows the weight distribution of the output channel weights of the depthwise-separable layer in the model’s first inverted residual block. Due to the strong differences between channel weight ranges that this layer exhibits, it cannot be quantized with reasonable accuracy for each channel. Many layers in the network suffer from this problem, making the overall model very difficult to quantize.
We conjecture that performance of trained models after quantization can be improved by adjusting the weights for each output channel such that their ranges are more similar. We provide a level 1 method to achieve this without changing the FP32 model output in section 4.1.
3.2 Biased quantization error
A common assumption in literature (e.g. ) is that quantization error of weights is unbiased and thus cancels out in a layer’s output, ensuring that a quantized layer’s expected output does not change from the original FP32 output. However, if the quantization error is biased, either for the whole layer or for certain output channels in a weight tensor, the layer’s expected output changes, causing unpredictable effects in next layers.
The error between the FP32 weights W for a fully connected layer and quantized weights can be computed as . In case the (biased) errors in the weights for an output unit do not cancel out, i.e. , there will be a biased error in output unit .
The biased error in a quantized layer’s output unit can be computed empirically using input data points as:
where and are the original outputs and the outputs generated using the quantized weight matrix, respectively.
Figure 3 shows the biased error per channel of a depthwise-separable convolution layer in a trained MobileNetV2 model. From this plot it is clear that for many channels in the layer’s output, the error introduced by weight quantization is biased, and influences the output statistics. The depth-separable layers in MobileNet are specifically sensitive to this biased error effect as each output has only 9 corresponding weights.
Such a biased error on the outputs can also be introduced in many settings, e.g. when weights or activations are clipped , or in non-quantization approaches, such as weight tensor factorization or channel pruning [13, 33].
In section 4.2 we introduce a method to correct for this bias. Furthermore, we show that a model’s batch normalization parameters can be used to compute the expected biased error on the output, yielding a level 1 method to fix the biased error introduced by quantization.
Our proposed data-free quantization method (DFQ) consists of three steps, on top of the normal quantization. The overall flow of the algorithm is shown in Figure 4.
4.1 Cross-layer range equalization
Positive scaling equivariance
We observe that for a ReLU  activation function the following scaling equivariance property holds:
for any non-negative real number . This follows from the definition of the ReLU:
since if . This equivariance also holds for the PreLU  activation function.
More generally, the positive scaling equivariance can be relaxed to for any piece-wise linear activation functions:
where is parameterized as , and . Note that contrary to equivariance defined in eq. 2 we now also change the function into .
4.1.1 Scaling equivariance in neural networks
The above introduced positive scaling equivariance can be exploited in consecutive layers in neural networks. Given two layers, and , through scaling invariance we have that:
where is a diagonal matrix with value denoting the scaling factor for neuron . This allows us to reparameterize our model with , and . In case of CNNs the scaling will be per channel and broadcast accordingly over the spatial dimensions. The rescaling procedure is illustrated in Figure 5.
4.1.2 Equalizing ranges over multiple layers
We can exploit the rescaling and reparameterization of the model to make the model more robust to quantization. Ideally the ranges of each channel are equal to the total range of the weight tensor, meaning we use the best possible representative power per channel. We can define the precision of a channel as the relative difference:
where is the range of channel and is the total range of the tensor. We want to find such that the total precision per channel is maximized:
meaning the limiting channel defining the quantization range is given by . We can satisfy this condition by setting such that
which results in . Thus the channel’s ranges between both tensors are matched as closely as possible.
When equalizing multiple layers at the same time, we iterate this process for pairs of layers that are connected to each other without input or output splits in between, until convergence.
4.1.3 Absorbing high biases
Running the equalization procedure on the weights potentially increases the biases of the layers. This in turn increases the range of the activation quantization. In order to avoid big differences between per-channel ranges in the activations we introduce a procedure that absorbs high biases into the subsequent layer. For a layer with ReLU function , there is a non-negative vector such that for all values of . This is the vector with the minimal activation value of each output of the layer. This is the amount that we can absorb from layer into layer as described below
where , , and .
Without data we can not find exactly. Therefore, we approximate based on the batch normalization statistics as , where and , the resulting mean and standard deviation from the batch normalization layer before applying ReLU. Under the assumption that the pre-bias activations are distributed as , approximately only are smaller than and would be clipped due to the shift by . As we will show in section 5.1.1, this approximation does not harm the full precision performance significantly but helps for activation quantization.
4.2 Quantization bias correction
As shown empirically in the motivation, quantization introduces a biased error in the activations. In this section we show how to correct for the biased error on the layer’s output, and how we can use the network’s batch normalization parameters to compute the expected biased error so that the method is level 1.
For a fully connected layer with weight tensor W, quantized weights , and input activations x, we have and therefore , where we define the quantization error , y as the layer pre-activations of the FP32 model, and that layer with quantization error added.
If the expectation of the error for neuron i, , then the mean of the activation of neuron i will change. This change in distribution can often lead to strongly detrimental behavior in the following layers. We can correct for this change by seeing that
Thus, subtracting the expected error on the output from the biased output ensures that the mean for each output unit is preserved.
For implementation, the expected error can be subtracted from the layer’s bias parameter, since the expected error vector has the same shape as the layer’s output. This method easily extends to convolutional layers as described in Appendix B.
4.2.1 Computing the expected input
To compute the expected error of the output of a layer, the expected input to the layer is required. If a model does not use batch normalization, or there are no data-usage restrictions, can simply be computed by comparing the activations before and after quantization. Appendix D explains this procedure in more detail.
Clipped normal distribution
When the network includes batch normalization before a layer, we can use it to calculate for that layer without using data. We assume the pre-activation outputs of a layer are normally distributed, that batch normalization is applied before the activation function, and that the activation function is some form of the class of clipped linear activation functions (e.g. ReLU, ReLU6), which clips its input range to the range where , and can be .
Due to the centralization and normalization applied by batch normalization, the mean and standard deviation of the pre-activations are known: these are the batch normalization scale and shift parameters (henceforth referred to as and respectively).
To compute from the previous layer’s batch normalization parameters, the mean and variance need to be adjusted to account for the activation function that follows the batch normalization layer. For this purpose we introduce the clipped normal distribution. A clipped-normally distributed random variable is a normally distributed random variable with mean and variance , whose values are clipped to the range The mean and variance of the clipped normal distribution can be computed in closed form from , , and . We present the mean of the clipped normal distribution for the ReLU activation function, i.e. and in this section, and refer the reader to Appendix C for the closed form solution for the general clipped normal distribution.
The expected value for channel in x, , which is the output of a layer with batch normalization parameters and , followed by a ReLU activation function is:
where is the pre-activation output for channel , which is assumed to be normally distributed with mean and variance , is the normal CDF, and the notation is used to denote the normal PDF.
In this section we present two sets of experiments to validate the performance of data-free quantization (DFQ). The order in which the procedures of DFQ are applied can be seen in Figure 4 We first show in section 5.1 the effect of the different aspects of DFQ and how they solve the problems observed earlier. Then we show in section 5.2 how DFQ generalizes to other models and tasks, and sets a new state-of-the-art for level 1 quantization. In all experiments we use asymmetric quantization  for both the baseline as well as DFQ. Before our DFQ procedure we apply batch-normalization folding. Unless otherwise stated, we do not do any fine-tuning after quantization. All experiments are done in Pytorch .
5.1 Ablation study
In this section we investigate the effect of our methods on a pre-trained MobileNetV2  model111We use the Pytorch implementation of MobileNetV2 provided by https://github.com/tonylins/pytorch-mobilenet-v2.. We validate the performance of the model on the ImageNet  validation set. We first investigate the effects of different parts of our approach through a set of ablation studies.
5.1.1 Cross-layer equalization
In this section we investigate the effects of cross-layer equalization. We compare the equalization to two baselines, the original quantized model as well as the less hardware friendly per-channel quantization. Then we show how folding high biases further improves results.
Our considered models employ residual connections . For these networks we apply cross-layer equalization only to the three layers within each residual block. MobileNetV2 also uses ReLU6 activation functions. ReLU6 clips the linear part of the ReLU activation at 6. In practice we observed that the ReLU6 can be replaced with a ReLU without a significant drop in performance, making the equalization procedure easier to apply as described in section 4.1
The results of the equalization experiments are shown in Table 1. Similar to , we observe that the model performance is close to random when quantizing the original model to INT8. Further we note that replacing ReLU6 by ReLU does not significantly degrade the model performance. Applying equalization brings us to within 2% of the original models performance, close to the performance of per-channel quantization. We note that absorbing high biases is not lossless (drop of 0.2% in floating point), but it boosts the quantized performance by 1% due to more precise activation quantization. Combining both methods improves performance over per-channel quantization, indicating the more efficient per-tensor quantization could be used instead.
|+ absorbing bias||71.57%||70.92%|
|Per channel quantization||71.72%||70.65%|
To illustrate the effect of the cross-layer equalization, we show in Figure 3 the weight distributions per output channel of the depthwise-separable layer in the modelâs first inverted residual block after applying the equalization. We observe that most channels ranges are now similar and that there are no strong outliers anymore. Note, there are still several channels which have all weight values close to zero. These channels convey little information and can be pruned from the network with hardly any loss in accuracy.
5.1.2 Bias correction
In this section we present results on bias correction for a quantized MobileNetV2 model. We furthermore present results of bias correction in combination with a naive weight-clipping baseline, and combined with the cross-layer equalization approach.
The weight-clipping baseline serves two functions: 1) as a naive baseline to the cross-layer equalization approach, and 2) to show that bias correction can be employed in any setting where biased noise is introduced. This could, for example, also be the case for tensor-factorization or channel pruning approaches to model compression [13, 33]. Weight clipping solves the problem of large differences in ranges between channels by clipping large ranges to smaller ranges, but it introduces a strongly biased error. Weight clipping is applied by first folding the batch normalization parameters into a layer’s weights, and then clipping all values to a certain range, in this case . We tried multiple symmetric ranges, all provided similar results. For residual connections we calculate and based on the sum and variance of all input expectations, taking the input to be zero mean and unit variance.
To illustrate the effect of bias correction, Figure 3 shows the per output channel biased error introduced by weight quantization. The per-channel biases are obtained as described in eq. 1. This figure shows that applying bias correction reduces the bias in the error on the output of a layer to very close to 0 for most output channels.
|Clip @ 15||67.06%||2.55%|
|+ Bias Corr||71.15%||70.43%|
|Rescaling + Bias Absorption||71.57%||70.92%|
|+ Bias Corr||71.57%||71.19%|
Results for the experiments described above for MobileNet V2 on the ImageNet validation set are shown in Table 2. Applying bias correction improves quantized model performance, indicating that a part of the problem of quantizing this model lies in the biased error that is introduced. However, bias correction on its own does not achieve near-floating point performance. The reason for this is most likely that the problem described in 3.1 is more severe for this model. The experiments on weight-clipping show that bias correction can mitigate performance degradation due to biased error in non-quantized models as well as quantized models. Clipping without correction in the FP32 model introduces a 4.66% loss in accuracy, and correction reduces that loss to a mere 0.57%. Furthermore, it shows that, while naive, weight clipping combined with bias correction is a fairly strong baseline for quantization for this model. Lastly, we show that bias correction improves results when combined with the cross-layer equalization and bias folding procedures. The combination of all methods is our data-free quantization (DFQ) method. The full DFQ approach achieves near-floating point performance with a loss of only 0.53% top 1 accuracy.
5.2 Comparison to other methods and models
In this section we show how DFQ generalizes to other popular computer vision tasks, namely semantic segmentation and object detection, and other model architectures such as MobileNetV1  and Resnet18 . Afterwards we compare DFQ to methods in the literature, including more complex level 3 and 4 approaches. This set of models was chosen as they are efficient and likely to be used in mobile applications where 8-bit quantization is frequently used for power efficiency.
5.2.1 Other tasks
To demonstrate the generalization of our method to semantic segmentation we apply DFQ for DeeplabV3+ with a MobileNetV2 backend [3, 30], performance is evaluated on the Pascal VOC segmentation challenge . For our experiments we use the publicly available Pytorch implementation222https://github.com/jfzhang95/pytorch-deeplab-xception.
We show the results of this experiment in Table 3. As observed earlier for classification we notice a significant drop in performance when quantizing the original model which makes it almost unusable in practice. Applying DFQ recovers almost all performance degradation and achieves less than 1% drop in mIOU compared to the full precision model. DFQ also outperforms the less hardware friendly per-channel quantization. To the best of our knowledge we are the first to publish quantization results on DeeplabV3+ as well as for semantic segmentation.
|QT  ^||✗||✗||✓||71.9%||70.9%||70.9%||70.0%||-||70.3%||67.3%|
To demonstrate the applicability of our method to object detection we apply DFQ for MobileNetV2 SSDLite [30, 20], evaluated on the Pascal VOC object detection challenge . In our experiments we use the publicly available Pytorch implementation of SSD333https://github.com/qfgaohao/pytorch-ssd.
The results are listed in Table 4. Similar to semantic segmentation we observe a significant drop in performance when quantizing the SSDLite model. Applying DFQ recovers almost all performance drop and achieves less than 1% drop in mAP compared to the full precision model, again outperforming per-channel quantization.
5.2.2 Comparison to other approaches
In this section we compare DFQ to other approaches in literature. We compare our results to two other level 1 approaches, direct per-layer quantization as well as per-channel quantization . In addition we also compare to multiple higher level approaches, namely quantization aware training  as well as stochastic rounding and dynamic ranges [9, 10], which are both level 3 approaches. We also compare to two level 4 approaches based on relaxed quantization , which involve training a model from scratch and to quantization friendly separable convolutions  that require a rework of the original MobileNet architecture. The results are summarized in Table 5.
For both MobileNetV1 and MobileNetV2 per-layer quantization results in an unusable model whereas DFQ stays close to full precision performance. DFQ also outperforms per-channel quantization as well as all level 3 and 4 approaches which require significant fine-tuning, training or even architecture changes.
On Resnet18 we maintain full precision performance for 8-bit fixed point quantization using DFQ. Some higher level approaches [16, 21] report slightly higher results than even our baseline model, likely because of a better training procedure than compared to the standard Pytorch Resnet18 model. Since 8-bit quantization is lossless we also compare 6-bit results. DFQ clearly outperforms traditional per-layer quantization but stays slightly below per-channel quantization and higher level approaches such as QT and RQ [16, 21].
Overall DFQ sets a new state-of-the-art for 8-bit fixed point quantization on several models and computer vision tasks. It is especially strong for mobile friendly architectures such as MobileNetV1 and MobileNetV2 which were previously hard to quantize. Even though DFQ is a simple level 1 approach, we generally show competitive performance when comparing to more complex higher level approaches.
In this work, we introduced DFQ, a data-free quantization method that helps quantized model performance significantly without the need for data, fine-tuning or hyper-parameter optimization. The method can be applied to many common computer vision architectures with a straight-forward API call. This is crucial for many practical applications where engineers want to deploy deep learning models trained in FP32 to INT8 hardware without much effort. Results are presented for common computer vision tasks like image classification, semantic segmentation and object detection. We show that our method compares favorably to per-channel quantization , meaning that the more efficient per-tensor quantization can be employed in practice instead. DFQ achieves near original model accuracy for almost every model we tested, and even competes with more complicated training-procedure based methods.
We also introduced a set of quantization levels to facilitate the discussion on the applicability of quantization methods. There is a difference in how simple a method is to use for generating a quantized model, which is a significant part of the impact potential of a quantization method in real world applications. Hopefully more work can be done in this area to make INT8 quantization ubiquitous, greatly reducing the energy consumption and latency of models ran on device and in the cloud.
We would like to thank Christos Louizos, Harris Teague, Jakub Tomczak, Mihir Jain and Pim de Haan for their helpful discussions and valuable feedback.
-  J. Achterhold, J. M. Koehler, A. Schmeink, and T. Genewein. Variational network quantization. In International Conference on Learning Representations (ICLR), 2018.
-  R. Alvarez, R. Prabhavalkar, and A. Bakhtin. On the efficient representation and execution of deep acoustic models. In The Annual Conference of the International Speech Communication Association (Interspeech), 2016.
-  L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In The European Conference on Computer Vision (ECCV), September 2018.
-  J. Choi, Z. Wang, S. Venkataramani, P. I. Chuang, V. Srinivasan, and K. Gopalakrishnan. PACT: parameterized clipping activation for quantized neural networks. arXiv preprint arxiv:805.06085, 2018.
-  M. Courbariaux, Y. Bengio, and J.-P. David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15, pages 3123–3131, Cambridge, MA, USA, 2015. MIT Press.
-  M. Everingham, S. Eslami, L. Van Gool, C. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1):98–136, 1 2015.
-  J. Faraone, N. J. Fraser, M. Blott, and P. H. W. Leong. SYQ: learning symmetric quantization for efficient deep neural networks. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 4300–4309, 2018.
-  D. A. Gudovskiy and L. Rigazio. Shiftcnn: Generalized low-precision architecture for inference of convolutional neural networks. CoRR, abs/1706.02393, 2017.
-  S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan. Deep learning with limited numerical precision. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pages 1737–1746, 2015.
-  P. Gysel, J. J. Pimentel, M. Motamedi, and S. Ghiasi. Ristretto: A framework for empirical study of resource-efficient inference in convolutional neural networks. IEEE Trans. Neural Netw. Learning Syst., 29(11):5784–5789, 2018.
-  K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pages 1026–1034, 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 770–778, 2016.
-  Y. He, X. Zhang, and J. Sun. Channel pruning for accelerating very deep neural networks. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 1398–1406, 2017.
-  A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
-  I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. Journal of Machine Learning Research, 18:187:1–187:30, 2017.
-  B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
-  U. Köster, T. Webb, X. Wang, M. Nassar, A. K. Bansal, W. Constable, O. Elibol, S. Hall, L. Hornof, A. Khosrowshahi, C. Kloss, R. J. Pai, and N. Rao. Flexpoint: An adaptive numerical format for efficient training of deep neural networks. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 1740–1750, 2017.
-  R. Krishnamoorthi. Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv preprint arXiv:1806.08342, Jun 2018.
-  F. Li and B. Liu. Ternary weight networks. arXiv preprint arxiv:1605.04711, 2016.
-  W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C. Fu, and A. C. Berg. SSD: single shot multibox detector. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part I, pages 21–37, 2016.
-  C. Louizos, M. Reisser, T. Blankevoort, E. Gavves, and M. Welling. Relaxed quantization for discretized neural networks. In International Conference on Learning Representations (ICLR), 2019.
-  E. Meller, A. Finkelstein, U. Almog, and M. Grobman. Same, same but different - recovering neural network quantization error through weight factorization. CoRR, abs/1902.01917, 2019.
-  A. K. Mishra, J. J. Cook, E. Nurvitadhi, and D. Marr. WRPN: training and inference using wide reduced-precision networks. arXiv preprint arxiv 1704.03079, 2017.
-  D. Miyashita, E. H. Lee, and B. Murmann. Convolutional neural networks using logarithmic data representation. arXiv preprint arxiv:1603.01025, 2016.
-  V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel, pages 807–814, 2010.
-  A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.
-  J. W. T. Peters and M. Welling. Probabilistic binary neural networks. arXiv preprint arxiv:1809.03368, 2018.
-  M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part IV, pages 525–542, 2016.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
-  M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
-  T. Sheng, C. Feng, S. Zhuo, X. Zhang, L. Shen, and M. Aleksic. A quantization-friendly separable convolution for mobilenets. In 1st Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2), 2018.
-  K. Ullrich, E. Meeds, and M. Welling. Soft weight-sharing for neural network compression. In International Conference on Learning Representations (ICLR), 2017.
-  X. Zhang, J. Zou, K. He, and J. Sun. Accelerating very deep convolutional networks for classification and detection. IEEE Trans. Pattern Anal. Mach. Intell., 38(10):1943–1955, 2016.
-  A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen. Incremental network quantization: Towards lossless cnns with low-precision weights. arXiv preprint arxiv:1702.03044, abs/1702.03044, 2017.
-  S. Zhou, Z. Ni, X. Zhou, H. Wen, Y. Wu, and Y. Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.
Appendix A Optimal range equalization of two layers
Consider two fully-connected layers with weight matrices and , that we scale as in 4.1. We investigate the problem of optimizing the quantization ranges by rescaling the weight matrices by , where , such that and the weight matrices after rescaling. We investigate the case of symmetric quantization, which also gives good results in practice for asymmetric quantization. We denote
Where is the per-channel weight quantization range that is scaled by , the range for the full weight matrix and are the original unscaled ranges.
Using this in our optimization goal of eq. 6 leads to
We observe that the specific scaling of each channel cancels out as long as they do not increase , the range of the full weight matrix. We can reformulate the above to
which is minimal iff
By contradiction, if there is a small positive such that which will decrease by without effecting . Therefore such a solution would not be optimal for eq. 24.
The condition from eq. 25 implies there is a limiting channel which defining the quantization range of both weight matrices and . However, our optimization goal is not effected by the choice of the other given the resulting and are smaller than and , respectively. To break the ties of solutions we decide to set . Thus the channel’s ranges between both tensors are matched as closely as possible and the introduced quantization error is spread equally among both weight tensors. This results in our final rescaling factor
which satisfies our necessary condition from eq. 25 and ensures that .
Appendix B Bias Correction for Convolutional Layers
Similarly to fully connected layers we can compute from W and , it becomes a constant and we have that . Expanding this yields:
where we assume that the expected value of each input channel is the same for all spatial dimensions in the input channel. Since the value of does not depend on the spatial dimensions and , the expected error is the same for the full output channel and can be folded into the layer’s bias parameter.
Appendix C Clipped Normal Distribution
Given a normally distributed random variable with mean and variance , and a clipped-linear function that clips its argument to the range , s.t. , the mean and variance of can be determined using the standard rules of computing the mean and variance of a function:
where we define , and .
c.1 Mean of Clipped Normal Distribution
Using the fact that is constant if we have that:
The first and last term can be computed as and respectively, where we define , , and , the normal CDF with zero mean and unit variance.
The integral over the linear part of can be computed as:
where we define , i.e. the standard normal pdf and is the normalization constant for a normal distribution with variance
c.2 Variance of Clipped Normal Distribution
We again exploit the fact that is constant if :
The first and last term can be solved as and respectively.
The second term can be decomposed as follows:
where we use the result from the previous subsection and define , and where is the mean of the truncated normal distribution.
Evaluating the first term yields:
This results in:
Appendix D Empirical quantization bias correction
|Empirical Bias Corr||71.72%||58.44%|
|Clip @ 15||67.06%||2.55%|
|+ Empirical Bias Corr||70.55%||70.27%|
|Equalization + Bias Absorption||71.57%||70.92%|
|+ Empirical Bias Corr||71.57%||71.41%|
If a network does not use batch normalization, or does not use batch normalization in all layers, a representative dataset can be used to compute pre-activation means before and after quantization. We can then use that to subtract the difference from the quantized models pre-activations. This procedure can be ran with unlabeled data. The procedure should be ran after BatchNorm folding and cross-layer range equalization. Clipping should be used in the quantized network, but not in the floating point network. Since the activation function and the quantization operation are fused, this procedure is ran on a network with quantized weights only. We bias correct a layer after all the layers feeding into it are already bias-corrected. The procedure is as follows:
Run examples through the FP32 model and collect for each layer the per-channel pre-activation means .
For each layer in the quantized model:
Collect the per-channel pre-activation means of layer for the same examples as in step 1.
Compute the per-channel biased quantization error
Subtract from the layer’s bias parameter.
The results of applying this procedure can be found in Table 6.