DataFree Quantization
through Weight Equalization and Bias Correction
Abstract
We introduce a datafree quantization method for deep neural networks that does not require finetuning or hyperparameter selection. It achieves nearoriginal model performance on common computer vision architectures and tasks. 8bit fixedpoint quantization is essential for efficient inference in modern deep learning hardware architectures. However, quantizing models to run in 8bit is a nontrivial task, frequently leading to either significant performance reduction or engineering time spent on training a network to be amenable to quantization. Our approach relies on equalizing the weight ranges in the network by making use of a scaleequivariance property of activation functions. In addition the method corrects biases in the error that are introduced during quantization. This improves quantization accuracy performance, and can be applied ubiquitously to almost any model with a straightforward API call. For common architectures, such as the MobileNet family, we achieve stateoftheart quantized model performance. We further show that the method also extends to other computer vision architectures and tasks such as semantic segmentation and object detection.
1 Introduction
In recent years, deep learning based computer vision models have moved from research labs into the cloud and onto edge devices. As a result, power consumption and latency of deep learning inference have become an important concern. For this reason fixedpoint quantization is often employed to make inference more efficient. By quantizing FP32 values onto a regularly spaced grid, the original FP32 values can be approximated by a set of integers, a scaling factor, and an optional zero point offset [16]. This allows for the use of faster and more powerefficient INT8 operations instead of FP32 operations in matrix multiplication or convolution computations, at the expense of lower representational power. We refer the reader to [18] for details on commonly used, hardwarefriendly quantization methods for deep learning models.
Quantization of 32bit full precision (FP32) models into 8bit fixed point (INT8) introduces quantization noise on the weights and activations, which often leads to reduced model performance. This performance degradation ranges from very minor to catastrophic. To minimize the quantization noise or even mitigate its effect, a wide range of different methods have been introduced in the literature (see Section 2). Such methods (e.g. [16, 4, 21]) often rely on quantizationaware finetuning or training from scratch.
A major drawback of practical quantization methods is their reliance on data and finetuning. As an example, consider realworld actors that manage hardware for quantized models, such as cloudbased deep learning inference providers or cellphone manufacturers. To provide a general use quantization service they would have to receive data from the customers to finetune the models, or rely on their customers to do the quantization. In either case, this can add a difficult step to the process. For such stakeholders it would be preferable if FP32 models could be converted directly to INT8, without needing the knowhow, data or compute necessary for running traditional quantization methods. Even for model developers that have the capability to go the extra mile to quantize their own models, automation would save significant time.
In this paper, we introduce a quantization approach that does not require data, finetuning or hyperparameter tuning, resulting in accuracy improvement with a simple API call. Despite these restrictions we achieve nearoriginal model performance when quantizing FP32 models to INT8. This is achieved by adapting the weight tensors of pretrained models such that they are more amenable to quantization, and by correcting for the bias of the error that is introduced when quantizing models. We show significant improvements in quantization performance on a wide range of computer vision models previously thought to be difficult to quantize without finetuning.
Levels of quantization solutions
Solutions that do not require data or retraining are rare in the literature, and the drawbacks of the practical applications of used methods are rarely discussed. To distinguish between the differences in applicability of quantization methods, we introduce four levels of quantization solutions, in decreasing order of practical applicability. Our hope is that this will enable other authors to explore solutions for each level, and makes the comparison between methods fairer. The axes for comparison are whether or not a method requires data, whether or not a method requires error backpropagation on the quantized model, and whether or not a method is generally applicable for any architecture or requires significant model reworking. We use the following definitions throughout the paper:
 Level 1

No data and no backpropagation required. Method works for any model. As simple as an API call that only looks at the model definition and weights.
 Level 2

Requires data but no backpropagation. Works for any model. The data is used e.g. to recalibrate the batch normalization mean and variance [27] or to compute layerwise loss functions to improve quantization performance. However, no finetuning pipeline is required.
 Level 3
 Level 4

Requires data and backpropagation. Only works for specific models. In this case, the network architecture needs nontrivial reworking, and/or the architecture needs to be trained from scratch with quantization in mind (e.g. [4, 31, 21]). Takes significant extra trainingtime and hyperparameter tuning to work.
2 Background and Related work
There are several works that describe quantization and improving networks for lower bit inference and deployment [9, 10, 16, 34]. These methods all rely strongly on finetuning, making them level 3 methods, whereas datafree quantization improves performance similarly without that requirement. Our method could be used as a preprocessing for the above methods before applying finetuning to improve quantization.
The only work we have found that improves INT8 quantization without the need for data is a whitepaper by Krishnamoorthi [18], that describes how having an activation range per channel alleviates the problems later discussed in 3.1. A major drawback of this method is that it is not supported on all hardware, and that it creates unnecessary overhead in the computation due to the necessity of scale and offset values for each channel individually. We show that our method improves on perchannel quantization, while keeping a single set of scale and offset values for the whole weight tensor instead.
Other methods to improve quantization need architecture changes or training with quantization in mind from the start [1, 21, 31, 32, 35]. These methods are even more involved than doing quantization and finetuning. They also incur a relatively large overhead during training because of sampling and noisy optimization, and introduce extra hyperparameters to optimize. This makes them level 4 methods.
Methods that binarize [5, 15, 27, 28] or ternarize [19] networks operate at great efficiency for inference as expensive multiplications and additions are replaced by bitshift operations. However, quantizing models to binary often leads to strong performance degradation. Generally they need to be trained from scratch as well. Thus, binarization methods are also level 4 methods.
Other approaches use lowbit floating point operations, or other custom quantization implementations [8, 17, 24, 34]. Our approach could also work to improve the results for models quantized with such custom floating point formats. Other approaches use codebooks [7], which put stringent restrictions on the hardware for an efficient implementation. We do not consider codebooks in our approach.
In work developed concurrently with ours, [22] also exploits the scale invariance of the ReLU function to rescale weight channels and notice the biased error introduced by weight quantization for certain models, leading to a method that resembles our datafree quantization approach. Our work was developed independently from theirs.
3 Motivation
While many trained FP32 models can be quantized to INT8 without much loss in performance, some models exhibit a significant drop in performance after quantization ([18, 31]). For example, when quantizing a trained MobileNetV2 [30] model, Krishnamoorthi [18] reports a drop in top1 accuracy from 70.9% to 0.1% on the ImageNet [29] validation set. The author restores near original model performance by either applying perchannel quantization, finetuning or both. Perchannel quantization defines a separate quantization range for each output channel by scaling weights before usage.
3.1 Weight tensor channel ranges
The fact that perchannel quantization yields much better performance on MobileNetV2 than pertensor quantization suggests that, in some layers, the weight distributions differ so strongly between outputs that the same set of quantization parameters cannot be used to quantize the full weight tensor effectively. For example, in the case where one channel has weights in the range and another channel has weights in the range , the weights in the latter channel will either all be quantized to when quantizing to 8bits.
Figure 2 shows that large differences in output channel weight ranges do indeed occur in a (trained) MobileNetV2 model. This figure shows the weight distribution of the output channel weights of the depthwiseseparable layer in the model’s first inverted residual block. Due to the strong differences between channel weight ranges that this layer exhibits, it cannot be quantized with reasonable accuracy for each channel. Many layers in the network suffer from this problem, making the overall model very difficult to quantize.
We conjecture that performance of trained models after quantization can be improved by adjusting the weights for each output channel such that their ranges are more similar. We provide a level 1 method to achieve this without changing the FP32 model output in section 4.1.
3.2 Biased quantization error
A common assumption in literature (e.g. [2]) is that quantization error of weights is unbiased and thus cancels out in a layer’s output, ensuring that a quantized layer’s expected output does not change from the original FP32 output. However, if the quantization error is biased, either for the whole layer or for certain output channels in a weight tensor, the layer’s expected output changes, causing unpredictable effects in next layers.
The error between the FP32 weights W for a fully connected layer and quantized weights can be computed as . In case the (biased) errors in the weights for an output unit do not cancel out, i.e. , there will be a biased error in output unit .
The biased error in a quantized layer’s output unit can be computed empirically using input data points as:
(1) 
where and are the original outputs and the outputs generated using the quantized weight matrix, respectively.
Figure 3 shows the biased error per channel of a depthwiseseparable convolution layer in a trained MobileNetV2 model. From this plot it is clear that for many channels in the layer’s output, the error introduced by weight quantization is biased, and influences the output statistics. The depthseparable layers in MobileNet are specifically sensitive to this biased error effect as each output has only 9 corresponding weights.
Such a biased error on the outputs can also be introduced in many settings, e.g. when weights or activations are clipped [23], or in nonquantization approaches, such as weight tensor factorization or channel pruning [13, 33].
In section 4.2 we introduce a method to correct for this bias. Furthermore, we show that a model’s batch normalization parameters can be used to compute the expected biased error on the output, yielding a level 1 method to fix the biased error introduced by quantization.
4 Method
Our proposed datafree quantization method (DFQ) consists of three steps, on top of the normal quantization. The overall flow of the algorithm is shown in Figure 4.
4.1 Crosslayer range equalization
Positive scaling equivariance
We observe that for a ReLU [25] activation function the following scaling equivariance property holds:
(2) 
for any nonnegative real number . This follows from the definition of the ReLU:
(3) 
since if . This equivariance also holds for the PreLU [11] activation function.
More generally, the positive scaling equivariance can be relaxed to for any piecewise linear activation functions:
(4) 
where is parameterized as , and . Note that contrary to equivariance defined in eq. 2 we now also change the function into .
4.1.1 Scaling equivariance in neural networks
The above introduced positive scaling equivariance can be exploited in consecutive layers in neural networks. Given two layers, and , through scaling invariance we have that:
where is a diagonal matrix with value denoting the scaling factor for neuron . This allows us to reparameterize our model with , and . In case of CNNs the scaling will be per channel and broadcast accordingly over the spatial dimensions. The rescaling procedure is illustrated in Figure 5.
4.1.2 Equalizing ranges over multiple layers
We can exploit the rescaling and reparameterization of the model to make the model more robust to quantization. Ideally the ranges of each channel are equal to the total range of the weight tensor, meaning we use the best possible representative power per channel. We can define the precision of a channel as the relative difference:
(5) 
where is the range of channel and is the total range of the tensor. We want to find such that the total precision per channel is maximized:
(6) 
In the case of symmetric quantization we have and . Solving eq. 6 (see appendix A) leads to the necessary condition
(7) 
meaning the limiting channel defining the quantization range is given by . We can satisfy this condition by setting such that
(8) 
which results in . Thus the channel’s ranges between both tensors are matched as closely as possible.
When equalizing multiple layers at the same time, we iterate this process for pairs of layers that are connected to each other without input or output splits in between, until convergence.
4.1.3 Absorbing high biases
Running the equalization procedure on the weights potentially increases the biases of the layers. This in turn increases the range of the activation quantization. In order to avoid big differences between perchannel ranges in the activations we introduce a procedure that absorbs high biases into the subsequent layer. For a layer with ReLU function , there is a nonnegative vector such that for all values of . This is the vector with the minimal activation value of each output of the layer. This is the amount that we can absorb from layer into layer as described below
(9)  
(10)  
(11)  
(12) 
where , , and .
Without data we can not find exactly. Therefore, we approximate based on the batch normalization statistics as , where and , the resulting mean and standard deviation from the batch normalization layer before applying ReLU. Under the assumption that the prebias activations are distributed as , approximately only are smaller than and would be clipped due to the shift by . As we will show in section 5.1.1, this approximation does not harm the full precision performance significantly but helps for activation quantization.
4.2 Quantization bias correction
As shown empirically in the motivation, quantization introduces a biased error in the activations. In this section we show how to correct for the biased error on the layer’s output, and how we can use the network’s batch normalization parameters to compute the expected biased error so that the method is level 1.
For a fully connected layer with weight tensor W, quantized weights , and input activations x, we have and therefore , where we define the quantization error , y as the layer preactivations of the FP32 model, and that layer with quantization error added.
If the expectation of the error for neuron i, , then the mean of the activation of neuron i will change. This change in distribution can often lead to strongly detrimental behavior in the following layers. We can correct for this change by seeing that
(13)  
(14) 
Thus, subtracting the expected error on the output from the biased output ensures that the mean for each output unit is preserved.
For implementation, the expected error can be subtracted from the layer’s bias parameter, since the expected error vector has the same shape as the layer’s output. This method easily extends to convolutional layers as described in Appendix B.
4.2.1 Computing the expected input
To compute the expected error of the output of a layer, the expected input to the layer is required. If a model does not use batch normalization, or there are no datausage restrictions, can simply be computed by comparing the activations before and after quantization. Appendix D explains this procedure in more detail.
Clipped normal distribution
When the network includes batch normalization before a layer, we can use it to calculate for that layer without using data. We assume the preactivation outputs of a layer are normally distributed, that batch normalization is applied before the activation function, and that the activation function is some form of the class of clipped linear activation functions (e.g. ReLU, ReLU6), which clips its input range to the range where , and can be .
Due to the centralization and normalization applied by batch normalization, the mean and standard deviation of the preactivations are known: these are the batch normalization scale and shift parameters (henceforth referred to as and respectively).
To compute from the previous layer’s batch normalization parameters, the mean and variance need to be adjusted to account for the activation function that follows the batch normalization layer. For this purpose we introduce the clipped normal distribution. A clippednormally distributed random variable is a normally distributed random variable with mean and variance , whose values are clipped to the range The mean and variance of the clipped normal distribution can be computed in closed form from , , and . We present the mean of the clipped normal distribution for the ReLU activation function, i.e. and in this section, and refer the reader to Appendix C for the closed form solution for the general clipped normal distribution.
The expected value for channel in x, , which is the output of a layer with batch normalization parameters and , followed by a ReLU activation function is:
(15)  
(16) 
where is the preactivation output for channel , which is assumed to be normally distributed with mean and variance , is the normal CDF, and the notation is used to denote the normal PDF.
5 Experiments
In this section we present two sets of experiments to validate the performance of datafree quantization (DFQ). The order in which the procedures of DFQ are applied can be seen in Figure 4 We first show in section 5.1 the effect of the different aspects of DFQ and how they solve the problems observed earlier. Then we show in section 5.2 how DFQ generalizes to other models and tasks, and sets a new stateoftheart for level 1 quantization. In all experiments we use asymmetric quantization [16] for both the baseline as well as DFQ. Before our DFQ procedure we apply batchnormalization folding. Unless otherwise stated, we do not do any finetuning after quantization. All experiments are done in Pytorch [26].
5.1 Ablation study
In this section we investigate the effect of our methods on a pretrained MobileNetV2 [30] model^{1}^{1}1We use the Pytorch implementation of MobileNetV2 provided by https://github.com/tonylins/pytorchmobilenetv2.. We validate the performance of the model on the ImageNet [29] validation set. We first investigate the effects of different parts of our approach through a set of ablation studies.
5.1.1 Crosslayer equalization
In this section we investigate the effects of crosslayer equalization. We compare the equalization to two baselines, the original quantized model as well as the less hardware friendly perchannel quantization. Then we show how folding high biases further improves results.
Our considered models employ residual connections [12]. For these networks we apply crosslayer equalization only to the three layers within each residual block. MobileNetV2 also uses ReLU6 activation functions. ReLU6 clips the linear part of the ReLU activation at 6. In practice we observed that the ReLU6 can be replaced with a ReLU without a significant drop in performance, making the equalization procedure easier to apply as described in section 4.1
The results of the equalization experiments are shown in Table 1. Similar to [18], we observe that the model performance is close to random when quantizing the original model to INT8. Further we note that replacing ReLU6 by ReLU does not significantly degrade the model performance. Applying equalization brings us to within 2% of the original models performance, close to the performance of perchannel quantization. We note that absorbing high biases is not lossless (drop of 0.2% in floating point), but it boosts the quantized performance by 1% due to more precise activation quantization. Combining both methods improves performance over perchannel quantization, indicating the more efficient pertensor quantization could be used instead.
Model  FP32  INT8 

Original model  71.72%  0.12% 
Replace ReLU6  71.70%  0.11% 
+ equalization  71.70%  69.91% 
+ absorbing bias  71.57%  70.92% 
Per channel quantization  71.72%  70.65% 
To illustrate the effect of the crosslayer equalization, we show in Figure 3 the weight distributions per output channel of the depthwiseseparable layer in the modelâs first inverted residual block after applying the equalization. We observe that most channels ranges are now similar and that there are no strong outliers anymore. Note, there are still several channels which have all weight values close to zero. These channels convey little information and can be pruned from the network with hardly any loss in accuracy.
5.1.2 Bias correction
In this section we present results on bias correction for a quantized MobileNetV2 model. We furthermore present results of bias correction in combination with a naive weightclipping baseline, and combined with the crosslayer equalization approach.
The weightclipping baseline serves two functions: 1) as a naive baseline to the crosslayer equalization approach, and 2) to show that bias correction can be employed in any setting where biased noise is introduced. This could, for example, also be the case for tensorfactorization or channel pruning approaches to model compression [13, 33]. Weight clipping solves the problem of large differences in ranges between channels by clipping large ranges to smaller ranges, but it introduces a strongly biased error. Weight clipping is applied by first folding the batch normalization parameters into a layer’s weights, and then clipping all values to a certain range, in this case . We tried multiple symmetric ranges, all provided similar results. For residual connections we calculate and based on the sum and variance of all input expectations, taking the input to be zero mean and unit variance.
To illustrate the effect of bias correction, Figure 3 shows the per output channel biased error introduced by weight quantization. The perchannel biases are obtained as described in eq. 1. This figure shows that applying bias correction reduces the bias in the error on the output of a layer to very close to 0 for most output channels.
Model  FP32  INT8 

Original Model  71.72%  0.12% 
Bias Corr  71.72%  52.02% 
Clip @ 15  67.06%  2.55% 
+ Bias Corr  71.15%  70.43% 
Rescaling + Bias Absorption  71.57%  70.92% 
+ Bias Corr  71.57%  71.19% 
Results for the experiments described above for MobileNet V2 on the ImageNet validation set are shown in Table 2. Applying bias correction improves quantized model performance, indicating that a part of the problem of quantizing this model lies in the biased error that is introduced. However, bias correction on its own does not achieve nearfloating point performance. The reason for this is most likely that the problem described in 3.1 is more severe for this model. The experiments on weightclipping show that bias correction can mitigate performance degradation due to biased error in nonquantized models as well as quantized models. Clipping without correction in the FP32 model introduces a 4.66% loss in accuracy, and correction reduces that loss to a mere 0.57%. Furthermore, it shows that, while naive, weight clipping combined with bias correction is a fairly strong baseline for quantization for this model. Lastly, we show that bias correction improves results when combined with the crosslayer equalization and bias folding procedures. The combination of all methods is our datafree quantization (DFQ) method. The full DFQ approach achieves nearfloating point performance with a loss of only 0.53% top 1 accuracy.
5.2 Comparison to other methods and models
In this section we show how DFQ generalizes to other popular computer vision tasks, namely semantic segmentation and object detection, and other model architectures such as MobileNetV1 [14] and Resnet18 [12]. Afterwards we compare DFQ to methods in the literature, including more complex level 3 and 4 approaches. This set of models was chosen as they are efficient and likely to be used in mobile applications where 8bit quantization is frequently used for power efficiency.
5.2.1 Other tasks
Semantic segmentation
Model  FP32  INT8 

Original model  72.94  41.40 
DFQ (ours)  72.45  72.33 
Perchannel quantization  72.94  71.44 
To demonstrate the generalization of our method to semantic segmentation we apply DFQ for DeeplabV3+ with a MobileNetV2 backend [3, 30], performance is evaluated on the Pascal VOC segmentation challenge [6]. For our experiments we use the publicly available Pytorch implementation^{2}^{2}2https://github.com/jfzhang95/pytorchdeeplabxception.
We show the results of this experiment in Table 3. As observed earlier for classification we notice a significant drop in performance when quantizing the original model which makes it almost unusable in practice. Applying DFQ recovers almost all performance degradation and achieves less than 1% drop in mIOU compared to the full precision model. DFQ also outperforms the less hardware friendly perchannel quantization. To the best of our knowledge we are the first to publish quantization results on DeeplabV3+ as well as for semantic segmentation.
Object detection
Model  FP32  INT8 

Original model  68.47  10.63 
DFQ (ours)  68.56  67.91 
Perchannel quantization  68.47  67.52 
D  BP  AC  MobileNetV2  MobileNetV1  ResNet18  
FP32  INT8  FP32  INT8  FP32  INT8  INT6  
DFQ (ours)  ✓  ✓  ✓  71.7%  71.2%  70.8%  70.5%  69.7%  69.7%  66.3% 
Perlayer [18]  ✓  ✓  ✓  71.9%  0.1%  70.9%  0.1%  69.7%  69.2%  63.8% 
Perchannel [18]  ✓  ✓  ✓  71.9%  69.7%  70.9%  70.3%  69.7%  69.6%  67.5% 
QT [16] ^  ✗  ✗  ✓  71.9%  70.9%  70.9%  70.0%    70.3%  67.3% 
SR+DR  ✗  ✗  ✓        71.3%    68.2%  59.3% 
QMN [31]  ✗  ✗  ✗      70.8%  68.0%       
RQ [21]  ✗  ✗  ✗        70.4%    69.9%  68.6% 
To demonstrate the applicability of our method to object detection we apply DFQ for MobileNetV2 SSDLite [30, 20], evaluated on the Pascal VOC object detection challenge [6]. In our experiments we use the publicly available Pytorch implementation of SSD^{3}^{3}3https://github.com/qfgaohao/pytorchssd.
The results are listed in Table 4. Similar to semantic segmentation we observe a significant drop in performance when quantizing the SSDLite model. Applying DFQ recovers almost all performance drop and achieves less than 1% drop in mAP compared to the full precision model, again outperforming perchannel quantization.
5.2.2 Comparison to other approaches
In this section we compare DFQ to other approaches in literature. We compare our results to two other level 1 approaches, direct perlayer quantization as well as perchannel quantization [18]. In addition we also compare to multiple higher level approaches, namely quantization aware training [16] as well as stochastic rounding and dynamic ranges [9, 10], which are both level 3 approaches. We also compare to two level 4 approaches based on relaxed quantization [21], which involve training a model from scratch and to quantization friendly separable convolutions [31] that require a rework of the original MobileNet architecture. The results are summarized in Table 5.
For both MobileNetV1 and MobileNetV2 perlayer quantization results in an unusable model whereas DFQ stays close to full precision performance. DFQ also outperforms perchannel quantization as well as all level 3 and 4 approaches which require significant finetuning, training or even architecture changes.
On Resnet18 we maintain full precision performance for 8bit fixed point quantization using DFQ. Some higher level approaches [16, 21] report slightly higher results than even our baseline model, likely because of a better training procedure than compared to the standard Pytorch Resnet18 model. Since 8bit quantization is lossless we also compare 6bit results. DFQ clearly outperforms traditional perlayer quantization but stays slightly below perchannel quantization and higher level approaches such as QT and RQ [16, 21].
Overall DFQ sets a new stateoftheart for 8bit fixed point quantization on several models and computer vision tasks. It is especially strong for mobile friendly architectures such as MobileNetV1 and MobileNetV2 which were previously hard to quantize. Even though DFQ is a simple level 1 approach, we generally show competitive performance when comparing to more complex higher level approaches.
6 Conclusion
In this work, we introduced DFQ, a datafree quantization method that helps quantized model performance significantly without the need for data, finetuning or hyperparameter optimization. The method can be applied to many common computer vision architectures with a straightforward API call. This is crucial for many practical applications where engineers want to deploy deep learning models trained in FP32 to INT8 hardware without much effort. Results are presented for common computer vision tasks like image classification, semantic segmentation and object detection. We show that our method compares favorably to perchannel quantization [18], meaning that the more efficient pertensor quantization can be employed in practice instead. DFQ achieves near original model accuracy for almost every model we tested, and even competes with more complicated trainingprocedure based methods.
We also introduced a set of quantization levels to facilitate the discussion on the applicability of quantization methods. There is a difference in how simple a method is to use for generating a quantized model, which is a significant part of the impact potential of a quantization method in real world applications. Hopefully more work can be done in this area to make INT8 quantization ubiquitous, greatly reducing the energy consumption and latency of models ran on device and in the cloud.
Acknowledgements
We would like to thank Christos Louizos, Harris Teague, Jakub Tomczak, Mihir Jain and Pim de Haan for their helpful discussions and valuable feedback.
References
 [1] J. Achterhold, J. M. Koehler, A. Schmeink, and T. Genewein. Variational network quantization. In International Conference on Learning Representations (ICLR), 2018.
 [2] R. Alvarez, R. Prabhavalkar, and A. Bakhtin. On the efficient representation and execution of deep acoustic models. In The Annual Conference of the International Speech Communication Association (Interspeech), 2016.
 [3] L.C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam. Encoderdecoder with atrous separable convolution for semantic image segmentation. In The European Conference on Computer Vision (ECCV), September 2018.
 [4] J. Choi, Z. Wang, S. Venkataramani, P. I. Chuang, V. Srinivasan, and K. Gopalakrishnan. PACT: parameterized clipping activation for quantized neural networks. arXiv preprint arxiv:805.06085, 2018.
 [5] M. Courbariaux, Y. Bengio, and J.P. David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Proceedings of the 28th International Conference on Neural Information Processing Systems  Volume 2, NIPS’15, pages 3123–3131, Cambridge, MA, USA, 2015. MIT Press.
 [6] M. Everingham, S. Eslami, L. Van Gool, C. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1):98–136, 1 2015.
 [7] J. Faraone, N. J. Fraser, M. Blott, and P. H. W. Leong. SYQ: learning symmetric quantization for efficient deep neural networks. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 1822, 2018, pages 4300–4309, 2018.
 [8] D. A. Gudovskiy and L. Rigazio. Shiftcnn: Generalized lowprecision architecture for inference of convolutional neural networks. CoRR, abs/1706.02393, 2017.
 [9] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan. Deep learning with limited numerical precision. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 611 July 2015, pages 1737–1746, 2015.
 [10] P. Gysel, J. J. Pimentel, M. Motamedi, and S. Ghiasi. Ristretto: A framework for empirical study of resourceefficient inference in convolutional neural networks. IEEE Trans. Neural Netw. Learning Syst., 29(11):5784–5789, 2018.
 [11] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 713, 2015, pages 1026–1034, 2015.
 [12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 2730, 2016, pages 770–778, 2016.
 [13] Y. He, X. Zhang, and J. Sun. Channel pruning for accelerating very deep neural networks. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 2229, 2017, pages 1398–1406, 2017.
 [14] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
 [15] I. Hubara, M. Courbariaux, D. Soudry, R. ElYaniv, and Y. Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. Journal of Machine Learning Research, 18:187:1–187:30, 2017.
 [16] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko. Quantization and training of neural networks for efficient integerarithmeticonly inference. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
 [17] U. Köster, T. Webb, X. Wang, M. Nassar, A. K. Bansal, W. Constable, O. Elibol, S. Hall, L. Hornof, A. Khosrowshahi, C. Kloss, R. J. Pai, and N. Rao. Flexpoint: An adaptive numerical format for efficient training of deep neural networks. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 49 December 2017, Long Beach, CA, USA, pages 1740–1750, 2017.
 [18] R. Krishnamoorthi. Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv preprint arXiv:1806.08342, Jun 2018.
 [19] F. Li and B. Liu. Ternary weight networks. arXiv preprint arxiv:1605.04711, 2016.
 [20] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C. Fu, and A. C. Berg. SSD: single shot multibox detector. In Computer Vision  ECCV 2016  14th European Conference, Amsterdam, The Netherlands, October 1114, 2016, Proceedings, Part I, pages 21–37, 2016.
 [21] C. Louizos, M. Reisser, T. Blankevoort, E. Gavves, and M. Welling. Relaxed quantization for discretized neural networks. In International Conference on Learning Representations (ICLR), 2019.
 [22] E. Meller, A. Finkelstein, U. Almog, and M. Grobman. Same, same but different  recovering neural network quantization error through weight factorization. CoRR, abs/1902.01917, 2019.
 [23] A. K. Mishra, J. J. Cook, E. Nurvitadhi, and D. Marr. WRPN: training and inference using wide reducedprecision networks. arXiv preprint arxiv 1704.03079, 2017.
 [24] D. Miyashita, E. H. Lee, and B. Murmann. Convolutional neural networks using logarithmic data representation. arXiv preprint arxiv:1603.01025, 2016.
 [25] V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML10), June 2124, 2010, Haifa, Israel, pages 807–814, 2010.
 [26] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.
 [27] J. W. T. Peters and M. Welling. Probabilistic binary neural networks. arXiv preprint arxiv:1809.03368, 2018.
 [28] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnornet: Imagenet classification using binary convolutional neural networks. In Computer Vision  ECCV 2016  14th European Conference, Amsterdam, The Netherlands, October 1114, 2016, Proceedings, Part IV, pages 525–542, 2016.
 [29] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. FeiFei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
 [30] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.C. Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
 [31] T. Sheng, C. Feng, S. Zhuo, X. Zhang, L. Shen, and M. Aleksic. A quantizationfriendly separable convolution for mobilenets. In 1st Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2), 2018.
 [32] K. Ullrich, E. Meeds, and M. Welling. Soft weightsharing for neural network compression. In International Conference on Learning Representations (ICLR), 2017.
 [33] X. Zhang, J. Zou, K. He, and J. Sun. Accelerating very deep convolutional networks for classification and detection. IEEE Trans. Pattern Anal. Mach. Intell., 38(10):1943–1955, 2016.
 [34] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen. Incremental network quantization: Towards lossless cnns with lowprecision weights. arXiv preprint arxiv:1702.03044, abs/1702.03044, 2017.
 [35] S. Zhou, Z. Ni, X. Zhou, H. Wen, Y. Wu, and Y. Zou. Dorefanet: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.
Appendix A Optimal range equalization of two layers
Consider two fullyconnected layers with weight matrices and , that we scale as in 4.1. We investigate the problem of optimizing the quantization ranges by rescaling the weight matrices by , where , such that and the weight matrices after rescaling. We investigate the case of symmetric quantization, which also gives good results in practice for asymmetric quantization. We denote
(17)  
(18)  
(19) 
Where is the perchannel weight quantization range that is scaled by , the range for the full weight matrix and are the original unscaled ranges.
Using this in our optimization goal of eq. 6 leads to
(20)  
(21)  
(22) 
We observe that the specific scaling of each channel cancels out as long as they do not increase , the range of the full weight matrix. We can reformulate the above to
(24) 
which is minimal iff
(25) 
By contradiction, if there is a small positive such that which will decrease by without effecting . Therefore such a solution would not be optimal for eq. 24.
The condition from eq. 25 implies there is a limiting channel which defining the quantization range of both weight matrices and . However, our optimization goal is not effected by the choice of the other given the resulting and are smaller than and , respectively. To break the ties of solutions we decide to set . Thus the channel’s ranges between both tensors are matched as closely as possible and the introduced quantization error is spread equally among both weight tensors. This results in our final rescaling factor
(26) 
which satisfies our necessary condition from eq. 25 and ensures that .
Appendix B Bias Correction for Convolutional Layers
Similarly to fully connected layers we can compute from W and , it becomes a constant and we have that . Expanding this yields:
(27)  
(28) 
where we assume that the expected value of each input channel is the same for all spatial dimensions in the input channel. Since the value of does not depend on the spatial dimensions and , the expected error is the same for the full output channel and can be folded into the layer’s bias parameter.
Appendix C Clipped Normal Distribution
Given a normally distributed random variable with mean and variance , and a clippedlinear function that clips its argument to the range , s.t. , the mean and variance of can be determined using the standard rules of computing the mean and variance of a function:
(30)  
(31) 
where we define , and .
c.1 Mean of Clipped Normal Distribution
Using the fact that is constant if we have that:
The first and last term can be computed as and respectively, where we define , , and , the normal CDF with zero mean and unit variance.
The integral over the linear part of can be computed as:
where we define , i.e. the standard normal pdf and is the normalization constant for a normal distribution with variance
Thus
(32)  
c.2 Variance of Clipped Normal Distribution
We again exploit the fact that is constant if :
The first and last term can be solved as and respectively.
The second term can be decomposed as follows:
where we use the result from the previous subsection and define , and where is the mean of the truncated normal distribution.
Evaluating the first term yields:
This results in:
(33)  
Appendix D Empirical quantization bias correction
Model  FP32  INT8 

Original Model  71.72%  0.12% 
Empirical Bias Corr  71.72%  58.44% 
Clip @ 15  67.06%  2.55% 
+ Empirical Bias Corr  70.55%  70.27% 
Equalization + Bias Absorption  71.57%  70.92% 
+ Empirical Bias Corr  71.57%  71.41% 
If a network does not use batch normalization, or does not use batch normalization in all layers, a representative dataset can be used to compute preactivation means before and after quantization. We can then use that to subtract the difference from the quantized models preactivations. This procedure can be ran with unlabeled data. The procedure should be ran after BatchNorm folding and crosslayer range equalization. Clipping should be used in the quantized network, but not in the floating point network. Since the activation function and the quantization operation are fused, this procedure is ran on a network with quantized weights only. We bias correct a layer after all the layers feeding into it are already biascorrected. The procedure is as follows:

Run examples through the FP32 model and collect for each layer the perchannel preactivation means .

For each layer in the quantized model:

Collect the perchannel preactivation means of layer for the same examples as in step 1.

Compute the perchannel biased quantization error

Subtract from the layer’s bias parameter.

The results of applying this procedure can be found in Table 6.