ACIQ: Analytical Clipping for Integer Quantization of neural networks
Unlike traditional approaches that focus on the quantization at the network level, in this work we propose to minimize the quantization effect at the tensor level. We analyze the trade-off between quantization noise and clipping distortion in low precision networks. We identify the statistics of various tensors, and derive exact expressions for the mean-square-error degradation due to clipping. By optimizing these expressions, we show marked improvements over standard quantization schemes that normally avoid clipping. For example, just by choosing the accurate clipping values, more than 40% accuracy improvement is obtained for the quantization of VGG16-BN to 4-bits of precision. Our results have many applications for the quantization of neural networks at both training and inference time. One immediate application is for a rapid deployment of neural networks to low-precision accelerators without time-consuming fine tuning or the availability of the full datasets.
|Ron Banner, Yury Nahshan, Elad Hoffer, Daniel Soudry|
(1) Intel - Artificial Intelligence Products Group (AIPG)
(2) Technion - Israel Institute of Technology, Haifa, Israel
A significant drawback of deep learning models is their computational costs. Low precision is one of the key techniques being actively studied recently to conquer the problem. With hardware support, low precision training and inference can compute more operations per second, reduce memory bandwidth and power consumption, and allow fit bigger networks into a device.
In general, a low-precision scheme involves a floating-point to integer conversion, which introduces quantization noise into the network. This quantization noise is strongly linked to the dynamic range, defined as the range between the largest and smallest values that we are need to quantized. For a given -bit integer representation, a smaller dynamic range leads to a smaller spacing between the quantization levels, enabling improved resolution and smaller quantization noise. To reduce this quantization noise, the dynamic range can be limited by clipping the values in the tensor. This clipping process introduces an additional noise because of the loss of information that otherwise would be carried by the clipped portion of the tensor. Hence, a trade-off between clipping and quantization effects exist. To find the best clipping value we need to minimize the information loss.
In this paper, we study the effect of clipping with the aim of improving overall quantization noise. To this end, we first study the distribution of values within these tensors. In all our measurements, the statistical distributions of weights and activations are observed to follow a bell-curve. This indicates that large values occur very rarely compared to small values, and suggests that the loss of information due to the clipping process might be compensated by improving the resolution of the more common smaller values.
To optimize this process further, it is essential to understand the underlying distributions of tensors before applying the clipping. By running a few statistical tests, we were able to see on a variety of convolution models that activation tensors follow either a Gaussian or Laplacian distributions with a high degree of certainty (p-value 0.01). This modeling of activation tensors enable a clear formulation of the mean-square quantization error and constitute the first step for its minimization. In Section 4, we provide a rigorous formulation to optimize the quantization effect of activation tensors using clipping by analyzing both the Gaussian and the Laplace priors. This formulation is henceforth refered to as Analytical Clipping for Integer Quantization (ACIQ).
These analytical results have many applications for the quantization of neural networks at both training and inference time. For example, a straightforward quantization of the weights and activations to 8-bit fixed point representation has been shown to have a negligible effect on the model accuracy. Yet, in the majority of the applications, further reduction of precision quickly degrades performance, calling for an optimal clipping scheme to minimize information-loss during quantization. On a more general level, exploiting the statistics of the activation tensors to minimize their quantization toll is orthogonal to other techniques for network quantization. It can work in synergy with other schemes to achieve more than they could have by working individually. Finally, it is easy to implement and requires only the adjustment of clipping value according to an analytical formula.
We further demonstrate the applicability and usability of our analytic terms on the following challenging problem. Given a pre-trained network, we would like to quantize the weights to 8-bit of precision and most activations to 4-bits of precision without any further processing (e.g., re-training). This specific setting is of a particular interest due to quantization of activations to 4-bits, which alleviates a major bottleneck in terms of memory bandwidth. Prior attempts using standard techniques (Krishnamoorthi, 2018) show severe degradation on accuracy. While several recent works were able to overcome this issue by additional re-training (McKinstry et al., 2018), this is not feasible in many practical settings, e.g., we often do not have the dataset on which the network is working on. By using smart clipping, we were able to significantly improve validation accuracy for configurations with 8-bit weight and 4-bit activation as summarized in Table 1.
The methods introduced in this work may be additionally useful to current and future applications, such as the attempts to fully train in a low precision setting (Banner et al., 2018). In this scenario, intermediate activations and weights are known to have their statistics change throughout the training process, further highlighting the need for a precise and careful scale determination.
2 Previous work
In many cases, taking a model trained for full precision and directly quantizing it to 8-bit precision, without any re-training, can result in a relatively low loss of accuracy (Jacob et al., 2017a; Gong et al., 2018; Krishnamoorthi, 2018). More recently, (Banner et al., 2018; Wu et al., 2018) has shown that 8-bit precision is sufficeint not just for running trained models, but also for training them. Yet, naively quantizing a full precision model below 8-bits usually incurs significant accuracy degradation.
Many attempts have been made to diminish this effect, but they usually employ training of the model with quantization constraints or modifying network structure (Lin et al., 2017). Recent work has been able to quantize weights and activations to 4-bits, but requires a long training time of 110 epochs (McKinstry et al., 2018). Other concepts to enable quantization below 8-bits have been proposed (Rastegari et al., 2016; Zhou et al., 2016; Choi et al., 2018). Per-channel scale has shown to be important both for inference (Rastegari et al., 2016) and for training (Wu et al., 2018). The concept of non-uniform quantization has recently been suggested by (Park et al., 2018) and (Baskin et al., 2018).
To the best of our knowledge, there have been only a few attempts to clip activations before. Nvidia proposed an iterative method to search for a good clipping threshold based on the Kullback-Leibler divergence measure (Migacz, 2017). More recently, (Choi et al., 2018; Jung et al., 2018) have proposed an activation clipping parameter that is optimized during training. Finally, Wu et al. (2018) has proposed a heuristic approach to adjust the clipping values according to the desired precision. As contrast to previous works, we propose a simple and efficient method to find the optimal clipping threshold analytically from the distribution of the tensor, by minimizing MSE error. Our method takes into account not only the target precision but also the statistics of the tensor. We prove analytically that, for a given distribution, our method finds an optimal solution for threshold that minimizes the MSE error. We then demonstrate its usefulness on several Imagenet topologies.
3 Probability distribution fitting for activations
In this section, we construct an estimate of the underlying probability density function of the activation tensors in deep learning models. Traditionally, no clipping was made in standard GEMMLWOP. Hence, quantization levels are uniformity spaced between the largest and the smallest values in the tensor. Yet, this approach is non-optimal, due to the fact that the activation tensors have bell-shaped distributions. Therefore, to reduce quantization noise (or increase resolution), it might be desirable to clip the values above a certain threshold. Specifically, we need to find the best clipping value that, on the one hand, maintains a low clipping rate, but, on the other hand, improves resolution. To do so we must first understand the underlying data distribution.
To that end, we collect the data of various tensors at different layers. We observe that the data is in general symmetrically distributed around a mean. We next estimate the goodness of fit to several bell-shaped distributions. This is done by measuring the static distance (largest vertical line) between the cumulative distribution function (CDF) of the empirically observed distribution and the CDF of the reference distribution (also know by Kolmogorov-Smirnov test (Lopes, 2011). By considering the activations of all layers of ResNet50 on ImageNet, we obtain the average static distance to each of the following distributions, with -value:
As one can see, the best fit is established by the Laplace and Normal distributions. In figure 3 (see Appendix), we plot the normalized distributions of the activation tensors at different layers for Resnet50, and compare them against both priors.
4 An analytical model of tensor quantization
In this section, we provide a detailed analysis for establishing the optimal clipping values under either Gaussian or Laplace distributions. We first derive a generic expression for any given distribution for the expected MSE as a function of clipping value. We then use this expression to develop a specific expression for each distribution. Finally, we establish the optimal clipping values by solving the equations for which the derivative with respect to the clipping value are set to zero.
Let be a high precision random variable with a probability density function . Without loss of generality, we assume a prepossessing step has been made so that the average value in the tensor zero i.e., (we do not lose generality since we can always subtract and add this mean). Assuming bit-width , we would like to quantize the values in the tensors uniformally to discrete values.
Commonly (e.g., in GEMMLOWP), integer tensors are uniformly quantized in the range , where is determined by the tensor maximal absolute value. In the following we show that the this choice of is suboptimal, and suggest a model where the tensor values are clipped to reduce quantization noise. For any , we define the clipping function as follows
Denoting by the clipping value, the range is partitioned to equal quantization regions. Hence, the quantization step between two adjacent quantized values is established as follows:
Our model assumes values are rounded to the midpoint of the region (bin) i.e., for every index all values that fall in are rounded to . Then, the expected mean-square-error between and its quantized version can be written as follows:
Equation 3 is composed of three parts. The first and last terms quantify the contribution of to the expected mean-square-error. Note that for symmetrical distributions around zero (e.g., Gaussian or Laplace ) these two terms are equal and their sum can therefore be evaluated by multiplying any of the terms by 2.
The second term corresponds to the expected mean-square-error when the range is quantized uniformly to discrete levels. We take the common additive orthogonal noise model for the analysis of quantization error, where quantization noise has no correlation with , and the probability density function is approximately uniform. The validity of this model is well established in cases of sufficient resolution. For more details see Marco & Neuhoff (2005). Accordingly, given a quantizaiton level , we approximate the second term by replacing the probability density function by the uniform density function , as follows:
Substituting provides an evaluation for the second term in Equation 3 as follows:
In the following we provide a closed form solution for the case where the density probability distribution function is either Gaussian or Laplace .
4.1 Laplace Clipping
In the following we develop an expression based on Equation 5 for the Laplace case. Since we assume that , we have the following Laplace density function . In order to derive a closed form solution for Equation 5, we need to evaluate
Let represent the expression below:
By taking the derivative of with respect to , it is easy to see that is the correct antiderivative of the integrand in equation 6. Hence,
We can finally state Equation 5 for the laplace case as follows.
Finally, in order to find the that give the minimum MSE, the corresponding derivative with respect to is set equal to zero as follows:
4.2 Gaussian Clipping
We now turn to evaluate Equation 5 for the Gaussian case. Given a Gaussian random variable , we define to represent the expression below:
As in subsection 4.1, one can observe that by taking the derivative of with respect to , it is easy to show that is the correct antiderivative of Equation 6 for the case where represents the Gaussian density function i.e., . Next, we use on the range to evaluate Equation 6 for the Gaussian case as follows:
Equation 5 can thus be written for the case of Gaussian distribution as follows:
In order to find the optimal clipping values for which mean-square-error is minimized, we need to differentiate with respect to and set the derivative equal to zero as follows.
We evaluate the proposed ideas of analytical clipping on multiple models, and demonstrate its usefulness by suggesting a fast method to quantize neural network aggressively and without a significant accuracy drop. Our method is simple and efficient, and does not require re-training. During quantization we only adjust the clipping values according to our analytical formula, and compare it against the standard GEMMLOWP quantization (Jacob et al., 2017b) that avoids clipping. We consider a mixed precision scheme of 4 and 8 bits quantization. All weights are kept at 8-bit precision. The goal is to quantize as many activation layers as possible to 4-bits of precision without a significant accuracy degradation. This setting is important as activation tensors constitute a major memory bottleneck for inference in large models. The code to replicate all our experiments is available online.111https://github.com/submission2019/AnalyticalScaleForIntegerQuantization
For each particular activation tensor, we use Equations 8 and 11 to evaluate the expected mean-square-error (MSE). Based on the expected MSE and a quantization requirement (i.e., the number of activation layers that need to be quantized to 4-bits), we decide whether to keep the tensor at 8-bits of precision or quantize the tensor into a 4-bit representation. Our analysis in Section 4 suggests that no clipping is required when a tensor is quantized to 8 bits of precision (see figure 1). On the other hand, clipping is important when the tensor is quantized to 4-bits of precision. In that case, we use the optimal clipping values stored in a lookup table. Given a tensor with estimated statistical parameters (e.g., std or mean), we search for the optimal clipping value that was pre-calculated offline (by solving either equation 9 or equation 12) and stored for fast retrieval.
We ran experiments to see how many activation layers can be quantized from 8 to 4 bits of precision without a significant accuracy drop. These experiments have been made with ResNet-18, ResNet-50, ResNet-101, VGG-16, VGG-16 with batch normalization and Inception-v3 on the ImageNet dataset. For each method, we select layers (tensors) to be quantized to 4 bits using our optimized clipping method and compare it against the standard GEMMLOWP approach. In figure 2 we present this accuracy-quantization tradeoff. Our results clearly show the significant potential of our optimized clipping method. For example, just by choosing the correct clipping values, more than half of the layers can be quatized to 4 bits with an accuracy drop of less than 1% for Res18, Res50 and both versions of Vgg16. Also, unlike the standard approach, degradation is much more predictable and follow a roughly linear trend. Therefore, by adjusting the number of layers, we can conveniently and linearly control the accuracy-quantization trade-off, as can be seen by the linear relation shown in Figures 1(g), 1(h) and 1(i).
We introduce ACIQ - an optimized clipping framework for improved quantization of neural networks. Optimized clipping is shown to have a drastic impact on quantization in a variety of models. The underlying reason lies in the statistical dispersion of activations, where large values occur very rarely. We show the bell-curve statistics of activations are best fit as either Laplace or Gaussian distributions, and formulate the clipping process as an optimization problem. The solution to this optimization problem constitutes a polynomial-exponential equation that can be calculated numerically for a variety of statistical parameters, and stored in a lookup table for fast retrieval. This scheme is very simple and easy to implement either in software or in hardware.
While results are very encouraging, this work is only the first step on the ladder for successful deployment of clipping in neural networks. First, our main focus in this work is quantization of activations, while similar evaluation still needs to be done for weights. On a more general level, our framework is not restricted to the inference settings and can be extended to training. For example, our preliminary results show that quantization of gradients might benefit from the clipping of small values (i.e., sparsification). Establishing the correct threshold for gradients is yet another important direction for future work. While much work still needs to be done with regards to optimized clipping, we believe our work clearly demonstrates the major importance of this concept for the quantization of neural networks.
- Banner et al. (2018) Ron Banner, Itay Hubara, Elad Hoffer, and Daniel Soudry. Scalable methods for 8-bit training of neural networks. arXiv preprint arXiv:1805.11046, 2018.
- Baskin et al. (2018) Chaim Baskin, Eli Schwartz, Evgenii Zheltonozhskii, Natan Liss, Raja Giryes, Alex M Bronstein, and Avi Mendelson. Uniq: Uniform noise injection for the quantization of neural networks. arXiv preprint arXiv:1804.10969, 2018.
- Choi et al. (2018) Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang, Vijayalakshmi Srinivasan, and Kailash Gopalakrishnan. Pact: Parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085, 2018.
- Gong et al. (2018) Jiong Gong, Haihao Shen, Guoming Zhang, Xiaoli Liu, Shane Li, Ge Jin, Niharika Maheshwari, Evarist Fomenko, and Eden Segal. Highly efficient 8-bit low precision inference of convolutional neural networks with intelcaffe. arXiv preprint arXiv:1805.08691, 2018.
- Jacob et al. (2017a) Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. arXiv preprint arXiv:1712.05877, 2017a.
- Jacob et al. (2017b) Benoit Jacob et al. gemmlowp: a small self-contained low-precision gemm library.(2017), 2017b.
- Jung et al. (2018) Sangil Jung, Changyong Son, Seohyung Lee, Jinwoo Son, Youngjun Kwak, Jae-Joon Han, and Changkyu Choi. Joint training of low-precision neural network with quantization interval parameters. arXiv preprint arXiv:1808.05779, 2018.
- Krishnamoorthi (2018) Raghuraman Krishnamoorthi. Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv preprint arXiv:1806.08342, 2018.
- Lin et al. (2017) Xiaofan Lin, Cong Zhao, and Wei Pan. Towards accurate binary convolutional neural network. In Advances in Neural Information Processing Systems, pp. 345–353, 2017.
- Lopes (2011) Raul HC Lopes. Kolmogorov-smirnov test. In International encyclopedia of statistical science, pp. 718–720. Springer, 2011.
- Marco & Neuhoff (2005) Daniel Marco and David L Neuhoff. The validity of the additive noise model for uniform scalar quantizers. IEEE Transactions on Information Theory, 51(5):1739–1755, 2005.
- McKinstry et al. (2018) Jeffrey L McKinstry, Steven K Esser, Rathinakumar Appuswamy, Deepika Bablani, John V Arthur, Izzet B Yildiz, and Dharmendra S Modha. Discovering low-precision networks close to full-precision networks for efficient embedded inference. arXiv preprint arXiv:1809.04191, 2018.
- Migacz (2017) S Migacz. 8-bit inference with tensorrt. In GPU Technology Conference, 2017.
- Park et al. (2018) Eunhyeok Park, Sungjoo Yoo, and Peter Vajda. Value-aware quantization for training and inference of neural networks. arXiv preprint arXiv:1804.07802, 2018.
- Rastegari et al. (2016) Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pp. 525–542. Springer, 2016.
- Wu et al. (2018) Shuang Wu, Guoqi Li, Feng Chen, and Luping Shi. Training and inference with integers in deep neural networks. arXiv preprint arXiv:1802.04680, 2018.
- Zhou et al. (2016) Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.