Posttraining Quantization with Multiple Points
Abstract
We consider the posttraining quantization problem, which discretizes the weights of pretrained deep neural networks without retraining the model. We propose multipoint quantization, a quantization method that approximates a fullprecision weight vector using a linear combination of multiple vectors of lowbit numbers; this is in contrast to typical quantization methods that approximate each weight using a single low precision number. Computationally, we construct the multipoint quantization with an efficient greedy selection procedure, and adaptively decides the number of low precision points on each quantized weight vector based on the error of its output. This allows us to achieve higher precision levels for important weights that greatly influence the outputs, yielding an “effect of mixed precision” but without physical mixed precision implementations (which requires specialized hardware accelerators (Wang et al., 2019)). Empirically, our method can be implemented by common operands, bringing almost no memory and computation overhead. We show that our method outperforms a range of stateoftheart methods on ImageNet classification and it can be generalized to more challenging tasks like PASCAL VOC object detection.
1 Introduction
The past decade has witnessed the great success of deep neural networks (DNNs) in many fields. Nonetheless, DNNs require expensive computational resources and enormous storage space, making it difficult for deployment on resourceconstrained devices, such as devices for Internet of Things (IoT), processors on smart phones, and embeded controllers in mobile robots (Howard et al., 2017; Xu et al., 2018).
Quantization is a promising method for creating more energyefficient deep learning systems (Han et al., 2015; Hubara et al., 2017; Zmora et al., 2018; Cheng et al., 2018). By approximating realvalued weights and activations using lowbit numbers, quantized neural networks (QNNs) trained with stateoftheart algorithms (e.g., Courbariaux et al., 2015; Rastegari et al., 2016; Louizos et al., 2018; Li et al., 2019) can be shown to perform similarly as their fullprecision counterparts (e.g., Jung et al., 2019; Li et al., 2019).
This work focuses on the problem of posttraining quantization, which aims to generate a QNN from a pretrained fullprecision network, without accessing the original training data (e.g., Sung et al., 2015; Krishnamoorthi, 2018; Zhao et al., 2019; Meller et al., 2019; Banner et al., 2019; Nagel et al., 2019; Choukroun et al., 2019). This scenario appears widely in practice. For example, when a client wants to deploy a fullprecision model provided by a machine learning service provider in lowprecision, the client may have no access to the original training data due to privacy policy. In addition, compared with training QNNs from scratch, prosttraining quantization is much more efficient computationally.
Mixed precision is a recent advanced technology to boost the performance of QNNs (Wang et al., 2019; Banner et al., 2019; Gong et al., 2019; Dong et al., 2019). The idea is to assign more bits to important layers (or channels) and less bits to unimportant layers/channels to better control the overall quantization error and balance the accuracy and cost more efficiently. The difficulty, however, is that current mixed precision methods require specialized hardware (e.g., Wang et al., 2019). Most commodity hardware do not support efficient mixed precision computation (e.g. due to chip area constraints (Horowitz, 2014)). This makes it difficult to implement mixed precision in practice, despite that it is highly desirable.
In this paper, we propose multipoint quantization for posttraining quantization, which can achieve the flexibility similar to mixed precision, but uses only a single precision level. The idea is to approximate a fullprecision weight vector by a linear combination of multiple lowbit vectors. This allows us to use a larger number of lowbit vectors to approximate the weights of more important channels, while use less points to approximate the insensitive channels. It enables a flexible tradeoff between accuracy and cost at a perchannel basis, while using only a single precision level. Because it does not require physical mixed precision implementation, our method can be easily deployed on commodity hardware by common operands.
We propose a greedy algorithm to iteratively find the optimal lowbit vectors to minimize the approximation error. The algorithm sequentially adds the lowbit vector that yields largest improvement on the error, until a stopping criterion is met. We develop a theoretical analysis, showing that the error decays exponentially with the number of lowbit vectors used. The fast decay of the greedy algorithm ensures small overhead after adding these additional points.
Our multipoint quantization is computationally efficient. The key advantage is that it only involves multiplyaccumulate (MAC) operations during inference, which has been highly optimized in normal deep learning devices. We adaptively decide the number of low precision points for each channel by measuring its output error. Empirically, we find that there are only a small number of channels that require a large number of points. By applying multipoint quantization on these channels, the performance of the QNN is improved significantly without any training or finetuning. Empirically, it only brings a negligible increase of memory cost.
We conduct experiments on ImageNet classification with different neural architectures. Our method performs favorably against the stateoftheart methods. It even outperforms the method proposed by Banner et al. (2019) in accuracy, which exploits physical mixed precision. We also verify the generalizability of our approach by applying it to PASCAL VOC object detection tasks.
2 Method
Section 2.1 introduces backgrounds on posttraining quantization. We then discuss the main framework of multipoint quantization in Section 2.2, its application to deep neural networks in Section 2.3 and its implementation overhead in Section 2.4.
2.1 Preliminaries: Posttrainig Quantization
Given a pretrained fullprecision neural network , the goal of posttraining quantization is to generate a quantized neural network (QNN) with high performance. We assume the full training dataset of is unavailable, but there is a small calibration dataset , where is a very small size, e.g. . The calibration set is used only for choosing a small number of hyperparameters of our algorithm, and we can not directly train on it because it is too small and would cause overfitting.
The bit linear quantization amounts to approximate real numbers using the following quantization set ,
(1) 
where denotes the uniform grid on with increment between elements, and is a scaling factor that controls the length of and specifies center of .
Then we map a floating number to by,
(2) 
where denotes the nearest rounding operator w.r.t. . For a real vector , we map it to by,
(3) 
Further, can be generalized to higher dimensional tensors by first stretching them to onedimensional vectors then applying Eq. 3.
Since all the values are larger than (or smaller than ) will be clipped, is also called the clipping factor. Supposing is used to quantize vector , a naive choice of is the element with the maximal absolute value in . In this case, no element will be clipped. However, because the weights in a layer/channel of a neural network empirically follows a bellshaped distribution, properly shrinking can boost the performance. Different clipping methods have been proposed to optimize (Zhao et al., 2019).
There are two common configurations for posttraining quantization, perlayer quantization and perchannel quantization. Perlayer quantization assigns the same and for all the weights in the same layer. Perchannel quantization is more finegrained, and it uses different and for different channels. The latter can achieve higher precision, but it also requires more complicated hardware design.
2.2 Multipoint Quantization and Optimization
We propose multipoint quantization, which can be implemented with common operands on commodity hardware.
Consider a linear layer in a neural network, which is either a fullyconnected (FC) layer or a convolutional layer. The weight of a channel is a vector for FC layer, or a convolution kernel for convolutional layer. For simplicity, we only introduce the case of FC layer in this section. It can be easily generalized to convolutional layers. Supposing the input to this layer is dimensional , then the realvalued weight of a channel can be denoted as . Multipoint quantization approximates with a weighted sum of a set of low precision weight vectors,
(4) 
where and for
Given a fixed , we want to find optimal that minimizes the norm between the realvalued weight and the weighted sum,
(5) 
Problem 5 yields a difficult combinatorial optimization. We propose an efficient greedy method for solving it, by sequentially adding the best pairs one by one. Specifically, we obtain the th pair by approximate the residual from the previous pairs,
(6) 
where is the residual from the first pairs,
(7) 
For a fixed , we have,
(8)  
Now we only need to solve optimal ,
(9) 
Because is not differentiable, it is hard to optimize by gradient descent. Instead, we adopt grid search to find efficiently. Once the optimal is found, the corresponding is,
(10) 
Choice of Parameters for Grid Searching :
Grid search enumerates all the values from set , and selects the value that achieves the lowest error. The parameters of grid search, search range and step size, are defined as the interval and the increment respectively. The choice of search range and step size are critical. We first define minimal gap of vector , and then give the choice of search range and step size.
The minimal gap is the minimal distance between two elements in a vector . It restricts the maximal value of step size.
Definition 1 (Minimal Gap)
Given any vector , the minimal gap of is defined as
Then we propose the following choice of and ,
(11) 
2.3 Multipoint Quantization on Deep Networks
Error of Output 
Index of Layers 
We describe how to apply multipoint quantization to deep neural networks. Using multipoint quantization can decrease the quantization error of a channel significantly, but every additional quantized filter requires additional memory and computation consumption. Therefore, to apply it to deep networks, we must select the important channels to compensate for their quantization error with multipoint quantization.
For a layer with dimensional input, we adopt a simple criterion, output error, to determine the target channels. Output error is the difference of the output of a channel before and after quantization. Suppose the weight of a channel is , its output error is defined as,
(12) 
where is the input batch to , collected by running forward pass of with calibration set . Our goal is to keep the output of each channel invariant. If is larger than a predefined threshold , we apply multipoint quantization to this channel and increase until . A similar idea is leveraged to determine the optimal clipping factor ,
(13) 
Here, is the set of weights sharing the same . For perlayer quantization, is contains the weights of all the channels in a layer. For perchannel quantization, contains only one element, which is the weight of a channel.
2.4 Analysis of Overhead
We introduce how the computation of dot product can be implemented with common operands when adopting multipoint quantization. Then we analyze the overhead of memory and computation.
Fig. 2 (a) demonstrates the computation flowchart of dot product in a normal QNN (Zmora et al., 2018; Jacob et al., 2018). For dimensional input and weight with bits, computing the dot product requires multiplications between two bit integers. The result of the dot product is stored in a 32bit accumulator, since the sum of the individual products could be more than bits. The above operation is called MultiplyAccumulate (MAC), which has been highly optimized in modern deep learning hardware (Chen et al., 2016). The 32bit integer is then quantized according to the quantization scheme of the output part.
Now we delve into the computation pipeline when . Because , we transform them to a hardwarefriendly integer representation beforehand,
(14) 
Here, determines the precision of the quantized . We use the same for all the weights with multipoint quantization in the network. are 32bit integers. The quantization of can be performed offline before deploying the QNN. We point out that,
(15)  
We divide the computation into three steps. Readers can follow Fig.2 (b).
Step 1: Matrix Multiplication In the first step, we compute . The results are stored in the 32bit accumulators.
Step 2: Coefficient Multiplication and Summation The second step first multiplies with , containing times of multiplication between two 32bit integers. Then we sum together with times of addition.
Step 3: Bit Shift Finally, the division with can be efficiently implemented by shifting bits of to the left. We ignore the computation overhead in this step.
Overall Storage/ Computation Overhead: We count the number of binary operations following the same bitop computation strategy as Li et al. (2019); Zhou et al. (2016). The multiplication between two bit integer costs binary operations. Suppose we have a weight vector . We compare the memory cost and the computational cost (dot product with bit input ) between naive quantization and multipoint quantization . The results are summarized in Table 1. Because is always large in neural networks, so the memory and the computation overhead is approximately proportional to the number .


Method  Memory  MULs  ADDs 


Naive  
Multipoint 
3 Theoretical Analysis
In this section, we give a convergence analysis of the proposed optimization procedure. We prove the quantizataion error of the proposed greedy optimization decays exponentially w.r.t. the number of points.
Suppose that we want to quantize a realvalued dimensional weight . For simplicity, we assume a binary precision in this section, which leads to . Our proof can be generalized to easily. We follow the notations in Section 2.2. At the th iteration, the residual , and are defined by Eq. (7), Eq. (9) and Eq. (10), respectively. The minimal gap of a vector , , is defined in Definition (1).
Let the loss function be . Now we can prove the following rate under mild assumptions.
Theorem 1 (Exponential Decay)
Suppose that at the th iteration of the algorithm, is obtained by grid searching from the range , where is the step size of the grid search. Assume that for any step before termination, where is a predefined maximal step size. We have
for some constant .
The proof is in Appendix A. Note that is usually much smaller than the exponential term and thus can be ignored. Theorem 1 suggests that if we use sufficiently small step size () for the optimization, the loss will decrease exponentially. Because of the exponentially fast decay of the algorithm, we find that for most of the channels using multipoint quantization in practice. Fig. 3 justifies our theoretical analysis by a toy experiment.
iteration 
4 Experiments
We evaluate our method on two tasks, ImageNet classification (Krizhevsky et al., 2012) and PASCAL VOC object detection (Everingham et al., ). Our evaluation contains various neural networks.
Model  Bits (W/A)  Method  Acc (Top1/Top5) (%)  Size  OPs 


VGG19BN  32/32  FullPrecision  74.24/91.85  76.42MB   
4/8  w/o Multipoint  60.81/83.68  9.55MB  9.754G  
OCS (Zhao et al., 2019)  62.11/84.59  10.70MB  10.924G  
Ours  64.06/86.14  9.59MB  10.923G  


ResNet18  32/32  FullPrecision  69.76/89.08  42.56MB   
4/8  w/o Multipoint  54.04/78.10  5.32MB  847.78M  
OCS (Zhao et al., 2019)  58.05/81.57  6.20MB  988.51M  
Ours  61.68/84.03  5.37MB  983.22M  


ResNet101  32/32  FullPrecision  77.37/93.56  161.68MB   
4/8  w/o Multipoint  61.04/83.02  20.21MB  3.841G  
OCS (Zhao et al., 2019)  70.27/89.73  23.40MB  4.448G  
Ours  73.09/91.34  20.86MB  4.446G  


WideResNet50  32/32  FullPrecision  78.51/94.09  262.64MB   
4/8  w/o Multipoint  61.78/83.60  31.83MB  5.639G  
OCS (Zhao et al., 2019)  68.54/88.68  35.97MB  6.372G  
Ours  70.47/89.43  32.08MB  6.365G  


Inceptionv3  32/32  FullPrecision  77.45/93.56  82.96MB   
4/8  w/o Multipoint  5.17/12.85  10.37MB  2.846G  
OCS (Zhao et al., 2019)  8.49/17.75  12.16MB  3.338G  
Ours  33.89/56.07  10.42MB  3.337G  


Mobilenetv2  32/32  FullPrecision  71.78/90.19  8.36MB   
8/8  w/o Multipoint  0.06/0.15  2.090MB  299.49M  
OCS (Zhao et al., 2019)  N/A  N/A  N/A  
Ours  70.70/89.70  2.091MB  357.29M  

Model  Bits (W/A)  Method  Acc (Top1/Top5) (%)  Size  OPs 


VGG19BN  32/32  FullPrecision  74.24/91.85  76.42MB   
4/4  w/o Multipoint  52.08/76.19  9.55MB  4.877G  
MP (Banner et al., 2019)  70.59/90.08  9.55MB  4.877G  
Ours  71.96/90.75  9.63MB  5.525G  
Ours + Clip  72.78/91.23  9.58MB  5.354G  


ResNet18  32/32  FullPrecision  69.76/89.08  42.56MB   
4/4  w/o Multipoint  57.00/80.40  5.32MB  423.89M  
MP (Banner et al., 2019)  64.78/85.90  5.32MB  423.89M  
Ours  64.29/85.59  5.39MB  494.16M  
Ours + Clip  65.89/86.68  5.41MB  470.89M  


ResNet50  32/32  FullPrecision  76.15/92.87  89.44MB   
4/4  w/o Multipoint  65.88/86.93  11.18MB  992.28M  
MP (Banner et al., 2019)  72.52/90.80  11.18MB  992.28M  
Ours  71.88/90.43  11.33MB  1.148G  
Ours + Clip  72.67/91.11  11.32MB  1.128G  


ResNet101  32/32  FullPrecision  77.37/93.56  161.68MB   
4/4  w/o Multipoint  69.67/89.21  20.21MB  1.920G  
MP (Banner et al., 2019)  74.22/91.95  20.21MB  1.920G  
Ours  71.56/90.36  20.82MB  2.177G  
Ours+Clip  72.85/91.16  21.04MB  2.189G  


Inceptionv3  32/32  FullPrecision  77.45/93.56  82.96MB   
4/4  w/o Multipoint  12.12/25.24  10.37MB  1.423G  
MP (Banner et al., 2019)  60.64/82.15  10.37MB  1.423G  
Ours  61.22/83.27  10.44MB  1.692G  
Ours+Clip  65.49/86.72  10.38MB  1.519G  


Mobilenetv2  32/32  FullPrecision  71.78/90.19  8.36MB   
4/4  w/o Multipoint  6.86/16.76  1.04MB  74.87M  
MP (Banner et al., 2019)  42.61/67.78  1.04MB  74.87M  
Ours  27.52/50.80  1.05MB  91.16M  
Ours+Clip  55.54/79.10  1.045MB  85.88M  

4.1 Experiment Results on ImageNet Benchmark
We evaluate our method on the ImageNet classification benchmark. For fair comparison, we use the pretrained models provided by PyTorch
We report both model size and number of operations under different bitwidth settings for all the methods. The first and the last layer are not counted. We follow the same bitop computation strategy as Li et al. (2019); Zhou et al. (2016) to count the number of binary operations. One OP is defined as one multiplication between an 8bit weight and an 8bit activation, which takes 64 binary operations. The multiplication between a bit and a bit integer is counted as OPs.
We provide two categories of results here: perlayer quantization and perchannel quantization. In perlayer quantization, all the channels in a layer exploit the same and . In perchannel quantization, each channel has its own parameter and . For both settings, we test six different networks in our experiments, including VGG19 with BN (Simonyan and Zisserman, 2014), ResNet18, ResNet101, WideResNet50 (He et al., 2016), Inceptionv3 (Szegedy et al., 2015) and MobileNetv2 (Sandler et al., 2018).
Perlayer Quantization For perlayer quantization, we compare our method with a stateoftheart (SOTA) baseline, Outlier Channel Splitting (OCS) (Zhao et al., 2019). OCS duplicates the channel with the maximal absolute value and halves it to mitigate the quantization error. For fair comparison, we choose the best clipping method among four methods for OCS according to their paper (Sung et al., 2015; Migacz, 2017; Banner et al., 2019). We select the threshold such that the OPs of the QNN with multipoint quantization is about 1.15 times of the naive QNN. For fair comparison, we expand the network with OCS until it has similar OPs with the QNN using multipoint quantization. The results without multipoint quantization (denoted ‘w/o Multipoint’ in Table. 7) serve as another baseline. We quantize the activations and the weights to the same precision as the baselines. Experiment results are presented in Table. 7. It shows that our method obtains consistently significant gain on all the models compared with ‘w/o Multipoint’, with little increase on memory overhead. Our method also consistently outperforms the performance of OCS under any computational constraint. Especially, on ResNet18, ResNet101 and Inceptionv3, our method surpasses OCS by more than 2% Top1 accuracy. OCS cannot quantize MobileNetv2 due to the group convolution layers, while our method nearly recovers the fullprecision accuracy. Our method achieves similar performance with Data Free Quantization (Nagel et al., 2019) (71.19% Top1 accuracy with 8bit MobileNetv2), which focuses on 8bit quantization on MobileNets only. Note that this method is orthogonal to ours and we expect to obtain more improvement by combining with it.
Perchannel Quantization For perchannel quantization, we compare our method with another SOTA baseline, Banner et al. (2019). Banner et al. (2019) requires physical perchannel mixed precision computation since it assigns different bits to different channels. We denote it as ’Mixed Precision (MP)’. All networks are quantized with asymmetric perchannel quantization (). Since perchannel quantization has higher precision, weight clipping is not performed for naive quantization, which means that . We quantize both weights and activations to 4 bits. Experiment results are presented in Table. 8.
Our method outperforms MP on VGG19BN and Inceptionv3 even without weight clipping. After performing weight clipping with Eq. 13, our method beats MP on 5 out of 6 networks, except for ResNet101. On VGG19BN, Inceptionv3 and MobileNetv2, compared with MP, the Top1 accuracy of our method after clipping is more than 2% higher. In the experiments, all the memory overhead is smaller than 5% and the computation overhead is no more than 17% compared with the naive QNN.
4.2 Experiment Results on PASCAL VOC Object Detection Benchmark
We test Single Shot MultiBox Object Detector (SSD), which is a wellknown object detection framework. We use an opensource implementation
In perlayer quantization, our method increases the performance of the baseline by over 1% mAP (72.86% 74.10%). When weight is quantized to 3bit, our method boost the baseline by 4.38% mAP (42.56% 46.94%) with little memory overhead of 0.01MB. Our method also performs well in perchannel quantization. It improves the baseline by 0.41% mAP for 4bit quantization and 1.09% mAP for 3bit quantization. Generally, our method performs better when the bit width goes smaller.


Top1 Accuracy/% 
OPs/G 
Relative Increment of Size 
Index of Layers 
4.3 Analysis of the Algorithm
We provide a case study of ResNet101 under perlayer quantization to analyze the algorithm. More results can be found in the appendix.
Computation Overhead and Performance: Fig. 4 demonstrates how the performance of different methods changes as the computational cost changes. Our method obtains huge gain with only a little overhead. OCS cannot perform comparably with our method at the beginning, but it catches up when the computational cost is large enough. The performance of ‘Random’ is consistently the worst among all three methods, implying the importance of choosing appropriate channels for multipoint quantization.
Where Multipoint Quantization is Applied: Fig 5 shows the relative increment of size in each layer. We observe that the layers close to the input have more relative increment of size compared with later layers. Typically, the starting layers have small size but huge computational cost. This explains why the computational overhead is large than the memory overhead when using our method.
5 Related Works
Quantized neural networks has made significant progress with training (Courbariaux et al., 2015; Han et al., 2015; Zhu et al., 2016; Rastegari et al., 2016; Mishra et al., 2017; Zmora et al., 2018; Cheng et al., 2018; Krishnamoorthi, 2018; Li et al., 2019). The research of posttraining quantization is conducted for scenarios when training is not available (Krishnamoorthi, 2018; Meller et al., 2019; Banner et al., 2019; Zhao et al., 2019). Hardwareaware Automated Quantization (Wang et al., 2019) is a pioneering work to apply mixed precision to improve the accuracy of QNN, which needs finetuning the network. It inspired a line of research of training a mixed precision QNN (Gong et al., 2019; Dong et al., 2019). Banner et al. (2019) first exploits mixed precision to enhance the performance of posttraining quantization.
6 Conclusions
We propose multipoint quantization for posttraining quantization, which is hardwarefriendly and effective. It performs favorably compared with stateoftheart methods.
Appendix A Proof of Theorem 1
Notice that in the main text, we define as the optimal solution of each iterations (see Equ (6)). While this can not be solved and in practice we use and . This slightly abuses the notation as the two and are actually different. We do this mainly for notation simplicity in the main text. In the proof we distinguish the notations. We use and in this proof.
In our proof, we only consider the simplest case when , which means . It can be generalized to easily. Define , and , where denotes the convex hull of set . It is obvious that is an interior point of . Now we define the following intermediate update scheme. Given the current residual vector , without loss of generality, we assume all the elements of are different (if some of them are equal, we can simply treat them as the same elements). We define
Notice that as the objective is linear and we thus have
Without loss of generality, we assume , as in this case, we have and the algorithm should be terminated. Simple algebra shows that . Notice that as we assume , we have . This gives that
Hence the optimal solution under the constraint of is also . Given the current residual vector, we also define
By the definition, we have We have the following inequalities:
Notice that as we showed that is an interior point of , we have , for some . This gives that
We define
And it is obvious that we have . Next we bound the difference between and Notice that for any , we have
Without loss of generality we assume that , for any . Without loss of generality, we also suppose that . Under the assumption of grid search, there exists in the search space such that . For any , if , then . Now we consider the case of . By the assumption that , for any , we have
This gives that
Here the last inequality is from the assumption that . Thus we have for any , . The case for is similar by choosing . This concludes that we have in the search region such that and Thus we have
for some constant . We have
This gives that
Apply the above inequality iteratively, we have
Appendix B Experiment Details
We provide more details of our algorithm in the experiments. For perlayer and perchannel quantization, the optimal clipping factor are obtained by uniform grid search from . For the first and last layer, we search for the optimal clipping factor on weights from . The optimal clipping factors for weights are obtained before performing multipoint quantization and we keep them fixed afterwards. For fair comparison, the quantization of the Batch Normalization layers are quantized in the same way as the baselines. When comparing with OCS, the BN layers are not quantized. When comparing with (Banner et al., 2019), the BN layers are absorbed into the weights and quantized together with the weights. Similar strategy for SSD quantization is adopted, i.e., the BN layers are kept fullprecision for perlayer setting and absorbed in the perchannel setting.
Network  VGG19BN  ResNet18  ResNet101  WideResNet50  Inceptionv3  Mobilenetv2 
50  15  0.25  1  100  10 
Network  VGG19BN  ResNet18  ResNet50  ResNet101  Inceptionv3  Mobilenetv2 
10  8  0.7  0.2  50  1 
Appendix C 3bit Quantization
We present the results of 3bit quantization in this section. 3bit quantization is more aggressive and the accuracy of the QNN is typically much lower than 4bit. As before, we report the results of perlayer quantization and perchannel quantization. All the hyperparameters are the same as 4bit quantization except for .
Model  Bits (W/A)  Method  Acc (Top1/Top5)  Size  OPs  


VGG19BN  32/32  FullPrecision  74.24%/91.85%  76.42MB     
3/8  w/o Multipoint  4.71%/12.33%  7.16MB  7.315G    
Ours  20.58%/40.38%  7.22MB  8.648G  100  


ResNet18  32/32  FullPrecision  69.76%/89.08%  42.56MB     
3/8  w/o Multipoint  9.83%/24.89%  3.99MB  635.83M    
Ours  26.16%/49.29%  4.01MB  714.53M  100  


WideResNet50  32/32  FullPrecision  78.51%/94.09%  262.64MB     
3/8  No Boosting  4.36%/10.64%  23.87MB  4.229G    
Ours  18.43%/35.34%  23.97MB  4.554G  5  

Model  Bits (W/A)  Method  Acc (Top1/Top5)  Size  OPs  


VGG19BN  32/32  FullPrecision  74.24%/91.85%  76.42MB     
3/3  w/o Multipoint  0.10%/0.492%  7.16MB  2.743G    
Ours + Clip  65.81%/87.25%  7.19MB  3.099G  50  


ResNet18  32/32  FullPrecision  69.76%/89.08%  42.56MB     
3/3  w/o Multipoint  0.11%/0.55%  3.99MB  238.44M    
Ours + Clip  43.75%/69.16%  4.06MB  265.90M  20  


MobileNetv2  32/32  FullPrecision  71.78%/90.19%  8.36MB     
3/3  w/o Multipoint  0.11%/0.64%  0.78MB  42.12M    
Ours+Clip  5.21%/14.33%  0.79MB  58.65M  50  

Appendix D More Visualization
We provide more experiment results for analyzing our algorithm in 4bit quantization. Specifically, we provide the results of perlayer quantization of WideResNet50 (Fig. 6 and Fig. 7) and perchannel quantization of ResNet18 (Fig. 8 and Fig. 9).
Top1 Accuracy/% 
OPs/G 
Relative Increment of Size 
Index of Layers 
Top1 Accuracy/% 
OPs/M 
Relative Increment of Size 
Index of Layers 
Footnotes
References
 Post training 4bit quantization of convolutional networks for rapiddeployment. In Advances in Neural Information Processing Systems, pp. 7948–7956. Cited by: Appendix B, §1, §1, §1, §4.1, §4.1, §4.1, Table 3, §5.
 Eyeriss: an energyefficient reconfigurable accelerator for deep convolutional neural networks. IEEE journal of solidstate circuits 52 (1), pp. 127–138. Cited by: §2.4.
 Recent advances in efficient computation of deep convolutional neural networks. arXiv preprint arXiv:1802.00939. Cited by: §1, §5.
 Lowbit quantization of neural networks for efficient inference. arXiv preprint arXiv:1902.06822. Cited by: §1.
 Binaryconnect: training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, pp. 3123–3131. Cited by: §1, §5.
 HAWQ: hessian aware quantization of neural networks with mixedprecision. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §5.
 The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. Note: http://www.pascalnetwork.org/challenges/VOC/voc2007/workshop/index.html Cited by: §4.
 Mixed precision neural architecture search for energy efficient deep learning. In 2019 IEEE/ACM International Conference on ComputerAided Design (ICCAD), pp. 1–7. Cited by: §1, §5.
 Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §1, §5.
 Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.1.
 1.1 computing’s energy problem (and what we can do about it). In 2014 IEEE International SolidState Circuits Conference Digest of Technical Papers (ISSCC), pp. 10–14. Cited by: §1.
 Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §1.
 Quantized neural networks: training neural networks with low precision weights and activations. The Journal of Machine Learning Research 18 (1), pp. 6869–6898. Cited by: §1.
 Quantization and training of neural networks for efficient integerarithmeticonly inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2704–2713. Cited by: §2.4.
 Learning to quantize deep networks by optimizing quantization intervals with task loss. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4350–4359. Cited by: §1.
 Quantizing deep convolutional networks for efficient inference: a whitepaper. arXiv preprint arXiv:1806.08342. Cited by: §1, §5.
 Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §4.
 Additive powersoftwo quantization: a nonuniform discretization for neural networks. arXiv preprint arXiv:1909.13144. Cited by: §1, §2.4, §4.1, §4.1, §5.
 Relaxed quantization for discretized neural networks. arXiv preprint arXiv:1810.01875. Cited by: §1.
 Same, same but differentrecovering neural network quantization error through weight factorization. arXiv preprint arXiv:1902.01917. Cited by: §1, §5.
 8bit inference with tensorrt. In GPU technology conference, Vol. 2, pp. 7. Cited by: §4.1.
 WRPN: wide reducedprecision networks. arXiv preprint arXiv:1709.01134. Cited by: §5.
 Datafree quantization through weight equalization and bias correction. arXiv preprint arXiv:1906.04721. Cited by: §1, §4.1.
 Loss aware posttraining quantization. arXiv preprint arXiv:1911.07190. Cited by: §4.1.
 Xnornet: imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pp. 525–542. Cited by: §1, §5.
 Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510–4520. Cited by: §4.1.
 Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §4.1.
 Resiliency of deep neural networks under quantization. arXiv preprint arXiv:1511.06488. Cited by: §1, §4.1, §4.1.
 Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §4.1.
 HAQ: hardwareaware automated quantization with mixed precision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8612–8620. Cited by: Posttraining Quantization with Multiple Points, §1, §5.
 Scaling for edge inference of deep neural networks. Nature Electronics 1 (4), pp. 216. Cited by: §1.
 Improving neural network quantization without retraining using outlier channel splitting. In International Conference on Machine Learning, pp. 7543–7552. Cited by: §1, §2.1, §4.1, §4.1, Table 2, §5.
 Dorefanet: training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160. Cited by: §2.4, §4.1.
 Trained ternary quantization. arXiv preprint arXiv:1612.01064. Cited by: §5.
 Neural network distiller. External Links: Document, Link Cited by: §1, §2.4, §5.