Incremental Network Quantization: Towards Lossless CNNs with LowPrecision Weights
Abstract
This paper presents incremental network quantization (INQ), a novel method, targeting to efficiently convert any pretrained fullprecision convolutional neural network (CNN) model into a lowprecision version whose weights are constrained to be either powers of two or zero. Unlike existing methods which are struggled in noticeable accuracy loss, our INQ has the potential to resolve this issue, as benefiting from two innovations. On one hand, we introduce three interdependent operations, namely weight partition, groupwise quantization and retraining. A wellproven measure is employed to divide the weights in each layer of a pretrained CNN model into two disjoint groups. The weights in the first group are responsible to form a lowprecision base, thus they are quantized by a variablelength encoding method. The weights in the other group are responsible to compensate for the accuracy loss from the quantization, thus they are the ones to be retrained. On the other hand, these three operations are repeated on the latest retrained group in an iterative manner until all the weights are converted into lowprecision ones, acting as an incremental network quantization and accuracy enhancement procedure. Extensive experiments on the ImageNet classification task using almost all known deep CNN architectures including AlexNet, VGG16, GoogleNet and ResNets well testify the efficacy of the proposed method. Specifically, at 5bit quantization (a variablelength encoding: 1 bit for representing zero value, and the remaining 4 bits represent at most 16 different values for the powers of two) ^{1}^{1}1This notation applies to our method throughout the paper., our models have improved accuracy than the 32bit floatingpoint references. Taking ResNet18 as an example, we further show that our quantized models with 4bit, 3bit and 2bit ternary weights have improved or very similar accuracy against its 32bit floatingpoint baseline. Besides, impressive results with the combination of network pruning and INQ are also reported. We believe that our method sheds new insights on how to make deep CNNs to be applicable on mobile or embedded devices. The code is available at https://github.com/Zhouaojun/IncrementalNetworkQuantization.
Incremental Network Quantization: Towards Lossless CNNs with LowPrecision Weights
Aojun Zhou^{†}^{†}thanks: This work was done when Aojun Zhou was an intern at Intel Labs China, supervised by Anbang Yao who proposed the original idea and is responsible for correspondence. The first three authors contributed equally to the writing of the paper. , Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen 
Intel Labs China 
{aojun.zhou, anbang.yao, yiwen.guo, lin.x.xu, yurong.chen}@intel.com 
1 Introduction
Deep convolutional neural networks (CNNs) have demonstrated record breaking results on a variety of computer vision tasks such as image classification (Krizhevsky et al., 2012; Simonyan & Zisserman, 2015), face recognition (Taigman et al., 2014; Sun et al., 2014), semantic segmentation (Long et al., 2015; Chen et al., 2015a) and object detection (Girshick, 2015; Ren et al., 2015). Regardless of the availability of significantly improved training resources such as abundant annotated data, powerful computational platforms and diverse training frameworks, the promising results of deep CNNs are mainly attributed to the large number of learnable parameters, ranging from tens of millions to even hundreds of millions. Recent progress further shows clear evidence that CNNs could easily enjoy the accuracy gain from the increased network depth and width (He et al., 2016; Szegedy et al., 2015; 2016). However, this in turn lays heavy burdens on the memory and other computational resources. For instance, ResNet152, a specific instance of the latest residual network architecture wining ImageNet classification challenge in 2015, has a model size of about 230 MB and needs to perform about 11.3 billion FLOPs to classify a image crop. Therefore, it is very challenging to deploy deep CNNs on the devices with limited computation and power budgets.
Substantial efforts have been made to the speedup and compression on CNNs during training, feedforward test or both of them. Among existing methods, the category of network quantization methods attracts great attention from researches and developers. Some network quantization works try to compress pretrained fullprecision CNN models directly. Gong et al. (2014) address the storage problem of AlexNet (Krizhevsky et al., 2012) with vector quantization techniques. By replacing the weights in each of the three fully connected layers with respective floatingpoint centroid values obtained from the clustering, they can get over 20 model compression at about 1% loss in top5 recognition rate. HashedNet (Chen et al., 2015b) uses a hash function to randomly map pretrained weights into hash buckets, and all the weights in the same hash bucket are constrained to share a single floatingpoint value. In HashedNet, only the fully connected layers of several shallow CNN models are considered. For better compression, Han et al. (2016) present deep compression method which combines the pruning (Han et al., 2015), vector quantization and Huffman coding, and reduce the model storage by on AlexNet and on VGG16 (Simonyan & Zisserman, 2015). Vanhoucke et al. (2011) use an SSE 8bit fixedpoint implementation to improve the computation of neural networks on the modern Intel x86 CPUs in feedforward test, yielding 3 speedup over an optimized floatingpoint baseline. Training CNNs by substituting the 32bit floatingpoint representation with the 16bit fixedpoint representation has also been explored in Gupta et al. (2015). Other seminal works attempt to restrict CNNs into lowprecision versions during training phase. Soudry et al. (2014) propose expectation backpropagation (EBP) to estimate the posterior distribution of deterministic network weights. With EBP, the network weights can be constrained to +1 and 1 during feedforward test in a probabilistic way. BinaryConnect (Courbariaux et al., 2015) further extends the idea behind EBP to binarize network weights during training phase directly. It has two versions of network weights: floatingpoint and binary. The floatingpoint version is used as the reference for weight binarization. BinaryConnect achieves stateoftheart accuracy using shallow CNNs for small datasets such as MNIST (LeCun et al., 1998) and CIFAR10. Later on, a series of efforts have been invested to train CNNs with lowprecision weights, lowprecision activations and even lowprecision gradients, including but not limited to BinaryNet (Courbariaux et al., 2016), XNORNet (Rastegari et al., 2016), ternary weight network (TWN) (Li & Liu, 2016), DoReFaNet (Zhou et al., 2016) and quantized neural network (QNN) (Hubara et al., 2016).
Despite these tremendous advances, CNN quantization still remains an open problem due to two critical issues which have not been well resolved yet, especially under scenarios of using lowprecision weights for quantization. The first issue is the nonnegligible accuracy loss for CNN quantization methods, and the other issue is the increased number of training iterations for ensuring convergence. In this paper, we attempt to address these two issues by presenting a novel incremental network quantization (INQ) method.
In our INQ, there is no assumption on the CNN architecture, and its basic goal is to efficiently convert any pretrained fullprecision (i.e., 32bit floatingpoint) CNN model into a lowprecision version whose weights are constrained to be either powers of two or zero. The advantage of such kind of lowprecision models is that the original floatingpoint multiplication operations can be replaced by cheaper binary bit shift operations on dedicated hardware like FPGA. We noticed that most existing network quantization methods adopt a global strategy in which all the weights are simultaneously converted to lowprecision ones (that are usually in the floatingpoint types). That is, they have not considered the different importance of network weights, leaving the room to retain network accuracy limited. In sharp contrast to existing methods, our INQ makes a very careful handling for the model accuracy drop from network quantization. To be more specific, it incorporates three interdependent operations: weight partition, groupwise quantization and retraining. Weight partition uses a pruninginspired measure (Han et al., 2015; Guo et al., 2016) to divide the weights in each layer of a pretrained fullprecision CNN model into two disjoint groups which play complementary roles in our INQ. The weights in the first group are quantized to be either powers of two or zero by a variablelength encoding method, forming a lowprecision base for the original model. The weights in the other group are retrained while keeping the quantized weights fixed, compensating for the accuracy loss resulted from the quantization. Furthermore, these three operations are repeated on the latest retrained weight group in an iterative manner until all the weights are quantized, acting as an incremental network quantization and accuracy enhancement procedure (as illustrated in Figure 1).
The main insight of our INQ is that a compact combination of the proposed weight partition, groupwise quantization and retraining operations has the potential to get a lossless lowprecision CNN model from any fullprecision reference. We conduct extensive experiments on the ImageNet large scale classification task using almost all known deep CNN architectures to validate the effectiveness of our method. We show that: (1) For AlexNet, VGG16, GoogleNet and ResNets with 5bit quantization, INQ achieves improved accuracy in comparison with their respective fullprecision baselines. The absolute top1 accuracy gain ranges from 0.13% to 2.28%, and the absolute top5 accuracy gain is in the range of 0.23% to 1.65%. (2) INQ has the property of easy convergence in training. In general, retraining with less than 8 epochs could consistently generate a lossless model with 5bit weights in the experiments. (3) Taking ResNet18 as an example, our quantized models with 4bit, 3bit and 2bit ternary weights also have improved or very similar accuracy compared with its 32bit floatingpoint baseline. (4) Taking AlexNet as an example, the combination of our network pruning and INQ outperforms deep compression method (Han et al., 2016) with significant margins.
2 Incremental Network Quantization
In this section, we clarify the insight of our INQ, describe its key components, and detail its implementation.
2.1 Weight Quantization with Variablelength Encoding
Suppose a pretrained fullprecision (i.e., 32bit floatingpoint) CNN model can be represented by , where denotes the weight set of the layer, and denotes the number of learnable layers in the model. To simplify the explanation, we only consider convolutional layers and fully connected layers. For CNN models like AlexNet, VGG16, GoogleNet and ResNets as tested in this paper, can be a tensor for the convolutional layer, or a matrix for the fully connected layer. For simplicity, here the dimension difference is not considered in the expression. Given a pretrained fullprecision CNN model, the main goal of our INQ is to convert all 32bit floatingpoint weights to be either powers of two or zero without loss of model accuracy. Besides, we also attempt to explore the limit of the expected bitwidth under the premise of guaranteeing lossless network quantization. Here, we start with our basic network quantization method on how to convert to be a lowprecision version , and each of its entries is chosen from
(1) 
where and are two integer numbers, and they satisfy . Mathematically, and help to bound in the sense that its nonzero elements are constrained to be in the range of either or . That is, network weights with absolute values smaller than will be pruned away (i.e., set to zero) in the final lowprecision model. Obviously, the problem is how to determine and . In our INQ, the expected bitwidth for storing the indices in is set beforehand, thus the only hyperparameter shall be determined is because can be naturally computed once and are available. Here, is calculated by using a tricky yet practically effective formula as
(2) 
where indicates the round down operation and is calculated by using
(3) 
where is an elementwise operation and outputs the largest element of its input. In fact, Equation (2) helps to match the rounding power of 2 for , and it could be easily implemented in practical programming. After is obtained, can be naturally determined as . For instance, if and , it is easy to get .
Once is determined, we further use the ladder of powers to convert every entry of into a lowprecision one by using
(4) 
where and are two adjacent elements in the sorted , making the above equation as a numerical rounding to the quantum values. It should be emphasized that factor in Equation (2) is set to make sure that all the elements in correspond with the quantization rule defined in Equation (4). In other words, factor in Equation (2) highly correlates with factor in Equation (4).
Here, an important thing we want to clarify is the definition of the expected bitwidth . Taking 5bit quantization as an example, since zero value cannot be written as the power of two, we use 1 bit to represent zero value, and the remaining 4 bits to represent at most 16 different values for the powers of two. That is, the number of candidate quantum values is at most , so our quantization method actually adopts a variablelength encoding scheme. It is clear that the quantization described above is performed in a linear scale. An alternative solution is to perform the quantization in the log scale. Although it may also be effective, it should be a little bit more difficult in implementation and may cause some extra computational overhead in comparison to our method.
2.2 Incremental Quantization Strategy
We can naturally use the above described method to quantize any pretrained fullprecision CNN model. However, noticeable accuracy loss appeared in the experiments when using small bitwidth values (e.g., 5bit, 4bit, 3bit and 2bit).
In the literature, there are many existing network quantization works such as HashedNet (Chen et al., 2015b), vector quantization (Gong et al., 2014), fixedpoint representation (Vanhoucke et al., 2011; Gupta et al., 2015), BinaryConnect (Courbariaux et al., 2015), BinaryNet (Courbariaux et al., 2016), XNORNet (Rastegari et al., 2016), TWN (Li & Liu, 2016), DoReFaNet (Zhou et al., 2016) and QNN (Hubara et al., 2016). Similar to our basic network quantization method, they also suffer from nonnegligible accuracy loss on deep CNNs, especially when being applied on the ImageNet large scale classification dataset. For all these methods, a common fact is that they adopt a global strategy in which all the weights are simultaneously converted into lowprecision ones, which in turn causes accuracy loss. Compared with the methods focusing on the pretrained models, accuracy loss becomes worse for the methods such as XNORNet, TWN, DoReFaNet and QNN which intend to train lowprecision CNNs from scratch.
Recall that our main goal is to achieve lossless lowprecision quantization for any pretrained fullprecision CNN model with no assumption on its architecture. To this end, our INQ makes a special handling of the strategy for suppressing resulting quantization loss in model accuracy. We are partially inspired by the latest progress in network pruning (Han et al., 2015; Guo et al., 2016). In these methods, the accuracy loss from removing less important network weights of a pretrained neural network model could be well compensated by following retraining steps. Therefore, we conjecture that the nature of changing network weight importance is critical to achieve lossless network quantization.
Base on this assumption, we present INQ which incorporates three interdependent operations: weight partition, groupwise quantization and retraining. Weight partition is to divide the weights in each layer of a pretrained fullprecision CNN model into two disjoint groups which play complementary roles in our INQ. The weights in the first group are responsible for forming a lowprecision base for the original model, thus they are quantized by using Equation (4). The weights in the second group adapt to compensate for the loss in model accuracy, thus they are the ones to be retrained. Once the first run of the quantization and retraining operations is finished, all the three operations are further conducted on the second weight group in an iterative manner, until all the weights are converted to be either powers of two or zero, acting as an incremental network quantization and accuracy enhancement procedure. As a result, accuracy loss under lowprecision CNN quantization can be well suppressed by our INQ. Illustrative results at iterative steps of our INQ are provided in Figure 2.
For the layer, weight partition can be defined as
(5) 
where denotes the first weight group that needs to be quantized, and denotes the other weight group that needs to be retrained. We leave the strategies for group partition to be chosen in the experiment section. Here, we define a binary matrix to help distinguish above two categories of weights. That is, means , and means .
2.3 Incremental Network Quantization Algorithm
Now, we come to the training method. Taking the layer as an example, the basic optimization problem of making its weights to be either powers of two or zero can be expressed as
(6)  
where is the network loss, is the regularization term, is a positive coefficient, and the constraint term indicates each weight entry should be chosen from the set consisting of a fixed number of the values of powers of two plus zero. Direct solving above optimization problem in training from scratch is challenging since it is very easy to undergo convergence problem.
By performing weight partition and groupwise quantization operations beforehand, the optimization problem defined in (6) can be reshaped into a easier version. That is, we only need to optimize the following objective function
(7)  
where is determined at groupwise quantization operation, and the binary matrix acts as a mask which is determined by weight partition operation. Since and are known, the optimization problem (7) can be solved using popular stochastic gradient decent (SGD) method. That is, in INQ, we can get the update scheme for the retraining as
(8) 
where is a positive learning rate. Note that the binary matrix forces zero update to the weights that have been quantized. That is, only the weights still keep with floatingpoint values are updated, akin to the latest pruning methods (Han et al., 2015; Guo et al., 2016) in which only the weights that are not currently removed are retrained to enhance network accuracy. The whole procedure of our INQ is summarized as Algorithm 1.
We would like to highlight that the merits of our INQ are in three aspects: (1) Weight partition introduces the importanceaware weight quantization. (2) Groupwise weight quantization introduces much less accuracy loss than simultaneously quantizing all the network weights, thus making retraining have larger room to recover model accuracy. (3) By integrating the operations of weight partition, groupwise quantization and retraining into a nested loop, our INQ has the potential to obtain lossless lowprecision CNN model from the pretrained fullprecision reference.
3 Experimental Results
To analyze the performance of our INQ, we perform extensive experiments on the ImageNet large scale classification task, which is known as the most challenging image classification benchmark so far. ImageNet dataset has about 1.2 million training images and 50 thousand validation images. Each image is annotated as one of 1000 object classes. We apply our INQ to AlexNet, VGG16, GoogleNet, ResNet18 and ResNet50, covering almost all known deep CNN architectures. Using the center crops of validation images, we report the results with two standard measures: top1 error rate and top5 error rate. For fair comparison, all pretrained fullprecision (i.e., 32bit floatingpoint) CNN models except ResNet18 are taken from the Caffe model zoo^{2}^{2}2https://github.com/BVLC/caffe/wiki/ModelZoo. Note that He et al. (2016) do not release their pretrained ResNet18 model to the public, so we use a publicly available reimplementation by Facebook^{3}^{3}3https://github.com/facebook/fb.resnet.torch/tree/master/pretrained. Since our method is implemented with Caffe, we make use of an open source tool^{4}^{4}4https://github.com/zhanghang1989/fbcaffeexts to convert the pretrained ResNet18 model from Torch to Caffe.
3.1 Results on ImageNet
Network  Bitwidth  Top1 error  Top5 error  Decrease in top1/top5 error 

AlexNet ref  32  42.76%  19.77%  
AlexNet  5  42.61%  19.54%  0.15%/0.23% 
VGG16 ref  32  31.46%  11.35%  
VGG16  5  29.18%  9.70%  2.28%/1.65% 
GoogleNet ref  32  31.11%  10.97%  
GoogleNet  5  30.98%  10.72%  0.13%/0.25% 
ResNet18 ref  32  31.73%  11.31%  
ResNet18  5  31.02%  10.90%  0.71%/0.41% 
ResNet50 ref  32  26.78%  8.76%  
ResNet50  5  25.19%  7.55%  1.59%/1.21% 
Setting expected bitwidth to 5, the first set of experiments is performed to testify the efficacy of our INQ on different CNN architectures. Regarding weight partition, there are several candidate strategies as we tried in our previous work for efficient network pruning (Guo et al., 2016). In Guo et al. (2016), we found random partition and pruninginspired partition are the two best choices compared with the others. Thus in this paper, we directly compare these two strategies for weight partition. In random strategy, the weights in each layer of any pretrained fullprecision deep CNN model are randomly split into two disjoint groups. In pruninginspired strategy, the weights are divided into two disjoint groups by comparing their absolute values with layerwise thresholds which are automatically determined by a given splitting ratio. Here we directly use pruninginspired strategy and the experimental results in Section 3.2 will show why. After the retraining with no more than 8 epochs over each pretrained fullprecision model, we obtain the results as shown in Table 1. It can be concluded that the 5bit CNN models generated by our INQ show consistently improved top1 and top5 recognition rates compared with respective fullprecision references. Parameter settings are described below.
AlexNet: AlexNet has 5 convolutional layers and 3 fullyconnected layers. We set the accumulated portions of quantized weights at iterative steps as {0.3, 0.6, 0.8, 1}, the batch size as 256, the weight decay as 0.0005, and the momentum as 0.9.
VGG16: Compared with AlexNet, VGG16 has 13 convolutional layers and more parameters. We set the accumulated portions of quantized weights at iterative steps as {0.5, 0.75, 0.875, 1}, the batch size as 32, the weight decay as 0.0005, and the momentum as 0.9.
GoogleNet: Compared with AlexNet and VGG16, GoogleNet is more difficult to quantize due to a smaller number of parameters and the increased network width. We set the accumulated portions of quantized weights at iterative steps as {0.2, 0.4, 0.6, 0.8, 1}, the batch size as 80, the weight decay as 0.0002, and the momentum as 0.9.
ResNet18: Different from above three networks, ResNets have batch normalization layers and relief the vanishing gradient problem by using shortcut connections. We first test the 18layer version for exploratory purpose and test the 50layer version later on. The network architectures of ResNet18 and ResNet34 are very similar. The only difference is the number of filters in every convolutional layer. We set the accumulated portions of quantized weights at iterative steps as {0.5, 0.75, 0.875, 1}, the batch size as 80, the weight decay as 0.0005, and the momentum as 0.9.
ResNet50: Besides significantly increased network depth, ResNet50 has a more complex network architecture in comparison to ResNet18. However, regarding network architecture, ResNet50 is very similar to ResNet101 and ResNet152. The only difference is the number of filters in every convolutional layer. We set the accumulated portions of quantized weights at iterative steps as {0.5, 0.75, 0.875, 1}, the batch size as 32, the weight decay as 0.0005, and the momentum as 0.9.
3.2 Analysis of Weight Partition Strategies
In our INQ, the first operation is weight partition whose result will directly affect the following groupwise quantization and retraining operations. Therefore, the second set of experiments is conducted to analyze two candidate strategies for weight partition. As mentioned in the previous section, we use pruninginspired strategy for weight partition. Unlike random strategy in which all the weights have equal probability to fall into the two disjoint groups, pruninginspired strategy considers that the weights with larger absolute values are more important than the smaller ones to form a lowprecision base for the original CNN model. We use ResNet18 as a test case to compare the performance of these two strategies. In the experiments, the parameter settings are completely the same as described in Section 3.1. We set 4 epochs for weight retraining. Table 2 summarizes the results of our INQ with 5bit quantization. It can be seen that our INQ achieves top1 error rate of and top5 error rate of by using random partition. Comparatively, pruninginspired partition brings and decrease in top1 and top5 error rates, respectively. Apparently, pruninginspired partition is better than random partition, and this is the reason why we use it in this paper. For future works, weight partition based on quantization error could also be an option worth exploring.
Strategy  Bitwidth  Top1 error  Top5 error 

Random partition  5  32.11%  11.73% 
Pruninginspired partition  5  31.02%  10.90% 
3.3 The Tradeoff between Expected Bitwidth and Model Accuracy
The third set of experiments is performed to explore the limit of the expected bitwidth under which our INQ can still achieve lossless network quantization. Similar to the second set of experiments, we also use ResNet18 as a test case, and the parameter settings for the batch size, the weight decay and the momentum are completely the same. Finally, lowerprecision models with 4bit, 3bit and even 2bit ternary weights are generated for comparisons. As the expected bitwidth goes down, the number of candidate quantum values will be decreased significantly, thus we shall increase the number of iterative steps accordingly for enhancing the accuracy of final lowprecision model. Specifically, we set the accumulated portions of quantized weights at iterative steps as {0.3, 0.5, 0.8, 0.9, 0.95, 1}, {0.2, 0.4, 0.6, 0.7, 0.8, 0.9, 0.95, 1} and {0.2, 0.4, 0.6, 0.7, 0.8, 0.85, 0.9, 0.95, 0.975, 1} for 4bit, 3bit and 2bit ternary models, respectively. The required number of epochs also increases when the expected bitwidth goes down, and it reaches 30 when training our 2bit ternary model. Although our 4bit model shows slightly decreased accuracy when compared with the 5bit model, its accuracy is still better than that of the pretrained fullprecision model. Comparatively, even when the expected bitwidth goes down to 3, our lowprecision model shows only and losses in top1 and top5 recognition rates, respectively. As for our 2bit ternary model, although it incurs decrease in top1 error rate and decrease in top5 error rate in comparison to the pretrained fullprecision reference, its accuracy is considerably better than stateoftheart results reported for binaryweight network (BWN) (Rastegari et al., 2016) and ternary weight network (TWN) (Li & Liu, 2016). Detailed results are summarized in Table 3 and Table 4.
Model  Bitwidth  Top1 error  Top5 error 

ResNet18 ref  32  31.73%  11.31% 
INQ  5  31.02%  10.90% 
INQ  4  31.11%  10.99% 
INQ  3  31.92%  11.64% 
INQ  2 (ternary)  33.98%  12.87% 
Method  Bitwidth  Top1 error  Top5 error 

BWN(Rastegari et al., 2016)  1  39.20%  17.00% 
TWN(Li & Liu, 2016)  2 (ternary)  38.20%  15.80% 
INQ (ours)  2 (ternary)  33.98%  12.87% 
3.4 LowBit Deep Compression
In the literature, recently proposed deep compression method (Han et al., 2016) reports so far best results on network compression without loss of model accuracy. Therefore, the last set of experiments is conducted to explore the potential of our INQ for much better deep compression. Note that Han et al. (2016) is a hybrid network compression solution combining three different techniques, namely network pruning (Han et al., 2015), vector quantization (Gong et al., 2014) and Huffman coding. Taking AlexNet as an example, network pruning gets 9 compression, however this result is mainly obtained from the fully connected layers. Actually its compression performance on the convolutional layers is less than 3 (as can be seen in the Table 4 of Han et al. (2016)). Besides, network pruning is realized by separately performing pruning and retraining in an iterative way, which is very timeconsuming. It will cost at least several weeks for compressing AlexNet. We solved this problem by our dynamic network surgery (DNS) method (Guo et al., 2016) which achieves about 7 speedup in training and improves the performance of network pruning from 9 to 17.7. In Han et al. (2016), after network pruning, vector quantization further improves compression ratio from 9 to 27, and Huffman coding finally boosts compression ratio up to 35. For fair comparison, we combine our proposed INQ and DNS, and compare the resulting method with Han et al. (2016). Detailed results are summarized in Table 5. When combing our proposed INQ and DNS, we achieve much better compression results compared with Han et al. (2016). Specifically, with 5bit quantization, we can achieve 53 compression with slightly larger gains both in top5 and top1 recognition rates, yielding 51.43%/96.30% absolute improvement in compression performance compared with full version/fair version (i.e., the combination of network pruning and vector quantization) of Han et al. (2016), respectively. Consistently better results have also obtained for our 4bit and 3bit models.
Method  Bitwidth(Conv/FC)  Compression ratio 



Han et al. (2016) (P+Q)  8/5  27  0.00%/0.03%  
Han et al. (2016) (P+Q+H)  8/5  35  0.00%/0.03%  
Han et al. (2016) (P+Q+H)  8/4    0.01%/0.00%  
Our method (P+Q)  5/5  53  0.08%/0.03%  
Han et al. (2016) (P+Q+H)  4/2    1.99%/2.60%  
Our method (P+Q)  4/4  71  0.52%/0.20%  
Our method (P+Q)  3/3  89  1.47%/0.96% 
Besides, we also perform a set of experiments on AlexNet to compare the performance of our INQ and vector quantization (Gong et al., 2014). For fair comparison, retraining is also used to enhance the performance of vector quantization, and we set the number of cluster centers for all of 5 convolutional layers and 3 fully connect layers to 32 (i.e., 5bit quantization). In the experiment, vector quantization incurs over 3% loss in model accuracy. When we change the number of cluster centers for convolutional layers from 32 to 128, it gets an accuracy loss of 0.98%. This is consistent with the results reported in (Gong et al., 2014). Comparatively, vector quantization is mainly proposed to compress the parameters in the fully connected layers of a pretrained fullprecision CNN model, while our INQ addresses all network layers simultaneously and has no accuracy loss for 5bit and 4bit quantization. Therefore, it is evident that our INQ is much better than vector quantization. Last but not least, the final weights for vector quantization (Gong et al., 2014), network pruning (Han et al., 2015) and deep compression (Han et al., 2016) are still floatingpoint values, but the final weights for our INQ are in the form of either powers of two or zero. The direct advantage of our INQ is that the original floatingpoint multiplication operations can be replaced by cheaper binary bit shift operations on dedicated hardware like FPGA.
4 Conclusions
In this paper, we present INQ, a new network quantization method, to address the problem of how to convert any pretrained fullprecision (i.e., 32bit floatingpoint) CNN model into a lossless lowprecision version whose weights are constrained to be either powers of two or zero. Unlike existing methods which usually quantize all the network weights simultaneously, INQ is a more compact quantization framework. It incorporates three interdependent operations: weight partition, groupwise quantization and retraining. Weight partition splits the weights in each layer of a pretrained fullprecision CNN model into two disjoint groups which play complementary roles in INQ. The weights in the first group is directly quantized by a variablelength encoding method, forming a lowprecision base for the original CNN model. The weights in the other group are retrained while keeping all the quantized weights fixed, compensating for the accuracy loss from network quantization. More importantly, the operations of weight partition, groupwise quantization and retraining are repeated on the latest retrained weight group in an iterative manner until all the weights are quantized, acting as an incremental network quantization and accuracy enhancement procedure. On the ImageNet large scale classification task, we conduct extensive experiments and show that our quantized CNN models with 5bit, 4bit, 3bit and even 2bit ternary weights have improved or at least comparable accuracy against their fullprecision baselines, including AlexNet, VGG16, GoogleNet and ResNets. As for future works, we plan to extend incremental idea behind INQ from lowprecision weights to lowprecision activations and lowprecision gradients (we have actually already made some good progress on it, as shown in our supplementary materials). We will also investigate computation and power efficiency by implementing our lowprecision CNN models on hardware platforms.
References
 Chen et al. (2015a) LiangChieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and L. Yuille Alan. Semantic image segmentation with deep convolutional nets and fully connected crfs. In ICLR, 2015a.
 Chen et al. (2015b) Wenlin Chen, James T. Wilson, Stephen Tyree, Kilian Q. Weinberger, and Yixin Chen. Compressing neural networks with the hashing trick. In ICML, 2015b.
 Courbariaux et al. (2015) Matthieu Courbariaux, Bengio Yoshua, and David JeanPierre. Binaryconnect: Training deep neural networks with binary weights during propagations. In NIPS, 2015.
 Courbariaux et al. (2016) Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran ElYaniv, and Yoshua Bengio. Binarized neural networks: Training deep neural networks with weights and activations constrained to +1 or 1. arXiv preprint arXiv:1602.02830v3, 2016.
 Girshick (2015) Ross Girshick. Fast rcnn. In ICCV, 2015.
 Gong et al. (2014) Yunchao Gong, Liu Liu, Ming Yang, and Lubomir Bourdev. Compressing deep concolutional networks using vector quantization. arXiv preprint arXiv:1412.6115v1, 2014.
 Guo et al. (2016) Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efficient dnns. In NIPS, 2016.
 Gupta et al. (2015) Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited numerical precision. In ICML, 2015.
 Han et al. (2015) Song Han, Jeff Pool, John Tran, and William J. Dally. Learning both weights and connections for efficient neural networks. In NIPS, 2015.
 Han et al. (2016) Song Han, Jeff Pool, John Tran, and William J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In ICLR, 2016.
 He et al. (2016) Kaiming He, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. Deep residual learning for image recognition. In CVPR, 2016.
 Hubara et al. (2016) Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran ElYaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. arXiv preprint arXiv:1609.07061v1, 2016.
 Krizhevsky et al. (2012) Alex Krizhevsky, Sutskever Ilya, and E. Hinton Geoffrey. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
 LeCun et al. (1998) Yann LeCun, Bottou Leon, Yoshua Bengio, and Patrick Hadner. Gradientbased learning applied to documentrecognition. In NIPS, 1998.
 Li & Liu (2016) Fengfu Li and Bin Liu. Ternary weight networks. arXiv preprint arXiv:1605.04711v1, 2016.
 Long et al. (2015) Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
 Rastegari et al. (2016) Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnornet: Imagenet classification using binary convolutional neural networks. arXiv preprint arXiv:1603.05279v4, 2016.
 Ren et al. (2015) Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster rcnn: Towards realtime object detection with region proposal networks. In NIPS, 2015.
 Simonyan & Zisserman (2015) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. In ICLR, 2015.
 Soudry et al. (2014) Daniel Soudry, Itay Hubara, and Ron Meir. Expectation backpropagation: Parameterfree training of multilayer neural networks with continuous or discrete weights. In NIPS, 2014.
 Sun et al. (2014) Yi Sun, Xiaogang Wang, and Xiaoou Tang. Deep learning face representation from predicting 10,000 classes. In CVPR, 2014.
 Szegedy et al. (2015) Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, 2015.
 Szegedy et al. (2016) Christian Szegedy, Sergey Ioffe, and Vincent Vanhoucke. Inceptionv4, inceptionresnet and the impact of residual connections on learning. arXiv preprint arXiv:1602.07261v1, 2016.
 Taigman et al. (2014) Yaniv Taigman, Ming Yang, Marc’ Aurelio Ranzato, and Lior Wolf. Deepface: Closing the gap to humanlevel performance in face verification. In CVPR, 2014.
 Vanhoucke et al. (2011) Vincent Vanhoucke, Andrew Senior, and Mark Z. Mao. Improving the speed of neural networks on cpus. In Deep Learning and Unsupervised Feature Learning Workshop, NIPS, 2011.
 Zhou et al. (2016) Shuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin Wu, and Yuheng Zou. Dorefanet: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1605.04711v1, 2016.
Appendix A A Appendix 1: Statistical Analysis of the Quantized Weights
Taking our 5bit AlexNet model as an example, we analyze the distribution of the quantized weights. Detailed statistical results are summarized in Table 6. We can find: (1) in the and convolutional layers, the values of {, , , , , } and {, , , , 0, , , , } occupy over 60% and 94% of all quantized weights, respectively; (2) the distributions of the quantized weights in the , and convolutional layers are similar to that of the convolutional layer, and more weights are quantized into zero in the , , and convolutional layers compared with the convolutional layer; (3) in the fully connected layer, the values of {, , , , 0, , , , } occupy about 98% of all quantized weights, and similar results can be seen for the fully connected layer; (4) generally, the distributions of the quantized weights in the convolutional layers are usually more scattered compared with the fully connected layers. This may be partially the reason why it is much easier to get good compression performance on fully connected layers in comparison to convolutional layers, when using methods such as network hashing (Chen et al., 2015b) and vector quantization (Gong et al., 2014); (5) for 5bit AlexNet model, the required bitwidth for each layer is actually 4 but not 5.
Weight  Conv1  Conv2  Conv3  Conv4  Conv5  FC6  FC7  FC8 

            8.95%  6.37%  3.86% 
            12.29%  9.58%  6.19% 
  5.04%  10.55%  11.58%  10.09%  9.88%  16.48%  16.13%  12.90% 
  6.56%  12.09%  14.34%  14.24%  14.68%  10.84%  17.87%  19.51% 
  9.22%  13.08%  15.26%  18.49%  20.66%  0.79%  3.43%  11.09% 
  10.52%  8.73%  5.92%  7.77%  9.79%  0.002%  0.004%  0.40% 
  9.75%  2.70%  0.49%  0.38%  0.55%       
  4.61%  0.39%  0.02%  0.01%  0.004%       
  0.67%  0.01%            
5.51%  11.30%  12.24%  9.70%  8.97%  8.86%  6.17%  3.62%  
          8.30%  5.81%  3.40%  
          10.51%  7.84%  4.69%  
5.20%  9.70%  10.44%  8.60%  7.69%  12.91%  11.30%  8.08%  
6.79%  11.01%  11.66%  10.33%  8.95%  8.95%  11.90%  10.94%  
9.99%  11.05%  11.86%  12.25%  10.67%  1.12%  3.54%  12.56%  
11.15%  6.57%  5.22%  6.81%  6.37%  0.01%  0.06%  2.75%  
10.14%  2.26%  0.86%  1.24%  1.62%  0.01%  
4.26%  0.53%  0.09%  0.08%  0.16%        
0.60%  0.05%  0.01%  0.003%  0.01%        
          
Total  100%  100%  100%  100%  100%  100%  100%  100% 
Bitwidth  4  4  4  4  4  4  4  4 
Appendix B B Appendix 2: Lossless CNNs with LowPrecision Weights and LowPrecision Activations
Network 

Top1 error  Top5 error 



VGG16 ref  32/32  31.46%  11.35%  
VGG16  5/4  29.82%  10.19%  1.64%/1.16% 
Recently, we have made some good progress on developing our INQ for lossless CNNs with both lowprecision weights and lowprecision activations. According to the results summarized in Table 7, it can be seen that our VGG16 model with 5bit weights and 4bit activations shows improved top5 and top1 recognition rates in comparison to the pretrained reference with 32bit floatingpoint weights and 32bit floatingpoint activations. To the best of our knowledge, this should be the best results reported on VGG16 architecture so far.