Learning Accurate Low-Bit Deep Neural Networks with Stochastic Quantization

Learning Accurate Low-Bit Deep Neural Networks with Stochastic Quantization


Low-bit deep neural networks (DNNs) become critical for embedded applications due to their low storage requirement and computing efficiency. However, they suffer much from the non-negligible accuracy drop. This paper proposes the stochastic quantization (SQ) algorithm for learning accurate low-bit DNNs. The motivation is due to the following observation. Existing training algorithms approximate the real-valued elements/filters with low-bit representation all together in each iteration. The quantization errors may be small for some elements/filters, while are remarkable for others, which lead to inappropriate gradient direction during training, and thus bring notable accuracy drop. Instead, SQ quantizes a portion of elements/filters to low-bit with a stochastic probability inversely proportional to the quantization error, while keeping the other portion unchanged with full-precision. The quantized and full-precision portions are updated with corresponding gradients separately in each iteration. The SQ ratio is gradually increased until the whole network is quantized. This procedure can greatly compensate the quantization error and thus yield better accuracy for low-bit DNNs. Experiments show that SQ can consistently and significantly improve the accuracy for different low-bit DNNs on various datasets and various network structures.


Yinpeng Dongdyp17@mails.tsinghua.edu.cn1 \addauthorRenkun Nirn9zm@virginia.edu2 \addauthorJianguo Lijianguo.li@intel.com3 \addauthorYurong Chenyurong.chen@intel.com3 \addauthorJun Zhudcszj@mail.tsinghua.edu.cn1 \addauthorHang Susuhangss@mail.tsinghua.edu.cn1 \addinstitution Department of Computer Science and Technology, Tsinghua University
Beijing, China \addinstitution University of Virginia \addinstitution Intel Labs China
Beijing, China Stochastic Quantization

1 Introduction

Deep Neural Networks (DNNs) have demonstrated significant performance improvements in a wide range of computer vision tasks [LeCun et al.(2015)LeCun, Bengio, and Hinton], such as image classification [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton, Simonyan and Zisserman(2015), Szegedy et al.(2015)Szegedy, Liu, Jia, Sermanet, Reed, Anguelov, Erhan, Vanhoucke, and Rabinovich, He et al.(2016)He, Zhang, Ren, and Sun], object detection [Girshick et al.(2014)Girshick, Donahue, Darrell, and Malik, Ren et al.(2015)Ren, He, Girshick, and Sun] and semantic segmentation [Long et al.(2015)Long, Shelhamer, and Darrell, Chen et al.(2015)Chen, Papandreou, Kokkinos, Murphy, and Yuille]. DNNs often stack tens or even hundreds of layers with millions of parameters to achieve the promising performance. Therefore, DNN based systems usually need considerable storage and computation power. This hinders the deployment of DNNs to some resource limited scenarios, especially low-power embedded devices in the emerging Internet-of-Things (IoT) domain.

Many works have been proposed to reduce model parameter size or even computation complexity due to the high redundancy in DNNs [Denil et al.(2013)Denil, Shakibi, Dinh, de Freitas, et al., Han et al.(2015)Han, Pool, Tran, and Dally]. Among them, low-bit deep neural networks [Courbariaux et al.(2015)Courbariaux, Bengio, and David, Rastegari et al.(2016)Rastegari, Ordonez, Redmon, and Farhadi, Li et al.(2016)Li, Zhang, and Liu, Zhou et al.(2016)Zhou, Wu, Ni, Zhou, Wen, and Zou, Zhu et al.(2017)Zhu, Han, Mao, and Dally, Courbariaux and Bengio(2016), Hubara et al.(2016)Hubara, Courbariaux, Soudry, El-Yaniv, and Bengio], which aim for training DNNs with low bitwidth weights or even activations, attract much more attention due to their promised model size and computing efficiency. In particular, in BinaryConnect (BNN) [Courbariaux et al.(2015)Courbariaux, Bengio, and David], the weights are binarized to and and multiplications are replaced by additions and subtractions to speed up the computation. In [Rastegari et al.(2016)Rastegari, Ordonez, Redmon, and Farhadi], the authors proposed binary weighted networks (BWN) with weight values to be binarized plus one scaling factor for each filter, and extended it to XNOR-Net with both weights and activations binarized. Moreover, Ternary Weight Networks (TWN) [Li et al.(2016)Li, Zhang, and Liu] incorporate an extra state, which converts weights into ternary values {, , } with -bits width. However, low-bit DNNs are challenged by the non-negligible accuracy drop, especially for large scale models (e.g\bmvaOneDot, ResNet [He et al.(2016)He, Zhang, Ren, and Sun]). We argue that the reason is due to that they quantize the weights of DNNs to low-bits all together at each training iteration. The quantization error is not consistently small for all elements/filters. It may be very large for some elements/filters, which may lead to inappropriate gradient direction during training, and thus makes the model converge to relatively worse local minimum.

Besides training based solutions for low-bit DNNs, there are also post-quantization techniques, which focus on directly quantizing the pre-trained full-precision models [Lin et al.(2016)Lin, Talathi, and Annapureddy, Miyashita et al.(2016)Miyashita, Lee, and Murmann, Zhou et al.(2017)Zhou, Yao, Guo, Xu, and Chen]. These methods have at least two limitations. First, they work majorly for DNNs in classification tasks, and lack the flexibility to be employed for other tasks like detection, segmentation, etc. Second, the low-bit quantization can be regarded as a constraint or regularizer [Courbariaux et al.(2015)Courbariaux, Bengio, and David] in training based solutions towards a local minimum in the low bitwidth weight space. However, it is relatively difficult for post-quantization techniques (even with fine-tuning) to transfer the local minimum from a full-precision weight space to a low bitwidth weight space losslessly.

In this paper, we try to overcome the aforementioned issues by proposing the Stochastic Quantization (SQ) algorithm to learn accurate low-bit DNNs. Inspired by stochastic depth [Huang et al.(2016)Huang, Sun, Liu, Sedra, and Weinberger] and dropout [Srivastava et al.(2014)Srivastava, Hinton, Krizhevsky, Sutskever, and Salakhutdinov], the SQ algorithm stochastically select a portion of weights in DNNs and quantize them to low-bits in each iteration, while keeping the other portion unchanged with full-precision. The selection probability is inversely proportional to the quantization error. We gradually increase the SQ ratio until the whole network is quantized.

We make a comprehensive study on different aspects of the SQ algorithm. First, we study the impact of selection granularity, and show that treating each filter-channel as a whole for stochastic selection and quantization performs better than treating each weight element separately. The reason is that the weight elements in one filter-channel are interacted with each other to represent a filter structure, so that treating each element separately may introduce large distortion when some elements in one filter-channel are low-bits while the others remain full-precision. Second, we compare the proposed roulette algorithm to some deterministic selection algorithms. The roulette algorithm selects elements/filters to quantize with the probability inversely proportional to the quantization error. If the quantization error is remarkable, we probably do not quantize the corresponding elements/filters, because they may introduce inaccurate gradient information. The roulette algorithm is shown better than deterministic selection algorithms since it eliminates the requirement of finding the best initial partition, and has the ability to explore the searching space for a better solution due to the exploitation-exploration nature of stochastic algorithms. Third, we compare different functions to calculate the quantization probability on top of quantization errors. Fourth, we design an appropriate scheme for updating the SQ ratio.

The proposed SQ algorithm is generally applicable for any low-bit DNNs including BWN, TWN, etc. Experiments show that SQ can consistently improve the accuracy for different low-bit DNNs (binary or ternary) on various network structures (VGGNet, ResNet, etc) and various datasets (CIFAR, ImageNet, etc). For instance, TWN trained with SQ can even beat the full-precision models in several testing cases. Our main contributions are:

  • We propose the stochastic quantization (SQ) algorithm to overcome the accuracy drop issue in existing low-bit DNNs training algorithms.

  • We comprehensively study different aspects of the SQ algorithm such as selection granularity, partition algorithm, definition of quantization probability functions, and scheme for updating the SQ ratio, which may provide valuable insights for researchers to design other stochastic algorithms in deep learning.

  • We present strong performance improvement with the proposed algorithm on various low bitwidth settings, network architectures, and benchmark datasets.

2 Low-bit DNNs

In this section, we briefly introduce some typical low-bit DNNs as prerequisites, including BNN [Courbariaux et al.(2015)Courbariaux, Bengio, and David], BWN [Rastegari et al.(2016)Rastegari, Ordonez, Redmon, and Farhadi] and TWN [Li et al.(2016)Li, Zhang, and Liu]. Formally, we denote the weights of the -th layer in a DNN by , where is the number of output channels, and is the weight vector of the i-th filter channel, in which in conv-layers and in FC-layers (, and represent input channels, kernel width, and kernel height respectively). can also be viewed as a weight matrix with rows, and each row is a -dimensional vector. For simplicity, we omit the subscript in the following.

BinaryConnect uses a simple stochastic method to convert each 32-bits weight vector into binary values with the following equation


where is the hard sigmoid function, and j is the index of elements in . During the training phase, it keeps both full-precision weights and binarized weights. Binarized weights are used for gradients and forward loss computation, while full-precision weights are applied to accumulate gradient updates. During the testing phase, it only needs the binarized weights so that it reduces the model size by .

BWN is an extension of BinaryConnect, but introduces a real-valued scaling factor along with to approximate the full-precision weight vector by solving an optimization problem and obtaining:


TWN introduces an extra 0 state over BWN and approximates the real-valued weight vector more accurately with a ternary value vector along with a scaling factor , while still keeping high model-size compression (16). It solves the optimization problem with an approximate solution:


where is a positive threshold with following values


and denotes the cardinality of set .

3 Method

In this section, we elaborate the SQ algorithm. As the low-bit quantization is done layer-by-layer in existing low-bit DNNs [Courbariaux et al.(2015)Courbariaux, Bengio, and David, Rastegari et al.(2016)Rastegari, Ordonez, Redmon, and Farhadi, Li et al.(2016)Li, Zhang, and Liu], we follow the same paradigm. For simplicity, here we just introduce the algorithm for one layer. We propose to quantize a portion of the weights to low-bits in each layer during forward and backward propagation to minimize the information reduction due to full quantization. Figure 1 illustrates the stochastic quantization algorithm. We first calculate the quantization error, and then derive a quantization probability for each element/filter (See Section 3.2 for details). Given a SQ ratio , we then stochastically select a portion of elements/filters to quantize by a roulette algorithm introduced in Section 3.1. We finally demonstrate the training procedure in Section 3.3.

Figure 1: Illustration of the stochastic quantization procedure. Given the weight matrix of a conv-layer and a SQ ratio , we first calculate the quantization error. Then we derive the quantization probability for each filter channel (rows of the weight matrix), and adopt a sampling without replacement roulette algorithm based on the probability to select a portion of quantized rows. Finally, we obtain the weight matrix mixed with quantized and full-precision rows, and perform forward and backward propagation based on this hybrid matrix during training procedure.

3.1 Stochastic Partition

In each iteration, given the weight matrix of each layer, we want to partition the rows of into two disjoint groups and , which should satisfy


where is the filter for the i-th output channel or -th row of weight matrix . We quantize the rows of in group to low bitwidth values while keeping the others in group full-precision. contains items, which is restricted by the SQ ratio (). Meanwhile, . The SQ ratio increases gradually to to make the whole weight matrix and the whole network quantized in the end of training. We introduce a quantization probability over each row of , where the i-th element indicates the probability of the i-th row to be quantized. The quantization probability is defined over the quantization error, which will be described in Section 3.2.1

Given the SQ ratio and the quantization probability for each layer, we propose a sampling without replacement roulette algorithm to select rows to be quantized for group , while the remaining rows are still kept full-precision. Algorithm 1 illustrates the roulette selecting procedure.

1:The SQ ratio and the quantization probability vector over output channels of weight matrix .
2:group and .
3:; ;
5:for  to  do
6:      Normalize with ; is norm of
7:      Sample a random value uniformly in ;
8:      Set , and ; accumulates the normalized probability
9:      while  do;
10:            ; ; is the -th element in
11:      end while
12:      ;
13:      Set ; avoid -th channels being selected again
14:end for
Algorithm 1 Roulette algorithm to partition weight matrix into quantized and real-valued groups

3.2 From Quantization Error to Quantization Probability

Recall that the motivation of this work is to alleviate the accuracy drop due to the inappropriate gradient direction from large quantization errors, the quantization probability over channels should base on the quantization error between the quantized and real-valued weights. If the quantization error is small, quantizing the corresponding channel brings little information reduction, then we should assign a high quantization probability to this channel. That means, the quantization probability should be inversely proportional to the quantization errors.

We generally denote as the quantized version of the full-precision weight vector . For simplicity, we omit the bitwidth or scaling factor in . That means can be , or even , for different low-bit DNNs. We measure the quantization error in terms of the normalized distance between and :


Then we can define the quantization probability given the quantization error . The quantization probability should be inversely proportional to . We define an intermediate variable to represent the inverse relationship, where is a small value such as to avoid possible overflow. The probability function should be a monotonically non-decreasing function of . Here we consider four different choices:

  • A constant function to make equal probability over all channels: . This function ignores the quantization error with a totally random selection strategy.

  • A linear function defined as .

  • A softmax function defined as .

  • A sigmoid function defined as .

We will empirically compare the performance of these choices in Section 4.2.

3.3 Training with Stochastic Quantization

When applied the proposed stochastic quantization algorithm to train a low-bit DNN, it will involve four steps: stochastic partition of weights, forward propagation, backward propagation and parameter update.

Algorithm 2 illustrates the procedure for training a low-bit DNN with stochastic quantization. First, all the rows of are quantized to obtain . Note that we do not specify the quantization algorithm and the bitwidth here, since our algorithm is flexibly applied to all the cases. We then calculate the quantization error by Eq. (6) and the corresponding quantization probability. Given these parameters, we use Algorithm 1 to partition into and . We then form the hybrid weight matrix composed of real-valued weights and quantized weights based on the partition result. If , we use its quantized version in . Otherwise, we use directly. approximates the real-valued weight matrix much better than , and thus provides much more appropriate gradient direction. We update with the hybrid gradients in each iteration as


where is the learning rate at -th iteration. That means the quantized part is updated with gradients derived from quantized weights, while the real-valued part is still updated with gradients from real-valued weights. Lastly, the learning rate and SQ ratio get updated with a pre-defined rule. We will specify this in Section 4.1.

When the training stops, increases to and all the weights in the network are quantized. There is no need to keep the real-valued weights further. We perform forward propagation with low bitwidth weights during the testing phase.

2:Mini-batch of inputs and targets {, }, loss function ;
3:weights , learning rate and SQ ratio of -th iteration.
4:Updated parameters , learning rate and SQ ratio .
5:Quantize each row of , and obtain quantized matrix ;
6:if  then
7:      Calculate the quantization error ;
8:      Calculate the quantization probability ;
9:      Partition into and by Algorithm 1 with and ;
10:      Form the hybrid weight matrix , where each row if ; else ;
12:      ;
13:end if
14: = Forward(, ); Forward to get the target estimation
15: = Backward(, ); Backward to get the gradient of
16:Update according to Eq. (7).
17: = Update();
Algorithm 2 Training algorithm based on SQ

4 Experiments

We conduct extensive experiments on the CIFAR-10, CIFAR-100 and ImageNet large scale classification datasets to validate the effectiveness of the proposed algorithm. We apply the SQ algorithm on two kinds of low-bit settings (aka BWN and TWN), and show the advantages over the existing low-bit DNN algorithms.

4.1 Datasets and Implementation Details

We briefly introduce the datasets used in our experiments, including the corresponding network structures and training settings.

CIFAR [Krizhevsky and Hinton(2009)] consists of a training set of and a test set of color images of resolution with classes in CIFAR-10 and classes in CIFAR-100. We adopt two network architectures. The first one is derived from VGGNet [Simonyan and Zisserman(2015)] and denoted as VGG-9 [Courbariaux et al.(2015)Courbariaux, Bengio, and David], by the architecture of “”. The network is trained with a SGD solver with momentum , weight decay and batch size . We do not use any data augmentation in this network. We start with a learning rate , divide it by after each k iterations, and terminate the training at k iterations. The second network is the ResNet-56 architecture with the same training settings as [He et al.(2016)He, Zhang, Ren, and Sun].

ImageNet [Russakovsky et al.(2015)Russakovsky, Deng, Su, Krause, Satheesh, Ma, Huang, Karpathy, Khosla, Bernstein, et al.] is a large scale classification dataset. We use the ILSVRC 2012 dataset in the experiments which consists million training images of classes. We evaluate the performance on the K validation images. We adopt a variant of AlexNet [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton] architecture with batch normalization (AlexNet-BN) and the ResNet-18 [He et al.(2016)He, Zhang, Ren, and Sun] architecture. We train the AlexNet-BN by SGD with momentum , weight decay and batch size . The initial learning rate is , which is divided by each after k, k and k iterations. The training is terminated after k iterations. We adopt batch normalization in AlexNet for faster convergence and better stability. ResNet-18 is trained with a SGD solver with batch size and the same momentum and weight decay values as AlexNet-BN. The learning rate starts from and is divided by when the error plateaus. We train it for k iterations.

We implement our codes based on the Caffe framework [Jia et al.(2014)Jia, Shelhamer, Donahue, Karayev, Long, Girshick, Guadarrama, and Darrell], and make training and testing on NVidia Titan X GPU. We make full experimental comparison between full-precision weight networks (FWN), Binary Weighted Networks (BWN), and Ternary Weighted Networks (TWN). We denote BWN and TWN trained with the proposed SQ algorithm as SQ-BWN and SQ-TWN respectively. For CIFAR-10 and CIFAR-100, we compare the results of FWN, BWN, TWN, SQ-BWN and SQ-TWN on two network architectures VGG-9 and ResNet-56. For ImageNet, we also make experimental comparison between these five settings with the AlexNet-BN and ResNet-18 network architectures.

The SQ ratio is one critical hyper-parameter in our algorithm, which should be gradually increased to to make all the network weights quantized. In this work, we divide the training into several stages with a fixed at each stage. When the training is converged in one stage, we increase and move to the next stage, until in the final stage. We conduct ablation study on the scheme for updating the SQ ratio in Section 4.2. We make another simplification that is the same for all layers in a DNN within one training stage. When applied the SQ algorithm to train low-bit DNNs, the number of training iterations is of original low-bit DNNs, where is the number of SQ stages and we use the same number of iterations as other low-bit DNNs (i.e., BWN, TWN) in each stage (although we do not need so many iterations to converge).

4.2 Ablation Study

There are several factors in the proposed method which will affect the final results, including selection granularity, partition algorithm, quantization probability function and scheme for updating the SQ ratio. Therefore, we design a series of experiments for each factor and analyze their impacts to our algorithm. All the ablation studies are based on the ResNet-56 network and the CIFAR-10 dataset. In consideration of the interaction between these factors, we adopt the variable-controlling approach to ease our study. The default settings for selection granularity, partition algorithm, quantization probability function and SQ ratio scheme are channel-wise selection, stochastic partition (roulette), linear function and four stages training with the SQ ratio and .

Channel-wise Element-wise
SQ-BWN 7.15 7.67
SQ-TWN 6.20 6.53
Table 1: Test error () of selection granularity on CIFAR-10 with ResNet-56.
Stochastic Deterministic Fixed
SQ-BWN 7.15 8.21 *
SQ-TWN 6.20 6.85 6.50
Table 2: Test error () of different partition algorithms on CIFAR-10 with ResNet-56. * means not converged.
Exp Ave Tune
SQ-BWN 7.15 7.35 7.18
SQ-TWN 6.20 6.88 6.62
Table 3: Test error rates () of different schemes for updating SQ ratio on CIFAR-10 with ResNet-56.
Linear Constant Softmax Sigmoid
SQ-BWN 7.15 7.44 7.51 7.37
SQ-TWN 6.20 6.30 6.29 6.28
Table 4: Test error () of different probability functions given quantization error on CIFAR-10 with ResNet-56.

Channel-wise v.s. Element-wise.

We first study the impact of quantization granularity on performance. We adopt two strategies, i.e., channel-wise and element-wise, with the results shown in Table 2. Element-wise selection ignores the filter structure and the interactions within each channel, which leads to lower accuracy. By treating each filter-channel as a whole, we can preserve the structures in channels and obtain higher accuracy.

Stochastic v.s. Deterministic v.s. Fixed.

The proposed algorithm uses stochastic partition based on the roulette algorithm to select rows to be quantized. We argue that the stochastic partition can eliminate the requirement of finding the best partition after initialization, and have the ability to explore the searching space for a better solution due to the exploitation-exploration nature. To prove that, we compare our pure stochastic scheme to another two none fully stochastic schemes. The first one is the deterministic scheme using sorting based partition. Given the SQ ratio and the quantization error , we sort rows of weight matrix based on and select rows with the least quantization error as group in each iteration. The second one is the fixed partition, in which we only do roulette partition for the first iteration, and keep and fixed for all the iterations in each training stage.

Table 2 compares the results. It shows that the stochastic algorithm gets significantly better results in BWN and TWN, which proves that the stochastic partition we introduced is a key factor for success. Another interpolation is that stochastic quantization acts as an regularizer in training, which also benefits the performance.

Quantization Probability Function.

In Section 3.2, we introduce four choices of quantization probability function given the quantization error, which are constant, linear, softmax and sigmoid functions. We compare the results of different functions in Table 4. Among these choices, linear function beats all other competitors due to its better performance and simplicity. We find that the performance of different functions are very close, which indicates that what matters most is the stochastic partition algorithm itself.

Scheme for updating the SQ ratio.

We also study how to design the scheme for updating the SQ ratio. Basically, we assume that training with stochastic quantization will be divided into several stages, and each stage has a fixed SQ ratio. In all previous setting, we use four stages with SQ ratio and . We call this exponential scheme (Exp) since in each stage, the non-quantized part is half-sized to that of previous stage. We compare our results with two additional schemes: the average scheme (Ave), and the fine-tuned scheme (Tune). The average scheme includes five stages with and . The fine-tuned scheme tries to fine-tune the pre-trained full-precision models with stages same as the exponential scheme instead of training from scratch.

Table 4 shows the results of the compared three schemes. It is obvious that the exponential scheme performs better than the average scheme and the fine-tuned scheme. It can also be concluded that our method works well when training from scratch.

Bits CIFAR-10 CIFAR-100
VGG-9 ResNet-56 VGG-9 ResNet-56
FWN 32 9.00 6.69 30.68 29.49
BWN 1 10.67 16.42 37.68 35.01
SQ-BWN 1 9.40 7.15 35.25 31.56
TWN 2 9.87 7.64 34.80 32.09
SQ-TWN 2 8.37 6.20 34.24 28.90
Table 5: Test error () of VGG-9 and ResNet-56 trained with 5 different methods on the CIFAR-10 and CIFAR-100 datasets. Our SQ algorithm can consistently improve the results. SQ-TWN even outperform full-precision models.
Bits AlexNet-BN ResNet-18
top-1 top-5 top-1 top-5
FWN 32 44.18 20.83 34.80 13.60
BWN 1 51.22 27.18 45.20 21.08
SQ-BWN 1 48.78 24.86 41.64 18.35
TWN 2 47.54 23.81 39.83 17.02
SQ-TWN 2 44.70 21.40 36.18 14.26
Table 6: Test error () of AlexNet-BN and ResNet-18 trained with 5 different methods on the ImageNet dataset.

4.3 Benchmark Results

We make a full benchmark comparison on the CIFAR-10, CIFAR-100 and ImageNet datasets between the proposed SQ algorithm and traditional algorithms on BWN and TWN. For fair comparison, we set the learning rate, optimization method, batch size etc identical in all these experiments. We adopt the best setting (i.e., channel-wise selection, stochastic partition, linear probability function, exponential scheme for SQ ratio) for all SQ based experiments.

Table 5 presents the results on the CIFAR-10 and CIFAR-100 datasets. In these two cases, SQ-BWN and SQ-TWN outperform BWN and TWN significantly, especially in ResNet-56 which are deeper and the gradients can be easily misled by quantized weights with large quantization errors. For example, SQ-BWN improve the accuracy by than BWN on the CIFAR-10 dataset with the ResNet-56 network. The 2-bits SQ-TWN models can even obtain higher accuracy than the full-precision models. Our results show that SQ-TWN improve the accuracy by , and than full-precision models on CIFAR-10 with VGG-9, CIFAR-10 with ResNet-56 and CIFAR-100 with ResNet-56, respectively.

Figure 2: Test loss of FWN, BWN and SQ-BWN on CIFAR-10 with ResNet-56. Figure 3: Test loss of FWN, TWN and SQ-TWN on CIFAR-10 with ResNet-56.

In 4.3 and 4.3, we show the curves of test loss on the CIFAR-10 dataset with the ResNet-56 network. In 4.3, BWN doesn’t converge well with much larger loss than SQ-BWN, while the losses in SQ-BWN and FWN are relatively low. In 4.3, we can see that at the beginning of each stage, quantizing more weights leads to large loss, which will be soon converged after some training iterations. Finally, SQ-TWN gets smaller loss than FWN, while the loss in TWN is larger than FWN.

Table 6 shows the results on the standard ImageNet 50K validation set. We compare SQ-BWN, SQ-TWN to FWN, BWN and TWN with the AlexNet-BN and the ResNet-18 architectures, and all the errors are reported with only single center crop testing. It shows that our algorithm helps to improve the performance quite a lot, which consistently beats the baseline methods by a large margin. We can also see that SQ-TWN yield approaching results with FWN. Note that some works [Rastegari et al.(2016)Rastegari, Ordonez, Redmon, and Farhadi, Venkatesh et al.(2016)Venkatesh, Nurvitadhi, and Marr] keep the first or last layer full-precision to alleviate possible accuracy loss. Although we did not do like that, SQ-TWN still achieves near full-precision accuracy. Nevertheless, we conclude that our SQ algorithm can consistently and significantly improve the performance.

5 Conclusion

In this paper, we propose a Stochastic Quantization (SQ) algorithm to learn accurate low-bit DNNs. We propose a roulette based partition algorithm to select a portion of weights to quantize, while keeping the other portion unchanged with full-precision, at each iteration. The hybrid weights provide much more appropriate gradients and lead to better local minimum. We empirically prove the effectiveness of our algorithm by extensive experiments on various low bitwidth settings, network architectures and benchmark datasets. We also make our codes public at https://github.com/dongyp13/Stochastic-Quantization. Future direction may consider the stochastic quantization of both weights and activations.


This work was done when Yinpeng Dong and Renkun Ni were interns at Intel Labs supervised by Jianguo Li. Yinpeng Dong, Jun Zhu and Hang Su are also supported by the National Basic Research Program of China (2013CB329403), the National Natural Science Foundation of China (61620106010, 61621136008).


  1. We can also define the non-quantization probability indicating the probability each row should not be quantized.


  1. Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. In ICLR, 2015.
  2. Matthieu Courbariaux and Yoshua Bengio. Binarynet: Training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830, 2016.
  3. Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In NIPS, 2015.
  4. Misha Denil, Babak Shakibi, Laurent Dinh, Nando de Freitas, et al. Predicting parameters in deep learning. In NIPS, 2013.
  5. Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
  6. Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In NIPS, 2015.
  7. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
  8. Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q. Weinberger. Deep networks with stochastic depth. In ECCV, 2016.
  9. Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. arXiv preprint arXiv:1609.07061, 2016.
  10. Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM MM, 2014.
  11. Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009.
  12. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
  13. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
  14. Fengfu Li, Bo Zhang, and Bin Liu. Ternary weight networks. In NIPS Workshop on EMDNN, 2016.
  15. Darryl D Lin, Sachin S Talathi, and V Sreekanth Annapureddy. Fixed point quantization of deep convolutional networks. In ICML, 2016.
  16. Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
  17. Daisuke Miyashita, Edward H Lee, and Boris Murmann. Convolutional neural networks using logarithmic data representation. arXiv preprint arXiv:1603.01025, 2016.
  18. Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In ECCV, 2016.
  19. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, 2015.
  20. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
  21. Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
  22. Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. JMLR, 15, 2014.
  23. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, 2015.
  24. Ganesh Venkatesh, Eriko Nurvitadhi, and Debbie Marr. Accelerating deep convolutional networks using low-precision and sparsity. arXiv preprint arXiv:1610.00324, 2016.
  25. Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen. Incremental network quantization: Towards lossless cnns with low-precision weights. In ICLR, 2017.
  26. Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.
  27. Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally. Trained ternary quantization. In ICLR, 2017.
This is a comment super asjknd jkasnjk adsnkj
The feedback cannot be empty
Comments 0
The feedback cannot be empty
Add comment

You’re adding your first comment!
How to quickly get a good reply:
  • Offer a constructive comment on the author work.
  • Add helpful links to code implementation or project page.