Learning Accurate LowBit Deep Neural Networks with Stochastic Quantization
Abstract
Lowbit deep neural networks (DNNs) become critical for embedded applications due to their low storage requirement and computing efficiency. However, they suffer much from the nonnegligible accuracy drop. This paper proposes the stochastic quantization (SQ) algorithm for learning accurate lowbit DNNs. The motivation is due to the following observation. Existing training algorithms approximate the realvalued elements/filters with lowbit representation all together in each iteration. The quantization errors may be small for some elements/filters, while are remarkable for others, which lead to inappropriate gradient direction during training, and thus bring notable accuracy drop. Instead, SQ quantizes a portion of elements/filters to lowbit with a stochastic probability inversely proportional to the quantization error, while keeping the other portion unchanged with fullprecision. The quantized and fullprecision portions are updated with corresponding gradients separately in each iteration. The SQ ratio is gradually increased until the whole network is quantized. This procedure can greatly compensate the quantization error and thus yield better accuracy for lowbit DNNs. Experiments show that SQ can consistently and significantly improve the accuracy for different lowbit DNNs on various datasets and various network structures.
Yinpeng Dongdyp17@mails.tsinghua.edu.cn1
\addauthorRenkun Nirn9zm@virginia.edu2
\addauthorJianguo Lijianguo.li@intel.com3
\addauthorYurong Chenyurong.chen@intel.com3
\addauthorJun Zhudcszj@mail.tsinghua.edu.cn1
\addauthorHang Susuhangss@mail.tsinghua.edu.cn1
\addinstitution
Department of Computer Science and Technology,
Tsinghua University
Beijing, China
\addinstitution
University of Virginia
\addinstitution
Intel Labs China
Beijing, China
Stochastic Quantization
1 Introduction
Deep Neural Networks (DNNs) have demonstrated significant performance improvements in a wide range of computer vision tasks [LeCun et al.(2015)LeCun, Bengio, and Hinton], such as image classification [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton, Simonyan and Zisserman(2015), Szegedy et al.(2015)Szegedy, Liu, Jia, Sermanet, Reed, Anguelov, Erhan, Vanhoucke, and Rabinovich, He et al.(2016)He, Zhang, Ren, and Sun], object detection [Girshick et al.(2014)Girshick, Donahue, Darrell, and Malik, Ren et al.(2015)Ren, He, Girshick, and Sun] and semantic segmentation [Long et al.(2015)Long, Shelhamer, and Darrell, Chen et al.(2015)Chen, Papandreou, Kokkinos, Murphy, and Yuille]. DNNs often stack tens or even hundreds of layers with millions of parameters to achieve the promising performance. Therefore, DNN based systems usually need considerable storage and computation power. This hinders the deployment of DNNs to some resource limited scenarios, especially lowpower embedded devices in the emerging InternetofThings (IoT) domain.
Many works have been proposed to reduce model parameter size or even computation complexity due to the high redundancy in DNNs [Denil et al.(2013)Denil, Shakibi, Dinh, de Freitas, et al., Han et al.(2015)Han, Pool, Tran, and Dally]. Among them, lowbit deep neural networks [Courbariaux et al.(2015)Courbariaux, Bengio, and David, Rastegari et al.(2016)Rastegari, Ordonez, Redmon, and Farhadi, Li et al.(2016)Li, Zhang, and Liu, Zhou et al.(2016)Zhou, Wu, Ni, Zhou, Wen, and Zou, Zhu et al.(2017)Zhu, Han, Mao, and Dally, Courbariaux and Bengio(2016), Hubara et al.(2016)Hubara, Courbariaux, Soudry, ElYaniv, and Bengio], which aim for training DNNs with low bitwidth weights or even activations, attract much more attention due to their promised model size and computing efficiency. In particular, in BinaryConnect (BNN) [Courbariaux et al.(2015)Courbariaux, Bengio, and David], the weights are binarized to and and multiplications are replaced by additions and subtractions to speed up the computation. In [Rastegari et al.(2016)Rastegari, Ordonez, Redmon, and Farhadi], the authors proposed binary weighted networks (BWN) with weight values to be binarized plus one scaling factor for each filter, and extended it to XNORNet with both weights and activations binarized. Moreover, Ternary Weight Networks (TWN) [Li et al.(2016)Li, Zhang, and Liu] incorporate an extra state, which converts weights into ternary values {, , } with bits width. However, lowbit DNNs are challenged by the nonnegligible accuracy drop, especially for large scale models (e.g\bmvaOneDot, ResNet [He et al.(2016)He, Zhang, Ren, and Sun]). We argue that the reason is due to that they quantize the weights of DNNs to lowbits all together at each training iteration. The quantization error is not consistently small for all elements/filters. It may be very large for some elements/filters, which may lead to inappropriate gradient direction during training, and thus makes the model converge to relatively worse local minimum.
Besides training based solutions for lowbit DNNs, there are also postquantization techniques, which focus on directly quantizing the pretrained fullprecision models [Lin et al.(2016)Lin, Talathi, and Annapureddy, Miyashita et al.(2016)Miyashita, Lee, and Murmann, Zhou et al.(2017)Zhou, Yao, Guo, Xu, and Chen]. These methods have at least two limitations. First, they work majorly for DNNs in classification tasks, and lack the flexibility to be employed for other tasks like detection, segmentation, etc. Second, the lowbit quantization can be regarded as a constraint or regularizer [Courbariaux et al.(2015)Courbariaux, Bengio, and David] in training based solutions towards a local minimum in the low bitwidth weight space. However, it is relatively difficult for postquantization techniques (even with finetuning) to transfer the local minimum from a fullprecision weight space to a low bitwidth weight space losslessly.
In this paper, we try to overcome the aforementioned issues by proposing the Stochastic Quantization (SQ) algorithm to learn accurate lowbit DNNs. Inspired by stochastic depth [Huang et al.(2016)Huang, Sun, Liu, Sedra, and Weinberger] and dropout [Srivastava et al.(2014)Srivastava, Hinton, Krizhevsky, Sutskever, and Salakhutdinov], the SQ algorithm stochastically select a portion of weights in DNNs and quantize them to lowbits in each iteration, while keeping the other portion unchanged with fullprecision. The selection probability is inversely proportional to the quantization error. We gradually increase the SQ ratio until the whole network is quantized.
We make a comprehensive study on different aspects of the SQ algorithm. First, we study the impact of selection granularity, and show that treating each filterchannel as a whole for stochastic selection and quantization performs better than treating each weight element separately. The reason is that the weight elements in one filterchannel are interacted with each other to represent a filter structure, so that treating each element separately may introduce large distortion when some elements in one filterchannel are lowbits while the others remain fullprecision. Second, we compare the proposed roulette algorithm to some deterministic selection algorithms. The roulette algorithm selects elements/filters to quantize with the probability inversely proportional to the quantization error. If the quantization error is remarkable, we probably do not quantize the corresponding elements/filters, because they may introduce inaccurate gradient information. The roulette algorithm is shown better than deterministic selection algorithms since it eliminates the requirement of finding the best initial partition, and has the ability to explore the searching space for a better solution due to the exploitationexploration nature of stochastic algorithms. Third, we compare different functions to calculate the quantization probability on top of quantization errors. Fourth, we design an appropriate scheme for updating the SQ ratio.
The proposed SQ algorithm is generally applicable for any lowbit DNNs including BWN, TWN, etc. Experiments show that SQ can consistently improve the accuracy for different lowbit DNNs (binary or ternary) on various network structures (VGGNet, ResNet, etc) and various datasets (CIFAR, ImageNet, etc). For instance, TWN trained with SQ can even beat the fullprecision models in several testing cases. Our main contributions are:

We propose the stochastic quantization (SQ) algorithm to overcome the accuracy drop issue in existing lowbit DNNs training algorithms.

We comprehensively study different aspects of the SQ algorithm such as selection granularity, partition algorithm, definition of quantization probability functions, and scheme for updating the SQ ratio, which may provide valuable insights for researchers to design other stochastic algorithms in deep learning.

We present strong performance improvement with the proposed algorithm on various low bitwidth settings, network architectures, and benchmark datasets.
2 Lowbit DNNs
In this section, we briefly introduce some typical lowbit DNNs as prerequisites, including BNN [Courbariaux et al.(2015)Courbariaux, Bengio, and David], BWN [Rastegari et al.(2016)Rastegari, Ordonez, Redmon, and Farhadi] and TWN [Li et al.(2016)Li, Zhang, and Liu]. Formally, we denote the weights of the th layer in a DNN by , where is the number of output channels, and is the weight vector of the ith filter channel, in which in convlayers and in FClayers (, and represent input channels, kernel width, and kernel height respectively). can also be viewed as a weight matrix with rows, and each row is a dimensional vector. For simplicity, we omit the subscript in the following.
BinaryConnect uses a simple stochastic method to convert each 32bits weight vector into binary values with the following equation
(1) 
where is the hard sigmoid function, and j is the index of elements in . During the training phase, it keeps both fullprecision weights and binarized weights. Binarized weights are used for gradients and forward loss computation, while fullprecision weights are applied to accumulate gradient updates. During the testing phase, it only needs the binarized weights so that it reduces the model size by .
BWN is an extension of BinaryConnect, but introduces a realvalued scaling factor along with to approximate the fullprecision weight vector by solving an optimization problem and obtaining:
(2) 
TWN introduces an extra 0 state over BWN and approximates the realvalued weight vector more accurately with a ternary value vector along with a scaling factor , while still keeping high modelsize compression (16). It solves the optimization problem with an approximate solution:
(3) 
where is a positive threshold with following values
(4) 
and denotes the cardinality of set .
3 Method
In this section, we elaborate the SQ algorithm. As the lowbit quantization is done layerbylayer in existing lowbit DNNs [Courbariaux et al.(2015)Courbariaux, Bengio, and David, Rastegari et al.(2016)Rastegari, Ordonez, Redmon, and Farhadi, Li et al.(2016)Li, Zhang, and Liu], we follow the same paradigm. For simplicity, here we just introduce the algorithm for one layer. We propose to quantize a portion of the weights to lowbits in each layer during forward and backward propagation to minimize the information reduction due to full quantization. Figure 1 illustrates the stochastic quantization algorithm. We first calculate the quantization error, and then derive a quantization probability for each element/filter (See Section 3.2 for details). Given a SQ ratio , we then stochastically select a portion of elements/filters to quantize by a roulette algorithm introduced in Section 3.1. We finally demonstrate the training procedure in Section 3.3.
3.1 Stochastic Partition
In each iteration, given the weight matrix of each layer, we want to partition the rows of into two disjoint groups and , which should satisfy
(5) 
where is the filter for the ith output channel or th row of weight matrix .
We quantize the rows of in group to low bitwidth values while keeping the others in group fullprecision.
contains items, which is restricted by the SQ ratio (). Meanwhile, .
The SQ ratio increases gradually to to make the whole weight matrix and the whole network quantized in the end of training.
We introduce a quantization probability over each row of , where the ith element indicates the probability of the ith row to be quantized. The quantization probability is defined over the quantization error, which will be described in Section 3.2.
Given the SQ ratio and the quantization probability for each layer, we propose a sampling without replacement roulette algorithm to select rows to be quantized for group , while the remaining rows are still kept fullprecision. Algorithm 1 illustrates the roulette selecting procedure.
3.2 From Quantization Error to Quantization Probability
Recall that the motivation of this work is to alleviate the accuracy drop due to the inappropriate gradient direction from large quantization errors, the quantization probability over channels should base on the quantization error between the quantized and realvalued weights. If the quantization error is small, quantizing the corresponding channel brings little information reduction, then we should assign a high quantization probability to this channel. That means, the quantization probability should be inversely proportional to the quantization errors.
We generally denote as the quantized version of the fullprecision weight vector . For simplicity, we omit the bitwidth or scaling factor in . That means can be , or even , for different lowbit DNNs. We measure the quantization error in terms of the normalized distance between and :
(6) 
Then we can define the quantization probability given the quantization error . The quantization probability should be inversely proportional to . We define an intermediate variable to represent the inverse relationship, where is a small value such as to avoid possible overflow. The probability function should be a monotonically nondecreasing function of . Here we consider four different choices:

A constant function to make equal probability over all channels: . This function ignores the quantization error with a totally random selection strategy.

A linear function defined as .

A softmax function defined as .

A sigmoid function defined as .
We will empirically compare the performance of these choices in Section 4.2.
3.3 Training with Stochastic Quantization
When applied the proposed stochastic quantization algorithm to train a lowbit DNN, it will involve four steps: stochastic partition of weights, forward propagation, backward propagation and parameter update.
Algorithm 2 illustrates the procedure for training a lowbit DNN with stochastic quantization. First, all the rows of are quantized to obtain . Note that we do not specify the quantization algorithm and the bitwidth here, since our algorithm is flexibly applied to all the cases. We then calculate the quantization error by Eq. (6) and the corresponding quantization probability. Given these parameters, we use Algorithm 1 to partition into and . We then form the hybrid weight matrix composed of realvalued weights and quantized weights based on the partition result. If , we use its quantized version in . Otherwise, we use directly. approximates the realvalued weight matrix much better than , and thus provides much more appropriate gradient direction. We update with the hybrid gradients in each iteration as
(7) 
where is the learning rate at th iteration. That means the quantized part is updated with gradients derived from quantized weights, while the realvalued part is still updated with gradients from realvalued weights. Lastly, the learning rate and SQ ratio get updated with a predefined rule. We will specify this in Section 4.1.
When the training stops, increases to and all the weights in the network are quantized. There is no need to keep the realvalued weights further. We perform forward propagation with low bitwidth weights during the testing phase.
4 Experiments
We conduct extensive experiments on the CIFAR10, CIFAR100 and ImageNet large scale classification datasets to validate the effectiveness of the proposed algorithm. We apply the SQ algorithm on two kinds of lowbit settings (aka BWN and TWN), and show the advantages over the existing lowbit DNN algorithms.
4.1 Datasets and Implementation Details
We briefly introduce the datasets used in our experiments, including the corresponding network structures and training settings.
CIFAR [Krizhevsky and Hinton(2009)] consists of a training set of and a test set of color images of resolution with classes in CIFAR10 and classes in CIFAR100. We adopt two network architectures. The first one is derived from VGGNet [Simonyan and Zisserman(2015)] and denoted as VGG9 [Courbariaux et al.(2015)Courbariaux, Bengio, and David], by the architecture of “”. The network is trained with a SGD solver with momentum , weight decay and batch size . We do not use any data augmentation in this network. We start with a learning rate , divide it by after each k iterations, and terminate the training at k iterations. The second network is the ResNet56 architecture with the same training settings as [He et al.(2016)He, Zhang, Ren, and Sun].
ImageNet [Russakovsky et al.(2015)Russakovsky, Deng, Su, Krause, Satheesh, Ma, Huang, Karpathy, Khosla, Bernstein, et al.] is a large scale classification dataset. We use the ILSVRC 2012 dataset in the experiments which consists million training images of classes. We evaluate the performance on the K validation images. We adopt a variant of AlexNet [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton] architecture with batch normalization (AlexNetBN) and the ResNet18 [He et al.(2016)He, Zhang, Ren, and Sun] architecture. We train the AlexNetBN by SGD with momentum , weight decay and batch size . The initial learning rate is , which is divided by each after k, k and k iterations. The training is terminated after k iterations. We adopt batch normalization in AlexNet for faster convergence and better stability. ResNet18 is trained with a SGD solver with batch size and the same momentum and weight decay values as AlexNetBN. The learning rate starts from and is divided by when the error plateaus. We train it for k iterations.
We implement our codes based on the Caffe framework [Jia et al.(2014)Jia, Shelhamer, Donahue, Karayev, Long, Girshick, Guadarrama, and Darrell], and make training and testing on NVidia Titan X GPU. We make full experimental comparison between fullprecision weight networks (FWN), Binary Weighted Networks (BWN), and Ternary Weighted Networks (TWN). We denote BWN and TWN trained with the proposed SQ algorithm as SQBWN and SQTWN respectively. For CIFAR10 and CIFAR100, we compare the results of FWN, BWN, TWN, SQBWN and SQTWN on two network architectures VGG9 and ResNet56. For ImageNet, we also make experimental comparison between these five settings with the AlexNetBN and ResNet18 network architectures.
The SQ ratio is one critical hyperparameter in our algorithm, which should be gradually increased to to make all the network weights quantized. In this work, we divide the training into several stages with a fixed at each stage. When the training is converged in one stage, we increase and move to the next stage, until in the final stage. We conduct ablation study on the scheme for updating the SQ ratio in Section 4.2. We make another simplification that is the same for all layers in a DNN within one training stage. When applied the SQ algorithm to train lowbit DNNs, the number of training iterations is of original lowbit DNNs, where is the number of SQ stages and we use the same number of iterations as other lowbit DNNs (i.e., BWN, TWN) in each stage (although we do not need so many iterations to converge).
4.2 Ablation Study
There are several factors in the proposed method which will affect the final results, including selection granularity, partition algorithm, quantization probability function and scheme for updating the SQ ratio. Therefore, we design a series of experiments for each factor and analyze their impacts to our algorithm. All the ablation studies are based on the ResNet56 network and the CIFAR10 dataset. In consideration of the interaction between these factors, we adopt the variablecontrolling approach to ease our study. The default settings for selection granularity, partition algorithm, quantization probability function and SQ ratio scheme are channelwise selection, stochastic partition (roulette), linear function and four stages training with the SQ ratio and .
Channelwise  Elementwise  

SQBWN  7.15  7.67 
SQTWN  6.20  6.53 
Stochastic  Deterministic  Fixed  

SQBWN  7.15  8.21  * 
SQTWN  6.20  6.85  6.50 
Exp  Ave  Tune  

SQBWN  7.15  7.35  7.18 
SQTWN  6.20  6.88  6.62 
Linear  Constant  Softmax  Sigmoid  

SQBWN  7.15  7.44  7.51  7.37 
SQTWN  6.20  6.30  6.29  6.28 
Channelwise v.s. Elementwise.
We first study the impact of quantization granularity on performance. We adopt two strategies, i.e., channelwise and elementwise, with the results shown in Table 2. Elementwise selection ignores the filter structure and the interactions within each channel, which leads to lower accuracy. By treating each filterchannel as a whole, we can preserve the structures in channels and obtain higher accuracy.
Stochastic v.s. Deterministic v.s. Fixed.
The proposed algorithm uses stochastic partition based on the roulette algorithm to select rows to be quantized. We argue that the stochastic partition can eliminate the requirement of finding the best partition after initialization, and have the ability to explore the searching space for a better solution due to the exploitationexploration nature. To prove that, we compare our pure stochastic scheme to another two none fully stochastic schemes. The first one is the deterministic scheme using sorting based partition. Given the SQ ratio and the quantization error , we sort rows of weight matrix based on and select rows with the least quantization error as group in each iteration. The second one is the fixed partition, in which we only do roulette partition for the first iteration, and keep and fixed for all the iterations in each training stage.
Table 2 compares the results. It shows that the stochastic algorithm gets significantly better results in BWN and TWN, which proves that the stochastic partition we introduced is a key factor for success. Another interpolation is that stochastic quantization acts as an regularizer in training, which also benefits the performance.
Quantization Probability Function.
In Section 3.2, we introduce four choices of quantization probability function given the quantization error, which are constant, linear, softmax and sigmoid functions. We compare the results of different functions in Table 4. Among these choices, linear function beats all other competitors due to its better performance and simplicity. We find that the performance of different functions are very close, which indicates that what matters most is the stochastic partition algorithm itself.
Scheme for updating the SQ ratio.
We also study how to design the scheme for updating the SQ ratio. Basically, we assume that training with stochastic quantization will be divided into several stages, and each stage has a fixed SQ ratio. In all previous setting, we use four stages with SQ ratio and . We call this exponential scheme (Exp) since in each stage, the nonquantized part is halfsized to that of previous stage. We compare our results with two additional schemes: the average scheme (Ave), and the finetuned scheme (Tune). The average scheme includes five stages with and . The finetuned scheme tries to finetune the pretrained fullprecision models with stages same as the exponential scheme instead of training from scratch.
Table 4 shows the results of the compared three schemes. It is obvious that the exponential scheme performs better than the average scheme and the finetuned scheme. It can also be concluded that our method works well when training from scratch.
Bits  CIFAR10  CIFAR100  

VGG9  ResNet56  VGG9  ResNet56  
FWN  32  9.00  6.69  30.68  29.49 
BWN  1  10.67  16.42  37.68  35.01 
SQBWN  1  9.40  7.15  35.25  31.56 
TWN  2  9.87  7.64  34.80  32.09 
SQTWN  2  8.37  6.20  34.24  28.90 
Bits  AlexNetBN  ResNet18  

top1  top5  top1  top5  
FWN  32  44.18  20.83  34.80  13.60 
BWN  1  51.22  27.18  45.20  21.08 
SQBWN  1  48.78  24.86  41.64  18.35 
TWN  2  47.54  23.81  39.83  17.02 
SQTWN  2  44.70  21.40  36.18  14.26 
4.3 Benchmark Results
We make a full benchmark comparison on the CIFAR10, CIFAR100 and ImageNet datasets between the proposed SQ algorithm and traditional algorithms on BWN and TWN. For fair comparison, we set the learning rate, optimization method, batch size etc identical in all these experiments. We adopt the best setting (i.e., channelwise selection, stochastic partition, linear probability function, exponential scheme for SQ ratio) for all SQ based experiments.
Table 5 presents the results on the CIFAR10 and CIFAR100 datasets. In these two cases, SQBWN and SQTWN outperform BWN and TWN significantly, especially in ResNet56 which are deeper and the gradients can be easily misled by quantized weights with large quantization errors. For example, SQBWN improve the accuracy by than BWN on the CIFAR10 dataset with the ResNet56 network. The 2bits SQTWN models can even obtain higher accuracy than the fullprecision models. Our results show that SQTWN improve the accuracy by , and than fullprecision models on CIFAR10 with VGG9, CIFAR10 with ResNet56 and CIFAR100 with ResNet56, respectively.
In 4.3 and 4.3, we show the curves of test loss on the CIFAR10 dataset with the ResNet56 network. In 4.3, BWN doesn’t converge well with much larger loss than SQBWN, while the losses in SQBWN and FWN are relatively low. In 4.3, we can see that at the beginning of each stage, quantizing more weights leads to large loss, which will be soon converged after some training iterations. Finally, SQTWN gets smaller loss than FWN, while the loss in TWN is larger than FWN.
Table 6 shows the results on the standard ImageNet 50K validation set. We compare SQBWN, SQTWN to FWN, BWN and TWN with the AlexNetBN and the ResNet18 architectures, and all the errors are reported with only single center crop testing. It shows that our algorithm helps to improve the performance quite a lot, which consistently beats the baseline methods by a large margin. We can also see that SQTWN yield approaching results with FWN. Note that some works [Rastegari et al.(2016)Rastegari, Ordonez, Redmon, and Farhadi, Venkatesh et al.(2016)Venkatesh, Nurvitadhi, and Marr] keep the first or last layer fullprecision to alleviate possible accuracy loss. Although we did not do like that, SQTWN still achieves near fullprecision accuracy. Nevertheless, we conclude that our SQ algorithm can consistently and significantly improve the performance.
5 Conclusion
In this paper, we propose a Stochastic Quantization (SQ) algorithm to learn accurate lowbit DNNs. We propose a roulette based partition algorithm to select a portion of weights to quantize, while keeping the other portion unchanged with fullprecision, at each iteration. The hybrid weights provide much more appropriate gradients and lead to better local minimum. We empirically prove the effectiveness of our algorithm by extensive experiments on various low bitwidth settings, network architectures and benchmark datasets. We also make our codes public at https://github.com/dongyp13/StochasticQuantization. Future direction may consider the stochastic quantization of both weights and activations.
Acknowledgement
This work was done when Yinpeng Dong and Renkun Ni were interns at Intel Labs supervised by Jianguo Li. Yinpeng Dong, Jun Zhu and Hang Su are also supported by the National Basic Research Program of China (2013CB329403), the National Natural Science Foundation of China (61620106010, 61621136008).
Footnotes
 We can also define the nonquantization probability indicating the probability each row should not be quantized.
References
 LiangChieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. In ICLR, 2015.
 Matthieu Courbariaux and Yoshua Bengio. Binarynet: Training deep neural networks with weights and activations constrained to+ 1 or1. arXiv preprint arXiv:1602.02830, 2016.
 Matthieu Courbariaux, Yoshua Bengio, and JeanPierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In NIPS, 2015.
 Misha Denil, Babak Shakibi, Laurent Dinh, Nando de Freitas, et al. Predicting parameters in deep learning. In NIPS, 2013.
 Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
 Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In NIPS, 2015.
 Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
 Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q. Weinberger. Deep networks with stochastic depth. In ECCV, 2016.
 Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran ElYaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. arXiv preprint arXiv:1609.07061, 2016.
 Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM MM, 2014.
 Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009.
 Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
 Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
 Fengfu Li, Bo Zhang, and Bin Liu. Ternary weight networks. In NIPS Workshop on EMDNN, 2016.
 Darryl D Lin, Sachin S Talathi, and V Sreekanth Annapureddy. Fixed point quantization of deep convolutional networks. In ICML, 2016.
 Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
 Daisuke Miyashita, Edward H Lee, and Boris Murmann. Convolutional neural networks using logarithmic data representation. arXiv preprint arXiv:1603.01025, 2016.
 Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnornet: Imagenet classification using binary convolutional neural networks. In ECCV, 2016.
 Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster rcnn: Towards realtime object detection with region proposal networks. In NIPS, 2015.
 Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
 Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. In ICLR, 2015.
 Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. JMLR, 15, 2014.
 Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, 2015.
 Ganesh Venkatesh, Eriko Nurvitadhi, and Debbie Marr. Accelerating deep convolutional networks using lowprecision and sparsity. arXiv preprint arXiv:1610.00324, 2016.
 Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen. Incremental network quantization: Towards lossless cnns with lowprecision weights. In ICLR, 2017.
 Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Dorefanet: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.
 Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally. Trained ternary quantization. In ICLR, 2017.