UNIQ: Uniform Noise Injection for the Quantization of Neural Networks
Abstract
We present a novel method for training deep neural network amenable to inference in lowprecision arithmetic with quantized weights and activations. The training is performed in full precision with random noise injection emulating quantization noise. In order to circumvent the need to simulate realistic quantization noise distributions, the weight and the activation distributions are uniformized by a nonlinear transformation, and uniform noise is injected. An inverse transformation is then applied. This procedure emulates a nonuniform quantile quantizer at inference time, which is shown to achieve stateoftheart results for training lowprecision networks on CIFAR10 and ImageNet1K datasets. In particular, we observe no degradation in accuracy for MobileNet and ResNet18 on ImageNet with as low as 2bit quantization of the activations and minimal degradation for as little as 4 bits for the weights.
Bit operations  Architecture  Method  Bits  Accuracy 
Less than ten GBOPs  
MobileNet  UNIQ (Ours)  2,4  51.01  
MobileNet  UNIQ (Ours)  2,2  49.27  
Tens of GBOPs 

ResNet18  UNIQ (Ours)  8,4  69.92  
ResNet18  UNIQ (Ours)  8,2  69.63  
MobileNet  UNIQ (Ours)  8,4  68.30  
MobileNet  UNIQ (Ours)  8,2  68.00  
ResNet18  UNIQ (Ours)  4,4  66.02  
ResNet18  UNIQ (Ours)  4,2  65.32  
MobileNet  UNIQ (Ours)  4,4  64.40  
MobileNet  UNIQ (Ours)  4,2  64.10  
AlexNet  XNOR (Rastegari et al., 2016)  1,32  60.10  
ResNet18  XNOR (Rastegari et al., 2016)  1,1  51.20  
AlexNet  QNN (Hubara et al., 2016b)  1,2  51.03  
AlexNet  XNOR (Rastegari et al., 2016)  1,1  44.20  
AlexNet  QNN (Hubara et al., 2016b)  1,1  41.80  
Hundreds of GBOPs  

ResNet18  Apprentice (Mishra & Marr, 2018)  4,8  70.40 
ResNet18  IQN (Zhou et al., 2017a)  5,32  68.89  
ResNet18  Apprentice (Mishra & Marr, 2018)  2,32  68.50  
MobileNet  Baseline (fullprecision)  32,32  68.20  
ResNet18  Apprentice (Mishra & Marr, 2018)  2,2  68.00  
ResNet18  Distillation (Polino et al., 2018)  4,32  64.20  
AlexNet  WRPN (Mishra et al., 2018)  2,32  57.50  
AlexNet  Deep Compression (Han et al., 2016)  8,32  57.21  
AlexNet  WRPN (Mishra et al., 2018)  1,32  56.80  
AlexNet  WRPN (Mishra et al., 2018)  4,32  55.34  
TBOPs  
VGG16  Baseline (fullprecision)  32,32  71.00  
VGG16  IQN (Zhou et al., 2017a)  5,32  70.82  
ResNet18  Baseline (fullprecision)  32,32  69.60  
AlexNet  Baseline (fullprecision)  32,32  56.50 
1 Introduction
Deep learning has established itself as an important tool in the machine learning arsenal. Deep networks have shown spectacular success in a variety of tasks in a broad range of fields including computer vision, signal processing, computational imaging and image processing, and speech and language processing (Hinton et al., 2012; Chen et al., 2016; Lai et al., 2015).
A major drawback of deep learning models is their storage and computational cost. Typical deep networks comprise millions of parameters and require billions of multiplyaccumulate (MAC) operations. In many cases, this cost renders them infeasible for running on mobile devices with limited resources. While some applications allow moving the inference to the cloud, such an architecture still incurs significant bandwidth and latency limitations.
A recent trend in research focuses on developing lighter deep models, both in their memory footprint and computational complexity. The main focus of most of these works, including the present one, is on alleviating the complexity at inference time rather than simplifying the training. While training deep model requires even more resources and time, it is usually done offline with plentiful computational resources.
One way of reducing computational cost is quantization of weights and activations. Quantization of weights also reduces storage size and memory access. The bit widths of activations and weights affect linearly the amount of required hardware logic; reducing both widths by a factor of two reduces the amount of hardware by roughly a factor of four. The weights need to be stored in memory, and reduction of their size allows to fit bigger networks into the device, which is especially critical for embedded devices and custom hardware. On the other hand, activations are passed between layers, and thus, if different layers are processed separately, activation size reduction affects communication overhear. It should be noted, that in typical architectures there are overwhelmingly more activations than weights.
Deep neural networks are usually trained and operate with both the weights and the activations represented in singleprecision (bit) floating point. A straightforward uniform quantization of the pretrained weights to or even bit fixed point representation has been shown to have a negligible effect on the model accuracy (Gupta et al., 2015). In the majority of the applications, further reduction of precision quickly degrades performance; hence, nontrivial techniques are required to carry it out.
Previous studies have investigated quantizing the network weights and, sometimes, the activations to as low as ternary or binary (bit) representation (Rastegari et al., 2016; Hubara et al., 2016b; Zhou et al., 2016; Mishra et al., 2018; Dong et al., 2017). Such extreme reduction of the range of possible parameter values can greatly affect accuracy. Recent works proposed to use a wider network, i.e., one with more filters per layer, to mitigate the accuracy loss (Zhu et al., 2016; Mishra et al., 2018; Polino et al., 2018). In some approaches, e.g., Zhu et al. (2016) and Zhou et al. (2017a), a learnable linear scaling layer is added after each quantized layer to improve expressiveness.
A general approach to learning a quantized model adopted in several recent papers (Hubara et al., 2016a, b; Zhou et al., 2016; Rastegari et al., 2016; Cai et al., 2017) is to perform the forward pass using the quantized values, while keeping another set of fullprecision values for the backward pass updates. In this case, the backward pass is still almost everywhere differentiable, while the forward pass is quantized. In the aforementioned papers, a deterministic or stochastic function is used at training for weight and activation quantization. Another approach introduced by Mishra & Marr (2018) and Polino et al. (2018) is based on a teacherstudent setup for knowledge distillation of a full precision (and larger) teacher model to a quantized student model. This method allows training highly accurate quantized models, but with the drawback that training a larger model is required, e.g. ResNet34 is used for training ResNet18.
Most previous studies have used uniform quantization (i.e., all quantization bins have equal width), which is attractive due to its simplicity. However, unless the values are uniformly distributed, uniform quantization is not optimal either in the sense, or in any other reasonable metric. Unfortunately, neural network weights are not uniform and tend to assume bellshaped distributions (Han et al., 2016).
Nonuniform quantization was utilized by Han et al. (2016), where the authors replaced the weight values with indexes pointing to a finite codebook of shared values. Another approach adopted by Zhou et al. (2017a) learns the quantizer thresholds by iteratively grouping close values and retraining the weights. Lee et al. (2017) utilize a logarithmic scale quantization for an approximation of the optimal Lloyd quantizer (Lloyd, 1982). Cai et al. (2017) proposed to optimize expectation of error of quantization function to reduce the quantization error. Normally distributed weights and halfnormally distributed activations were assumed, that enables using a precalculated means quantizer. In Zhou et al. (2017b) balanced bins are used, so each bin has the same number of samples. In some sense, this idea is the closest to our approach; yet, while Zhou et al. (2017b) force the values to have an approximately uniform distribution, we pose no such constraint. Also, since calculating percentiles is expensive in this setting, Zhou et al. (2017b) estimate them with means, while our method allows using the actual percentiles as detailed in the sequel.
Contribution. This paper makes the following contributions: Firstly, we propose a quantile quantization method with balanced (equal probability mass) bins, which is particularly suitable for neural networks, where outliers or long tails of the bell curve are less important. We also show a simple and efficient way of reformulating this quantizer using a “uniformization trick”.
Secondly, we introduce a novel method for training a neural network that performs well with quantized values. This is achieved by adding noise (at training time) to emulate the quantization noise introduced at inference time. The uniformization trick renders exact the injection of uniform noise and alleviates the need to draw the noise from complicated and bindependent distributions. While we limit our attention to the quantile quantization, the proposed scheme can work with any threshold configuration while still keeping the advantage of uniformly distributed noise in every bin.
Lastly, we report a major improvement over the stateoftheart quantization techniques in the performance vs. complexity tradeoff. Unlike several leading methods, our approach can be applied directly to the existing architectures without the need to modify them at training (as opposed, for example, to the teacherstudent approaches that require to train a bigger network, or the XNOR networks that typically increase the number of parameters by a significant factor in order to meet accuracy goals).
2 Method
In this section, we present our method for training a neural network amenable to operation in lowprecision arithmetic. We start by outlining several common quantization schemes and discussing their suitability for deep neural networks. Then, we suggest a training procedure where during training time uniform random additive noise is injected into weights and activations, simulating the quantization error. The scheme aims at improving the quantized network performance at inference time, when regular deterministic quantization is used.
2.1 Quantization
Let be a random variable drawn from some distribution described by the probability density function . Let be a set of thresholds partitioning the real line into disjoint intervals (bins) , and let be a set of representation levels. A quantizer is a function mapping each bin to the corresponding representation level . We denote the quantization error by . The effect of quantization can be modelled as the addition of random noise to ; the noise added to the th bin admits the conditional distribution .
Most papers on neural network quantization focus on the uniform quantizer which has a constant bin width and . A means quantizer is optimal in the sense of the mean squared error , where the expectation is taken with respect to the density . Its name follows from the property that each representation level coincides with the th bin centroid (mean w.r.t. ). While finding the optimal means quantizer is known to be an NPhard problem, heuristic procedures such as the LloydMax algorithm (Lloyd, 1982) usually produce a good approximation. The means quantizer coincides with the uniform quantizer when is uniformly distributed.
While being a popular choice in signal processing, the means quantizer encounters severe obstacles in our problem of neural network quantization. Firstly, the LloydMax algorithm has a prohibitively high complexity to be used in every forward pass. Secondly, it is not easily amenable to our scheme of modelling quantization as the addition of random noise, as the noise distribution at every bin is complex and varies with the change of the quantization thresholds. Finally, our experiments shown in Section 3.6 in the sequel suggest that the use of the criterion for quantization of deep neural classifier does not produce best classification results. The weights in such networks typically assume a bellshaped distribution with tails exerting a great effect on the mean squared error, yet having little impact on the classification precision.
As an alternative to means, we propose the quantile quantizer characterized by the equiprobable bins property, that is, . The property is realized by setting , where denotes the cumulative distribution function of and, accordingly, its inverse denotes the quantile function. The representation level of the th bin is set to the bin median, .
It can be shown that the quantile quantizer minimizes the mean absolute error . The use of such a more robust error criterion limits the effect of the tails of the distribution and typically produces more bins closer to the distribution mean. Based on empirical observations, we conjecture that the distribution tails are not essential for good model performance at least in classification tasks. It can also be shown that in the case of a uniform , the quantile quantizer coincides with the level uniform quantizer.
The cumulative distribution and the quantile function can be efficiently estimated empirically from the distribution of weights and activations, and updated in every forward pass. Alternatively, one can rely on the empirical observation that the regularized weights tend to assume an approximately normal distribution (Blundell et al., 2015) and use the normal quantile function.
Empirical observations also show that the network observations can be approximated as normallydistributed variables saturated by the ReLU units (Cai et al., 2017). The resulting distribution is a mixture of a delta at zero and a halfnormal distribution, for which a quantile function also has a closed form.
The fact that a monotonically increasing transformation preserves quantiles allows an alternative construction of the quantile quantizer. We first apply the transformation to the input converting it into a uniform random variable on the interval . Then, a uniform level quantizer (coinciding with the quantile quantizer) is applied to producing ; the result is transformed back into using the inverse transformation. We refer to this procedure as to the uniformization trick. Its importance will become evident in the next section.
2.2 Training quantizers by uniform noise injection
The lack of continuity, let alone smoothness, of the quantization operator renders impractical its use in the backward pass. As an alternative, at training we replace the quantizer by the injection of random additive noise. This scheme suggests that instead of using the quantized value of a weight in the forward pass, is used with drawn from the conditional distribution of described by the density
defined for and vanishing elsewhere. The bin to which belongs is established according to its value and is fixed during the backward pass. Quantization of the network activations is performed in the same manner.
The fact that the parameters stay do not directly undergo quantization keeps the model differentiable. In addition, gradient updates in the backward pass have an immediate impact on the forward pass, in contrast to the directly quantized model, where small updates often leave the parameter in the same bin, leaving it effectively unchanged.
However, while in signal processing it is customary to model the quantization noise distribution as uniform, this approximation breaks in the extremely low precision regimes (small ) considered here. Hence, the injected noise has to be drawn from a potentially nonuniform distribution which furthermore changes as the network parameters and the quantizer thresholds are updated.
To overcome this difficulty, we resort to the uniformization trick outlined in the previous section. Instead of the quantile quantizer , we apply the equivalent uniform quantizer to the uniformized variable, . The effect of the quantizer can be again modelled using noise injection,
with the cardinal difference than now the noise is uniformly distributed on the interval .
Note that the proposed scheme, henceforth referred to as uniform noise injection quantization (UNIQ), is not restricted to the quantile quantizer discussed in this paper, but rather applies to any quantizer. In the general case, the noise injected into each bin is uniform, but its variance changes with the bin width. The latter requires finding the bin number in the forward pass, which has a negative impact on the training time.
Usually, quantization of neural networks is either used for training a model from scratch or applied posttraining as a finetuning stage. Our method, as will be demonstrated in our experiments, works well in both cases. Our practice shows that best results are obtained when the learning rate is reduced as the noise is added; we explain this by the need to compensate for noisier gradients.
2.3 Gradual quantization
The described method works well “as is” for small to mediumsized neural networks. For deeper networks, the basic method does not perform well, most likely due to cumulative errors arising when applying a long sequence of operations where more noise is added at each step. We found that applying the scheme gradually to groups of layers works better in deep networks.
In order to perform gradual quantization, we split the network into blocks , each containing about same number of consecutive layers. We also split our budget of training epochs into stages. At the th stage, we quantize and freeze the parameters in blocks , and inject noise into the parameters of . For the rest of the blocks neither noise nor quantization is applied. This way, the number of parameters and activations into which the noise is injected simultaneously is reduced, which allows better convergence. The deeper blocks gradually adapt to the quantization error of previous ones and thus tend to converge relatively fast when the noise is injected into them. For finetuning a pretrained model, we use the same scheme, applying a single epoch per stage. An empirical analysis of the effect of a different number of stages is presented in Section 3.5.
3 Experimental evaluation
We performed an extensive performance analysis of the UNIQ scheme compared to most prominent recent methods for neural network quantization. The basis for comparison is the accuracy vs. the total number of bit operations in visual classification tasks on the ImageNet1K (Russakovsky et al., 2015) dataset. CIFAR10 and 100 datasets (Krizhevsky, 2009) are further used to evaluate different design choices made.
MobileNet (Howard et al., 2017) and ResNet18 (He et al., 2016) architectures were used as the baseline for quantization. MobileNet was chosen as a representative of lighter models, which are more suited for a limited hardware setting where quantization is also most likely to be used. ResNet18 was chosen due to its near stateoftheart performance and popularity, which makes it an excellent reference for comparison to other results in the literature.
We compared the performance of different quantization methods, training from scratch vs. finetuning, and different partitions to stages in the gradual quantization strategy. We adopted the number of bit operations (BOPs) metric for arithmetic complexity. This metric is particularly informative about the performance of mixedprecision arithmetic especially in hardware implementations on FPGAs and ASICs.
Training details
For training a quantized ResNet18 model from random initialization on CIFAR, we used SGD with an initial learning rate of , momentum and weight decay . The weight decay was set to zero after epochs, and the learning rate was reduced by after epochs. The total number of epochs was set to . For finetuning a pretrained model, we used the same settings and trained for epochs. Both were trained with stages, noise was not injected to the first and last layers and they were not quantized. The same setup was used when finetuning for ImageNet, but with learning rate set to for ResNet18 and for MobileNet. For MobileNet, we finetuned with epochs and stages. In all experiments, unless otherwise stated, we finetuned a pretrained model.
3.1 Performance of quantized networks on ImageNet
Tables 2–3 report the top1 accuracy performance of ResNet18 and MobileNet with weights and activations quantized to several levels using UNIQ. As the baseline, we use a fullprecision model with bit weights and activations. We found the UNIQ method to perform especially well for extreme lowprecision activations with minor degradation in performance even with as little as bit quantization. Although higher precision is required for the weights, we only observed degradation when using bits or less.
Table 1 compare the performance of quantized using UNIQ to other leading approaches reported in the literature. Those methods are aimed at lower precision weights rather than activations, while UNIQ is better suited for lower precision activations. Nevertheless, UNIQ achieved comparable performance at a significantly lower overall complexity.
We found the UNIQ method to perform well also with the smaller MobileNet architecture comprising only million parameters. This is in contrast with the previous works that resort to using larger models, e.g. by doubling the number of filters for ResNet18 and thus quadrupling the number of parameters from to million (Mishra et al., 2018; Polino et al., 2018).
Activation bits  
2  4  32  
Weight bits  2  49.27  51.01  – 
4  64.15  64.43  –  
8  68.07  68.32  –  
32  –  –  68.01 
Activation bits  
2  4  32  
Weight bits  2  48.01  49.77  – 
4  65.32  66.02  –  
8  69.63  69.92  –  
32  –  –  70.14 
3.2 Accuracy vs. complexity tradeoff
Since custom precision data types are used for the network weights and activations, the number of MAC operations is not an appropriate metric to describe the computational complexity of the model. Therefore, we use the BOP metric quantifying the number of bit operations. Given the bit width of two operands, it is possible to approximate the number of bit operations required for a basic arithmetic operation such as addition and multiplication. The proposed metric is useful when the inference is run on custom hardware like FPGAs or ASICs. Both of them are a natural choice for quantized networks, due to the use of lookup tables (LUTs) and dedicated MAC (or more general DSP) units, which are efficient with custom data types.
Another factor that must be incorporated into the BOPS calculation is the cost of fetching the parameters from an an external memory. Two assumptions are made in the approximation of this cost: firstly, we assume that each parameter is only fetched once from an external memory; secondly, the cost of fetching a bit parameter is assumed to be BOPs. Given a neural network with parameters all represented in bits, the memory access cost is simply .
3.3 Accuracy vs. quantization level
Table 4 reports the accuracy of a narrow version of ResNet18 on CIFAR10 for various levels of weight and activation quantization. We observed that when using UNIQ, quantizing activations has a smaller effect than quantizing weights. Degradation in accuracy appears to be significant for bit weights, without a noticeable dependence on the activation precision. For bit weights, we observed only a minor degradation, compared to fullprecision, even when using bit (binary) activations.
Activation bits  
1  2  4  32  
Weight bits  1  78.74  78.74  78.78  78.77 
2  88.79  88.85  88.63  88.63  
4  90.57  90.60  90.71  90.71  
32  90.74  90.79  90.82  90.90 
3.4 Training from scratch vs. finetuning
Both training from scratch (that is, from random initialization) and finetuning have their advantages and disadvantages. Training from scratch takes more time but requires a single training phase with no extra training epochs, at the end of which a quantized model is obtained. Finetuning, on the other hand, is useful when a pretrained fullprecision model is already available; it can then be quantized with a short retraining.
Table 5 compares the accuracy achieved in the two regimes on a narrow version of ResNet18 trained on CIFAR10 and 100. bit quantization of weights only and bit quantization of both weights and activations were compared. We found that both regimes work equally well, reaching accuracy close to the fullprecision baseline.
Dataset  Bits  Full training  Finetuning  Baseline 
CIFAR10  5,32  90.36  90.47  90.9 
5,5  90.57  90.53  
CIFAR100  5,32  65.56  65.73  66.3 
5,5  65.65  66.05 
3.5 Accuracy vs. number of quantization stages
We found that injecting noise to all layers simultaneously does not perform well for deeper networks. As described in Section 2.3, we suggest splitting the training into stages, such that at each stage the noise is injected only into a subset of layers.
To determine the optimal number of stages, we finetuned ResNet18 on CIFAR10 with a fixed epoch budget. Bit width was set to for both the weights and the activations. The first and the last layers were not quantized; the other layers were partitioned evenly into stages. Two of the epochs were reserved for finetuning the first and the last layers after all other layers were quantized and fixed. The remaining epochs were allocated evenly to the stages.
Figure 2 reports the classification accuracy as a function of the number of quantization stages. Based on these results, we conclude that the best strategy is injecting noise to a single layer at each stage. We follow this strategy in all ResNet18 experiments conducted in this paper. For MobileNet, since it is deeper and includes layers, we chose to inject noise to consecutive layers at every stage.
3.6 Comparison of different quantizers
In the following experiment, different quantizers were compared within the uniform noise injection scheme.
The bins of the uniform quantizer were allocated evenly in the range with denoting the standard deviation of the parameters. For both the quantile and the means quantizers, normal distribution of the weights was assumed and the normal cumulative distribution and quantile functions were used for uniformization and deuniformization of the quantized parameter. The means and uniform quantizers used a precalculated set of thresholds translated to the uniformized domain. Since the resulting bins in the uniformized domain had different widths, the level of noise was different in each bin. This required an additional step of finding the bin index for each parameter approximately doubling the training time.
The three quantizers were evaluated in a ResNet18 network trained on the CIFAR10 dataset with weights quantized to bits () and activations computed in full precision ( bit). Table 6 reports the obtained top1 classification accuracy. quantile quantization outperforms other quantization methods and is only slightly inferior to the fullprecision baseline. In terms of training time, the quantile quantizer requires about more time to train for ; this is compared to around increase in training time required for the means quantizer. In addition, quantile training time is independent on the number of quantization bins as the noise distribution is same for every bin while the other methods require separate processing of each bin, increasing the training time for higher bit widths.
Quantization method  Accuracy  Training time [h] 
Baseline (unquantized)  90.90  1.42 
quantile  90.47  2.28 
means  84.80  5.37 
Uniform  83.93  5.37 
4 Conclusion
We introduced UNIQ – a training scheme for quantized neural networks. The scheme is based on the uniformization of the distribution of the quantized parameters, injection of additive uniform noise, followed by the deuniformization of the obtained result. The scheme is amenable to efficient training by back propagation in full precision arithmetic, and achieves maximum efficiency with the quantile (balanced) quantizer that was investigated in this paper.
We reported stateoftheart results for quantized neural networks on the ImageNet visual recognition task. In the highaccuracy regime (around top1 accuracy), our networks require around GBOPs as opposed to the best results obtained so far by the apprentice networks in (Mishra & Marr, 2018) at the complexity of around GBOPs. The 2 to 5fold improvement in complexity comes at a negligible drop in accuracy compared to both (Mishra & Marr, 2018) and the fullprecision baseline (the latter requires around TBOPs).
In the lowcomplexity regime, we achieve around top1 accuracy with a network requiring around GBOPs. For comparison, the fullprecision baseline requires over GBOPs, while the best extreme quantization schemes such as XNOR networks (Rastegari et al., 2016) and QNN (Hubara et al., 2016b) achieving similar complexity perform at around accuracy.
UNIQ is straightforward to implement and can be used as a “plugandplay” modification of existing architectures. It also does not require tweaking the architecture, e.g. increasing the number of filters as done in few previous works. It can also benefit from pretrained parameters by applying only a short retraining. Unlike the teacherstudent scheme that requires training a much bigger teacher network, our method trains the target architecture directly.
One of the most surprising properties of UNIQ is that it allows extreme quantization of the activations without a significant drop in accuracy (see Table 4). This allows settings with very low precision quantization of the weights (around bits) and extreme (even binary) quantization of the activation without the need to modify the architecture.
While this paper considered a setting in which all MAC operations were performed in the same precision, more complicated bit allocations will be explored in follow up studies.
Acknowledgments
The research was funded by ERC StG RAPID and ERC StG SPADE.
References
 Blundell et al. (2015) Blundell, Charles, Cornebise, Julien, Kavukcuoglu, Koray, and Wierstra, Daan. Weight uncertainty in neural network. In International Conference on Machine Learning, pp. 1613–1622, 2015.
 Cai et al. (2017) Cai, Z, He, X, Sun, J, and Vasconcelos, N. Deep learning with low precision by halfwave gaussian quantization. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2017.
 Chen et al. (2016) Chen, LiangChieh, Papandreou, George, Kokkinos, Iasonas, Murphy, Kevin, and Yuille, Alan L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. CoRR, abs/1606.00915, 2016. URL http://arxiv.org/abs/1606.00915.
 Dong et al. (2017) Dong, Yinpeng, Ni, Renkun, Li, Jianguo, Chen, Yurong, Zhu, Jun, and Su, Hang. Learning accurate lowbit deep neural networks with stochastic quantization. In British Machine Vision Conference (BMVC’17), 2017.
 Gupta et al. (2015) Gupta, Suyog, Agrawal, Ankur, Gopalakrishnan, Kailash, and Narayanan, Pritish. Deep learning with limited numerical precision. In Proceedings of the 32nd International Conference on Machine Learning (ICML15), pp. 1737–1746, 2015.
 Han et al. (2016) Han, Song, Mao, Huizi, and Dally, William J. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. International Conference on Learning Representations (ICLR), 2016.
 He et al. (2016) He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
 Hinton et al. (2012) Hinton, G., Deng, L., Yu, D., Dahl, G. E., r. Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., and Kingsbury, B. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, Nov 2012. ISSN 10535888. doi: 10.1109/MSP.2012.2205597.
 Howard et al. (2017) Howard, Andrew G, Zhu, Menglong, Chen, Bo, Kalenichenko, Dmitry, Wang, Weijun, Weyand, Tobias, Andreetto, Marco, and Adam, Hartwig. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
 Hubara et al. (2016a) Hubara, Itay, Courbariaux, Matthieu, Soudry, Daniel, ElYaniv, Ran, and Bengio, Yoshua. Binarized neural networks. In Advances in neural information processing systems, pp. 4107–4115, 2016a.
 Hubara et al. (2016b) Hubara, Itay, Courbariaux, Matthieu, Soudry, Daniel, ElYaniv, Ran, and Bengio, Yoshua. Quantized neural networks: Training neural networks with low precision weights and activations. arXiv preprint arXiv:1609.07061, 2016b.
 Krizhevsky (2009) Krizhevsky, Alex. Learning multiple layers of features from tiny images. Master’s thesis, Department of Computer Science, University of Toronto, 2009.
 Lai et al. (2015) Lai, Siwei, Xu, Liheng, Liu, Kang, and Zhao, Jun. Recurrent convolutional neural networks for text classification. In Proceedings of the TwentyNinth AAAI Conference on Artificial Intelligence, AAAI’15, pp. 2267–2273. AAAI Press, 2015. ISBN 0262511290. URL http://dl.acm.org/citation.cfm?id=2886521.2886636.
 Lee et al. (2017) Lee, Edward H, Miyashita, Daisuke, Chai, Elaina, Murmann, Boris, and Wong, S Simon. Lognet: Energyefficient neural networks using logarithmic computation. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pp. 5900–5904. IEEE, 2017.
 Lloyd (1982) Lloyd, Stuart. Least squares quantization in pcm. IEEE transactions on information theory, 28(2):129–137, 1982.
 Mishra & Marr (2018) Mishra, Asit and Marr, Debbie. Apprentice: Using knowledge distillation techniques to improve lowprecision network accuracy. International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=B1ae1lZRb.
 Mishra et al. (2018) Mishra, Asit, Nurvitadhi, Eriko, Cook, Jeffrey J, and Marr, Debbie. Wrpn: Wide reducedprecision networks. International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=B1ZvaaeAZ.
 Polino et al. (2018) Polino, Antonio, Pascanu, Razvan, and Alistarh, Dan. Model compression via distillation and quantization. International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=S1XolQbRW.
 Rastegari et al. (2016) Rastegari, Mohammad, Ordonez, Vicente, Redmon, Joseph, and Farhadi, Ali. Xnornet: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pp. 525–542. Springer, 2016.
 Russakovsky et al. (2015) Russakovsky, Olga, Deng, Jia, Su, Hao, Krause, Jonathan, Satheesh, Sanjeev, Ma, Sean, Huang, Zhiheng, Karpathy, Andrej, Khosla, Aditya, Bernstein, Michael, Berg, Alexander C., and FeiFei, Li. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s112630150816y.
 Zhou et al. (2017a) Zhou, Aojun, Yao, Anbang, Guo, Yiwen, Xu, Lin, and Chen, Yurong. Incremental network quantization: Towards lossless cnns with lowprecision weights. In International Conference on Learning Representations,ICLR2017, 2017a.
 Zhou et al. (2017b) Zhou, ShuChang, Wang, YuZhi, Wen, He, He, QinYao, and Zou, YuHeng. Balanced quantization: An effective and efficient approach to quantized neural networks. Journal of Computer Science and Technology, 32(4):667–682, 2017b.
 Zhou et al. (2016) Zhou, Shuchang, Wu, Yuxin, Ni, Zekun, Zhou, Xinyu, Wen, He, and Zou, Yuheng. Dorefanet: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.
 Zhu et al. (2016) Zhu, Chenzhuo, Han, Song, Mao, Huizi, and Dally, William J. Trained ternary quantization. International Conference on Learning Representations (ICLR), 2016.