Ternary Neural Networks with FineGrained Quantization
Abstract
We propose a novel finegrained quantization (FGQ) method to ternarize pretrained full precision models, while also constraining activations to 8 and 4bits. Using this method, we demonstrate minimal loss in classification accuracy on stateoftheart topologies without additional training. We provide an improved theoretical formulation that forms the basis for a higher quality solution using FGQ. Our method involves ternarizing the original weight tensor in groups of weights. Using , we achieve Top1 accuracy within and of the baseline full precision result for Resnet101 and Resnet50 respectively, while eliminating of all multiplications. These results enable a full 8/4bit inference pipeline, with best reported accuracy using ternary weights on ImageNet dataset, with a potential of improvement in performance. Also, for smaller networks like AlexNet, FGQ achieves stateoftheart results. We further study the impact of group size on both performance and accuracy. With a group size of , we eliminate of the multiplications; however, this introduces a noticeable drop in accuracy, which necessitates fine tuning the parameters at lower precision. We address this by finetuning Resnet50 with 8bit activations and ternary weights at , improving the Top1 accuracy to within of the full precision result with additional training overhead. Our final quantized model can run on a full 8bit compute pipeline using 2bit weights and has the potential of up to improvement in performance compared to baseline fullprecision models.
1 Introduction
Today’s deep learning models achieve stateoftheart results on a wide variety of tasks including Computer Vision, Natural Language Processing, Automatic Speech Recognition and Reinforcement Learning 2016DlBook (). Mathematically, this involves solving a nonconvex optimization problem with order of millions or more parameters. Solving this optimization problem  also referred to as training the neural network  is a computeintensive process that, for current stateoftheart networks, requires days to weeks. Once trained, the network evaluates a function on specific input data  referred to as inference. While the compute intensity for inference is much lower than that of training, owing to the fact that inference is done on a large number of input data, the total computing resources spent on inference is likely to dwarf those spent on training. The large and somewhat unique compute requirements for both deep learning training and inference operations motivate the use of customized low precision arithmetic hubara2016qnn (); courbariaux2016bnn (); hubara2016bnn (); zhou2016dorefa (); logqunat (); 2016twn () and specialized hardware to run these computations as efficiently as possible gupta2015lp (); zhu2016ttq (); venkatesh2016 (); finn2016 (); tpuGblog (). Most of these cases requires partial or full training of network in low precision. Training at lowprecision allows for the network to implicitly learn the low precision representation (along with the inherent noise); however, it introduces significant resource overheads which can be prohibitive for many resourceconstrained applications, specifically those involving edge devices.
Reducing precision for both weights and activations has significant powerperformance implication on system design. Lowprecision not only allows increasing compute density, but also reduce pressure on the memory subsystem. Most of the current solutions are focused on compressing the model NNFM2015 (); rastegariECCV16 (), going as low as binary weights, which allows storing the model on the limited onchip local memory. However, activations (input) need to be fetched from external memory or I/Odevice (camera). Fetching data contributes to majority of the system power consumption. Hence reducing the size of activations is essential for more efficient utilization of the available computational resources. There have been a few solutions hubara2016bnn (); hubara2016qnn () using lower precision representation for activations, however they necessitate specialized hardware for efficient implementation. Further, with widespread adoption of deep learning across various applications, such as autonomous driving, augmented reality etc, there is an increased demand for inference tasks to be done on edge devices efficiently. To address both the aforementioned system and application requirements, there is a general trend to move towards a full lower precision inference pipeline tpuGblog (). This is evident from the advent of 8bit and sub 8bit hardware such as Google’s TPU tpuGblog () and other main stream GPU^{1}^{1}1https://devblogs.nvidia.com/parallelforall/newpascalgpusaccelerateinferenceinthedatacenter/, CPU offerings. Further, there is also software support for 8bit inference through popular frameworks such as TensorFlow, Theano and compute libraries like NVidia’s TensorRT^{1}^{1}footnotemark: 1.
In this paper, we focus on enabling a sub 8bit inference pipeline by using ternary weights and 8/4bit activations, with minimal or no retraining, and yet achieving near stateofart accuracy. The rationale behind our approach is to carefully convert the fullprecision weights to lowprecision, such that the elementwise distance between fullprecision and lowprecision weights are small. Consequently, the lowprecision weights remain in the neighborhood of pretrained fullprecision weights in the search space of network parameters, and we expect them to generalize in a similar manner, despite no retraining.
We summarize our contributions below:

Based on an improved theoretical formulation, propose a novel finegrained quantization (FGQ) method to convert pretrained models to a ternary representation with minimal loss in test accuracy, without retraining.

With ternary weights, achieve classification accuracy (Top1) of with 8bit activations (2w8a) and with 4bit activations (2w4a), on the ImageNet datasetimagenet_data () using a pretrained Resnet101 model (no retraining). To the best of our knowledge, these are the highest reported accuracies in this category on ImageNet datasetimagenet_contest ().

Demonstrate the general applicability of FGQ, with stateofart results (2w8a, 2w4a) on smaller models such as Resnet50 and Alexnetalexnet (). And also show the efficacy of using FGQ for (re)training at low precision.

Study the performanceaccuracy tradeoff using different group sizes. For a group of filters, we reduce the number of multiplications to one in every additions, thus significantly reducing computation complexity with a potential for up to improvement baseline full precision models.
The rest of the paper is organized as follows, Section2 discusses related work on ternary weights, low precision inference and contrast them with FGQ. Section3 describes the FGQ formulation and the theoretical basis for this method. This is followed by Section4, which includes experimental results and related discussion. Finally, in Section5 we conclude with summarizing the implications of of FGQ, our results and the future research directions.
2 Related Work
Deep learning inference using lowprecision weights and activations is a wellresearched topic. Many researchers have experimented with custom data representations to perform deep learning tasks and have shown significant benefits over the general purpose floating point representation. vanhoucke2011improving () have show that 8bit dynamically scaled fixed point representation williamson1991dynamically () can be used to speed up convolution neural networks using general purpose CPU hardware by carefully choosing right data layout, compute batching and using implementation optimized for target hardware. With this they show up to improvement over an aggressively tuned floating point implementation. gupta2015lp () have done a comprehensive study on the effect of low precision fixed point computation for deep learning and have successfully trained smaller networks using 16bit fixed point on specialized hardware. This suggests that fixed point representations are better suited for low(er) precision deep learning. There have also been recent efforts exploring 8bit floating point representation dettmers20158 (), however such schemes have the additional overhead of reduced precision since the exponent is replicated for each value. Whereas with fixed point representation, using a single shared exponent improves capacity for precision. Typically for deep neural networks, with reduced bitwidths it is desired to preserve the numerical precision, since the loss in range can be augmented by the dynamic scaling of the shared exponent.
Commonly, low precision networks are designed to be trained from scratch, leveraging the inherent ability of network to learn the approximations introduced by low precision computations NNFM2015 (); rastegariECCV16 (); hubara2016qnn (); courbariaux2016bnn (); hubara2016bnn (); zhu2016ttq (); han2015learning (). This can be prohibitive in applications which rely on using previously trained models. Such use cases are typical in many edge device deployments. To address such cases, FGQ is developed with motivation to be able to achieve stateofart accuracies without any training and hence enabling direct use of pretrained models. This requirement results in the quantization scheme being quite complex but making it more widely applicable and making it also easily usable for the former case  with training from scratch.
Many of the recent reduced precision work, look at the low precision only for the weights while retaining the activations in full precision inq (); venkatesh2016 (); han2015learning (); logqunat (); rastegariECCV16 (); NNFM2015 (); dettmers20158 (). Using low precision also for activations is essential to realize the full powerperformance benefits with using 2bit (ternary) weights. The hardware needs to operate at the throughput that is close to the precision of the weights (i.e. better throughput compared to using 32bit weights). This cannot be achieved (or would be very hard to achieve) when the activations are at full precision because streaming 32bit activations from main memory at that rate requires much higher() bandwidth and compute engine needs to be much wider to deliver the desired throughput. All of which will increase the area and power budget which is not desirable when designing for lowpower edge devices. Hence, reducing the size of activations is essential for reducing the compute requirements at the edge. Using 8bits and below for activations dramatically reduces the design requirements on the edge and opens up the possibility of achieving throughput improvement. NNFM2015 (); rastegariECCV16 () propose low precision networks with binary weights, while retaining the activations in full precision. NNFM2015 () use a stochastic binarization scheme, achieving stateofart (SOTA) accuracies on smaller datasets (MNIST, CIFAR10, SVHN). rastegariECCV16 () demonstrate nearSOTA accuracies on the large ImageNet dataset using AlexNet topology. Further, they also demonstrate a variant with binary weights and activations, with all computations are simplified bitcount operations but with significant loss in accuracy. Lower precision for activations have also been used, hubara2016bnn () use 1bit for both weights and activations for smaller networks. For larger Imagenetclass networks hubara2016qnn (), use 2bit activations and binary weights showing reasonable accuracies. However, both these hubara2016bnn (); hubara2016qnn () use specialized data representation requiring custom hardware for efficient implementation. Other solutions such as zhou2016dorefa (), employ a more tailored approach with different precision for each  weights (1bit), activations (2bits) and gradients (6bits); implemented with specialpurpose hardware.
2016twn () introduces a theoretical formulation for ternary weight network using a threshold based approach (symmetric threshold ) with one scaling factor for each layer. They provide an approximation to the optimal ternary representation assuming weights follow a Gaussian distribution. However, one scaling factor per layer may not approximate the fullprecision network well as the model capacity is very limited in this case. To increase model capacity zhu2016ttq () modify this solution to use two symmetric thresholds () and two scaling factors (separately for positive and negative weights). However, despite improving the accuracy this approach typically makes the inferencing inefficient by requiring multiple passes over the positive and negative values, hence increasing the bandwidth requirements. inq () have proposed a postfacto incremental quantization approach, which aims to find the optimal representation using an iterative method, constraining weights to either 0 or powers of 2, using a 5bit representation. and retraining activations with full precision. All the aforementioned implementation require partial or full training of the network in low precision. Alternatively, logqunat () used log quantization method on pretrained models and achieved good accuracy by tuning the bit length for each layer without retraining.
Achieving nearSOTA accuracy on the Imagenet dataset with deeper networks resnet (), without any training in low precision (for both weights and activations) is still a challenge. Our work is an attempt to address this problem and improve over existing approaches.
3 Ternary Conversion of Trained Network
Our goal is to convert the fullprecision trained weights to ternary values , , without retraining. We use a threshold () based approach similar to 2016twn (): th element , if , and otherwise. Then, the elementwise error is and an optimal ternary representation is as follows:
(1) 
where is the size of (). We hypothesize that weights that learn different types of features may follow different distributions. Combining all the weights together represents a mixture of various distributions, and ternarizing them using a single threshold () and magnitude () may not preserve the distributions of individual weights. Consequently, many weights are approximated poorly (if not totally pruned out) leading to loss of valuable information that they learn. We may not be able to compensate for this loss of information as we do not train the network in low precision.
This motivates us to use a finegrained quantization technique involving multiple scaling factors in order to increase model capacity that can lead to better preservation of distributions learned by filters. Moreover, we hypothesize that positive and negative weight distributions are not always symmetric around the mean, further refinement of this solution maybe possible using two separate thresholds, and , for positive and negative weights, respectively, along with a scaling factor to ternarize the weights.
3.1 Our Formulation
Computing separate and for each weight compensates for information loss and better preserves the underlying distributions. However, such solution, while showing significant improvement in accuracy, does not reduce the number of multiplications leading to a less efficient implementation. Therefore, we seek to find a tradeoff between achieving higher accuracy and reducing the total number of multiplications. We propose a finegrained quantization approach that creates groups of weights, and ternarizes each group independently. Let us consider the weights represented as a vector . We partition the set of indices into disjoint subsets, , with cardinality , such that, , , . We can decompose into orthogonal vectors , , where th component if , otherwise . Clearly, ; then we ternarize each orthogonal component as , where components of are in . Thresholdbased pruning never turns to nonzero, and the following orthogonality holds. . It follows that, , for , and we have . Then, for a given group of filters , and ,
(2) 
Therefore, we need to solve independent subproblems. This formulation allows a better ternary approximation to the original fullprecision weights, ensuring that they remain within a neighborhood of the original solution in the complex search space of parameters, despite no retraining. Consequently, we expect the fullprecision solution and the ternary counterpart to generalize in a similar manner. From model capacity point of view, we can have only three distinct values, , for a ternary weight vector without such grouping. With such groups, however, we can represent distinct values, thus increasing model capacity linearly with number of groups.
We can solve each subproblem using a thresholdbased approach the following way. We are given a vector of elements , and we can use separate thresholds for positive and negative weights, along with one scaling factor to ternarize . Let , , for some . We want to solve
(3) 
We have the following analytical solution.
(4) 
Note that for , , and (4) reproduces the formulation in (2016twn ()).
(5) 
The advantage of our formulation (2) is that the smaller independent subproblems can be solved efficiently using bruteforce methods to achieve better approximation. However, we also explore analytical solutions to establish the theoretical veracity of our approach. Assuming that the magnitude of the learned weights follow exponential distribution with parameter , we analytically derive the optimal from the following lemma.
Lemma 1.
Using above notations, if , then, in (5) is
From this analysis, we see the need for a higher threshold value to prune larger number of smaller elements. This is intuitive from the shape of the model distributions, which are typically heavytailed distributions. In reality, however, it may not be appropriate to use a single distribution to model the weights of all the layers of a neural network. We can apply KolmogorovSmirnov (KS) test as a goodnessoffit measure to identify an appropriate reference distribution (here we choose between Gaussian and exponential), and find accordingly. We approximate a heavytailed distribution by an exponential one by pruning out some of the smaller elements. This gives us an exponential approximation with smaller . Further, we can use maximum likelihood functions to estimate the parameters of such distributions. For Gaussian , estimated , and for exponential case, estimated parameter . Based on such refined analysis, we observe significant improvement in the theoretical ternary error over Gaussian assumption of 2016twn () (Figure 1). It is interesting to observe that for earlier convolution layers of ResNet101 trained on ImageNet, the magnitude of the weights follow exponential distribution, and later layer weights are Gaussian.
3.2 Weight Grouping
Our method (2) is agnostic to how the (fullprecision) weights are grouped , but leverages that consequence of grouping  which allows for solving these as independent subproblems more efficiently. The specifics of the grouping mechanism and memory layout used for accessing these groups of weights is an independent problem to explore. The primary objective of grouping is to minimize the dynamic range within each group and split the weights in such a way that the smaller groups have a uniform distribution. This helps in reducing the complexity of finding an optimal solution () for each independent subproblem using either analytical or bruteforce techniques.
However, to realize the full performance potential of ternarization, it is essential to ensure that the grouping mechanism itself does not introduce a significant overhead. Similarity based clustering algorithms such as Kmeans, despite being better at finding optimal grouping of weights that may even lead to better accuracy, are not friendly for efficient implementations (in both software hardware), because of the random grouping of elements from noncontiguous memory locations. This leads to irregular memory accesses with longer latencies, to gather arbitrarily grouped weights that use a common , for a partial accumulation of the output.
Based on our empirical observations, we conclude that using static groups of weights that are partitioned along input channels achieves best accuracy. The same element from multiple filters along have significantly less variance, since they correspond to similar input features. Hence grouping such elements results in reduced dynamic range within the group. Such a grouping also easily lends itself for efficient implementation using both existing hardware and in software with using layout for the weight tensor, where groups of elements are accessed from contiguous memory locations. Since the elements along accumulate to the same output feature, this layout is also amenable to efficient vectorization along . Figure 2 shows an example of this grouping scheme applied to filters. Each group of ternary filters, has scaling factors () corresponding to each element of the filter.
4 Experimental Results
For experimental results, we focused on Resnet50 and Resnet101resnet () using ILSVRC2012imagenet_data () dataset, to demonstrate the efficacy of our method on large, sophisticated models using 2bit weights and 8bit activations (2w8a). We extended our study by applying FGQ on activations to help further reduce the precision of activations to 4bits (2w4a) and show results comparable with 8bit activations for all the tested networks. Further, towards establishing the broader applicability of FGQ we demonstrate stateoftheart accuracy also for Alexnetalexnet ().
Our setup consists of a modified version of Caffecaffe () that emulates lowprecision dynamic fixed point (DFP^{2}^{2}2Please not that fixed point and dynamic fixed point are used interchangeably) computations described in Fig. 3. We use 32bit accumulator for all our low precision computations to minimize the chances of an overflow. We split the pretrained weights into groups of elements using the mechanism described in section3.2, and use bruteforce technique to compute the floating point values of the threshold () and scalingfactor () for each group. The scaling factors are then quantized to a 8/4bit fixed point and the weights are stored in the memory format described in3.2. The activations are quantized to 8/4bits before performing convolution operation and the 32bit outputs are down converted to 8bit and appropriately rounded before they are passed to the next layer.
Our experiments indicate that it is essential to use higher precision of the first layer (8w8a), to minimize the accumulation of quantization loss. We also observe that using pretrained parameters in batch normalization layers leads to a loss of accuracy due to shift in variance introduced by the quantization. We prevent this loss by recomputing batch normalization parameters during the inference phase to compensate for the shift in variance.
We explored the accuracyperformance tradeoff using different group sizes of , our experiments show that FGQ with a group size of N=4 (FGQN4) achieves highest accuracy with no retraining and a potential performance benefit. FGQN4 applied to a pretrained Resnet101 model with 2w8a achieves Top1 accuracy of , which is within of the fullprecision results. With activations reduced to 4bits (2w4a), the Top1 accuracy drops only marginally to . FGQN4 performs equally well on Resnet50, achieving with Top1 accuracy of with 2w8a which is off from fullprecision result, and with 2w4a. To the best of our knowledge, these are the highest reported accuracies using 2w8a and 2w4a on Imagenet datasetimagenet_data () using stateoftheart networks.
To understand the general applicability of our method to a wider range of networks, we apply FGQ to the smaller Alexnetalexnet () model. FGQN4 applied to a pretrained Alexnet model, achieves Top1 accuracy with 2w8a without any retraining, this is away from the baseline fullprecision result( ). With 2w4a we do not see any further reduction in accuracy. There are no previously published results that can be directly compared to FGQ, which perform quantizaion on pretrained models and work with endtoend lowprecision. Hence, we compare with venkatesh2016 (); inq (); zhou2016dorefa (), which are the closest in terms of the networks used and/or the target precision. Our Alexnet result using FGQN is comparable to previously published resultzhou2016dorefa () which is away from the baseline using 1w4a while also employing training in lowprecision with full precision gradients. Table 1 has a comparison with previous reported results frominq () using bit weights and venkatesh2016 () using ternary weights. While they report slightly better absolute numbers, our numbers are relatively better because both these results use fullprecision activations and train the network in low precision to achieve those numbers. While without any low precision training and reduced precision for activation, results with FGQ is still competitive with other similar (aforementioned) results.^{3}^{3}3It should be noted that both these works use Resnet50 with slight variations and hence have slightly different baseline accuracies. For inq () the baseline full precision a Top1 accuracy is , for venkatesh2016 () it is and for zhou2016dorefa () it is With additional low precision training with FGQ we are able significantly improve accuracy and get closer to stateofart full precision results, as outlined in the next section along with associated performance implications.
Networks  Our Baseline  FGQN4 2w8a  FGQN4 2w4a  INQ 5w32a inq ()  dLAC 2w32a venkatesh2016 ()  DoReFa 1w4a32g zhou2016dorefa () 

no low precision training  with low precision retraining  
Resnet101  77.50%  73.85%  70.69%       
Resnet50  75.05%  70.76%  68.38%  74.81%3  73.85%3   
AlexNet  56.83%  49.04%  49.00%      50.3%3 
4.1 Discussion
In order to realize the full performance potential of ternary networks, the inference platform needs to operate at the throughput that is close to the precision of the weights. This would increase the amount of memory bandwidth required to stream activations by and a compute engine that is much wider to deliver the desired compute throughput. Building such solution around fullprecision activations would be prohibitive in terms of areas and power requirements, whereas it is more amenable to build such solution when the activations are 8 or 4bits.
Figure3(a) shows the performance Vs accuracy tradeoff for Resnet50marcelbatchnorm () for a FGQ based 8bit inference design. Our model projects the lower bound of the performance potential based on the percentage of FMA operations that can be converted into ternary accumulations at each group size . In the ideal case, where is equal to the total number of weights in the layer, the best case performance potential is compared to the baseline fullprecision performance. For a group size of N=4, of all FMA operations can be performed in ternary, Using slightly larger subgroups of N=8 we can replace of FMA operations with ternary while losing an additional Top1 accuracy. At group size , of all FMA operations can be replaced by ternary accumulations, resulting in potential improvement in performance. But the performance comes at a cost of significant drop in accuracy. Using larger groups of weights results in a poor ternary approximation to the fullprecision model. Consequently, the ternary solution moves away from the fullprecision local optima and display different generalization behavior.
We have trained lowprecision (2w8a) ResNet50marcelbatchnorm () at group size N=64 on ImageNetimagenet_data () dataset to recover the accuracy lost because of ternarization. We initialized the network with a pretrained fullprecision model and finetuned the parameters of the lowprecision network. We reduced the learning rate to an order of 1e4 to avoid exploding gradients, retaining all the other hyper parameters from fullprecision training and performed gradient updates in full precision. After training for 20epochs, we recover most of the lost accuracy and achieved Top1 and Top5 bringing it to within of the fullprecision baseline accuracy. Figure3(b) shows the reduction of training error and improvements in validation accuracy.
5 Conclusion
We propose a finegrained ternarization method which exploits local correlations in dynamic range of the parameters to minimize the impact of quantization on overall accuracy. We demonstrate near SOTA accuracy on Imagenet dataset using pretrained models with quantized networks without retraining. Using ternary weights on Resnet101 and Resnet50 with 8bit activations our results are within from the full precision accuracy. Using 4bit activations we see a further drop of in accuracy. To the best of our knowledge, these are the highest reported accuracies using ternary weights and lowprecision activations.
Our weight grouping based approach allows us to obtain solutions that can be tailored for specific hardware, as well as, can be used on general purpose hardware, based on the accuracy and performance requirements. Smaller group sizes with N=4 achieve best accuracy, and use of the computations ternary operations (simple 8bit additions) and this is better suited for implementation on specialized hardware. Larger group sizes are more suitable for current general purpose hardware, with a larger portion of computations as low precision operations ( for N=64), although this comes with the cost of reduced accuracy. This gap may be bridged with additional low precision training as shown in Section4. Our final quantized model can be efficiently run on full 8bit compute pipeline, thus offering a potential performancepower benefit.
We continue to actively work on closing the current accuracy gap, exploring both low precision (re)training and extensions to the FGQ method itself. Also we are looking into a more theoretical exploration to better understand the formal relationship between the weight grouping and final accuracy, with an attempt to establish realistic bounds for given networkperformanceaccuracy requirements.
References
 [1] Yoshua Bengio, Ian Goodfellow, and Aaron Courville. Deep learning. Book in preparation for MIT Press, 2016.
 [2] Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran ElYaniv, and Yoshua Bengio. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or1. arXiv preprint arXiv:1602.02830, 2016.
 [3] Jia Deng, Wei Dong, Richard Socher, LiJia Li, Kai Li, and Li FeiFei. Imagenet: A largescale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.
 [4] Tim Dettmers. 8bit approximations for parallelism in deep learning. arXiv preprint arXiv:1511.04561, 2015.
 [5] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited numerical precision. In ICML, pages 1737–1746, 2015.
 [6] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, pages 1135–1143, 2015.
 [7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
 [8] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran ElYaniv, and Yoshua Bengio. Binarized neural networks. In Advances in Neural Information Processing Systems, pages 4107–4115, 2016.
 [9] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran ElYaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. arXiv preprint arXiv:1609.07061, 2016.
 [10] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
 [11] N Jouppi. Google supercharges machine learning tasks with tpu custom chip. Google Blog, May, 18, 2016.
 [12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 [13] Fengfu Li, Bo Zhang, and Bin Liu. Ternary weight networks. arXiv preprint arXiv:1605.04711, 2016.
 [14] Zhouhan Lin, Matthieu Courbariaux, Roland Memisevic, and Yoshua Bengio. Neural networks with few multiplications. arXiv preprint arXiv:1609.07061, 2016.
 [15] Daisuke Miyashita, Edward H Lee, and Boris Murmann. Convolutional neural networks using logarithmic data representation. arXiv preprint arXiv:1603.01025, 2016.
 [16] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnornet: Imagenet classification using binary convolutional neural networks. In ECCV, 2016.
 [17] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
 [18] Marcel Simon, Erik Rodner, and Joachim Denzler. Imagenet pretrained models with batch normalization. arXiv preprint arXiv:1612.01452v2, 2016.
 [19] Yaman Umuroglu, Nicholas J Fraser, Giulio Gambardella, Michaela Blott, Philip Leong, Magnus Jahre, and Kees Vissers. Finn: A framework for fast, scalable binarized neural network inference. arXiv preprint arXiv:1612.07119, 2016.
 [20] Vincent Vanhoucke, Andrew Senior, and Mark Z Mao. Improving the speed of neural networks on cpus. In Proc. Deep Learning and Unsupervised Feature Learning NIPS Workshop, volume 1, page 4. Citeseer, 2011.
 [21] Ganesh Venkatesh, Eriko Nurvitadhi, and Debbie Marr. Accelerating deep convolutional networks using lowprecision and sparsity. arXiv preprint arXiv:1610.00324, 2016.
 [22] Darrell Williamson. Dynamically scaled fixed point arithmetic. In Communications, Computers and Signal Processing, 1991., IEEE Pacific Rim Conference on, pages 315–318. IEEE, 1991.
 [23] Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen. Incremental network quantization: Towards lossless cnns with lowprecision weights. poster at International Conference on Learning Representations, 2017.
 [24] Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Dorefanet: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.
 [25] Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally. Trained ternary quantization. arXiv preprint arXiv:1612.01064, 2016.
Appendix
.1 Proof of Lemma 1
Let denote the number of elements. Let be the pdf of exponential distribution with parameter , and be the cdf. Then,
Furthermore,
Then,
Therefore,