Ternary Neural Networks with Fine-Grained Quantization
We propose a novel fine-grained quantization (FGQ) method to ternarize pre-trained full precision models, while also constraining activations to 8 and 4-bits. Using this method, we demonstrate minimal loss in classification accuracy on state-of-the-art topologies without additional training. We provide an improved theoretical formulation that forms the basis for a higher quality solution using FGQ. Our method involves ternarizing the original weight tensor in groups of weights. Using , we achieve Top-1 accuracy within and of the baseline full precision result for Resnet-101 and Resnet-50 respectively, while eliminating of all multiplications. These results enable a full 8/4-bit inference pipeline, with best reported accuracy using ternary weights on ImageNet dataset, with a potential of improvement in performance. Also, for smaller networks like AlexNet, FGQ achieves state-of-the-art results. We further study the impact of group size on both performance and accuracy. With a group size of , we eliminate of the multiplications; however, this introduces a noticeable drop in accuracy, which necessitates fine tuning the parameters at lower precision. We address this by fine-tuning Resnet-50 with 8-bit activations and ternary weights at , improving the Top-1 accuracy to within of the full precision result with additional training overhead. Our final quantized model can run on a full 8-bit compute pipeline using 2-bit weights and has the potential of up to improvement in performance compared to baseline full-precision models.
Today’s deep learning models achieve state-of-the-art results on a wide variety of tasks including Computer Vision, Natural Language Processing, Automatic Speech Recognition and Reinforcement Learning 2016DlBook (). Mathematically, this involves solving a non-convex optimization problem with order of millions or more parameters. Solving this optimization problem - also referred to as training the neural network - is a compute-intensive process that, for current state-of-the-art networks, requires days to weeks. Once trained, the network evaluates a function on specific input data - referred to as inference. While the compute intensity for inference is much lower than that of training, owing to the fact that inference is done on a large number of input data, the total computing resources spent on inference is likely to dwarf those spent on training. The large and somewhat unique compute requirements for both deep learning training and inference operations motivate the use of customized low precision arithmetic hubara2016qnn (); courbariaux2016bnn (); hubara2016bnn (); zhou2016dorefa (); logqunat (); 2016twn () and specialized hardware to run these computations as efficiently as possible gupta2015lp (); zhu2016ttq (); venkatesh2016 (); finn2016 (); tpuGblog (). Most of these cases requires partial or full training of network in low precision. Training at low-precision allows for the network to implicitly learn the low precision representation (along with the inherent noise); however, it introduces significant resource overheads which can be prohibitive for many resource-constrained applications, specifically those involving edge devices.
Reducing precision for both weights and activations has significant power-performance implication on system design. Low-precision not only allows increasing compute density, but also reduce pressure on the memory sub-system. Most of the current solutions are focused on compressing the model NNFM2015 (); rastegariECCV16 (), going as low as binary weights, which allows storing the model on the limited on-chip local memory. However, activations (input) need to be fetched from external memory or I/O-device (camera). Fetching data contributes to majority of the system power consumption. Hence reducing the size of activations is essential for more efficient utilization of the available computational resources. There have been a few solutions hubara2016bnn (); hubara2016qnn () using lower precision representation for activations, however they necessitate specialized hardware for efficient implementation. Further, with widespread adoption of deep learning across various applications, such as autonomous driving, augmented reality etc, there is an increased demand for inference tasks to be done on edge devices efficiently. To address both the aforementioned system and application requirements, there is a general trend to move towards a full lower precision inference pipeline tpuGblog (). This is evident from the advent of 8-bit and sub 8-bit hardware such as Google’s TPU tpuGblog () and other main stream GPU111https://devblogs.nvidia.com/parallelforall/new-pascal-gpus-accelerate-inference-in-the-data-center/, CPU offerings. Further, there is also software support for 8-bit inference through popular frameworks such as TensorFlow, Theano and compute libraries like NVidia’s TensorRT11footnotemark: 1.
In this paper, we focus on enabling a sub 8-bit inference pipeline by using ternary weights and 8/4-bit activations, with minimal or no re-training, and yet achieving near state-of-art accuracy. The rationale behind our approach is to carefully convert the full-precision weights to low-precision, such that the element-wise distance between full-precision and low-precision weights are small. Consequently, the low-precision weights remain in the neighborhood of pre-trained full-precision weights in the search space of network parameters, and we expect them to generalize in a similar manner, despite no re-training.
We summarize our contributions below:
Based on an improved theoretical formulation, propose a novel fine-grained quantization (FGQ) method to convert pre-trained models to a ternary representation with minimal loss in test accuracy, without re-training.
With ternary weights, achieve classification accuracy (Top-1) of with 8-bit activations (2w-8a) and with 4-bit activations (2w-4a), on the ImageNet datasetimagenet_data () using a pre-trained Resnet-101 model (no re-training). To the best of our knowledge, these are the highest reported accuracies in this category on ImageNet datasetimagenet_contest ().
Demonstrate the general applicability of FGQ, with state-of-art results (2w-8a, 2w-4a) on smaller models such as Resnet-50 and Alexnetalexnet (). And also show the efficacy of using FGQ for (re)training at low precision.
Study the performance-accuracy trade-off using different group sizes. For a group of filters, we reduce the number of multiplications to one in every additions, thus significantly reducing computation complexity with a potential for up to improvement baseline full precision models.
The rest of the paper is organized as follows, Section2 discusses related work on ternary weights, low precision inference and contrast them with FGQ. Section3 describes the FGQ formulation and the theoretical basis for this method. This is followed by Section4, which includes experimental results and related discussion. Finally, in Section5 we conclude with summarizing the implications of of FGQ, our results and the future research directions.
2 Related Work
Deep learning inference using low-precision weights and activations is a well-researched topic. Many researchers have experimented with custom data representations to perform deep learning tasks and have shown significant benefits over the general purpose floating point representation. vanhoucke2011improving () have show that 8-bit dynamically scaled fixed point representation williamson1991dynamically () can be used to speed up convolution neural networks using general purpose CPU hardware by carefully choosing right data layout, compute batching and using implementation optimized for target hardware. With this they show up to improvement over an aggressively tuned floating point implementation. gupta2015lp () have done a comprehensive study on the effect of low precision fixed point computation for deep learning and have successfully trained smaller networks using 16-bit fixed point on specialized hardware. This suggests that fixed point representations are better suited for low(er) precision deep learning. There have also been recent efforts exploring 8-bit floating point representation dettmers20158 (), however such schemes have the additional overhead of reduced precision since the exponent is replicated for each value. Whereas with fixed point representation, using a single shared exponent improves capacity for precision. Typically for deep neural networks, with reduced bit-widths it is desired to preserve the numerical precision, since the loss in range can be augmented by the dynamic scaling of the shared exponent.
Commonly, low precision networks are designed to be trained from scratch, leveraging the inherent ability of network to learn the approximations introduced by low precision computations NNFM2015 (); rastegariECCV16 (); hubara2016qnn (); courbariaux2016bnn (); hubara2016bnn (); zhu2016ttq (); han2015learning (). This can be prohibitive in applications which rely on using previously trained models. Such use cases are typical in many edge device deployments. To address such cases, FGQ is developed with motivation to be able to achieve state-of-art accuracies without any training and hence enabling direct use of pre-trained models. This requirement results in the quantization scheme being quite complex but making it more widely applicable and making it also easily usable for the former case - with training from scratch.
Many of the recent reduced precision work, look at the low precision only for the weights while retaining the activations in full precision inq (); venkatesh2016 (); han2015learning (); logqunat (); rastegariECCV16 (); NNFM2015 (); dettmers20158 (). Using low precision also for activations is essential to realize the full power-performance benefits with using 2-bit (ternary) weights. The hardware needs to operate at the throughput that is close to the precision of the weights (i.e. better throughput compared to using 32-bit weights). This cannot be achieved (or would be very hard to achieve) when the activations are at full precision because streaming 32-bit activations from main memory at that rate requires much higher() bandwidth and compute engine needs to be much wider to deliver the desired throughput. All of which will increase the area and power budget which is not desirable when designing for low-power edge devices. Hence, reducing the size of activations is essential for reducing the compute requirements at the edge. Using 8-bits and below for activations dramatically reduces the design requirements on the edge and opens up the possibility of achieving throughput improvement. NNFM2015 (); rastegariECCV16 () propose low precision networks with binary weights, while retaining the activations in full precision. NNFM2015 () use a stochastic binarization scheme, achieving state-of-art (SOTA) accuracies on smaller data-sets (MNIST, CIFAR10, SVHN). rastegariECCV16 () demonstrate near-SOTA accuracies on the large ImageNet data-set using AlexNet topology. Further, they also demonstrate a variant with binary weights and activations, with all computations are simplified bit-count operations but with significant loss in accuracy. Lower precision for activations have also been used, hubara2016bnn () use 1-bit for both weights and activations for smaller networks. For larger Imagenet-class networks hubara2016qnn (), use 2-bit activations and binary weights showing reasonable accuracies. However, both these hubara2016bnn (); hubara2016qnn () use specialized data representation requiring custom hardware for efficient implementation. Other solutions such as zhou2016dorefa (), employ a more tailored approach with different precision for each - weights (1-bit), activations (2-bits) and gradients (6-bits); implemented with special-purpose hardware.
2016twn () introduces a theoretical formulation for ternary weight network using a threshold based approach (symmetric threshold ) with one scaling factor for each layer. They provide an approximation to the optimal ternary representation assuming weights follow a Gaussian distribution. However, one scaling factor per layer may not approximate the full-precision network well as the model capacity is very limited in this case. To increase model capacity zhu2016ttq () modify this solution to use two symmetric thresholds () and two scaling factors (separately for positive and negative weights). However, despite improving the accuracy this approach typically makes the inferencing inefficient by requiring multiple passes over the positive and negative values, hence increasing the bandwidth requirements. inq () have proposed a post-facto incremental quantization approach, which aims to find the optimal representation using an iterative method, constraining weights to either 0 or powers of 2, using a 5-bit representation. and re-training activations with full precision. All the aforementioned implementation require partial or full training of the network in low precision. Alternatively, logqunat () used log quantization method on pre-trained models and achieved good accuracy by tuning the bit length for each layer without re-training.
Achieving near-SOTA accuracy on the Imagenet dataset with deeper networks resnet (), without any training in low precision (for both weights and activations) is still a challenge. Our work is an attempt to address this problem and improve over existing approaches.
3 Ternary Conversion of Trained Network
Our goal is to convert the full-precision trained weights to ternary values , , without re-training. We use a threshold () based approach similar to 2016twn (): -th element , if , and otherwise. Then, the element-wise error is and an optimal ternary representation is as follows:
where is the size of (). We hypothesize that weights that learn different types of features may follow different distributions. Combining all the weights together represents a mixture of various distributions, and ternarizing them using a single threshold () and magnitude () may not preserve the distributions of individual weights. Consequently, many weights are approximated poorly (if not totally pruned out) leading to loss of valuable information that they learn. We may not be able to compensate for this loss of information as we do not train the network in low precision.
This motivates us to use a fine-grained quantization technique involving multiple scaling factors in order to increase model capacity that can lead to better preservation of distributions learned by filters. Moreover, we hypothesize that positive and negative weight distributions are not always symmetric around the mean, further refinement of this solution maybe possible using two separate thresholds, and , for positive and negative weights, respectively, along with a scaling factor to ternarize the weights.
3.1 Our Formulation
Computing separate and for each weight compensates for information loss and better preserves the underlying distributions. However, such solution, while showing significant improvement in accuracy, does not reduce the number of multiplications leading to a less efficient implementation. Therefore, we seek to find a trade-off between achieving higher accuracy and reducing the total number of multiplications. We propose a fine-grained quantization approach that creates groups of weights, and ternarizes each group independently. Let us consider the weights represented as a vector . We partition the set of indices into disjoint subsets, , with cardinality , such that, , , . We can decompose into orthogonal vectors , , where -th component if , otherwise . Clearly, ; then we ternarize each orthogonal component as , where components of are in . Threshold-based pruning never turns to non-zero, and the following orthogonality holds. . It follows that, , for , and we have . Then, for a given group of filters , and ,
Therefore, we need to solve independent sub-problems. This formulation allows a better ternary approximation to the original full-precision weights, ensuring that they remain within a neighborhood of the original solution in the complex search space of parameters, despite no re-training. Consequently, we expect the full-precision solution and the ternary counterpart to generalize in a similar manner. From model capacity point of view, we can have only three distinct values, , for a ternary weight vector without such grouping. With such groups, however, we can represent distinct values, thus increasing model capacity linearly with number of groups.
We can solve each sub-problem using a threshold-based approach the following way. We are given a vector of elements , and we can use separate thresholds for positive and negative weights, along with one scaling factor to ternarize . Let , , for some . We want to solve
We have the following analytical solution.
The advantage of our formulation (2) is that the smaller independent sub-problems can be solved efficiently using brute-force methods to achieve better approximation. However, we also explore analytical solutions to establish the theoretical veracity of our approach. Assuming that the magnitude of the learned weights follow exponential distribution with parameter , we analytically derive the optimal from the following lemma.
Using above notations, if , then, in (5) is
From this analysis, we see the need for a higher threshold value to prune larger number of smaller elements. This is intuitive from the shape of the model distributions, which are typically heavy-tailed distributions. In reality, however, it may not be appropriate to use a single distribution to model the weights of all the layers of a neural network. We can apply Kolmogorov-Smirnov (K-S) test as a goodness-of-fit measure to identify an appropriate reference distribution (here we choose between Gaussian and exponential), and find accordingly. We approximate a heavy-tailed distribution by an exponential one by pruning out some of the smaller elements. This gives us an exponential approximation with smaller . Further, we can use maximum likelihood functions to estimate the parameters of such distributions. For Gaussian , estimated , and for exponential case, estimated parameter . Based on such refined analysis, we observe significant improvement in the theoretical ternary error over Gaussian assumption of 2016twn () (Figure 1). It is interesting to observe that for earlier convolution layers of ResNet-101 trained on ImageNet, the magnitude of the weights follow exponential distribution, and later layer weights are Gaussian.
3.2 Weight Grouping
Our method (2) is agnostic to how the (full-precision) weights are grouped , but leverages that consequence of grouping - which allows for solving these as independent sub-problems more efficiently. The specifics of the grouping mechanism and memory layout used for accessing these groups of weights is an independent problem to explore. The primary objective of grouping is to minimize the dynamic range within each group and split the weights in such a way that the smaller groups have a uniform distribution. This helps in reducing the complexity of finding an optimal solution () for each independent sub-problem using either analytical or brute-force techniques.
However, to realize the full performance potential of ternarization, it is essential to ensure that the grouping mechanism itself does not introduce a significant overhead. Similarity based clustering algorithms such as K-means, despite being better at finding optimal grouping of weights that may even lead to better accuracy, are not friendly for efficient implementations (in both software hardware), because of the random grouping of elements from non-contiguous memory locations. This leads to irregular memory accesses with longer latencies, to gather arbitrarily grouped weights that use a common , for a partial accumulation of the output.
Based on our empirical observations, we conclude that using static groups of weights that are partitioned along input channels achieves best accuracy. The same element from multiple filters along have significantly less variance, since they correspond to similar input features. Hence grouping such elements results in reduced dynamic range within the group. Such a grouping also easily lends itself for efficient implementation using both existing hardware and in software with using layout for the weight tensor, where groups of elements are accessed from contiguous memory locations. Since the elements along accumulate to the same output feature, this layout is also amenable to efficient vectorization along . Figure 2 shows an example of this grouping scheme applied to filters. Each group of ternary filters, has scaling factors () corresponding to each element of the filter.
4 Experimental Results
For experimental results, we focused on Resnet-50 and Resnet-101resnet () using ILSVRC-2012imagenet_data () dataset, to demonstrate the efficacy of our method on large, sophisticated models using 2-bit weights and 8-bit activations (2w-8a). We extended our study by applying FGQ on activations to help further reduce the precision of activations to 4-bits (2w-4a) and show results comparable with 8-bit activations for all the tested networks. Further, towards establishing the broader applicability of FGQ we demonstrate state-of-the-art accuracy also for Alexnetalexnet ().
Our setup consists of a modified version of Caffecaffe () that emulates low-precision dynamic fixed point (DFP222Please not that fixed point and dynamic fixed point are used interchangeably) computations described in Fig. 3. We use 32-bit accumulator for all our low precision computations to minimize the chances of an overflow. We split the pre-trained weights into groups of elements using the mechanism described in section3.2, and use brute-force technique to compute the floating point values of the threshold () and scaling-factor () for each group. The scaling factors are then quantized to a 8/4-bit fixed point and the weights are stored in the memory format described in3.2. The activations are quantized to 8/4-bits before performing convolution operation and the 32-bit outputs are down converted to 8-bit and appropriately rounded before they are passed to the next layer.
Our experiments indicate that it is essential to use higher precision of the first layer (8w-8a), to minimize the accumulation of quantization loss. We also observe that using pre-trained parameters in batch normalization layers leads to a loss of accuracy due to shift in variance introduced by the quantization. We prevent this loss by recomputing batch normalization parameters during the inference phase to compensate for the shift in variance.
We explored the accuracy-performance trade-off using different group sizes of , our experiments show that FGQ with a group size of N=4 (FGQ-N4) achieves highest accuracy with no re-training and a potential performance benefit. FGQ-N4 applied to a pre-trained Resnet-101 model with 2w-8a achieves Top-1 accuracy of , which is within of the full-precision results. With activations reduced to 4-bits (2w-4a), the Top-1 accuracy drops only marginally to . FGQ-N4 performs equally well on Resnet-50, achieving with Top-1 accuracy of with 2w-8a which is off from full-precision result, and with 2w-4a. To the best of our knowledge, these are the highest reported accuracies using 2w-8a and 2w-4a on Imagenet datasetimagenet_data () using state-of-the-art networks.
To understand the general applicability of our method to a wider range of networks, we apply FGQ to the smaller Alexnetalexnet () model. FGQ-N4 applied to a pre-trained Alexnet model, achieves Top-1 accuracy with 2w-8a without any re-training, this is away from the baseline full-precision result( ). With 2w-4a we do not see any further reduction in accuracy. There are no previously published results that can be directly compared to FGQ, which perform quantizaion on pre-trained models and work with end-to-end low-precision. Hence, we compare with venkatesh2016 (); inq (); zhou2016dorefa (), which are the closest in terms of the networks used and/or the target precision. Our Alexnet result using FGQ-N is comparable to previously published resultzhou2016dorefa () which is away from the baseline using 1w-4a while also employing training in low-precision with full precision gradients. Table 1 has a comparison with previous reported results frominq () using -bit weights and venkatesh2016 () using ternary weights. While they report slightly better absolute numbers, our numbers are relatively better because both these results use full-precision activations and train the network in low precision to achieve those numbers. While without any low precision training and reduced precision for activation, results with FGQ is still competitive with other similar (aforementioned) results.333It should be noted that both these works use Resnet-50 with slight variations and hence have slightly different baseline accuracies. For inq () the baseline full precision a Top-1 accuracy is , for venkatesh2016 () it is and for zhou2016dorefa () it is With additional low precision training with FGQ we are able significantly improve accuracy and get closer to state-of-art full precision results, as outlined in the next section along with associated performance implications.
|Networks||Our Baseline||FGQ-N4 2w-8a||FGQ-N4 2w-4a||INQ 5w-32a inq ()||dLAC 2w-32a venkatesh2016 ()||DoReFa 1w-4a-32g zhou2016dorefa ()|
|no low precision training||with low precision re-training|
In order to realize the full performance potential of ternary networks, the inference platform needs to operate at the throughput that is close to the precision of the weights. This would increase the amount of memory bandwidth required to stream activations by and a compute engine that is much wider to deliver the desired compute throughput. Building such solution around full-precision activations would be prohibitive in terms of areas and power requirements, whereas it is more amenable to build such solution when the activations are 8 or 4-bits.
Figure3(a) shows the performance Vs accuracy trade-off for Resnet-50marcelbatchnorm () for a FGQ based 8-bit inference design. Our model projects the lower bound of the performance potential based on the percentage of FMA operations that can be converted into ternary accumulations at each group size . In the ideal case, where is equal to the total number of weights in the layer, the best case performance potential is compared to the baseline full-precision performance. For a group size of N=4, of all FMA operations can be performed in ternary, Using slightly larger sub-groups of N=8 we can replace of FMA operations with ternary while losing an additional Top-1 accuracy. At group size , of all FMA operations can be replaced by ternary accumulations, resulting in potential improvement in performance. But the performance comes at a cost of significant drop in accuracy. Using larger groups of weights results in a poor ternary approximation to the full-precision model. Consequently, the ternary solution moves away from the full-precision local optima and display different generalization behavior.
We have trained low-precision (2w-8a) ResNet-50marcelbatchnorm () at group size N=64 on ImageNetimagenet_data () dataset to recover the accuracy lost because of ternarization. We initialized the network with a pre-trained full-precision model and fine-tuned the parameters of the low-precision network. We reduced the learning rate to an order of 1e-4 to avoid exploding gradients, retaining all the other hyper parameters from full-precision training and performed gradient updates in full precision. After training for 20-epochs, we recover most of the lost accuracy and achieved Top-1 and Top-5 bringing it to within of the full-precision baseline accuracy. Figure3(b) shows the reduction of training error and improvements in validation accuracy.
We propose a fine-grained ternarization method which exploits local correlations in dynamic range of the parameters to minimize the impact of quantization on overall accuracy. We demonstrate near SOTA accuracy on Imagenet data-set using pre-trained models with quantized networks without re-training. Using ternary weights on Resnet-101 and Resnet-50 with 8-bit activations our results are within from the full precision accuracy. Using 4-bit activations we see a further drop of in accuracy. To the best of our knowledge, these are the highest reported accuracies using ternary weights and low-precision activations.
Our weight grouping based approach allows us to obtain solutions that can be tailored for specific hardware, as well as, can be used on general purpose hardware, based on the accuracy and performance requirements. Smaller group sizes with N=4 achieve best accuracy, and use of the computations ternary operations (simple 8-bit additions) and this is better suited for implementation on specialized hardware. Larger group sizes are more suitable for current general purpose hardware, with a larger portion of computations as low precision operations ( for N=64), although this comes with the cost of reduced accuracy. This gap may be bridged with additional low precision training as shown in Section4. Our final quantized model can be efficiently run on full 8-bit compute pipeline, thus offering a potential performance-power benefit.
We continue to actively work on closing the current accuracy gap, exploring both low precision (re)training and extensions to the FGQ method itself. Also we are looking into a more theoretical exploration to better understand the formal relationship between the weight grouping and final accuracy, with an attempt to establish realistic bounds for given network-performance-accuracy requirements.
-  Yoshua Bengio, Ian Goodfellow, and Aaron Courville. Deep learning. Book in preparation for MIT Press, 2016.
-  Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830, 2016.
-  Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.
-  Tim Dettmers. 8-bit approximations for parallelism in deep learning. arXiv preprint arXiv:1511.04561, 2015.
-  Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited numerical precision. In ICML, pages 1737–1746, 2015.
-  Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, pages 1135–1143, 2015.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
-  Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks. In Advances in Neural Information Processing Systems, pages 4107–4115, 2016.
-  Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. arXiv preprint arXiv:1609.07061, 2016.
-  Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
-  N Jouppi. Google supercharges machine learning tasks with tpu custom chip. Google Blog, May, 18, 2016.
-  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
-  Fengfu Li, Bo Zhang, and Bin Liu. Ternary weight networks. arXiv preprint arXiv:1605.04711, 2016.
-  Zhouhan Lin, Matthieu Courbariaux, Roland Memisevic, and Yoshua Bengio. Neural networks with few multiplications. arXiv preprint arXiv:1609.07061, 2016.
-  Daisuke Miyashita, Edward H Lee, and Boris Murmann. Convolutional neural networks using logarithmic data representation. arXiv preprint arXiv:1603.01025, 2016.
-  Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In ECCV, 2016.
-  Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
-  Marcel Simon, Erik Rodner, and Joachim Denzler. Imagenet pre-trained models with batch normalization. arXiv preprint arXiv:1612.01452v2, 2016.
-  Yaman Umuroglu, Nicholas J Fraser, Giulio Gambardella, Michaela Blott, Philip Leong, Magnus Jahre, and Kees Vissers. Finn: A framework for fast, scalable binarized neural network inference. arXiv preprint arXiv:1612.07119, 2016.
-  Vincent Vanhoucke, Andrew Senior, and Mark Z Mao. Improving the speed of neural networks on cpus. In Proc. Deep Learning and Unsupervised Feature Learning NIPS Workshop, volume 1, page 4. Citeseer, 2011.
-  Ganesh Venkatesh, Eriko Nurvitadhi, and Debbie Marr. Accelerating deep convolutional networks using low-precision and sparsity. arXiv preprint arXiv:1610.00324, 2016.
-  Darrell Williamson. Dynamically scaled fixed point arithmetic. In Communications, Computers and Signal Processing, 1991., IEEE Pacific Rim Conference on, pages 315–318. IEEE, 1991.
-  Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen. Incremental network quantization: Towards lossless cnns with low-precision weights. poster at International Conference on Learning Representations, 2017.
-  Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.
-  Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally. Trained ternary quantization. arXiv preprint arXiv:1612.01064, 2016.
.1 Proof of Lemma 1
Let denote the number of elements. Let be the pdf of exponential distribution with parameter , and be the cdf. Then,