DeepShift: Towards MultiplicationLess Neural Networks
Abstract
Deep learning models, especially DCNN have obtained high accuracies in several computer vision applications. However, for deployment in mobile environments, the high computation and power budget proves to be a major bottleneck. Convolution layers and fully connected layers, because of their intense use of multiplications, are the dominant contributer to this computation budget. This paper, proposes to tackle this problem by introducing two new operations: convolutional shifts and fullyconnected shifts, that replace multiplications all together and use bitwise shift and bitwise negation instead. This family of neural network architectures ( that use convolutional shifts and fullyconnected shifts) are referred to as DeepShift models. With such DeepShift models that can be implemented with no multiplications, the authors have obtained accuracies of up to 93.6% on CIFAR10 dataset, and Top1/Top5 accuracies of 70.9%/90.13% on Imagenet dataset. Extensive testing is made on various wellknown CNN architectures after converting all their convolution layers and fully connected layers to their bitwise shift counterparts, and we show that in some architectures, the Top1 accuracy drops by less than 4% and the Top5 accuracy drops by less than 1.5%. The experiments have been conducted on PyTorch framework and the code for training and running is submitted along with the paper and will be made available online.
DeepShift: Towards MultiplicationLess Neural Networks
Mostafa Elhoushi Hetrogeneous Compilers Lab Huawei Technologies Toronto, Canada mostafa.elhoushi@huawei.com Farhan Shafiq Hetrogeneous Compilers Lab Huawei Technologies Toronto, Canada farhan.shafiq@huawei.com Ye Tian Hetrogeneous Compilers Lab Huawei Technologies Toronto, Canada ye.tian2@huawei.com Joey Yiwei Li Hetrogeneous Compilers Lab Huawei Technologies Toronto, Canada joey.yiwei.li@huawei.com Zihao Chen Hetrogeneous Compilers Lab Huawei Technologies Toronto, Canada zihao.chen@huawei.com
noticebox[b]Preprint. Under review.\end@float
Introduction
Deep Neural Networks are increasingly being targeted for mobile and IoT applications. Devices at the edge have a lower power and price budget as well as constrained memory size. Moreover, the amount of communication between memory and compute also has a major role in the power requirements of a CNN. Moreover, if communication between device and cloud become necessary (e.g. in case of model updates etc), model size could affect the connectivity costs. Therefore, for mobile / IoT inference applications, model optimization, size reduction, faster inference and lower power consumption are key areas of research. Several approaches are being considered to address this need: As such these efforts can be divided into a few categories: One approach is to build efficient models from the ground up resulting in novel network architectures, however that proves to be a task requiring a lot of training resource to try multiple variants of an architecture to find the best fit. Another approach is to start with a big model initially. Since among the many parameters in the network, some are redundant and don’t contribute a lot to the output, hence a ranking is assigned to each parameters based on contribution to the output. Low ranking parameters can be done away with (pruned), without effecting the accuracy too much. The ranking can be done according to the L1/L2 mean of neuron weights, their mean activations, or on the proportion of nonzero neurons on some validation set. After pruning, the accuracy will drop, and the network is usually trained more to recover. Too much prunning at once may cause degradation in output so in practice pruning is performed iteratively with cycles of pruning and retraining. This can result in reduced model sizes and improved speeds. Another technique is to start with a big model and reduce the model size by applying quantization to smaller bitwidth floating or fixedpoint numbers. In some cases the quantized models are retrained to regain some of the accuracy. Key attractions of these technique are that they can be easily applied to various kinds of networks and they not only reduces model size but also require less complex compute units on the underlying hardware. This results in smaller model footprint, less working memory (and cache), faster computation on supporting platforms and lower power consumption. Also, some optimization techniques replace multiplication with binary XNOR operations. Such techniques may have high accuracy on small datasets such as MNIST or CIFAR10, but suffer high degradation on complex datasets such as Imagenet.
This paper proposes to reduce computation and power budget of CNNs by introducing two new operations: convolutional shifts and fully connected shifts, that replace multiplications all together and use bitwise shift and bitwise negation instead. This family of neural network architectures are refered to as DeepShift models. Our approach is focused on either oneshot training using powers of 2 or bitwise shifts from scratch or as a conversion of pretrained models.
1 DeepShift Networks
As shown in Figure 1, the main concept of this paper is to replace multiplication with bitwise shift and negation. If the underlying binary representation of an input number, is in integer or fixedpoint format, a bitwise shift of bits to the left (or right) is mathematically equivalent to multiplying by a positive (or negative power) of 2:
(1) 
Bitwise shift can only be equivalent to multiplying by a positive number, because for any value of . However, in neural networks, it is necessary for the training to have the equivalent of multiplying by negative numbers in its search space, especially in convolutional neural networks where filters with both positive and negative values contribute to detecting edges. Therefore, we also need to use the negation operation. The negation operation is mathematically equivalent to:
(2) 
Similar to bitwise shift, negation is a computationally cheap operation too as it involves returning the 2’s complement of a number.
As we will see in the coming sections, we will introduce novel operators, LinearShift and ConvShift, that both replace multiplication with bitwise shift and negation:
(3) 
where will be referred to as the shift value and will be referred to as the negation value.
In typical CPU architectures both bitwise shift and bitwise negation use only 1 clock cycle while floatingpoint multiplication may consume up to 10 clock cycles.
1.1 LinearShift Operator
The linear operator (a.k.a fullyconnected operator) is based on matrix multiplication. The forward pass can be expressed as:
(4) 
where is the input that can be represented as a matrix of size , is the output that can be represented as a matrix of size , is the trainable weight matrix of size , and is the trainable bias vector of size . is the batch size while is the input feature size and is the output feature size.
The backward pass of the linear operator can be expressed as:
(5) 
where is the gradient input to the operator (derivative of the model loss, , with respect to the operator output), is the gradient output to the operator (derivative of the model loss with respect to the operator input), and is the derivative of the model loss with respect to the operator weights.
In this paper we introduce the shift linear operator, which in the forward pass replaces matrix multiplication with bitwise shift and a negation. The forward pass is defined as:
(6) 
where is the matrix of negation, is the matrix of shift values, and refers to elementwise multiplication of the 2 matrices. The size of both and is . is the bias vector which is similar to the original linear operator. , , and are all trainable parameters.
To help deduce the backward pass, we are going to use the term :
(7) 
Note that the backward pass result in noninteger values for the powers of and . However, in the forward pass they are rounded to enable implementing them as bitwise negation and shift respectively.
1.2 ConvShift Operator
The forward pass of the original convolution operator is expressed as:
(8) 
where has dimensions , is the input channel size, is the output channel size, and are the height and width of the convolution filters. The backward pass of convolution is deduced in LeCun et al. (1999) as:
(9) 
Likewise, our proposed convolutional shift (which we will refer to as ConvShift) operator has the forward pass defined as:
(10) 
where and are the negation and shift matrices and have dimensions .
Similarly, to deduce the backward pass, we are going to use the term :
(11) 
2 Implementation
To implement the forward and backward passes, we follow an approach similar to that of Hubara et al. (2016). We used PyTorch to define the forward pass for the 2 custom ops: LinearShift using Equation 6 and for ConvShift using Equation 10. PyTorch’s AutoGrad tool is used to generate the backward pass. In order to emulate the precision of an actual bitwise shift hardware implementation, the input data to the LinearShift and ConvShift operators is rounded to fixedpoint format precision before applying the forward pass.
The code and some of the model binary files have been submitted in a compressed file along with the paper submission and will be made available online on GitHub upon acceptance.
3 Benchmark Results
We have tested the training and inference results on 3 datasets: MNIST (LeCun and Cortes (2010)), CIFAR10 (Krizhevsky (2009)), and Imagenet (Deng et al. (2009)). For each dataset, we have tested a group of architectures. For each architecture, we tested:

Original Version: evaluating the original architecture with standard convolution and linear operators,

DeepShift Version:

Train from Scratch: converting all the convolution and linear operators to their shift counterparts, and training from scratch (i.e., random initialization of weights),

Convert Original Weights: converting all the convolution and linear operators to their shift counterparts, and converting the pretrained weights of Option 1 using the following equations:
(12) but apply no further training

Convert Original Weights + Training: converting all the convolution and linear operators to their shift counterparts, and training starting from pretrained weights of Option 1 converted using the above equations.

For Imagenet datasets, training from scratch (Option 2a) was not done because of its time consumption.
3.1 MNIST Data Set
Two simple models were trained and tested on the MNIST dataset:

Simple FC: a simple fullyconnected model consisting of 3 linear layers with feature output sizes 512, 512, and 10 respectively. Dropout layers with probability of 0.2 were inserted in between the layers. All intermediate layers had a ReLu activation following it. RMSProp optimizer was found to produce higher accuracies for this model.

Simple CNN: a model consisting of 2 convolutional layers and 2 linear layers. The 2 convolutional layers had output channels sizes 20 and 50 respectively, and both had kernel sizes of 5x5 and strides of 1. Max pooling layers of window size 2x2 followed by ReLu activation were inserted after each convolution layer. The linear layers had output feature sizes of 500 and 10 respectively. Stochastic gradient optimizer was found to produce higher accuracies for this model.
A learning rate of 0.01, a momentum of 0.0, as well as batch size of 64 was used to train. The accuracy on the validation set is shown in Table 1. We can see that while accuracy dropped by more than 13% when the DeepShift version trained from scratch is evaluated, the DeepShift version with converted pretrained weights had minor accuracy reduction, while further training on top of those converted weights has actually achieved higher validation accuracy.
Model  Original Version  DeepShift Version  

Training from Scratch  Convert Original Weights  Convert Original Weights + Training  
Simple FC  93.59%  78.55%  90.19%  93.78% 
Simple CNN  98.91%  85.38%  98.41%  98.98% 
3.2 CIFAR10 Data Set
A set of ResNet models with various depths were analyzed. The models were trained using stochastic gradient descent optimizer, with momentum of 0.9 and weight decay of . The loss criterion used was categorical cross entropy. The learning rate used to train was 0.1 and the number of epochs for training was 200.
The results for evaluating on the validation set are shown in Table 2. We notice that there was drastic reduction in accuracy when the DeepShift models were trained from scratch, however the DeepShift models that were trained on top of the converted pretrained weights have minor (less than 2%)  if not no  reduction in validation accuracy. It is worth to note that for converted weights with no further training, models with higher depth and complexity achieved better than those of lower complexity. This may be interpreted as that the increase in model complexity compensates for the decrease in precision of the ops that were converted to ConvShift or LinearShift.
Model  Original Version  DeepShift Version  

Training from Scratch  Convert Original Weights  Convert Original Weights + Training  
ResNet20  91.73%  47.59%  83.66%  89.32% 
ResNet32  92.63%  52.92%  85.84%  92.16% 
ResNet44  93.10%  59.11%  87.90%  92.74% 
ResNet56  93.39%  61.62%  91.03%  93.46% 
ResNet110  93.68%  67.17%  90.81%  93.68% 
ResNet1202  93.82%  N/A  91.22%  93.63% 
3.3 Imagenet Data Set
The models were trained using stochastic gradient descent optimizer, with momentum of 0.9 and weight decay of . The loss criterion used was categorical cross entropy. The learning rate used to train from scratch was 0.1, while the number of epochs and learning rate used to train a pretrained converted model is specified for each model in Table 3. The learning rate for training each converted model was manually tuned by looking at the accuracy of the first few batches of training: if accuracy was decreasing to that below of the untrained converted DeepShift model, if the accuracy dropped as compared to the untrained model, then the weights have "unlearnt" and therefore the learning rate was too high and needed to decreases.
Looking at the results of Table 3, we can see that different models had varying results. The best result obtained was for ResNet152 which had a Top1 accuracy of 75.56% and Top5 accuracy of 92.75%. It is noteworthy that due to limited time, many of the models were trained for only 4 epochs. Training them for more epochs may result in better accuracies. More complex models tend to have better results when converted to DeepShift. A slime model like MobileNetv2 got around 6% in reduction of accuracy when all the multiplications in it have been removed. This is considered a strong advantage when compared to other acceleration methods (e.g., XNOR networks, quantization, or pruning) that have negative results for optimizing MobileNets. Nevertheless, other slim networks such as SqueezeNets suffered drastic reduction in accuracy. It is yet to be analyzed why MobileNetv2 had almost 0% accuracy when weights were converted without further training, while it had above 84% Top5 accuracy when trained for a few epochs.
Model  Original Version  DeepShift Version  

Convert Original Weights  Convert Original Weights + Training  
VGG11  69.02% / 88.63%  46.76% / 71.29%  65.61% / 86.72% 


VGG11bn  70.37% / 89.81%  37.49% / 61.94%  63.52% / 85.68% 


VGG13  69.93% / 89.25%  60.34% / 82.56%  68.09% / 88.22% 


VGG13bn  71.59% / 90.37%  45.92% / 70.38%  57.97% / 81.83% 


VGG16  71.59% / 90.38%  65.25% / 86.30%  70.28% / 89.77% 


VGG16bn  73.36% / 91.52%  56.30% / 79.77%  71.98% / 90.81% 


VGG19  72.38% / 90.88%  66.61% / 87.21%  69.91% / 89.46% 


VGG19bn  74.22% / 91.84%  58.96% / 82.02%  72.85% / 91.15% 


AlexNet  56.52% / 79.07%  42.99% / 67.40%  48.81% / 73.39% 


DenseNet121  74.43% / 91.97%  46.40% / 71.95%  70.41% / 89.93% 


DenseNet161  77.14% / 93.56%  61.97% / 84.64%  73.34% / 91.55% 


DenseNet169  75.60% / 92.81%  39.24% / 63.93%  72.84% / 91.28% 


DenseNet201  76.90% / 93.37%  51.84% / 75.83%  73.83% / 91.80% 


ResNet18  69.76% / 89.08%  41.53% / 67.29%  65.81% / 86.88% 


ResNet34  73.31% / 91.42%  56.26% / 80.22%  70.99% / 90.13% 


ResNet50  76.13% / 92.86%  41.30% / 65.10%  68.42% / 88.66% 


ResNet101  77.37% / 93.55%  52.59% / 76.57%  69.21% / 88.95% 


ResNet152  78.31% / 94.05%  46.14% / 69.15%  75.56% / 92.75% 


MobileNetv1  69.57% / 89.07%  0.14% / 0.67%  60.61% / 83.23% 


MobileNetv2  71.81% / 90.42%  0.10% / 0.48%  62.69% / 84.77% 


SqueezeNet10  58.09% / 80.42%  12.56% / 29.92%  21.71% / 44.74% 


SqueezeNet11  58.18% / 80.62%  4.01% / 12.19%  15.50% / 35.25% 

Further Analysis
We have analyzed the minimum and maximum values of shifts encountered in all the models and found that the maxium was 1 while the minimum was 70. This means that the shift values can be represented using 8bits rather than 32bits as regular weights of convolutions are represented. Also, the negation can be saved as a single bit too.
4 Related Work
This section covers related works which fall in three categories. First, innovative new network architectures have been proposed that reduce the model sizes as well as computation requirements. Iandola et al. (2016) proposed SqueezeNet achieving 50x fewer parameters compared to AlexNet. Howard et al. (2017) proposed MobileNets, based on a streamlined architecture that uses depthwise separable convolutions to build light weight DNNs, followed by MobileNetV2 (Sandler et al. (2018)) introducing inverted residual structure where the input and output of the residual blocks are thin bottleneck layers opposite to traditional residual models. Chen et al. (2019) Proposed reducing spatial redundancy in CNN by replacing usual convolution operations with Octave Convolution, a spatialy lessredundant variant.
Secondly, model quantization techniques have been an active research area and various approaches have been developed. Courbariaux et al. (2015) proposed Binary Connect constraining weights to only two values (1 or 1). Rastegari et al. (2016) improve on Binary Connect, introducing Binary Weight Networks (BWNs), employing channelwise scaling factors, while Li and Liu (2016) introduced Ternary Weight Networks (TWNs). Zhu et al. (2016) extended TWN with Trained Ternary Quantization (TTQ), with nonuniform and trainable scaling factors. Other works like Binarized Neural Networks (Courbariaux and Bengio (2016)), XNORNet (Rastegari et al. (2016)), BiReal Net (Liu et al. (2018)) and ABCNet (Lin et al. (2017a)) quantize both weights and activations to either 1 or 1. Zhou et al. (2017) proposed Incremental Network Quantization (INQ) to quantize pretrained fullprecision DNNs with weights constrained to zeros and powers of two. This is accomplished by iteratively partitioning the weights into two sets, one of which is quantized while the other is retrained to compensate for accuracy degradation. Cai et al. (2017) proposed lowprecision activations using HalfWave Gaussian Quantization (HWGQ) while weights are binarized using BWN. Faraone et al. (2018) introduced Symmetric Quantization (SYQ) by pixelwise scaling factors and fixedpoint activation quantization. DoReFaNet (Zhou et al. (2016)), PACT (Choi et al. (2018a)) and PACTSAWB (Choi et al. (2018b)) allows weights and activation to be variable configurable. Louizos et al. (2019) proposed uniform noise injection for nonuniform quantization (UNIQ) of both weights and activations. Lin et al. (2017b) use multibit quantization. Zhang et al. (2018) proposed an adaptively learnable quantizer (LQNets).
Thirdly pruning can be used to reduce model redundancy. Several pruning works have been proposed that employ various kinds of ranking mechanisms. Hanson and Pratt (1989) introduced hyperbolic and exponential biases for pruning in the late 80s. Hassibi and Stork (1993) and LeCun et al. (1990) published some of the other earlier works on pruning. More recently, Han et al. (2016a) proposed Deep Compression, a method that leverages pruning, weight sharing and Huffman Coding for model compression. Lin et al. (2017b) proposed Runtime Neural Pruning, a framework which prunes the deep neural network dynamically at the runtime. Pruning introduces another complexity to the mix, since after pruning the computation units need to be able to handle sparse matrix arithmetic which adds some overhead.
Conclusion and Future Work
We have introduced DeepShift neural networks, which replace multiplications in the forward pass with bitwise operations  bitwise shift and negation, that can lead to dramatic reduction in computation time, power consumption, and memory requirements during inference. We have proved that the accuracies of DeepShift networks on Imagenet and other datasets are near stateoftheart.
The hardware realization of DeepShift neural networks is yet to be done, and is needed to evaluate the actual speedup in performance. Designing parallel architectures for bitwise shift and negation on vectors rather than on individual registers may face its own challenges but it is expected to have faster execution times than their multiplication counterparts. While training from pretrained weights  rounded to their bitwise shift counterparts  result in much higher accuracies that are close to the stateoftheart, training DeepShift networks from scratch results in low accuracies. Therefore, further research is required to find better random initialization methods for weights to enable training DeepShift networks from scratch. Also, the current design of DeepShift networks still need multiplications in the backward pass (as noninteger powers of 2 are used). Researching into using bitwise shifts in the backward pass may lead to dramatic seedup in training as well.
While other methods that tackle neural network speed ups such as BNNs perform well on small datasets (e.g., MNIST and CIFAR10) but suffer significant degradation on big datasets such as Imagenet, DeepShift networks proved that they are suitable for Imagenet and can result in Top5 accuracies above 90%.
Acknowledgments
We thank Yerlan Idelbayev for providing opensourced code and model files to reproduce the accuracies of the original ResNet paper results on CIFAR10 (Idelbayev ()). We thank also the developers of PyTorch for providing example scripts and pretrained model binary files to reproduce the accuracies of various models on the Imagenet dataset.
References
 LeCun et al. [1999] Yann LeCun, Patrick Haffner, Léon Bottou, and Yoshua Bengio. Object recognition with gradientbased learning. In Shape, Contour and Grouping in Computer Vision, pages 319–, London, UK, UK, 1999. SpringerVerlag. ISBN 3540667229. URL http://dl.acm.org/citation.cfm?id=646469.691875.
 Hubara et al. [2016] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran ElYaniv, and Yoshua Bengio. Binarized neural networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 4107–4115. Curran Associates, Inc., 2016. URL http://papers.nips.cc/paper/6573binarizedneuralnetworks.pdf.
 LeCun and Cortes [2010] Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010. URL http://yann.lecun.com/exdb/mnist/.
 Krizhevsky [2009] A. Krizhevsky. Learning Multiple Layers of Features from Tiny Images. Technical report, University of Toronto, Department of Computer Science, 2009.
 Deng et al. [2009] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. ImageNet: A LargeScale Hierarchical Image Database. In CVPR09, 2009.
 Iandola et al. [2016] Forrest N. Iandola, Matthew W. Moskewicz, Khalid Ashraf, Song Han, William J. Dally, and Kurt Keutzer. Squeezenet: Alexnetlevel accuracy with 50x fewer parameters and <1mb model size. CoRR, abs/1602.07360, 2016. URL http://arxiv.org/abs/1602.07360.
 Howard et al. [2017] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861, 2017. URL http://arxiv.org/abs/1704.04861.
 Sandler et al. [2018] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4510–4520, June 2018. doi: 10.1109/CVPR.2018.00474.
 Chen et al. [2019] Yunpeng Chen, Haoqi Fang, Bing Xu, Zhicheng Yan, Yannis Kalantidis, Marcus Rohrbach, Shuicheng Yan, and Jiashi Feng. Drop an octave: Reducing spatial redundancy in convolutional neural networks with octave convolution. CoRR, abs/1904.05049, 2019.
 Courbariaux et al. [2015] Matthieu Courbariaux, Yoshua Bengio, and JeanPierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 3123–3131. Curran Associates, Inc., 2015. URL http://papers.nips.cc/paper/5647binaryconnecttrainingdeepneuralnetworkswithbinaryweightsduringpropagations.pdf.
 Rastegari et al. [2016] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnornet: Imagenet classification using binary convolutional neural networks. CoRR, abs/1603.05279, 2016. URL http://arxiv.org/abs/1603.05279.
 Li and Liu [2016] Fengfu Li and Bin Liu. Ternary weight networks. CoRR, abs/1605.04711, 2016. URL http://arxiv.org/abs/1605.04711.
 Zhu et al. [2016] Chenzhuo Zhu, Song Han, Huizi Mao, and William J. Dally. Trained ternary quantization. CoRR, abs/1612.01064, 2016. URL http://arxiv.org/abs/1612.01064.
 Courbariaux and Bengio [2016] Matthieu Courbariaux and Yoshua Bengio. Binarynet: Training deep neural networks with weights and activations constrained to +1 or 1. CoRR, abs/1602.02830, 2016. URL http://arxiv.org/abs/1602.02830.
 Liu et al. [2018] Zechun Liu, Baoyuan Wu, Wenhan Luo, Xin Yang, Wei Liu, and KwangTing Cheng. Bireal net: Enhancing the performance of 1bit cnns with improved representational capability and advanced training algorithm. CoRR, abs/1808.00278, 2018. URL http://arxiv.org/abs/1808.00278.
 Lin et al. [2017a] Xiaofan Lin, Cong Zhao, and Wei Pan. Towards accurate binary convolutional neural network. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 345–353. Curran Associates, Inc., 2017a. URL http://papers.nips.cc/paper/6638towardsaccuratebinaryconvolutionalneuralnetwork.pdf.
 Zhou et al. [2017] Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen. Incremental network quantization: Towards lossless cnns with lowprecision weights. CoRR, abs/1702.03044, 2017.
 Cai et al. [2017] Zhaowei Cai, Xiaodong He, Jian Sun, and Nuno Vasconcelos. Deep learning with low precision by halfwave gaussian quantization. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5406–5414, 2017.
 Faraone et al. [2018] Julian Faraone, Nicholas J. Fraser, Michaela Blott, and Philip Heng Wai Leong. Syq: Learning symmetric quantization for efficient deep neural networks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4300–4309, 2018.
 Zhou et al. [2016] Shuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin Wu, and Yuheng Zou. Dorefanet: Training low bitwidth convolutional neural networks with low bitwidth gradients. CoRR, abs/1606.06160, 2016. URL http://arxiv.org/abs/1606.06160.
 Choi et al. [2018a] Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce IJen Chuang, Vijayalakshmi Srinivasan, and Kailash Gopalakrishnan. PACT: Parameterized clipping activation for quantized neural networks, 2018a. URL https://openreview.net/forum?id=By5ugjyCb.
 Choi et al. [2018b] Jungwook Choi, Pierce IJen Chuang, Zhuo Wang, Swagath Venkataramani, Vijayalakshmi Srinivasan, and Kailash Gopalakrishnan. Bridging the accuracy gap for 2bit quantized neural networks (QNN). CoRR, abs/1807.06964, 2018b. URL http://arxiv.org/abs/1807.06964.
 Louizos et al. [2019] Christos Louizos, Matthias Reisser, Tijmen Blankevoort, Efstratios Gavves, and Max Welling. Relaxed quantization for discretized neural networks. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=HkxjYoCqKX.
 Lin et al. [2017b] Ji Lin, Yongming Rao, Jiwen Lu, and Jie Zhou. Runtime neural pruning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 2181–2191. Curran Associates, Inc., 2017b. URL http://papers.nips.cc/paper/6813runtimeneuralpruning.pdf.
 Zhang et al. [2018] Dongqing Zhang, Jiaolong Yang, Dongqiangzi Ye, and Gang Hua. Lqnets: Learned quantization for highly accurate and compact deep neural networks. In ECCV, 2018.
 Hanson and Pratt [1989] Stephen Jose Hanson and Lorien Y. Pratt. Comparing biases for minimal network construction with backpropagation. In D. S. Touretzky, editor, Advances in Neural Information Processing Systems 1, pages 177–185. MorganKaufmann, 1989. URL http://papers.nips.cc/paper/156comparingbiasesforminimalnetworkconstructionwithbackpropagation.pdf.
 Hassibi and Stork [1993] Babak Hassibi and David G. Stork. Second order derivatives for network pruning: Optimal brain surgeon. In S. J. Hanson, J. D. Cowan, and C. L. Giles, editors, Advances in Neural Information Processing Systems 5, pages 164–171. MorganKaufmann, 1993. URL http://papers.nips.cc/paper/647secondorderderivativesfornetworkpruningoptimalbrainsurgeon.pdf.
 LeCun et al. [1990] Yann LeCun, John S. Denker, and Sara A. Solla. Optimal brain damage. In D. S. Touretzky, editor, Advances in Neural Information Processing Systems 2, pages 598–605. MorganKaufmann, 1990. URL http://papers.nips.cc/paper/250optimalbraindamage.pdf.
 Han et al. [2016a] Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. CoRR, abs/1510.00149, 2016a.
 [30] Gemmlowp: a small selfcontained lowprecision gemm library. https://github.com/google/gemmlowp. Accessed: 20190523.
 [31] Intel(r) math kernel library for deep neural networks. https://intel.github.io/mkldnn/index.html. Accessed: 20190523.
 [32] Cmsis nn library. https://www.keil.com/pack/doc/CMSIS/NN/html/index.html. Accessed: 20190523.
 [33] How can snapdragon 845’s new ai boost your smartphone’s iq? https://www.qualcomm.com/news/onq/2018/02/01/howcansnapdragon845snewaiboostyoursmartphonesiq. Accessed: 20190523.
 Migacz [2017] Szymon Migacz. 8bit inference with tensorrt. In GTC, 2017.
 [35] The nvidia deep learning accelerator. http://nvdla.org/index.html. Accessed: 20190523.
 Han et al. [2016b] Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark Horowitz, and William J. Dally. Eie: Efficient inference engine on compressed deep neural network. 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pages 243–254, 2016b.
 Umuroglu et al. [2017] Yaman Umuroglu, Nicholas J. Fraser, Giulio Gambardella, Michaela Blott, Philip Heng Wai Leong, Magnus Jahre, and Kees A. Vissers. Finn: A framework for fast, scalable binarized neural network inference. In FPGA, 2017.
 [38] Yerlan Idelbayev. Proper resnet implementation for cifar10/cifar100 in pytorch. https://github.com/akamaster/pytorch_resnet_cifar10. Accessed: 20190523.