Training Deep Neural Network in Limited Precision
Energy and resource efficient training of DNNs will greatly extend the applications of deep learning. However, there are three major obstacles which mandate accurate calculation in high precision. In this paper, we tackle two of them related to the loss of gradients during parameter update and backpropagation through a softmax nonlinearity layer in low precision training. We implemented SGD with Kahan summation by employing an additional parameter to virtually extend the bit-width of the parameters for a reliable parameter update. We also proposed a simple guideline to help select the appropriate bit-width for the last FC layer followed by a softmax nonlinearity layer. It determines the lower bound of the required bit-width based on the class size of the dataset. Extensive experiments on various network architectures and benchmarks verifies the effectiveness of the proposed technique for low precision training.
Employing accelerators equipped with low precision computation elements can significantly improve the energy efficiency of operating state-of-the-art deep neural networks (DNNs) [4, 9, 34, 27, 7]. Although abundant previous works exist on converting or training DNNs for low precision inference [5, 24, 36, 12], most of them require full precision hardware (HW) for training.
Enabling training as well as inference on the edge devices empowered with accelerators based on low precision computation units can open doors for many personalization applications previously restricted due to privacy issues . For example, sensitive data such as biometrics can safely be consumed for training with in the trust zone of a personal device. Nevertheless, computing in low precision is critical in order to reduce power consumption and memory footprint.
A few methods were proposed to train DNNs in limited precisions [6, 8, 28, 35, 31]. However, the accuracies were not comparable to the full precision version. Some of them even require special HW for unconventional operations, namely, log-scale calculation and/or stochastic quantization [28, 12, 13, 31, 35]. As mentioned in , three problems have been commonly identified so far in training DNNs in limited precision: parameter update, softmax nonlinearity, and normalization. In previous works, these were computed in full precision to avoid catastrophic accuracy degradation.
Recently, mixed precision HWs using both the low and high precision units together was proposed for accelerated training [27, 18]. Speedup arises from reduced memory footprint and increased number of arithmatic units thanks to the low precision operations. Massive portions of the calculations for the forward and backward propagations were accelerated by utilizing the low precision units while the high precision units were used to carefully handle the parameter update, softmax nonlinearity, and normalizations. However, the accelerators must contain both low and high precision HW.
In this paper, we tackle two issues known to the parameter update and the softmax nonlinearity in low precision training. We propose to use Kahan summation  in SGD for updating the network parameters with low precision computation units. In addition, a simple method of selecting sufficient bit-width for the last fully-connected layer followed by a softmax layer based on the number of the output classes is proposed.
To investigate the effectiveness of the proposed methods, we selected an 8-bit accelerator supporting dynamic fixed-point with 8 bits or 16 bits for inference and training as the target HW model. 16-bit operation is accomplished by using 8-bit computing units in multiple cycles without any additional HW unit. We kept the HW model as simple as possible by using uniform linear quantization without stochastic computing. At any rate, the proposed methods can be applied to any type of number systems.
2 SGD with the Kahan summation for low precision parameter update
Parameter set for a neural network can be optimized with the stochastic gradient descent (SGD) method  as
where is the loss function, is the gradient of the loss in respect to the parameter at a given input batch, and is the learning rate. In general, is much smaller than by a few orders of magnitude. When using limited numerical precision (e.g., 8 or 16 bits), is often too small to change due to the lack of precision when is updated by (1). Thus, the portion of the gradient vector used to update the parameter could significantly deviate from the original, which could substantially deteriorate training accuracy. To overcome this issue, we propose to use an extra parameter to maintain the partial accumulation of and update the parameter only when the accumulated amount is large enough to be representable within the given precision and be able to actually change the parameter values. The accumulation parameter acts as a carrier aggregating the small gradients and delivering them to the parameter. The parameter update is accomplished in two stages through the accumulation parameter. This process is known as Kahan summation , which was proposed to reduce the numerical error in the total obtained by adding a sequence of finite precision numbers. We call the SGD with the Kahan summation the lazy update due to the fact that the gradients are not likely to be applied to the parameters at every update step.
When the accumulation of is large enough, we update with the value that is within the supported precision range only and preserve the remainder in the accumulation for later use. Figure 1 shows the operating principle of the lazy update in a dynamic fixed-point number system. The parameter, the gradient, and the accumulator all have the same decimal point location on the horizontal axis. If the gradients are accumulated into an accumulator for multiple iterations until its values is large enough to change the targeted parameter value, only a few bits of the accumulator will overlap with the parameter. If bits overlap, the bit-width of the parameter can be effectively extended to ( + – – ) bits excluding the sign bit, where and represent the number of bits for the parameter and the accumulator. For example, if we use an 8-bit integer format (INT8) for the parameter and a 16-bit integer format (INT16) for the accumulator, we can provide a maximum of 22 bits for the parameter update, which is close to the bit-width of the fraction part of the 32-bit floating point format (FP32).
The lazy update can be implemented in a pure algorithmic way as shown in Algorithm 1 in low precision computing systems. In addition, it can also be applied in a simpler manner on a specialized HW. We emphasize that this idea can be applied to any numerical format: both floating-point and fixed-point representations.
|Algorithm 1 Lazy update (SGD with the Kahan summation).|
|is the accumulator for lazy update shown in Figure 1.|
|Require: , , previous , previous|
|Ensure: Update ,|
|# Accumulate gradient. Due to finite precision, only effective value of in the|
|# significant figure of will be added to|
|# Update with . Due to finite precision, only effective value of in the|
|# significant figure of will be added to|
|# Update residual value by subtracting updated portion.|
|# () is different from due to the finite precision.|
|# Update parameter.|
3 Classifier with softmax and cross entropy loss
In previous works on low precision training, the last fully-connected (FC) layer was usually trained in full precision since the required precision for the last FC layer showed complicated behavior depending on the benchmark datasets [5, 27, 35]. However, there has been no clear explanation on such a behavior. In this section, we explain why the last FC layer requires benchmark-dependent precision unlike other layers and propose a simple way to select an appropriate bit-width.
The last FC layer (i.e. classifier) is commonly followed by a softmax nonlinearity layer with the cross entropy loss for training in a network for a classification task as shown in Figure 1(a). The softmax layer converts the output of the last FC layer to the inferred probability of each class.
The cross entropy loss of the -th output is defined as
Here, is the class size (i.e. the number of classes), , , and are the -th output of the last FC layer, the output of the softmax layer, and the ground truth (GT) label, respectively.
The gradient of the cross entropy loss in (2) is calculated as
Figure 1(b) shows the distribution of , , and at an early stage of training and their quantized levels. Since the network is at an early stage of training, the output of the last FC layer ( in the left Figure 1(b)) is distributed around zero. On the other hand, the output of the softmax layer ( in the center Figure 1(b)) is concentrated on the level 1/ since its output values are normalized so that they sum up to one (i.e. ). For example, when is 1000, all the values are around 0.001. Applying (3) to this case makes the gradient vector have a value close to -1 for the element with the GT label and values around 0.001 at all others which, unfortunately, will be truncated to be zero with low precision quantization (shown in the right Figure 1(b)). We realized that this one-hot-vector-like gradient tends to make training unstable. The sum of all the elements in the gradient vector is considerably biased to a negative value since most of the small positive errors vanish due to quantization. Those small values of the gradient vector play an important role in training as stated in . The magnitude of the small components is inversely proportional to the class size. Therefore, we propose to determine the bit-width for the last FC layer based on the class size.
The results in Section 4.1 show that the simple tasks with 10 classes can be trained with dynamic INT8 for all layers. However, INT16 was required for the last FC layer to train networks on the 1000-class ImageNet benchmark even though INT8 worked for all the other layers. If we want to keep the sum of round-off components of the gradient vector less than , the following condition must be satisfied for dynamic fixed-point with linear quantization:
where |class| is the class size.
Thus, we can derive a simple rule of thumb for the required bit-width for the last FC layer followed by the softmax and cross entropy loss layers as follows:
We empirically found that is a reasonable choice.
4 Benchmark results
We evaluated our method on a variety of neural networks that include convolutional neural networks (CNNs) for image classification, general adversarial networks (GANs) for image generation, and recurrent neural network (RNN) based on long-short-term memory (LSTM) units for speech recognition. We modified Caffe and Tensorflow to support the dynamic fixed-point data formats for training.
The lazy update described in Algorithm 1 was implemented in the optimizers. We found that INT16 was necessary for the ADAM optimizer because of (= 0.999 normally) while the momentum optimizer will still work in INT8. For comparison purposes, the same hyperparameters were used as those from training the networks in full precision (i.e., FP32) except where otherwise noted.
4.1 CNNs on MNIST, SVHN and CIFAR10
This section demonstrates the training results for four small-scale CNNs on MNIST, SVHN and CIFAR10 datasets. Both MNIST and SVHN are digit image datasets, totalling 10 classes from 0 to 9. MNIST consists of handwritten digit images, which has 60,000 training images and 10,000 testing images. SVHN is a real-world dataset and contains 73,257 digit images for training and 26,032 samples for testing. CIFAR10 is composed of 10 classes for color images. There are 50,000 training images and 10,000 testing images.
All networks were configured with multiple convolutional (CONV) modules consisting of convolution, ReLU activation, and pooling layers. We made variations in the network architecture by adding batch normalization (BN) layers and skipping connections to investigate the proposed method over various architectures. We trained LeNet consisting of two CONV modules and two FC layers on the MNIST dataset. The network for the SVHN benchmark has four CONV modules and two FC layers. CIFAR10 was trained with two networks; one of which has three CONV modules followed by one FC layer (CIFAR10-CNN) and the other ResNet20 (CIFAR10-ResNet20). The CNN for SVHN and RestNet20 for CIFAR10 contained BN layers. We also investigated two optimizer types: momentum SGD and ADAM. The CNN for SVHN was trained using the ADAM optimizer ( = 0.9, = 0.999) while the momentum optimizer ( = 0.9) was used for other networks. The inputs and outputs of BN layers were also quantized, but internal parameters and operations were done without quantization to avoid instability during training.
Figure 3 shows the convergence of the validation accuracy measured on the benchmarks during training. We observed a huge accuracy degradation (up to 44.9%) in all cases when we trained the networks in INT8 due to the limited precision during the parameter update. The effect of limited precision is clearly shown in CIFAR10-ResNet20 where the accuracy did not improve in INT8 unlike FP32 where the learning rate was reduced at 120 epochs. However, when we added the lazy update in the optimizers, the accuracy converged to a level comparable with FP32 in all cases. The lazy update could effectively aggregate small weight gradients to update the parameters. Table 1 summaries the final validation accuracies for FP32, INT8, and our method (INT8 with the lazy update).
|Benchmark||FP32||INT8||INT8 with lazy update|
4.2 AlexNet on ImageNet dataset
In this section, we evaluate the lazy update on a large scale network on the ImageNet dataset. We trained AlexNet using the momentum optimizer. As mentioned in Section 3, we used INT16 for the weight parameters, the activation and the gradient values between the last FC layer and the softmax layer to accommodate 1,000 classes of the ImageNet benchmark. From (5), the bit-width needs to be larger than 14 bits (13.97 = ). We also used INT16 for the momentum optimizer ( = 0.9). The learning rate was initialized to 0.01 and reduced by a factor of 5 every 100k updates.
As expected, straight-forward training of AlexNet in normal INT8 failed. The accuracy did not improve at all from the chance probability (0.001). It was possible to train the network in INT16 with some accuracy degradation (2.82%), but we were able to obtain a better accuracy by adding the lazy update in INT8 as shown in Table 2. The lazy update using INT8 led to only 0.25% accuracy loss compared with FP32 training.
In order to investigate the advantage of our method in resource perspective, we calculated the memory usage required for training AlexNet with a mini-batch of 256 samples as shown in Figure 4. Using INT16 obviously halved the memory usage compared to FP32. INT8 further reduces it to half. However, there is an increased memory usage to implement the lazy update in the optimizers. Our method of (INT8 + lazy update) can achieve 31.2% reduction of the memory usage in addition to the improved accuracy compared to INT16.
|Model||FP32||INT16||INT8 with lazy update|
4.3 Transfer learning on sub-problems of ImageNet dataset
Transfer learning will be one of the major applications of low-precision training on edge devices. In this section, we evaluate our method on transfer learning scenarios. We made two small datasets from ImageNet dataset by grouping classes into super classes: hunting dog and machine. Each dataset is composed of 58 classes and 44 classes, respectively. Since we use pre-trained models on ImageNet training dataset, we constructed the new datasets from the ImageNet validation dataset to avoid using the same image samples during the transfer learning. The ImageNet validation dataset has 50 images for each class, of which, 45 images are used for training and 5 images are used for validation in our experiments. Pre-trained AlexNet and Inception-v3  are used as base architectures. We replace the classifier (i.e. last FC layer) to adapt to the new classification tasks. Since the class size is small, we used INT8 for the classifier.
Table 3 summarizes the results of the transfer learning experiment. Training Incpetion-v3 with INT8 results in accuracy loss up to -4.14% on hunting dog dataset. However, with the lazy update, there is no accuracy degradation. INT16 also shows comparable accuracy, but it requires more memory and longer operation duration.
|Benchmark||FP32||INT16||INT8||INT8 with lazy update|
4.4 Image generation with GANs
We applied the lazy update scheme to train GANs in low precision fixed point formats. A GAN consists of two neural networks competing against each other. One of them (the generator) generates fake images and tries to deceive the other (the discriminator) which distinguishes whether the input images are real or not.
We trained BEGAN and DCGAN on the CelebA dataset with cropping and alignment and the LSUN bedrooms dataset, respectively, which are representative datasets in the GAN experiments. The CelebA dataset consists of facial images and the LSUN bedrooms dataset consists of images for training and 300 images for validation. The network parameters were quantized in INT8. However, INT16 was required for the activations to generate good quality images. The ADAM optimizer was used in INT16 with = 0.5, = 0.9, = , and the learning rate .
Figure 5 illustrates the images generated by BEGAN and DCGAN. The effectiveness of the lazy update is clearly shown in Figure 4(a) and 4(b). BEGAN trained in INT8 failed to produce faces all together (Figure 4(a)) while applying the lazy update created faces which were indistinguishable with those trained in full precision (Figure 4(b)). We observed the same tendency for the DCGAN on the bedrooms dataset as shown in Figure 4(c) and 4(d).
4.5 LSTM on TIDIGIT dataset
To investigate the effectiveness of the lazy update on training RNNs in low precision fixed point formats, a simple LSTM network was trained on the TIDIGITS dataset . The TIDIGITS dataset, which is a set of spoken digits (“zero” to “nine” plus “oh”) for classification, has 2,464 digit samples in the training set and 2,486 samples for the test set. Individual digits were transformed to produce a 39-dimensional Mel-Frequency Cepstral Coefficient (MFCC) feature vector using a 25 ms window, 10 ms frame shift, and 20 filter bank channels. The labels for “oh” and “zero” were collapsed to a single label. The network has a single LSTM layer consisting of 200 units. The final state of the LSTM layer was fed to an FC layer with 200 units followed by a classification layer for the 10 digit classes. We used the ADAM optimizer (=0.9, =0.999) for training.
Figure 6 shows the validation accuracy measured during training on the TIDIGIT dataset. Training in INT8 failed to converge to the accuracy of FP32. However, the accuracy loss improved from 22.41% to 0.21%, a level competitive to full precision by using the lazy update in the optimizer (see Table 4).
|Benchmark||FP32||INT8||INT8 with lazy update|
5 Related works
There has been much effort to train deep neural networks (DNNs) operating in low precision for efficient inference[5, 12, 31, 35]. They mainly focused on obtaining the networks with the weights or the activations in low precision. Other tensors for training like the gradient vectors required full precision data format. Thus, training was done in hardwares with full precision. For example, the Binaryconnect proposed training for binary weights. BNN extended applying binarization to the activations. In QNN, they tried to increase the bit-width of activations to 2, 4, and 6 bits in order to improve the training accuracy. However, in all of these works, training was done in full precision.
XNOR-net attempted to binarize all the tensors including the gradients. Along these lines, most of the multiplications were substituted by additions in both forward and backward propagation. DoReFa-Net also proposed a method for training DNNs with quantized weights, activations, and gradients. This works used a different bit-width for each tensor to further improve accuracy. XNOR-net  and DoReFa-Net quantized the activation gradients, but they used full precision for the weight gradient and kept the master parameters in FP32 for update.
Flexpoint introduced a way to implement the dynamic fixed point for training. It estimates the exponent value based on the history of the maximum values to prevent overflow. It achieved a similar accuracy with FP32 on various tasks. However, this numerical format requires specialized hardware and complicated calculation for the exponent value. Mixed precision training in  solves such overhead by scanning the maximum value of each tensor to determine the exponent value. However, they used FP32 accumulator to prevent overflow during accumulation of the results of fixed point multiplications. Mixed precision method in  proposed to use both half precision floating point (FP16) and FP32 for training. The main datapaths for forward and backward propagations were accelerated by using FP16 while the parameter update, softmax nonlinearity, and normalizations were done in FP32.
In this paper, we addressed two major problems in low precision training. We proposed the lazy update to avoid the problem caused by precision shortage in parameter update of DNNs on a HW with limited precision computation units. The lazy update employs an additional parameter to keep the partial accumulation of small gradient values for reliable update of parameters during training. We can train DNNs in various network architectures (CNN, GAN, and LSTM) from scratch on various benchmarks (MNIST, SVHN, CIFAR10, ImageNet, CelebA, LSUN bedrooms, and TIDIGIT datasets) without accuracy degradation.
We also proposed a simple guideline to help select the appropriate bit-width for the last FC layer followed by a softmax nonlinearity layer. It determines the lower bound of the required bit-width based on the class size of the benchmark. Further study is required to alleviate the requirement of high precision computation in normalization layers and move toward energy efficient training of DNNs.
-  M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: A system for large-scale machine learning. In OSDI, volume 16, pages 265–283, 2016.
-  D. Berthelot, T. Schumm, and L. Metz. Began: Boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717, 2017.
-  L. Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010.
-  Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of Solid-State Circuits, 52(1):127–138, 2017.
-  M. Courbariaux, Y. Bengio, and J.-P. David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, pages 3123–3131, 2015.
-  M. Courbariaux, J.-P. David, and Y. Bengio. Training deep neural networks with low precision multiplications. arXiv preprint arXiv:1412.7024v4, 2015.
-  D. Das, N. Mellempudi, D. Mudigere, D. Kalamkar, S. Avancha, K. Banerjee, S. Sridharan, K. Vaidyanathan, B. Kaul, E. Georganas, et al. Mixed precision training of convolutional neural networks using integer operations. arXiv preprint arXiv:1802.00930, 2018.
-  S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan. Deep learning with limited numerical precision. arXiv preprint arXiv:1502.02551, 2015.
-  S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally. Eie: efficient inference engine on compressed deep neural network. In Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on, pages 243–254. IEEE, 2016.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.0253, 2015.
-  I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. Binarized neural networks. In Advances in neural information processing systems, pages 4107–4115, 2016.
-  I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. arXiv preprint arXiv:1609.07061, 2016.
-  S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
-  Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
-  W. Kahan. Further remarks on reducing truncation errors. Communications of the ACM, 8(1):40, January 1965.
-  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  U. Köster, T. Webb, X. Wang, M. Nassar, A. K. Bansal, W. Constable, O. Elibol, S. Gray, S. Hall, L. Hornof, et al. Flexpoint: An adaptive numerical format for efficient training of deep neural networks. In Advances in Neural Information Processing Systems, pages 1742–1752, 2017.
-  A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
-  Y. LeCun, C. Cortes, and C. Burges. Mnist handwritten digit database. AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist, 2, 2010.
-  R. G. Leonard and G. Doddington. Tidigits. Linguistic Data Consortium, Philadelphia, 1993.
-  F. Li, B. Zhang, and B. Liu. Ternary weight networks. arXiv preprint arXiv:1605.04711, 2016.
-  Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), 2015.
-  B. McMahan and D. Ramage. Federated learning: Collaborative machine learning without centralized training data. Google Research Blog, 2017.
-  P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaev, G. Venkatesh, et al. Mixed precision training. arXiv preprint arXiv:1710.03740, 2017.
-  D. Miyashita, E. H. Lee, and B. Murmann. Convolutional neural networks using logarithmic data representation. arXiv preprint arXiv:1603.01025v2, 2016.
-  Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, page 5, 2011.
-  A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
-  M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pages 525–542. Springer, 2016.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
-  F. Yu, Y. Zhang, S. Song, A. Seff, and J. Xiao. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015.
-  S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen. Cambricon-x: An accelerator for sparse neural networks. In Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on, pages 1–12. IEEE, 2016.
-  S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.
-  C. Zhu, S. Han, H. Mao, and W. J. Dally. Trained ternary quantization. arXiv preprint arXiv:1612.01064, 2016.