Trained Ternary Quantization
Abstract
Deep neural networks are widely used in machine learning applications. However, the deployment of large neural networks models can be difficult to deploy on mobile devices with limited power budgets. To solve this problem, we propose Trained Ternary Quantization (TTQ), a method that can reduce the precision of weights in neural networks to ternary values. This method has very little accuracy degradation and can even improve the accuracy of some models (32, 44, 56layer ResNet) on CIFAR10 and AlexNet on ImageNet. And our AlexNet model is trained from scratch, which means it’s as easy as to train normal full precision model. We highlight our trained quantization method that can learn both ternary values and ternary assignment. During inference, only ternary values (2bit weights) and scaling factors are needed, therefore our models are nearly smaller than fullprecision models. Our ternary models can also be viewed as sparse binary weight networks, which can potentially be accelerated with custom circuit. Experiments on CIFAR10 show that the ternary models obtained by trained quantization method outperform fullprecision models of ResNet32,44,56 by 0.04%, 0.16%, 0.36%, respectively. On ImageNet, our model outperforms fullprecision AlexNet model by 0.3% of Top1 accuracy and outperforms previous ternary models by 3%.
Trained Ternary Quantization
Chenzhuo Zhu^{†}^{†}thanks: Work done while at Stanford CVA lab. 

Tsinghua University 
zhucz13@mails.tsinghua.edu.cn 
Song Han 

Stanford University 
songhan@stanford.edu 
Huizi Mao 

Stanford University 
huizi@stanford.edu 
William J. Dally 

Stanford University 
NVIDIA 
dally@stanford.edu 
1 Introduction
Deep neural networks are becoming the preferred approach for many machine learning applications. However, as networks get deeper, deploying a network with a large number of parameters on a small device becomes increasingly difficult. Much work has been done to reduce the size of networks. Halfprecision networks (Amodei et al., 2015) cut sizes of neural networks in half. XNORNet (Rastegari et al., 2016), DoReFaNet (Zhou et al., 2016) and network binarization (Courbariaux et al., ; Courbariaux et al., 2015; Lin et al., 2015) use aggressively quantized weights, activations and gradients to further reduce computation during training. While weight binarization benefits from smaller model size, the extreme compression rate comes with a loss of accuracy. Hubara et al. (2016) and Li & Liu (2016) propose ternary weight networks to trade off between model size and accuracy.
In this paper, we propose Trained Ternary Quantization which uses two fullprecision scaling coefficients , for each layer , and quantize the weights to {, 0, } instead of traditional {1, 0, +1} or {E, 0, +E} where E is the mean of the absolute weight value, which is not learned. Our positive and negative weights have different absolute values and that are trainable parameters. We also maintain latent fullprecision weights at training time, and discard them at test time. We back propagate the gradient to both , and to the latent fullprecision weights. This makes it possible to adjust the ternary assignment (i.e. which of the three values a weight is assigned).
Our quantization method, achieves higher accuracy on the CIFAR10 and ImageNet datasets. For AlexNet on ImageNet dataset, our method outperforms previously stateofart ternary network(Li & Liu, 2016) by 3.0% of Top1 accuracy and the fullprecision model by 1.6%. By converting most of the parameters to 2bit values, we also compress the network by about 16x. Moreover, the advantage of few multiplications still remains, because and are fixed for each layer during inference. On custom hardware, multiplications can be precomputed on activations, so only two multiplications per activation are required.
2 Motivations
The potential of deep neural networks, once deployed to mobile devices, has the advantage of lower latency, no reliance on the network, and better user privacy. However, energy efficiency becomes the bottleneck for deploying deep neural networks on mobile devices because mobile devices are battery constrained. Current deep neural network models consist of hundreds of millions of parameters. Reducing the size of a DNN model makes the deployment on edge devices easier.
First, a smaller model means less overhead when exporting models to clients. Take autonomous driving for example; Tesla periodically copies new models from their servers to customers’ cars. Smaller models require less communication in such overtheair updates, making frequent updates more feasible. Another example is on Apple Store; apps above 100 MB will not download until you connect to WiFi. It’s infeasible to put a large DNN model in an app. The second issue is energy consumption. Deep learning is energy consuming, which is problematic for batteryconstrained mobile devices. As a result, iOS 10 requires iPhone to be plugged with charger while performing photo analysis. Fetching DNN models from memory takes more than two orders of magnitude more energy than arithmetic operations. Smaller neural networks require less memory bandwidth to fetch the model, saving the energy and extending battery life. The third issue is area cost. When deploying DNNs on ApplicationSpecific Integrated Circuits (ASICs), a sufficiently small model can be stored directly onchip, and smaller models enable a smaller ASIC die.
Several previous works aimed to improve energy and spatial efficiency of deep networks. One common strategy proven useful is to quantize 32bit weights to one or two bits, which greatly reduces model size and saves memory reference. However, experimental results show that compressed weights usually come with degraded performance, which is a great loss for some performancesensitive applications. The contradiction between compression and performance motivates us to work on trained ternary quantization, minimizing performance degradation of deep neural networks while saving as much energy and space as possible.
3 Related Work
3.1 Binary Neural Network (BNN)
Lin et al. (2015) proposed binary and ternary connections to compress neural networks and speed up computation during inference. They used similar probabilistic methods to convert 32bit weights into binary values or ternary values, defined as:
(1) 
Here and denote binary and ternary weights after quantization. denotes the latent full precision weight.
During backpropagation, as the above quantization equations are not differentiable, derivatives of expectations of the Bernoulli distribution are computed instead, yielding the identity function:
(2) 
Here is the loss to optimize.
For BNN with binary connections, only quantized binary values are needed for inference. Therefore a smaller model can be deployed into applications.
3.2 DoReFaNet
Zhou et al. (2016) proposed DoReFaNet which quantizes weights, activations and gradients of neural networks using different widths of bits. Therefore with specifically designed lowbit multiplication algorithm or hardware, both training and inference stages can be accelerated.
They also introduced a much simpler method to quantize 32bit weights to binary values, defined as:
(3) 
Here calculates the mean of absolute values of full precision weights as layerwise scaling factors. During backpropagation, Equation 2 still applies.
3.3 Ternary Weight Networks
Li & Liu (2016) proposed TWN (Ternary weight networks), which reduce accuracy loss of binary networks by introducing zero as a third quantized value. They use two symmetric thresholds and a scaling factor for each layer to quantize weighs into :
(4) 
They then solve an optimization problem of minimizing L2 distance between full precision and ternary weights to obtain layerwise values of and :
(5) 
And again Equation 2 is used to calculate gradients. While an additional bit is required for ternary weights, TWN achieves a validation accuracy that is very close to full precision networks according to their paper.
3.4 Deep Compression
Han et al. (2015) proposed deep compression to prune away trivial connections and reduce precision of weights. Unlike above models using zero or symmetric thresholds to quantize high precision weights, Deep Compression used clusters to categorize weights into groups. In Deep Compression, low precision weights are finetuned from a pretrained full precision network, and the assignment of each weight is established at the beginning and stay unchanged, while representative value of each cluster is updated throughout finetuning.
4 Method
Our method is illustrated in Figure 1. First, we normalize the fullprecision weights to the range [1, +1] by dividing each weight by the maximum weight. Next, we quantize the intermediate fullresolution weights to {1, 0, +1} by thresholding. The threshold factor is a hyperparameter that is the same across all the layers in order to reduce the search space. Finally, we perform trained quantization by back propagating two gradients, as shown in the dashed lines in Figure 1. We backpropagate to the fullresolution weights and to the scaling coefficients. The former enables learning the ternary assignments, and the latter enables learning the ternary values.
At inference time, we throw away the fullresolution weights and only use ternary weights.
4.1 Learning both Ternary Values and Ternary Assignments
During gradient descent we learn both the quantized ternary weights (the codebook), and choose which of these values is assigned to each weight (choosing the codebook index).
To learn the ternary value (codebook), we introduce two quantization factors and for positive and negative weights in each layer . During feedforward, quantized ternary weights are calculated as:
(6) 
Unlike previous work where quantized weights are calculated from 32bit weights, the scaling coefficients and are two independent parameters and are trained together with other parameters. Following the rule of gradient descent, derivatives of and are calculated as:
(7) 
Here and . Furthermore, because of the existence of two scaling factors, gradients of latent full precision weights can no longer be calculated by Equation 2. We use scaled gradients for 32bit weights:
(8) 
Note we use scalar number 1 as factor of gradients of zero weights. The overall quantization process is illustrated as Figure 1. The evolution of the ternary weights from different layers during training is shown in Figure 2. We observe that as training proceeds, different layers behave differently: for the first quantized conv layer, the absolute values of and get smaller and sparsity gets lower, while for the last conv layer and fully connected layer, the absolute values of and get larger and sparsity gets higher.
We learn the ternary assignments (index to the codebook) by updating the latent fullresolution weights during training. This may cause the assignments to change between iterations. Note that the thresholds are not constants as the maximal absolute values change over time. Once an updated weight crosses the threshold, the ternary assignment is changed.
The benefits of using trained quantization factors are: i) The asymmetry of enables neural networks to have more model capacity. ii) Quantized weights play the role of "learning rate multipliers" during back propagation.
4.2 Quantization Heuristic
In previous work on ternary weight networks, Li & Liu (2016) proposed Ternary Weight Networks (TWN) using as thresholds to reduce 32bit weights to ternary values, where is defined as Equation 5. They optimized value of by minimizing expectation of L2 distance between full precision weights and ternary weights. Instead of using a strictly optimized threshold, we adopt different heuristics: 1) use the maximum absolute value of the weights as a reference to the layer’s threshold and maintain a constant factor for all layers:
(9) 
and 2) maintain a constant sparsity for all layers throughout training. By adjusting the hyperparameter we are able to obtain ternary weight networks with various sparsities. We use the first method and set to 0.05 in experiments on CIFAR10 and ImageNet dataset and use the second one to explore a wider range of sparsities in section 5.1.1.
4.3 Cifar10
5 Experiments
CIFAR10 is an image classification benchmark containing images of size 3232RGB pixels in a training set of 50000 and a test set of 10000. ResNet (He et al., 2015) structure is used for our experiments.
We use parameters pretrained from a full precision ResNet to initialize our model. Learning rate is set to 0.1 at beginning and scaled by 0.1 at epoch 80, 120 and 300. A L2normalized weight decay of 0.0002 is used as regularizer. Most of our models converge after 160 epochs. We take a moving average on errors of all epochs to filter off fluctuations when reporting error rate.
We compare our model with the fullprecision model and a binaryweight model. We train a a full precision ResNet (He et al., 2016) on CIFAR10 as the baseline (blue line in Figure 3). We finetune the trained baseline network as a 13232 DoReFaNet where weights are 1 bit and both activations and gradients are 32 bits giving a significant loss of accuracy (green line) . Finally, we finetuning the baseline with trained ternary weights (red line). Our model has substantial accuracy improvement over the binary weight model, and our loss of accuracy over the full precision model is small. We also compare our model to Tenary Weight Network (TWN) on ResNet20. Result shows our model improves the accuracy by on CIFAR10.
We expand our experiments to ternarize ResNet with 32, 44 and 56 layers. All ternary models are finetuned from full precision models. Our results show that we improve the accuracy of ResNet32, ResNet44 and ResNet56 by 0.04%, 0.16% and 0.36% . The deeper the model, the larger the improvement. We conjecture that this is due to ternary weights providing the right model capacity and preventing overfitting for deeper networks.
Model  Full resolution  Ternary (Ours)  Improvement 

ResNet20  8.23  8.87  0.64 
ResNet32  7.67  7.63  0.04 
ResNet44  7.18  7.02  0.16 
ResNet56  6.80  6.44  0.36 
5.1 ImageNet
We further train and evaluate our model on ILSVRC12(Russakovsky et al. (2015)). ILSVRC12 is a 1000category dataset with over 1.2 million images in training set and 50 thousand images in validation set. Images from ILSVRC12 also have various resolutions. We used a variant of AlexNet(Krizhevsky et al. (2012)) structure by removing dropout layers and add batch normalization(Ioffe & Szegedy, 2015) for all models in our experiments. The same variant is also used in experiments described in the paper of DoReFaNet.
Our ternary model of AlexNet uses full precision weights for the first convolution layer and the last fullyconnected layer. Other layer parameters are all quantized to ternary values. We train our model on ImageNet from scratch using an Adam optimizer (Kingma & Ba (2014)). Minibatch size is set to 128. Learning rate starts at and is scaled by 0.2 at epoch 56 and 64. A L2normalized weight decay of is used as a regularizer. Images are first resized to then randomly cropped to before input. We report both top 1 and top 5 error rate on validation set.
We compare our model to a full precision baseline, 13232 DoReFaNet and TWN. After around 64 epochs, validation error of our model dropped significantly compared to other lowbit networks as well as the full precision baseline. Finally our model reaches top 1 error rate of 42.5%, while DoReFaNet gets 46.1% and TWN gets 45.5%. Furthermore, our model still outperforms full precision AlexNet (the batch normalization version, 44.1% according to paper of DoReFaNet) by 1.6%, and is even better than the best AlexNet results reported (42.8%^{1}^{1}1https://github.com/BVLC/caffe/wiki/ModelsaccuracyonImageNet2012val). The complete results are listed in Table 2.
Error 






Top1  42.8%  46.1%  45.5%  42.5%  
Top5  19.7%  23.7%  23.2%  20.3% 
We draw the process of training in Figure 4, the baseline results of AlexNet are marked with dashed lines. Our ternary model effectively reduces the gap between training and validation performance, which appears to be quite great for DoReFaNet and TWN. This indicates that adopting trainable and helps prevent models from overfitting to the training set.
We also report the results of our methods on ResNet18B in Table 3. The fullprecision error rates are obtained from Facebook’s implementation. Here we cite Binarized Weight Network(BWN)Rastegari et al. (2016) results with all layers quantized and TWN finetuned based on a full precision network, while we train our TTQ model from scratch. Compared with BWN and TWN, our method obtains a substantial improvement.
Error 






Top1  30.4%  39.2%  34.7%  33.4%  
Top5  10.8%  17.0%  13.8%  12.8% 
6 Discussion
In this section we analyze performance of our model with regard to weight compression and inference speeding up. These two goals are achieved through reducing bit precision and introducing sparsity. We also visualize convolution kernels in quantized convolution layers to find that basic patterns of edge/corner detectors are also well learned from scratch even precision is low.
6.1 Spatial and energy efficiency
We save storage for models by by using ternary weights. Although switching from a binaryweight network to a ternaryweight network increases bits per weight, it brings sparsity to the weights, which gives potential to skip the computation on zero weights and achieve higher energy efficiency.
Layer  Full precision  Pruning (NIPS’15)  Ours  

Density  Width  Density  Width  Density  Width  
conv1  100%  32 bit  84%  8 bit  100%  32 bit 
conv2  100%  32 bit  38%  8 bit  23%  2 bit 
conv3  100%  32 bit  35%  8 bit  24%  2 bit 
conv4  100%  32 bit  37%  8 bit  40%  2 bit 
conv5  100%  32 bit  37%  8 bit  43%  2 bit 
conv total  100%    37%    33%   
fc1  100%  32 bit  9%  5 bit  30%  2 bit 
fc2  100%  32 bit  9%  5 bit  36%  2 bit 
fc3  100%  32 bit  25%  5 bit  100%  32 bit 
fc total  100%    10%    37%   
All total  100%    11%    37%   
6.1.1 Tradeoff between sparsity and accuracy
Figure 5 shows the relationship between sparsity and accuracy. As the sparsity of weights grows from 0 (a pure binaryweight network) to 0.5 (a ternary network with 50% zeros), both the training and validation error decrease. Increasing sparsity beyond 50% reduces the model capacity too far, increasing error. Minimum error occurs with sparsity between 30% and 50%.
We introduce only one hyperparameter to reduce search space. This hyperparameter can be either sparsity, or the threshold w.r.t the max value in Equation 6. We find that using threshold produces better results. This is because fixing the threshold allows the sparsity of each layer to vary (Figure reffig:weights).
6.1.2 Sparsity and efficiency of AlexNet
We further analyze parameters from our AlexNet model. We calculate layerwise density (complement of sparsity) as shown in Table 4. Despite we use different and for each layer, ternary weights can be precomputed when fetched from memory, thus multiplications during convolution and inner product process are still saved. Compared to Deep Compression, we accelerate inference speed using ternary values and more importantly, we reduce energy consumption of inference by saving memory references and multiplications, while achieving higher accuracy.
We notice that without all quantized layers sharing the same for Equation 9, our model achieves considerable sparsity in convolution layers where the majority of computations takes place. Therefore we are able to squeeze forward time to less than 30% of full precision networks.
As for spatial compression, by substituting 32bit weights with 2bit ternary weights, our model is approximately smaller than original 32bit AlexNet.
6.2 Kernel Visualization
We visualize quantized convolution kernels in Figure 6. The left matrix is kernels from the second convolution layer () and the right one is from the third (). We pick first 10 input channels and first 10 output channels to display for each layer. Grey, black and white color represent zero, negative and positive weights respectively.
We observe similar filter patterns as full precision AlexNet. Edge and corner detectors of various directions can be found among listed kernels. While these patterns are important for convolution neural networks, the precision of each weight is not. Ternary value filters are capable enough extracting key features after a full precision first convolution layer while saving unnecessary storage.
Furthermore, we find that there are a number of empty filters (all zeros) or filters with single nonzero value in convolution layers. More aggressive pruning can be applied to prune away these redundant kernels to further compress and speed up our model.
7 Conclusion
We introduce a novel neural network quantization method that compresses network weights to ternary values. We introduce two trained scaling coefficients and for each layer and train these coefficients using backpropagation. During training, the gradients are backpropagated both to the latent fullresolution weights and to the scaling coefficients. We use layerwise thresholds that are proportional to the maximum absolute values to quantize the weights. When deploying the ternary network, only the ternary weights and scaling coefficients are needed, which reducing parameter size by at least . Experiments show that our model reaches or even surpasses the accuracy of full precision models on both CIFAR10 and ImageNet dataset. On ImageNet we exceed the accuracy of prior ternary networks (TWN) by 3%.
References
 Abadi & et. al o (2015) Martín Abadi and et. al o. TensorFlow: Largescale machine learning on heterogeneous systems, 2015. URL http://tensorflow.org/. Software available from tensorflow.org.
 Amodei et al. (2015) Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, et al. Deep speech 2: Endtoend speech recognition in english and mandarin. arXiv preprint arXiv:1512.02595, 2015.
 (3) Matthieu Courbariaux, Itay Hubara, COM Daniel Soudry, Ran ElYaniv, and Yoshua Bengio. Binarized neural networks: Training neural networks with weights and activations constrained to+ 1 or.
 Courbariaux et al. (2015) Matthieu Courbariaux, Yoshua Bengio, and JeanPierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems, pp. 3123–3131, 2015.
 Han et al. (2015) Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. CoRR, abs/1510.00149, 2, 2015.
 He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.
 He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. arXiv preprint arXiv:1603.05027, 2016.
 Hubara et al. (2016) Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran ElYaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. arXiv preprint arXiv:1609.07061, 2016.
 Ioffe & Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
 Jia et al. (2014) Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
 Kingma & Ba (2014) Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Krizhevsky & Hinton (2009) Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009.
 Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (eds.), Advances in Neural Information Processing Systems 25, pp. 1097–1105. Curran Associates, Inc., 2012. URL http://papers.nips.cc/paper/4824imagenetclassificationwithdeepconvolutionalneuralnetworks.pdf.
 Li & Liu (2016) Fengfu Li and Bin Liu. Ternary weight networks. arXiv preprint arXiv:1605.04711, 2016.
 Lin et al. (2015) Zhouhan Lin, Matthieu Courbariaux, Roland Memisevic, and Yoshua Bengio. Neural networks with few multiplications. arXiv preprint arXiv:1510.03009, 2015.
 Rastegari et al. (2016) Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnornet: Imagenet classification using binary convolutional neural networks. arXiv preprint arXiv:1603.05279, 2016.
 Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li FeiFei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s112630150816y.
 Zhou et al. (2016) Shuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin Wu, and Yuheng Zou. Dorefanet: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.