Simultaneously Optimizing Weight and Quantizer of Ternary Neural Network using Truncated Gaussian Approximation
Abstract
In the past years, Deep convolution neural network has achieved great success in many artificial intelligence applications. However, its enormous model size and massive computation cost have become the main obstacle for deployment of such powerful algorithm in the low power and resourcelimited mobile systems. As the countermeasure to this problem, deep neural networks with ternarized weights (i.e. 1, 0, +1) have been widely explored to greatly reduce model size and computational cost, with limited accuracy degradation. In this work, we propose a novel ternarized neural network training method which simultaneously optimizes both weights and quantizer during training, differentiating from prior works. Instead of fixed and uniform weight ternarization, we are the first to incorporate the thresholds of weight ternarization into a closedform representation using truncated Gaussian approximation, enabling simultaneous optimization of weights and quantizer through backpropagation training. With both of the first and last layer ternarized, the experiments on the ImageNet classification task show that our ternarized ResNet18/34/50 only has 3.9/2.52/2.16% accuracy degradation in comparison to the fullprecision counterparts.
I Introduction
Artificial intelligence is nowadays one of the hottest research topics, which has drawn tremendous efforts from various fields in the past couple years. While computer scientists have succeed to develop Deep Neural Networks (DNN) with transcendent performance in the domains of computer vision, speech recognition, big data processing and etc. [1]. The stateoftheart DNN evolves into structures with larger model size, higher computational cost and denser layer connections [2, 3, 4, 5]. Such evolution brings great challenges to the computer hardware in terms of both computation and onchip storage [6], which leads to great research effort on the topics of model compression in recent years, including channel pruning, weight sparsification, weight quantization, etc.
Weight ternarization, as a special case of weight quantization technique to efficiently compress DNN model, mainly provides three benefits: 1) it converts the floatingpoint weights into ternary format (i.e., 1, 0, +1), which can significantly reduce the model size by . With proper sparse encoding technique, such model compression rate can be further boosted. 2) Besides the model size reduction, the ternarized weight enables elimination of hardwareexpensive floatingpoint multiplication operations, while replacing with hardware friendly addition/subtraction operations. Thus, it could significantly reduce the inference latency. 3) The ternarized weights with zero values intrinsically prune network connections, thus the computations related to those zero weights can be simply skipped.
In the previous lowbit network qunatization works, such as TTN [7], TTQ [8] and BNN [9], they do retrain the models’ weights but a fixed weight quantizer is used and not properly updated together with other model parameters, which leads to accuracy degradation and slow convergence of training. In this work, we have proposed a network ternarization method which simultaneously update both weights and quantizer (i.e. thresholds) during training, where our contributions can be summarized as:

We propose a fully trainable deep neural network ternarization method that jointly trains the quantizer threshold, layerwise scaling factor and model weight to achieve minimum accuracy degradation due to model compression.

Instead of fixed and uniform weight ternarization, we are the first to incorporate the thresholds of weight ternarization into a closedform expression using truncated Gaussian approximation, which can be optimized through backpropagation together with network’s other parameters. It differentiates from all precious works.

We propose a novel gradient correctness optimization method in straightthroughestimator design. It gives better gradient approximation for the nondifferentiable staircase ternarization function, which leads to faster convergence speed and higher inference accuracy.

In order to validate the effectiveness of our proposed methods, we apply our model ternarization method on ImageNet dataset for object classification task.
The rest of this paper is organized as follows. We first give a brief introduction about the related works regarding the topics of model compression. Then the proposed network ternarization method and the applied tricks are explained in details. In the following section, experiments are performed on both small and large scale dataset with various deep neural network structure, to evaluate the effectiveness of our proposed method. After that, the conclusion is drawn in the end.
Ii Related Works
Recently, model compression on deep convolutional neural network has emerged as one hot topic in the hardware deployment of artificial intelligence. There are various techniques, including network pruning [10], knowledge distillation [11], weight sparsification [12], weight quantization [13] and etc. [14], to perform network model compression. As one of the most popular technique, weight quantization techniques are widely explored in many related works which can significantly shrink the model size and reduce the computation complexity [6]. The famous deep compression technique [13] adopts the scheme that optimizing weight quantizer using kmeans clustering on the pretrained model. Even though the deep compression technique can achieve barely no accuracy loss with 8bit quantized weight, its performance on lowbit quantized case is nonideal. Thereafter, many works are devoted to quantize the model parameters into binary or ternary formats, not only for its extremely model size reduction (), but also the computations are simplified from floatingpoint multiplication (i.e. mul) operations into addition/subtraction (i.e. add/sub). BinaryConnect [15] is the first work of binary CNN which can get close to the stateoftheart accuracy on CIFAR10, whose most effective technique is to introduce the gradient clipping. After that, both BWN in [16] and DoreFaNet [17] show better or close validation accuracy on ImageNet dataset. In order to reduce the computation complexity, XNORNet [16] binarizes the input tensor of convolution layer which further converts the add/sub operations into bitwise xnor and bitcount operations. Besides weight binarization, there are also recent works proposing to ternarize the weights of neural network using trained scaling factors [8]. Leng et. al. employ ADMM method to optimize neural network weights in configurable discrete levels to trade off between accuracy and model size [18]. ABCNet in [19] proposes multiple parallel binary convolution layers to improve the network model capacity and accuracy, while maintaining binary kernel. All above discussed aggressive neural network binarization or ternarization methodologies sacrifice the inference accuracy in comparison with the full precision counterpart to achieve large model compression rate and computation cost reduction.
Iii Methodology
Iiia Problem Definition
As for weight quantization of neural networks, the stateoftheart work [20] typically divides it into two subproblems: (1) minimizing quantization noise (i.e., MeanSquareError) between floatingpoint weights and quantized weights and (2) minimizing inference error of neural network w.r.t the defined objective function. In this work, instead of optimizing two separated subproblems, we mathematically incorporate the thresholds of weight quantizer into neural network forward path, enabling simultaneous optimization of weights and thresholds through backpropagation method. In this work, the network optimization problem can be described as obtaining the optimized ternarized weight and ternarization threshold :
(1) 
where is the defined network loss function, is the target corresponding to the network input tensor , computes the network output w.r.t the network parameters.
IiiB Trainable ternarization under Gaussian approximation
In this subsection, we will first introduce the our weight ternarization methodology. Then, our proposed method to incorporate ternarization thresholds into neural network inference path, which makes it trainable through backpropagation, is discussed particularly.
IiiB1 Network Ternarization:
For the sake of obtaining a deep neural network with ternarized weight and minimum accuracy loss, the training scheme for one iteration (as shown in Fig. 1) can be generally enumerated as four steps:

Initialize the weight with fullprecision pretrained model. Previous works have experimentally demonstrate that finetuning the pretrained model with small learning rate normally generates a quantized model with higher accuracy. More importantly, with the pretrained model as parameter initialization, much less number of training epochs is required to get model converged in comparison to training from scratch.

Ternarize the fullprecision weight w.r.t the layerwise thresholds and scaling factor in real time. The weight ternarization function can be described as:
(2) where and denote the fullprecision weight base and its ternarized version of th layer respectively. are the ternarization thresholds. calculates the layerwise scaling factor using extracted mean , standard deviation and thresholds , which is the key to enable threshold training in our work. The closedform expression of will be described hereinafter.

For one input batch, this step only adjusts the thresholds through back propagation in a layerwise manner, meanwhile suspending the update of weight.

For the identical input batch, it repeats step2 to synchronize the ternarized weight base w.r.t the updated thresholds in step3. It then disables the update of thresholds and only allows fullprecision weight base to be updated. Since the staircase ternarization function ( in Eq. 2) is nondifferentiable owing to its zero derivatives almost everywhere, we adopt the method of StraightThroughEstimator [21] similar as previous network quantization works [8]. It is noteworthy that we propose a new gradient correctness algorithm in STE which is critical to improve the convergence speed for weight retraining (see details in following subsections).
With ternarized weights, the major computation ^{1}^{1}1For simplicity, we neglect the bias term. (i.e., floatingpoint multiplication and accumulation) is converted to more efficient and less complex floatingpoint addition and subtraction, due to . The computation can be expressed as:
(3) 
where and are the vectorized input and ternarized weight of layer respectively. Since in the structures of stateoftheart deep neural networks, convolution/fullyconnected layers normally follows a batchnormalization layer or ReLU where both layers perform elementwise function on their input tensor (i.e., ), the elementwise scaling in Eq. 3 can be emitted and integrate with the forthcoming batchnorm and ReLU. Beyond the above descriptions of ternarized model training procedure, we formalize those operations in Algorithm 1 as well for clarification.
IiiB2 Trainable thresholds utilizing truncated Gaussian distribution approximation:
It has been fully discussed in previous works [13, 22] that the weight distributions of spatial convolution layers and fullyconnected layers are intending to follow Gaussian distribution, whose histogram is in bellshape, owing to the effect of regularization. For example, in Fig. 2, we have shown the weight distributions and their corresponding Probability Density Function (PDF) using the calculated mean and standard deviation for each parametric layer (i.e., convolution and fullyconnected layers) in ResNet18b [2]. Meanwhile, the ShapiroWilk normality test [23] is conducted to identify whether the weight sample originated from Gaussian distribution quantitatively. The given test statistic of ShapiroWilk normality test indicate a good normally distribution match with minimum 0.82 value. Note that, the asymmetry (i.e., Skewness) of the last fullyconnected layer is due to the existence of bias term. In this work, we consider the weight of parametric layers approximately following Gaussian distribution and further perform the weight ternarization based on such approximation.
In order to make the thresholds as trainable parameters that can be updated using backpropagation, there are two criteria that have to be met:

thresholds have to be parameters of inference path in a closedform expression.

such closedform expression is differentiable w.r.t the threshold.
Hereby, we make the assumption that:
Assumption 1
the weights of designated layer are approximately following Gaussian distribution (i.e., ), where and are the calculated mean and standard deviation of the weight sample .
The centroid is normally taken as the quantized value for a nonuniform quantizer setup to minimize the quantization error [24]. Thus, for weight ternarization, the layerwise scaling factor can be set as:
(4) 
where is the conditional PDF under the condition of . In this work, by setting and , we can approximate the Eq. 4 and reformat it into:
(5) 
where and are the PDF and CDF for Gaussian distribution . Such calculation can directly utilize the closedform expression of mathematical expectation for truncated Gaussian distribution with lower bound and upper bound . Thus, we finally obtain a closedform expression of scaling factor embedding trainable thresholds :
(6) 
(7) 
where and are PDF and CDF of standard normal distribution .
For visualization purpose, as shown in Fig. (a)a, we plot the function of in the forward and backward paths w.r.t the variation of . Moreover, in order to ensure for correct ternarization and prevent framework convergence issue ^{2}^{2}2Since most of the popular deep learning frameworks using numerical method (e.g., MonteCarlo method) for distribution related calculation, there will be error for calculating and at the tail of distribution (i.e., ), we apply constraints on which keep . Such clipping operation is functionally equivalent as propagating through the hard tanh function, which is piecewise linear activation function with upperlimit and lowerlimit , then the trainable thresholds with clipping constraints can be expressed as:
(8) 
(9) 
After substituting the vanilla with the Fig. (b)b in calculating , the function of forward () and backward path () w.r.t is transformed into Fig. (b)b.
Beyond that, the weight decay tends to push the trainable threshold close to zero, and biases the ternary weight representation towards the binary counterpart, thus lowering the sparsity. Therefore, we do not apply weight decay on threshold during training.
In summary, we finalize the scaling factor term and weight ternarization function to substitute the original fullprecision weight in the forward propagation path:
(10) 
(11) 
IiiC Straight Through Estimator with Gradient Correctness
Almost for any quantization function which maps the continuous values into discrete space, it has encountered the same problem that such staircase function is nondifferentiable. Thus, a widely adopted countermeasure is using the socalled StraightThroughEstimator (STE) to manually assign an approximated gradient to the quantization function. We take the STE in famous binarized neural network [9] as an example to perform the analysis, which is defined as:
(12) 
(13) 
where is the defined loss function. The rule behind such STE setup is that the output of quantization function can effectively represent the fullprecision input value . Thus, performs the similar function as whose derivative is . However, the rough approximation in Eq. 12 and Eq. 13 leads to significant quantization error and hamper the network training when or . For example, as shown in Fig. (a)a, if the layer’s weight distribution owns , performing network quantization through finetuning will result in significant weight distribution shift which slows down the convergence speed.
In order to encounter the drawback of naive STE design discussed above, we propose a method called gradient correctness for better gradient approximation. For our weight ternarization case, the fullprecision weight base is represented by , where both terms can pass back gradients to update the embedding parameters. For assigning a proper gradient to the , we follow STE design rule which leads to the following expression:
(14) 
Thus, the STE for ternarization function can be derived Eq. 14 as:
(15) 
As seen in Eq. 15, instead of simply assigning the gradient as 1, we scale the w.r.t the value of in real time. As shown in Fig. (b)b, STE could better approximate the gradient with adjustable gradient correctness term.
Iv Experiment and Result Evaluation
Iva Experiment setup
In this work, we evaluate our proposed network ternarization method for object classification task with CIFAR10 and ImageNet datasets. All the experiments are performed under Pytorch deep learning framework using 4way Nvidia TitanXP GPUs. For clarification, in this work, both the first and last layer are ternarized during the training and test stage.
IvA1 Cifar10:
CIFAR10 contains 50 thousands training samples and 10 thousands test samples with image size. The data augmentation method is identical as used in [2]. For finetuning, we set the initial learning rate as 0.1, which is scheduled to scale by 0.1 at epoch 80, 120 respectively. The minibatch size is set to 128.
IvA2 ImageNet:
In order to provide a more comprehensive experimental results on large dataset, we examine our model ternarization techniques on image classification task with ImageNet [25] (ILSVRC2012) dataset. ImageNet contains 1.2 million training images and 50 thousands validation images, which are labeled with 1000 categories. For the data preprocessing, we choose the scheme adopted by ResNet [2]. Augmentations applied to the training images can be sequentially enumerated as: randomly resized crop, random horizontal flip, pixelwise normalization. All the reported classification accuracy on validation dataset is singlecrop result. The minibatch size is set to 256.
IvB Ablation studies
In order to exam the effectiveness of our proposed methods, we have performed the following ablation studies. The experiments are conducted with ResNet20 [2] on CIFAR10 dataset, where the differences are significant enough to tell the effectiveness.
Configurations  Accuracy 

fullprecision (baseline)  91.7% 
w/ gradient correctness  90.39% 
w/o gradient correctness  87.89% 
vanilla SGD  90.39% 
Adam  56.31% 
Initialize with  89.96% 
Initialize with  90.24% 
Initialize with  90.12% 
IvB1 Gradient Correctness
We compare the accuracy curve convergence speed between the STE design with or without the gradient correctness. As shown in Fig. 5, the network training speed with gradient correctness is much faster in comparison with the case without gradient correctness. The main reason cause the convergence speed degradation is that when layerwise scaling factor is less than 1, without gradient correctness, the gradient of the loss function w.r.t the weights is scaled by the scaling factor due to the chainrule. Thus, weights are updated with a much smaller stepsize in comparison to the thresholds, when optimizers are set up with identical parameters (e.g., learning rate, etc.).







ResNet18b  
Full precision    FP  FP  69.75/89.07  1  
BWN[16]  Bin.  FP  FP  60.8/83.0  32  
ABCNet[19]  Bin.  FP*  FP*  68.3/87.9  6.4  
ADMM[18]  Bin.  FP*  FP*  64.8/86.2  32  
TWN[7, 18]  Tern.  FP  FP  61.8/84.2  16  
TTN[8]  Tern.  FP  FP  66.6/87.2  16  
ADMM[18]  Tern.  FP*  FP*  67.0/87.5  16  
APPRENTICE[11]  Tern.  FP*  FP*  68.5/  16  
this work  Tern.  Tern  Tern  66.01/86.78  16  
ResNet34b  
Full precision    FP  FP  73.31/91.42  1  
APPRENTICE[11]  Tern.  FP*  FP*  72.8/  16  
this work  Tern.  Tern  Tern  70.79/89.89  16  
ResNet50b  
Full precision    FP  FP  76.13/92.86  1  
APPRENTICE[11]  Tern.  FP*  FP*  74.7/  16  
this work  Tern.  Tern  Tern  73.97/91.65  16 
IvB2 Optimizer on thresholds
The vanilla SGD and Adam are two most adopted optimizers for quantized neural network training. Hereby, we took those two optimizers as an example to show the training evolution. Note that, since weights and thresholds are iteratively updated for each input minibatch, we can use different optimizer for weights and thresholds. In this experiment, we use SGD for weight optimization, while using SGD and Adam on thresholds. The result depicted in Fig. 6 shows that it is better to use the same SGD optimizers to achieve higher accuracy.
IvB3 Thresholds Initialization
In order to exam how the threshold initialization affects the network training, we initialize the threshold as for all the layers. The experimental results reported in Fig. 7 shows that the initialization does not play an important role for network ternarization in our case. The reason of that may comes to twofolds: 1) on one hand, all the layerwise ternarization thresholds are initialized with small values where the difference is not significant. 2) on the other hand, all the thresholds are fully trainable which will mitigate the difference during training.
IvC Performance on ImageNet dataset
Beyond the ablation studies we performed on the CIFAR10 dataset, we also conduct the experiment on large scale ImageNet dataset with ResNet18/34/50 (typeb residual connection) network structures. The experimental results are listed in Table II together the methods adopted in related works. Since for the realistic case that neural network operating on the specifically designed hardware, it is expected that all the layers are ternarized. The results shows that, our result can achieve the stateoftheart results. The layerwise thresholds are initialized as . We use the fullprecision pretrained model for weight initilization as described in Fig. 1. The learning rate starts from 1e4, then change to 2e5, 4e6, 2e6 at epoch 30, 40, 45 correspondingly.
V Conclusion and future works
In this work, we have proposed a neural network ternarization method which incorporate thresholds as trainable parameter within the network inference path, thus both weights and thresholds are updated through backpropagation. Furthermore, we explicitly discuss the importance of straightthroughestimator design for approximating the gradient for staircase function. In general, our work is based on the assumption that the weight of deep neural network is tend to following Gaussian distribution. It turns out that such assumption somehow successfully returns a abstract model for network ternarization purpose.
References
 [1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, p. 436, 2015.
 [2] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
 [3] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pp. 5987–5995, IEEE, 2017.
 [4] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826, 2016.
 [5] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten, “Densely connected convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, vol. 1, p. 3, 2017.
 [6] Z. He, B. Gong, and D. Fan, “Optimize deep convolutional neural network with ternarized weights and high accuracy,” arXiv preprint arXiv:1807.07948, 2018.
 [7] F. Li, B. Zhang, and B. Liu, “Ternary weight networks,” arXiv preprint arXiv:1605.04711, 2016.
 [8] C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained ternary quantization,” arXiv preprint arXiv:1612.01064, 2016.
 [9] M. Courbariaux, I. Hubara, D. Soudry, R. ElYaniv, and Y. Bengio, “Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or1,” arXiv preprint arXiv:1602.02830, 2016.
 [10] J.H. Luo, J. Wu, and W. Lin, “Thinet: A filter level pruning method for deep neural network compression,” arXiv preprint arXiv:1707.06342, 2017.
 [11] A. Mishra and D. Marr, “Apprentice: Using knowledge distillation techniques to improve lowprecision network accuracy,” arXiv preprint arXiv:1711.05852, 2017.
 [12] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” in Advances in neural information processing systems, pp. 1135–1143, 2015.
 [13] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” arXiv preprint arXiv:1510.00149, 2015.
 [14] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520, 2018.
 [15] M. Courbariaux, Y. Bengio, and J.P. David, “Binaryconnect: Training deep neural networks with binary weights during propagations,” in Advances in neural information processing systems, pp. 3123–3131, 2015.
 [16] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnornet: Imagenet classification using binary convolutional neural networks,” in European Conference on Computer Vision, pp. 525–542, Springer, 2016.
 [17] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, “Dorefanet: Training low bitwidth convolutional neural networks with low bitwidth gradients,” arXiv preprint arXiv:1606.06160, 2016.
 [18] C. Leng, H. Li, S. Zhu, and R. Jin, “Extremely low bit neural network: Squeeze the last bit out with admm,” arXiv preprint arXiv:1707.09870, 2017.
 [19] X. Lin, C. Zhao, and W. Pan, “Towards accurate binary convolutional neural network,” in Advances in Neural Information Processing Systems, pp. 344–352, 2017.
 [20] D. Zhang, J. Yang, D. Ye, and G. Hua, “Lqnets: Learned quantization for highly accurate and compact deep neural networks,” arXiv preprint arXiv:1807.10029, 2018.
 [21] Y. Bengio, N. Léonard, and A. Courville, “Estimating or propagating gradients through stochastic neurons for conditional computation,” arXiv preprint arXiv:1308.3432, 2013.
 [22] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra, “Weight uncertainty in neural networks,” arXiv preprint arXiv:1505.05424, 2015.
 [23] S. S. Shapiro and M. B. Wilk, “An analysis of variance test for normality (complete samples),” Biometrika, vol. 52, no. 3/4, pp. 591–611, 1965.
 [24] J. G. Proakis, M. Salehi, N. Zhou, and X. Li, Communication systems engineering, vol. 2. Prentice Hall New Jersey, 1994.
 [25] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al., “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.