Simultaneously Optimizing Weight and Quantizer of Ternary Neural Network using Truncated Gaussian Approximation
In the past years, Deep convolution neural network has achieved great success in many artificial intelligence applications. However, its enormous model size and massive computation cost have become the main obstacle for deployment of such powerful algorithm in the low power and resource-limited mobile systems. As the countermeasure to this problem, deep neural networks with ternarized weights (i.e. -1, 0, +1) have been widely explored to greatly reduce model size and computational cost, with limited accuracy degradation. In this work, we propose a novel ternarized neural network training method which simultaneously optimizes both weights and quantizer during training, differentiating from prior works. Instead of fixed and uniform weight ternarization, we are the first to incorporate the thresholds of weight ternarization into a closed-form representation using truncated Gaussian approximation, enabling simultaneous optimization of weights and quantizer through back-propagation training. With both of the first and last layer ternarized, the experiments on the ImageNet classification task show that our ternarized ResNet-18/34/50 only has 3.9/2.52/2.16% accuracy degradation in comparison to the full-precision counterparts.
Artificial intelligence is nowadays one of the hottest research topics, which has drawn tremendous efforts from various fields in the past couple years. While computer scientists have succeed to develop Deep Neural Networks (DNN) with transcendent performance in the domains of computer vision, speech recognition, big data processing and etc. . The state-of-the-art DNN evolves into structures with larger model size, higher computational cost and denser layer connections [2, 3, 4, 5]. Such evolution brings great challenges to the computer hardware in terms of both computation and on-chip storage , which leads to great research effort on the topics of model compression in recent years, including channel pruning, weight sparsification, weight quantization, etc.
Weight ternarization, as a special case of weight quantization technique to efficiently compress DNN model, mainly provides three benefits: 1) it converts the floating-point weights into ternary format (i.e., -1, 0, +1), which can significantly reduce the model size by . With proper sparse encoding technique, such model compression rate can be further boosted. 2) Besides the model size reduction, the ternarized weight enables elimination of hardware-expensive floating-point multiplication operations, while replacing with hardware friendly addition/subtraction operations. Thus, it could significantly reduce the inference latency. 3) The ternarized weights with zero values intrinsically prune network connections, thus the computations related to those zero weights can be simply skipped.
In the previous low-bit network qunatization works, such as TTN , TTQ  and BNN , they do re-train the models’ weights but a fixed weight quantizer is used and not properly updated together with other model parameters, which leads to accuracy degradation and slow convergence of training. In this work, we have proposed a network ternarization method which simultaneously update both weights and quantizer (i.e. thresholds) during training, where our contributions can be summarized as:
We propose a fully trainable deep neural network ternarization method that jointly trains the quantizer threshold, layer-wise scaling factor and model weight to achieve minimum accuracy degradation due to model compression.
Instead of fixed and uniform weight ternarization, we are the first to incorporate the thresholds of weight ternarization into a closed-form expression using truncated Gaussian approximation, which can be optimized through back-propagation together with network’s other parameters. It differentiates from all precious works.
We propose a novel gradient correctness optimization method in straight-through-estimator design. It gives better gradient approximation for the non-differentiable staircase ternarization function, which leads to faster convergence speed and higher inference accuracy.
In order to validate the effectiveness of our proposed methods, we apply our model ternarization method on ImageNet dataset for object classification task.
The rest of this paper is organized as follows. We first give a brief introduction about the related works regarding the topics of model compression. Then the proposed network ternarization method and the applied tricks are explained in details. In the following section, experiments are performed on both small and large scale dataset with various deep neural network structure, to evaluate the effectiveness of our proposed method. After that, the conclusion is drawn in the end.
Ii Related Works
Recently, model compression on deep convolutional neural network has emerged as one hot topic in the hardware deployment of artificial intelligence. There are various techniques, including network pruning , knowledge distillation , weight sparsification , weight quantization  and etc. , to perform network model compression. As one of the most popular technique, weight quantization techniques are widely explored in many related works which can significantly shrink the model size and reduce the computation complexity . The famous deep compression technique  adopts the scheme that optimizing weight quantizer using k-means clustering on the pre-trained model. Even though the deep compression technique can achieve barely no accuracy loss with 8-bit quantized weight, its performance on low-bit quantized case is non-ideal. Thereafter, many works are devoted to quantize the model parameters into binary or ternary formats, not only for its extremely model size reduction (), but also the computations are simplified from floating-point multiplication (i.e. mul) operations into addition/subtraction (i.e. add/sub). BinaryConnect  is the first work of binary CNN which can get close to the state-of-the-art accuracy on CIFAR-10, whose most effective technique is to introduce the gradient clipping. After that, both BWN in  and DoreFa-Net  show better or close validation accuracy on ImageNet dataset. In order to reduce the computation complexity, XNOR-Net  binarizes the input tensor of convolution layer which further converts the add/sub operations into bit-wise xnor and bit-count operations. Besides weight binarization, there are also recent works proposing to ternarize the weights of neural network using trained scaling factors . Leng et. al. employ ADMM method to optimize neural network weights in configurable discrete levels to trade off between accuracy and model size . ABC-Net in  proposes multiple parallel binary convolution layers to improve the network model capacity and accuracy, while maintaining binary kernel. All above discussed aggressive neural network binarization or ternarization methodologies sacrifice the inference accuracy in comparison with the full precision counterpart to achieve large model compression rate and computation cost reduction.
Iii-a Problem Definition
As for weight quantization of neural networks, the state-of-the-art work  typically divides it into two sub-problems: (1) minimizing quantization noise (i.e., Mean-Square-Error) between floating-point weights and quantized weights and (2) minimizing inference error of neural network w.r.t the defined objective function. In this work, instead of optimizing two separated sub-problems, we mathematically incorporate the thresholds of weight quantizer into neural network forward path, enabling simultaneous optimization of weights and thresholds through back-propagation method. In this work, the network optimization problem can be described as obtaining the optimized ternarized weight and ternarization threshold :
where is the defined network loss function, is the target corresponding to the network input tensor , computes the network output w.r.t the network parameters.
Iii-B Trainable ternarization under Gaussian approximation
In this subsection, we will first introduce the our weight ternarization methodology. Then, our proposed method to incorporate ternarization thresholds into neural network inference path, which makes it trainable through back-propagation, is discussed particularly.
Iii-B1 Network Ternarization:
For the sake of obtaining a deep neural network with ternarized weight and minimum accuracy loss, the training scheme for one iteration (as shown in Fig. 1) can be generally enumerated as four steps:
Initialize the weight with full-precision pre-trained model. Previous works have experimentally demonstrate that fine-tuning the pre-trained model with small learning rate normally generates a quantized model with higher accuracy. More importantly, with the pre-trained model as parameter initialization, much less number of training epochs is required to get model converged in comparison to training from scratch.
Ternarize the full-precision weight w.r.t the layer-wise thresholds and scaling factor in real time. The weight ternarization function can be described as:
where and denote the full-precision weight base and its ternarized version of -th layer respectively. are the ternarization thresholds. calculates the layer-wise scaling factor using extracted mean , standard deviation and thresholds , which is the key to enable threshold training in our work. The closed-form expression of will be described here-in-after.
For one input batch, this step only adjusts the thresholds through back propagation in a layer-wise manner, meanwhile suspending the update of weight.
For the identical input batch, it repeats step-2 to synchronize the ternarized weight base w.r.t the updated thresholds in step-3. It then disables the update of thresholds and only allows full-precision weight base to be updated. Since the staircase ternarization function ( in Eq. 2) is non-differentiable owing to its zero derivatives almost everywhere, we adopt the method of Straight-Through-Estimator  similar as previous network quantization works . It is noteworthy that we propose a new gradient correctness algorithm in STE which is critical to improve the convergence speed for weight retraining (see details in following subsections).
With ternarized weights, the major computation 111For simplicity, we neglect the bias term. (i.e., floating-point multiplication and accumulation) is converted to more efficient and less complex floating-point addition and subtraction, due to . The computation can be expressed as:
where and are the vectorized input and ternarized weight of layer respectively. Since in the structures of state-of-the-art deep neural networks, convolution/fully-connected layers normally follows a batch-normalization layer or ReLU where both layers perform element-wise function on their input tensor (i.e., ), the element-wise scaling in Eq. 3 can be emitted and integrate with the forthcoming batch-norm and ReLU. Beyond the above descriptions of ternarized model training procedure, we formalize those operations in Algorithm 1 as well for clarification.
Iii-B2 Trainable thresholds utilizing truncated Gaussian distribution approximation:
It has been fully discussed in previous works [13, 22] that the weight distributions of spatial convolution layers and fully-connected layers are intending to follow Gaussian distribution, whose histogram is in bell-shape, owing to the effect of -regularization. For example, in Fig. 2, we have shown the weight distributions and their corresponding Probability Density Function (PDF) using the calculated mean and standard deviation for each parametric layer (i.e., convolution and fully-connected layers) in ResNet-18b . Meanwhile, the Shapiro-Wilk normality test  is conducted to identify whether the weight sample originated from Gaussian distribution quantitatively. The given test statistic of Shapiro-Wilk normality test indicate a good normally distribution match with minimum 0.82 value. Note that, the asymmetry (i.e., Skewness) of the last fully-connected layer is due to the existence of bias term. In this work, we consider the weight of parametric layers approximately following Gaussian distribution and further perform the weight ternarization based on such approximation.
In order to make the thresholds as trainable parameters that can be updated using back-propagation, there are two criteria that have to be met:
thresholds have to be parameters of inference path in a closed-form expression.
such closed-form expression is differentiable w.r.t the threshold.
Hereby, we make the assumption that:
the weights of designated layer are approximately following Gaussian distribution (i.e., ), where and are the calculated mean and standard deviation of the weight sample .
The centroid is normally taken as the quantized value for a nonuniform quantizer setup to minimize the quantization error . Thus, for weight ternarization, the layer-wise scaling factor can be set as:
where is the conditional PDF under the condition of . In this work, by setting and , we can approximate the Eq. 4 and reformat it into:
where and are the PDF and CDF for Gaussian distribution . Such calculation can directly utilize the closed-form expression of mathematical expectation for truncated Gaussian distribution with lower bound and upper bound . Thus, we finally obtain a closed-form expression of scaling factor embedding trainable thresholds :
where and are PDF and CDF of standard normal distribution .
For visualization purpose, as shown in Fig. (a)a, we plot the function of in the forward and backward paths w.r.t the variation of . Moreover, in order to ensure for correct ternarization and prevent framework convergence issue 222Since most of the popular deep learning frameworks using numerical method (e.g., Monte-Carlo method) for distribution related calculation, there will be error for calculating and at the tail of distribution (i.e., ), we apply constraints on which keep . Such clipping operation is functionally equivalent as propagating through the hard tanh function, which is piece-wise linear activation function with upper-limit and lower-limit , then the trainable thresholds with clipping constraints can be expressed as:
Beyond that, the weight decay tends to push the trainable threshold close to zero, and biases the ternary weight representation towards the binary counterpart, thus lowering the sparsity. Therefore, we do not apply weight decay on threshold during training.
In summary, we finalize the scaling factor term and weight ternarization function to substitute the original full-precision weight in the forward propagation path:
Iii-C Straight Through Estimator with Gradient Correctness
Almost for any quantization function which maps the continuous values into discrete space, it has encountered the same problem that such stair-case function is non-differentiable. Thus, a widely adopted countermeasure is using the so-called Straight-Through-Estimator (STE) to manually assign an approximated gradient to the quantization function. We take the STE in famous binarized neural network  as an example to perform the analysis, which is defined as:
where is the defined loss function. The rule behind such STE setup is that the output of quantization function can effectively represent the full-precision input value . Thus, performs the similar function as whose derivative is . However, the rough approximation in Eq. 12 and Eq. 13 leads to significant quantization error and hamper the network training when or . For example, as shown in Fig. (a)a, if the layer’s weight distribution owns , performing network quantization through fine-tuning will result in significant weight distribution shift which slows down the convergence speed.
In order to encounter the drawback of naive STE design discussed above, we propose a method called gradient correctness for better gradient approximation. For our weight ternarization case, the full-precision weight base is represented by , where both terms can pass back gradients to update the embedding parameters. For assigning a proper gradient to the , we follow STE design rule which leads to the following expression:
Thus, the STE for ternarization function can be derived Eq. 14 as:
As seen in Eq. 15, instead of simply assigning the gradient as 1, we scale the w.r.t the value of in real time. As shown in Fig. (b)b, STE could better approximate the gradient with adjustable gradient correctness term.
Iv Experiment and Result Evaluation
Iv-a Experiment setup
In this work, we evaluate our proposed network ternarization method for object classification task with CIFAR-10 and ImageNet datasets. All the experiments are performed under Pytorch deep learning framework using 4-way Nvidia Titan-XP GPUs. For clarification, in this work, both the first and last layer are ternarized during the training and test stage.
CIFAR-10 contains 50 thousands training samples and 10 thousands test samples with image size. The data augmentation method is identical as used in . For fine-tuning, we set the initial learning rate as 0.1, which is scheduled to scale by 0.1 at epoch 80, 120 respectively. The mini-batch size is set to 128.
In order to provide a more comprehensive experimental results on large dataset, we examine our model ternarization techniques on image classification task with ImageNet  (ILSVRC2012) dataset. ImageNet contains 1.2 million training images and 50 thousands validation images, which are labeled with 1000 categories. For the data pre-processing, we choose the scheme adopted by ResNet . Augmentations applied to the training images can be sequentially enumerated as: randomly resized crop, random horizontal flip, pixel-wise normalization. All the reported classification accuracy on validation dataset is single-crop result. The mini-batch size is set to 256.
Iv-B Ablation studies
In order to exam the effectiveness of our proposed methods, we have performed the following ablation studies. The experiments are conducted with ResNet-20  on CIFAR-10 dataset, where the differences are significant enough to tell the effectiveness.
|w/ gradient correctness||90.39%|
|w/o gradient correctness||87.89%|
Iv-B1 Gradient Correctness
We compare the accuracy curve convergence speed between the STE design with or without the gradient correctness. As shown in Fig. 5, the network training speed with gradient correctness is much faster in comparison with the case without gradient correctness. The main reason cause the convergence speed degradation is that when layer-wise scaling factor is less than 1, without gradient correctness, the gradient of the loss function w.r.t the weights is scaled by the scaling factor due to the chain-rule. Thus, weights are updated with a much smaller step-size in comparison to the thresholds, when optimizers are set up with identical parameters (e.g., learning rate, etc.).
Iv-B2 Optimizer on thresholds
The vanilla SGD and Adam are two most adopted optimizers for quantized neural network training. Hereby, we took those two optimizers as an example to show the training evolution. Note that, since weights and thresholds are iteratively updated for each input mini-batch, we can use different optimizer for weights and thresholds. In this experiment, we use SGD for weight optimization, while using SGD and Adam on thresholds. The result depicted in Fig. 6 shows that it is better to use the same SGD optimizers to achieve higher accuracy.
Iv-B3 Thresholds Initialization
In order to exam how the threshold initialization affects the network training, we initialize the threshold as for all the layers. The experimental results reported in Fig. 7 shows that the initialization does not play an important role for network ternarization in our case. The reason of that may comes to twofolds: 1) on one hand, all the layer-wise ternarization thresholds are initialized with small values where the difference is not significant. 2) on the other hand, all the thresholds are fully trainable which will mitigate the difference during training.
Iv-C Performance on ImageNet dataset
Beyond the ablation studies we performed on the CIFAR-10 dataset, we also conduct the experiment on large scale ImageNet dataset with ResNet-18/34/50 (type-b residual connection) network structures. The experimental results are listed in Table II together the methods adopted in related works. Since for the realistic case that neural network operating on the specifically designed hardware, it is expected that all the layers are ternarized. The results shows that, our result can achieve the state-of-the-art results. The layer-wise thresholds are initialized as . We use the full-precision pretrained model for weight initilization as described in Fig. 1. The learning rate starts from 1e-4, then change to 2e-5, 4e-6, 2e-6 at epoch 30, 40, 45 correspondingly.
V Conclusion and future works
In this work, we have proposed a neural network ternarization method which incorporate thresholds as trainable parameter within the network inference path, thus both weights and thresholds are updated through back-propagation. Furthermore, we explicitly discuss the importance of straight-through-estimator design for approximating the gradient for staircase function. In general, our work is based on the assumption that the weight of deep neural network is tend to following Gaussian distribution. It turns out that such assumption somehow successfully returns a abstract model for network ternarization purpose.
-  Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, p. 436, 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
-  S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pp. 5987–5995, IEEE, 2017.
-  C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826, 2016.
-  G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten, “Densely connected convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, vol. 1, p. 3, 2017.
-  Z. He, B. Gong, and D. Fan, “Optimize deep convolutional neural network with ternarized weights and high accuracy,” arXiv preprint arXiv:1807.07948, 2018.
-  F. Li, B. Zhang, and B. Liu, “Ternary weight networks,” arXiv preprint arXiv:1605.04711, 2016.
-  C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained ternary quantization,” arXiv preprint arXiv:1612.01064, 2016.
-  M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio, “Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1,” arXiv preprint arXiv:1602.02830, 2016.
-  J.-H. Luo, J. Wu, and W. Lin, “Thinet: A filter level pruning method for deep neural network compression,” arXiv preprint arXiv:1707.06342, 2017.
-  A. Mishra and D. Marr, “Apprentice: Using knowledge distillation techniques to improve low-precision network accuracy,” arXiv preprint arXiv:1711.05852, 2017.
-  S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” in Advances in neural information processing systems, pp. 1135–1143, 2015.
-  S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” arXiv preprint arXiv:1510.00149, 2015.
-  M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520, 2018.
-  M. Courbariaux, Y. Bengio, and J.-P. David, “Binaryconnect: Training deep neural networks with binary weights during propagations,” in Advances in neural information processing systems, pp. 3123–3131, 2015.
-  M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net: Imagenet classification using binary convolutional neural networks,” in European Conference on Computer Vision, pp. 525–542, Springer, 2016.
-  S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, “Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients,” arXiv preprint arXiv:1606.06160, 2016.
-  C. Leng, H. Li, S. Zhu, and R. Jin, “Extremely low bit neural network: Squeeze the last bit out with admm,” arXiv preprint arXiv:1707.09870, 2017.
-  X. Lin, C. Zhao, and W. Pan, “Towards accurate binary convolutional neural network,” in Advances in Neural Information Processing Systems, pp. 344–352, 2017.
-  D. Zhang, J. Yang, D. Ye, and G. Hua, “Lq-nets: Learned quantization for highly accurate and compact deep neural networks,” arXiv preprint arXiv:1807.10029, 2018.
-  Y. Bengio, N. Léonard, and A. Courville, “Estimating or propagating gradients through stochastic neurons for conditional computation,” arXiv preprint arXiv:1308.3432, 2013.
-  C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra, “Weight uncertainty in neural networks,” arXiv preprint arXiv:1505.05424, 2015.
-  S. S. Shapiro and M. B. Wilk, “An analysis of variance test for normality (complete samples),” Biometrika, vol. 52, no. 3/4, pp. 591–611, 1965.
-  J. G. Proakis, M. Salehi, N. Zhou, and X. Li, Communication systems engineering, vol. 2. Prentice Hall New Jersey, 1994.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al., “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.