On Periodic Functions as Regularizers
for Quantization of Neural Networks
Deep learning models have been successfully used in computer vision and many other fields. We propose an unorthodox algorithm for performing quantization of the model parameters. In contrast with popular quantization schemes based on thresholds, we use a novel technique based on periodic functions, such as continuous trigonometric sine or cosine as well as non-continuous hat functions. We apply these functions component-wise and add the sum over the model parameters as a regularizer to the model loss during training. The frequency and amplitude hyper-parameters of these functions can be adjusted during training. The regularization pushes the weights into discrete points that can be encoded as integers. We show that using this technique the resulting quantized models exhibit the same accuracy as the original ones on CIFAR-10 and ImageNet datasets.
|Maxim Naumov, Utku Diril, Jongsoo Park,|
|Benjamin Ray, Jedrzej Jablonski and Andrew Tulloch|
|Facebook, 1 Hacker Way, Menlo Park, CA, 94025|
Deep learning models require a very large amount of resources during their training (repeated forward and backward propagation) as well as inference (forward propagation). Further, the latter is often performed on the edge devices, such as smartphones or embedded systems, which operate within strict size, temperature and power budget Shimpi (2011); Humrick (2017); Dolbeau (2018); NVIDIA (2018). As a result these devices can perform a limited # of operations per second111The peak for ARM Cortex is based on fp32 “VMLA.F32 Qd, Qn Dm”, fp16 “VMLA.F16 Qd, Qn Dm” and int8 “VMLAL.S8 Qd, Dn, Dm” instructions, with estimated reciprocal throughput and width for A-7, reciprocal throughput and width for A-75. Then, peak ops are defined as (frequency/throughput)*width*cores. Also, the power is assumed to be 750mW and 1W per core for ARM Cortex A-7 and A-75, respectively., as illustrated in Tab. 1.
|ARM Cortex A-7(2-core)||1.5||1.5||3||12|
|ARM Cortex A-75(4-core)||3.0||4||48||96||96|
|NVIDIA Turing Tesla T4||1.35||70||8100||16200||130000|
In order to decrease the storage and compute requirements of the model during inference its parameters are often stored as integers with a low number of bits. It is common to use 8-bit integers (1 Byte), rather than 16- (2 Bytes) or 32-bit (4 Bytes) floating point numbers. The process of converting model parameters from “continuous” floating point to discrete integer numbers is called quantization.
Let the original optimization problem be
where is the loss measured during training. There are many different quantization schemes based on symmetric vs. asymetric intervals, uniform vs. non-uniform discrete partitioning, different rounding modes and choices for handling the outliers, e.g. few elements that lie outside of the range of most of the other elements.
Let us consider a uniform quantization of the parameter weights , for example from the either convolution (14) or fully connected layers (15), and activations. Then, the quantized problem is commonly written as
where is the quantization function.
For instance, if we use only two intervals then the process is referred to as binarization and resulting element can be stored in a single bit Courbariaux et al. (2015; 2016); Hubara et al. (2018); Rastegari et al. (2016). It can be performed using a single threshold point as shown below
The ternary networks use three intervals with resulting elements stored in 2 bits Li et al. (2016); Mellempudi et al. (2017); Choi et al. (2018). Then, quantization can be performed using two threshold points and resulting in
Finally, let arbitrary # of bits correspond to points and intervals. Let us assume that we would like to quantize floating point number , with and length . A uniform quantization can be performed symetrically in the interval using multiplier , so that
with effective points because is double counted.
On the other hand, notice that we can shift the interval to the interval located around by adding a scalar bias term . Therefore, uniform asymetric quantization can be performed with bias using multiplier , so that
The advantage of symmetric quantization is that for sparse parameters, with a lot of elements, the sparsity is preserved. Note that computation with zeroes can be skipped in hardware Albericio et al. (2016); Venkatesh et al. (2016); Reagen et al. (2016); Chen et al. (2017); Kim et al. (2017); Parashar et al. (2017). The disadvatage is that for highly asymetric intervals many discrete representations may be wasted.
The non-uniform quantization assigns discrete points to the interval based on the distribution of floating point values in it Bagherinezhad et al. (2017); Wang et al. (2018). Therefore, it does not have a fixed stride from one point to the next. Its advantage is that the encoded values are more representative of the original ones, but at the same time it can be hard to map back and perform operations with them.
The techniques for handling outliers and determining maximum thresholds, e.g. using adaptive schemes, KullbackâLeibler (KL) divergence measured loss of information, or L2 error minimization in Caffe2, have been investigated in Jia et al. (2014); Migasz (); Zhou et al. (2017); Park et al. (2018).
However, independent of all of these choices, notice that a common trend among (3) - (6) is that fixed thresholds are used in quantization function to clamp floating point values to discrete points. We point out that the matrix- and neural network-based compression techniques are outside the scope of this paper Gong et al. (2014); Denton et al. (2014); Jaderberg et al. (2014); Mishra & Marr (2018).
In this paper we will focus on a very different approach for uniform quantization using periodic functions, such as trigonometric sine (or cosine) as well as hat functions. We discuss uniform quantization, but our ideas can be generalized to non-uniform case using variations of these periodic functions with decaying amplitude and increasing base lengths away from the origin Stenger (1993); Strang & Fix (2008).
2 Periodic Functions as Regularizers
We propose an unorthodox approach for quantizing the weights of a neural network. Instead of using quantization function , we propose proposed adding a regularization term to the loss, so that the resulting optimization problem is written as
where is a scalar scaling parameter.
The regularization term is a sum of periodic functions that push the values of the weights (and potentially activations) to a set of discrete points during training. Next we will discuss different choices for these functions.
2.1 Trigonometric (Continuous) Functions
Let us focus only on the weights and use trigonometric sine, so that
where is the maximum weight in absolute value as defined in (5).
Notice that for frequency=1 the function R(w) attains its minimum 0.0, when the weight values are distributed at 3 discrete locations, while for frequency=7 it attains its minimum 0.0, when the weight values are distributed at 15 discrete locations, as shown on Fig. 0(a) and 0(b), respectively.
2.2 Hat (Non-Continuous) Functions
Let us focus only on the weights and use a hat function, so that
where is the maximum weight in absolute value as defined in (5).
Once again, notice that for frequency=1 the function R(w) attains its minimum 0.0, when the weight values are distributed at 3 discrete locations, while for frequency=7 it attains its minimum 0.0, when the weight values are distributed at 15 discrete locations, as shown on Fig. 2(a) and 2(b), respectively. Here we show variant of the hat function corresponding to sine, while a shifted variant corresponding to cosine is also possible.
Notice that the use of regularization for the purpose of quantization has been suggested in Hung et al. (2015). However, in this earlier work the authors use distance from fixed points (centroids) as a penalty measure to ensure quantization. This contrasts with our periodic trigonometric sine and hat functions, with amplitude and frequency hyper-parameters defined in (8) and (9), respectively.
It is important to highlight a few differences between sine (or cosine) and hat functions. Notice that sine function has very nice properties. It is periodic, continuous and differentiable. However, it is not convex, unlike many of the existing regularizers. Also, notice that the maximum value of the regularizer R(w) is know ahead of time. It can be computed by assuming that all weights translate into value 1.0 after application of the function. Then, the regularizer can be scaled by a constant , such that R(w) . This can be used to facilitate and in fact define the regularizer scaling in (7), therefore reducing the number of hyper-parameters.
Also, sine function has a gradient that is zero (or close to zero) in the neighborhood of points where it attains it’s minimum and maximum values. This property might make escaping the maximum or approaching the minimum slow in their respective neighborhoods. On the other hand, hat function is non-convex and non-continuous, with constant gradient towards the minimum except for the points where it attains its minimum and maximum values, where the gradient does not exist. These tradeoffs might guide the choice between these functions, in a way similar to that of a choice between Sigmoid and ReLU activation functions.
Finally, notice that amplitude can be changed adaptively during the training procedure, which allows us to obtain higher test accuracy, as will be shown in the experiments section. The frequency can also be varied during training, but these experiments are outside of the scope of this paper.
2.3 From Bits to Frequency and Vice-Versa
In practice we are interested in selecting the number of bits to be used for quantization. For the sine and associated hat function the frequency corresponding to number of bits can be found by using
so that frequency 1 implies 2 bits, while frequency 7 implies 4 bits.
On the other hand, for cosine the frequency corresponding to number of bits can be found by using
For instance, frequency 1 implies 1 bit, while frequency 8 implies 4 bits, and so on and so forth.
In this section we will investigate the accuracy of ResNet-20 on CIFAR-10 and ResNet-50 on ImageNet datasets He et al. (2015); Krizhevsky et al. (2009); Deng et al. (2009). We will compare the test error achieved by the original and quantized models with loss function defined in (1) and (7), respectively. The regularization term we add to the loss in (7) relies on periodic functions: trigonometric sine in (8) and hat in (9). It can be computed using the following PyTorch Paszke et al. (2017) code snippet
def periodic_regularization(model, amplitude, frequency): pi = 3.141592 total = 0 for m in model.modules(): if isinstance(m, nn.Conv2d) or isinstance(m, nn.Linear): ic = 1/w.abs().max() rw = torch.sum(amplitude * #either sin torch.pow(torch.sin(pi * frequency * (w * ic)), 2)) #or hat function torch.abs(((((w * ic) - 0.5) * frequency) % 1) * 2 -1))
The training is performed using batch size 256 with default 100 epochs for CIFAR-10 and 90 epochs for ImageNet dataset. We use a fixed schedule that adjusts the amplitude hyper-parameter every epochs. We start with a small amplitude, such as , and progressively adjust it until it reaches, say after typical epochs of training. Notice that amplitude subsumes the scaling hyper-parameter , which is always set to 1.0. Note that other than using a fixed schedule we do not require any special treatment for the first or last model layers or training epochs, which is otherwise often required to produce good approximations. We show the results of representative runs.
def quantize_model(model, frequency): def quantize_weights(m): if isinstance(m, nn.Conv2d) or isinstance(m, nn.Linear): c = m.weight.abs().max().data m.weight.data.mul_(frequency/c) m.weight.data.round_() m.weight.data.mul_(c/frequency) model.apply(quantize_weights)
We illustrate the difference between original, original with regularization (Regu), and quantized model (Quant) for CIFAR-10 dataset on Fig. 4. The accuracy of the original model is plotted with a black dotted line, while the accuracy of other models is plotted with color bars. Notice that the model accuracy changes significantly depending on the number of bits used for quantization. For instance, there seems to be a clear boundary between 4 and 8 bits, where there seems to be (not or) enough bits to represent the information. Notice that while the training succeeds in all cases, the quantization fails to produce accurate results with less than 8 bits.
Also, we illustrate the attained model accuracy with different starting amplitudes for CIFAR-10 dataset on Fig. 5. The accuracy of the original model is plotted with a black dotted line, while the accuracy of the 8-bit quantized model is plotted with color bars. Notice that using adaptive rather than static amplitude allows us to reach higher test accuracy. Also, in our experiments we have found that it is a good practice to target the initial amplitude and choice of fixed schedule such that the final amplitude is in the range of 0.01 - 0.001, which would correspond to a reasonable value of the regularization scaling . We observe similar results on the ImageNet dataset, as seen on Fig. 6.
Finally, notice that both sine and hat functions perform as well on the ImageNet dataset, as shown in Fig. 7. Once again, the accuracy of the original model is plotted with a black dotted line, while the accuracy of the 8-bit quantized model is plotted with color bars. In all plots, amplitude denotes the final amplitude. The detailed results are also summarized in tables Tab. 2 and 3.
|Default Model||Quantized model (with sine)|
|test (best)||8 bits||4 bits||2 bits|
|Test error (fixed)||84.72 (87.70)||29.26||58.18||79.18||n/a||n/a|
|Test error (dyn)||84.72 (87.70)||53.28||74.14||87.46||16.66||10.20|
|Default Model||Quantized model (with sine)||(with hat)|
|8 bits||8 bits|
4 Conclusion and Future Work
We have proposed a novel technique for quantizing neural networks, based on regularization with periodic functions. We have shown that it can be effectively used to quantize ResNets on CIFAR-10 and ImageNet datasets. In our experiments we have achieved virtually no losses vis-Ã -vis standard model by using amplitude scaling on a fixed schedule through training followed by 8-bit integer quantization. While similar quality results exist for quantization of CNNs, in this note we have achieved them through a completely novel method. In the future, we would like to incorporate the quantization of activations into this approach and experiment with more classes of neural networks.
The authors would like to thank Marat Dukhan, Bram Wasti and Satish Nadathur for collecting ARM Cortex A-7 and A-75 processor intruction information as well as Misha Smelyanskiy for his helpful comments and suggestions.
- Albericio et al. (2016) J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and A. Moshovos. Cnvlutin: Ineffectual-neuron-free deep neural network computing. ACM/IEEE 43rd Annual International Symposium on Computer Architecture, 2016.
- Bagherinezhad et al. (2017) H. Bagherinezhad, M. Rastegari, and A. Farhadi. LCNN: Lookup-based convolutional neural network. Proc. Computer Vision and Pattern Recognition, 2017.
- Chen et al. (2017) Y.H. Chen, T. Krishna, J. Emer, and V. Sze. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of Solid-State Circuits, 52:127–138, 2017.
- Choi et al. (2018) J. Choi, P. Chuang, Z. Wang, S. Venkataramani, V. Srinivasan, and K. Gopalakrishnan. Bridging the accuracy gap for 2-bit quantized neural networks (QNN). Proc. Computer Vision and Pattern Recognition, 2018.
- Courbariaux et al. (2015) M. Courbariaux, Y. Bengio, and J.P. David. BinaryConnect: Training deep neural networks with binary weights during propagations. CoRR, 2015. URL https://arxiv.org/abs/1511.00363.
- Courbariaux et al. (2016) M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio. Binarized neural networks: Training deep neural networks with weights and activations constrained to +1 or -1. CoRR, 2016. URL https://arxiv.org/abs/1602.02830.
- Deng et al. (2009) J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical image database. Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 248–255, 2009. URL http://www.image-net.org.
- Denton et al. (2014) E. Denton, W. Zaremba, J. Bruna, Y. LeCun, and B. Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. Proc. Neural Information Processing Systems, 2014.
- Devarakonda et al. (2017) A. Devarakonda, M. Naumov, and M. Garland. Adabatch: Adaptive batch sizes for training deep neural networks. CoRR, 2017. URL https://arxiv.org/abs/1712.02029.
- Dolbeau (2018) R. Dolbeau. Theoretical peak flops per instruction set: A tutorial. The Journal of Supercomputing, 74:1341–1377, 2018.
- Gong et al. (2014) Y. Gong, L. Liu, M. Yang, and L. Bourdev. Compressing deep convolutional networks using vector quantization. CoRR, 2014. URL https://arxiv.org/abs/1412.6115.
- Goodfellow et al. (2016) I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016. URL https://www.deeplearningbook.org.
- He et al. (2015) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CoRR, 2015. URL http://arxiv.org/abs/1512.03385.
- Hubara et al. (2018) I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. Journal of Machine Learning Research, 18:1–30, 2018.
- Humrick (2017) M. Humrick. Exploring DynamIQ and ARMâs new CPUs: Cortex-A75, Cortex-A55. AnandTech, 2017. URL https://www.anandtech.com/show/11441/dynamiq-and-arms-new-cpus-cortex-a75-a55.
- Hung et al. (2015) P. Hung, C. Lee, S. Yang, V. S. Somayazulu, Y. Chen, and S. Chien. Bridge deep learning to the physical world: An efficient method to quantize network. IEEE Signal Processing Systems, 2015.
- Ioffe & Szegedy (2015) S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. Proc. International Conf. Machine Learning, pp. 448–456, 2015. URL http://proceedings.mlr.press/v37/ioffe15.html.
- Jacob et al. (2017) B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. CoRR, 2017. URL https://arxiv.org/abs/1712.05877.
- Jaderberg et al. (2014) M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with low rank expansions. BMVC, 2014.
- Jia et al. (2014) Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. CoRR, 2014.
- Kim et al. (2017) D. Kim, J. Ahn, and S. Yoo. ZeNA: Zero-aware neural network accelerator. IEEE Design and Test, 35:39–46, 2017.
- Krishnamoorthi (2018) R. Krishnamoorthi. Quantizing deep convolutional networks for efficient inference: A whitepaper. CoRR, 2018.
- Krizhevsky et al. (2009) A. Krizhevsky, V. Nair, and G. Hinton. CIFAR-10 (Canadian Institute for Advanced Research). 2009. URL http://www.cs.toronto.edu/Ëkriz/cifar.html.
- Krizhevsky et al. (2012) A. Krizhevsky, I. Sutskever, and G. E Hinton. ImageNet classification with deep convolutional neural networks. Advances Neural Information Processing Systems, pp. 1097–1105, 2012.
- LeCun et al. (1989a) Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1:541–551, 1989a.
- LeCun et al. (1989b) Y. LeCun, L. D. Jackel, B. Boser, J. S. Denker, H. P. Graf, I. Guyon, D. Henderson, R. E. Howard, and W. Hubbard. Handwritten digit recognition: Applications of neural net chips and automatic learning. IEEE Communication, pp. 41–46, 1989b.
- LeCun et al. (1998) Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proc. IEEE, 86:2278–2324, 1998. URL http://yann.lecun.com/exdb/mnist.
- Li et al. (2016) F. Li, B. Zhang, and B. Liu. Ternary weight networks. CoRR, 2016. URL https://arxiv.org/abs/1605.04711.
- Mellempudi et al. (2017) N. Mellempudi, A. Kundu, D. Mudigere, D. Das, B. Kaul, and P. Dubey. Ternary neural networks with fine-grained quantization. CoRR, 2017. URL https://arxiv.org/abs/1705.01462.
- (30) S. Migasz. 8-bit inference with Tensor RT. GTC 2017. URL http://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-with-tensorrt.pdf.
- Mishra & Marr (2018) A. Mishra and D. Marr. Apprentice: Using knowledge distillation techniques to improve low-precision network accuracy. CoRR, 2018. URL https://arxiv.org/abs/1711.05852.
Turing architecture whitepaper.
- Parashar et al. (2017) A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally. SCNN: An accelerator for compressed-sparse convolutional neural networks. ACM/IEEE 44th Annual International Symposium on Computer Architecture, 2017.
- Park et al. (2018) E. Park, S. Yoo, and P. Vajda. Value-aware quantization for training and inference of neural networks. CoRR, 2018. URL https://arxiv.org/abs/1804.07802.
- Paszke et al. (2017) A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in PyTorch. Proc. Neural Information Processing Systems, 2017.
- Rastegari et al. (2016) M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. XNOR-Net: ImageNet classification using binary convolutional neural networks. CoRR, 2016. URL https://arxiv.org/abs/1603.05279.
- Reagen et al. (2016) B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. K. Lee, J. M. Hernandez-Lobato, G.Y. Wei, and D. Brooks. Minerva: Enabling low-power, highly-accurate deep. ACM/IEEE 43rd Annual International Symposium on Computer Architecture, 2016.
- Shimpi (2011) A. L. Shimpi. ARM’s Cortex A7: Bringing cheaper dual-core & more power efficient high-end devices. AnandTech, 2011. URL https://www.anandtech.com/show/4991/arms-cortex-a7-bringing-cheaper-dualcore-more-power-efficient//-highend-devices.
- Stenger (1993) F. Stenger. Numerical methods based on sinc and analytic functions. Springer Series in Computational Mathematics, 20, 1993.
- Strang & Fix (2008) G. Strang and G. Fix. An Analysis of the Finite Element Method. Wellesley-Cambridge Press, 2nd Ed., 2008.
- Szegedy et al. (2014) C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. CoRR, 2014. URL https://arxiv.org/abs/1409.4842.
- Venkatesh et al. (2016) G. Venkatesh, E. Nurvitadhi, and D. Marr. Accelerating deep convolutional networks using low-precision and sparsity. CoRR, 2016. URL https://arxiv.org/abs/1610.00324.
- Wang et al. (2018) P. Wang, Q. Hu, Y. Zhang, C. Zhang, Y. Liu, and J. Cheng. Two-step quantization for low-bit neural networks. Proc. Computer Vision and Pattern Recognition, 2018.
- Wen et al. (2016) H. Wen, S. Zhou, Z. Liang, Y. Zhang, D. Feng, X. Zhou, and C. Yao. Training bit fully convolutional network for fast semantic segmentation. CoRR, 2016. URL https://arxiv.org/abs/1612.00212.
- Zhou et al. (2017) Y. Zhou, S.M. Moosavi-Dezfooli, N.M. Cheung, and P. Frossard. Adaptive quantization for deep neural network. CoRR, 2017. URL https://arxiv.org/abs/1712.01048.
Appendix A Appendix: Brief Background
The machine learning models are used in the fields of computer vision (CV) and natural language processing (NLP) among many others. In particular, the deep learning models based on neural networks composed of multiple layers have achieved unprecedented gains in accuracy of image classification and object detection tasks Krizhevsky et al. (2012); Szegedy et al. (2014); LeCun et al. (1998); Krizhevsky et al. (2009); Deng et al. (2009).
In this paper we focus on the CV deep learning models that often rely on convolutional neural networks (CNNs), that are mainly composed of multiple convolution, fully connected and batch normalization layers LeCun et al. (1989a; b); Goodfellow et al. (2016); Ioffe & Szegedy (2015). For example, we will investigate ResNet-20 on CIFAR-10 and ResNet-50 on ImageNet datasets He et al. (2015); Krizhevsky et al. (2009); Deng et al. (2009). For completeness we review the most common layers next.
The convolution layer can be defined as
where input image , the filter , while denotes a convolution222In this context it is also common to use a cross-correlation rather than a convolution. and denotes a pooling operation, resulting in output with and for strides . The operation is usually repeated for output channels, resulting in .
The fully connected layer is defined as
where input , weights , bias , unit vector , the non-linear activation function is applied component-wise on intermediate and output for batch size.
The typical batch normalization layer can be written as
where input , scaled diagonal matrix of weights , bias , unit vector for batch size Devarakonda et al. (2017).