Convolutional neural network compression for natural language processing
Convolutional neural networks are modern models that are very efficient in many classification tasks. They were originally created for image processing purposes. Then some trials were performed to use them in different domains like natural language processing. The artificial intelligence systems (like humanoid robots) are very often based on embedded systems with constraints on memory, power consumption etc. Therefore convolutional neural network because of its memory capacity should be reduced to be mapped to given hardware. In this paper, results are presented of compressing the efficient convolutional neural networks for sentiment analysis. The main steps are quantization and pruning processes. The method responsible for mapping compressed network to FPGA and results of this implementation are presented. The described simulations showed that 5-bit width is enough to have no drop in accuracy from floating point version of the network. Additionally, significant memory footprint reduction was achieved (from 85% up to 93%).
Natural Language processing (NLP) is considered one of three main pillars of Deep Learning along with image and video processing. Mobile devices are becoming increasingly dominant both in terms of their number as well as a computing load and network traffic. They also become more interactive in terms of NLP algorithms implemented for applications such as translation or voice-typing. Consequently, there is a significant benefit from reducing memory footprint and computational burden for deployment of NLP modules on embedded devices. It is worth noting that despite an abundance of research efforts in Deep Learning architectures compression for image processing [4, 5, 11], there are just a few projects aiming at NLP neural architectures compression [1, 2, 3, 12].
This paper shows a case study of employing common compression techniques for compressing NLP architectures. A series of quantization and pruning techniques were implemented and deployed in FPGA platform. The main goal of the paper was to examine feasibility and efficiency of using FPGAs in a domain of embedded neural computations. A set of common datasets were used to conduct reliable experiments. The experiments reviled a strong relationship between a size and the structure of datasets and performance of quantization and pruning methods used. All the layers of the neural architecture  were analyzed separately with respect to their ability to be quantized and pruned.
The structure of the paper is structured as follows. Section II provides an overview of CNN architectures, quantization and pruning processes. Section III covers the dataset used for the experiments. Section IV describes neural model used for the experiments as well as FPGA implementation details of the architecture. Finally, section VI contains the results of the experiments. Section VII summarizes contribution of this work.
Ii Convolutional Neural Network
CNNs are composed of neurons that have learnable weights and biases. Each neuron receives inputs structured as multi-dimensional vectors (tensor), perform a convolution operation on this input it with multi-dimensional filters and then optionally follows outputs with pooling and non-linearity functions. Typically, the layers of a CNN have neurons that generate outputs feature maps , as follows
where are two-dimensional (2D) convolutional kernels of dimensions , represents the convolution operation and are the bias terms.
The number of multiply-accumulate (MAC) operations and cycles spent on the execution of in a practical implementation is often used as the metric for a complexity of a CNN. Assuming each output feature map has elements (where P is equal to height multiplied by width of given feature map) the total number of MAC calculations for a convolutional filtering operation is . Our role in quantization is to reduce the complexity of each one of these operations and in network compression through pruning the goal is to reduce the total number of operations.
Ii-a Quantization process
Quantization is the procedure of constraining values from a continuous set or more dense domain to a relatively discrete set (e.g. integers), more sparse domain in which quantized input values will be represented. In our case, the domain is floating point representation. A floating-point number can be represented as
is the mantissa, is the base, and is the exponent. In case of single precision floating point format according to the IEEE-754 standard the mantissa is assigned 23bits, the exponent is assigned 8bits, and 1bit is assigned for a sign indicator. Therefore the set of values that can be defined by this format is described by
Similarly, a reduced precision IEEE-754 mini-float format assigns 10bits of mantissa, 5bits of exponent and a sign bit.
Currently, GPUs with parallel processing being applied to machine learning operate mainly using the single precision format. In contrast, embedded DSPs and the latest GPUs operate with 8-bit integer and 16-bit fixed-point processing that restricts numbers to a range
where are conventional bit-widths and is a shift down (or up) that determines fractional length, or number of fractional bits, as well as the integer length for signed numerical representation. When the fractional length varies over individual coefficients or data samples. This format is also referred to as dynamic fixed-point .
It is possible to define a general mapping from a set of floating-point data to fixed-point as follows (assuming signed representation)
In our case and where
and . The scaling factor is essentially just a shift up or down. A drawback is that a great deal of precision may be lost if the distribution of the dataset is skewed by a large mean.
Yet another approach can define the number of integer and fractional bits to represent regions of a distribution that will represent a large percentage of the range. In these cases, there will be saturation of a small percentage of the data, such as outliers, through the quantization procedure which may or may not affect the accuracy in a significant way. To determine the effects of saturation on can experiment with different saturation levels. Therefore histogram analysis is used to analyze outliers and set best levels of saturation.
Another approach is quantization that maps floating point values to integers:
The parameter can be set to ( is a input set of values to be quantized) or can be zero value. In a first case it is known as a asymmetric integer quantization (e.g. used in Tensorflow framework), in the second it is called symmetric. The compression system presented in the paper are based on the symmetric quantization and dynamic fixed point was also implemented.
All presented approaches are examples of linear quantization. It is possible also to use nonlinear version to minimize quantization loss but its hardware implementation is more sophisticated and it is more difficult to achieve gain in used hardware resources.
After training neural model we acquire a set of weights for each trainable layer. Those weights are not evenly distributed over the range of possible values for a selected data format. As presented in figure 1, most weights are concentrated around 0 and are very close to it. Therefore, their impact on resulting activation value is not significant. The mechanism of pruning removes weights which values are below a certain threshold level. Authors of  examined how pruning affects convolutional neural networks for image classification, showing little accuracy drop and high compression ratio. In this work pruning is applied to 1D convolution layer according to algorithm 1, as weights are concentrated around 0, for negative values threshold was calculated for an absolute value. Example from figure 1 presents that threshold level of 0.02 is sufficient to remove around 50% of weights. Depending on specific network implementation storing weights may require significant amounts of memory, removing weights through pruning have a direct impact on lowering storage requirements.
The order in which quantization and pruning are applied should have little effect, especially for higher precisions. If quantization is applied first then weights from a certain range will be brought to the same value and pruning will only be able to remove none or all of them. If pruning will be applied first it will be possible to cut out weights more precisely. This will only have an effect on border quantization buckets and therefore will have any significant impact only for very small precisions.
Experiments are performed on the same datasets as used in . Summary statistics of the datasets are provided in table I. Some datasets are divided into training, validation and testing data. If validation data is not specified then random 10% of training is used for it. If testing data is not prepared then 10-fold cross validation is performed.
Iv Network architecture
We employed a model CNN-static from . The model uses pre-trained 300-dimensional word embeddings GloVe , which are not fine-tuned during training. Unknown words are randomly initialized. Outputs of two parallel CNN layers with windows size 2 and 3 with 128 filters each are concatenated. The concatenation is connected to dense layer with 128 outputs. CNNs and dense layer have rectified linear units as activation functions. The last layer uses softmax activation function in case of many outputs and sigmoid activation function in case of binary classification. After each layer dropout was applied with probability 50%. Training data was divided into mini-batches of size 128. The training involves Nadam  as optimizing algorithm. Training lasts 25 epochs, the best model is saved.
Iv-a FPGA Implementation
Pruning mechanism is especially effective in FPGA implementations. On standard CPUs processing throughput improvement is not significant, multiply and accumulate operation is replaced by conditional, and on most modern processors both of those operations are supported by hardware. Greater improvements can be achieved when pruning is extensive enough to apply algorithms for sparse operations, such extreme pruning, however, will likely has a considerable impact on accuracy. When creating FPGA implementation of a certain neural network there is a higher degree of freedom. Its possible to hard code weights into logic during synthesis, as a result a complete cost of resolving condition if certain calculation should be performed, is moved from inference to compilation. Weights that will be removed by pruning are not included in the synthesized design, saving both logic and routing resources.
FPGA implementation can also give additional freedom for quantization adjustments, as it is not bound to any standard data width or format. On a standard CPU moving from 16 bit precision to 15 bit will not give any performance improvements, a calculation will be executed using the same hardware. For FPGAs however, it will result in synthesizing a smaller logic design with fewer connections, which can be beneficial for latency, throughput or calculations density.
Weights in the used neural network are mostly concentrated in Convolution operations, around 99% of total trained weights, with 1% in fully connected layers, as embeddings layer is not implemented in FPGA it is excluded from these calculations. Therefore FPGA implementation is concentrated on reducing and accelerating convolutional layer. The design was implemented and synthesized using Xilinx Vivado HLS environment according to algorithm 1. To allow for more effective pruning, all weights are hard coded into logic. Loops from lines 2, 3 and 4 are fully unrolled. Due to limitations of synthesis tools layer from the used neural network could not be used. Therefore experimental results presented in section VI-A are obtained from a smaller version of a similar network, the results, however, can be extrapolated. The process of transforming Keras model to an IP core that can be used in FPGA is presented in the figure 2.
In the first step, Keras network model is created and trained, when accuracy is satisfactory network model and weights need to be exported. Weights from an HDF5 file are processed by a python script to C++ header format. It is important that their values will be known during compilation as Vivado HLS will not be able to create hard coded network otherwise. A separate header is created for each layer. Json file is used to create infer function of a neural network. This is a top level function that performs all the computations on input data. At this, we can decide how the data and weights should be represented. To validate the generation process a simulation was performed. If results are satisfactory, Vivado HLS will synthesize networks IP core. Next, the IP core is put into Vivado design and bitstream is generated.
V Precision reduction
The precision reduction was applied to two types of places: weights of layers and outputs of activation functions. The process is performed by finding minimum and maximum of values to be reduced and uniformly dividing value range into buckets. Each bucket has assigned a middle point of represented subrange. In case of layer weights, it is straightforward. Precision reduction of activation functions requires to run computation on training data and remember minimum and maximum of activation functions for chosen layers. During testing, it is possible that computed values will be lower than minimum and larger than maximum — they need to be clipped.
The model used in this work has 5 layers: embeddings, two CNNs, and two dense layers. Activation functions are reduced after the two CNNs and the first dense layer. A number of bits used in each of the 8 places may be different.
The first set of experiments involved precision reduction to the same number of bits on every combination of the 8 places. Tested number of bits includes 32, 16, 8, 7, 6, 5, 4, 3, 2 and 1. It gives experiments for each dataset/fold. All presented results are based on integer quantization. In all experiments this type of quantization gives little bit better accuracy than in dynamic fixed point (difference up to 2%).
The results show that precision reduction to 16 bits does not change accuracy. Embeddings can be reduced to 5 bits, a majority of datasets weights of convolutional layers can be reduced to 4 bits. It is important to mark that convolutional layers are parallel and reducing one of the layers to 1 bit does not change overall accuracy significantly because the second is operating with full precision.
Tables II and III presents results for SST-2 and MR datasets for each place precision reduced separately. In the case of 32 bits there is no reduction as it is default precision. 16 bits do not change accuracy results. Accuracy increases for some settings. The maximal drop in accuracy less than 1% is observed down to 4–5 bits. Usually, quantization of activations decreases accuracy more than quantizations of weights. Precision reduction to less than 4 bits of activations of first dense layer affects negatively accuracy the most. The tables present also results for precision reduction of all places. a joint reduction can be better than only for one place.
The second experiment was focused on finding the strongest precision reduction with arbitrary chosen maximal accuracy decrease. Here we are not limiting the model reduction to the same number of bits for each place. In order to test every combination including 10 values of a number of bits, it would be necessary to perform tests, which is not computable in a reasonable time. To tackle this problem a random-restart hill climbing algorithm was employed. The algorithm starts with a randomly chosen order of places to be reduced. Then iterates the places in a cyclic way until does not increase model compression. For each layer, it tries to find the smallest number of bits that does not decrease accuracy more than a set threshold. The searching method using algorithm 2 was restarted 50 times. In this work, we set maximal decrease as 0.2% of accuracy without precision reduction.
Table IV shows results of precision reduction for each dataset. In comparison to 32 bit representation, the size of models can be reduced by more than 84%. However, the most space is taken by word embeddings (more than 93%). It could be further reduced by limiting vocabulary and reducing dimensions of embeddings. Another approach was to order the places by their influence on model size, in this case starting from optimizing embedding layer. The approach has not achieved better results than the algorithm 2. The same precision bits parameters were obtained for two datasets.
The best found precision settings are shown in table V. All precision values are equal to or less than 8 bits.
The third experiment involved a combination of quantization and pruning. It was performed on MR database, the number of bits changed in the same way as in previous experiments. Pruning was applied according to algorithm 1. pruning threshold started from 0, i.e. no pruning, with 0.005 step to 0.15, where all weights were removed and no multiply and accumulate operations were performed in convolutional layers. Figure 3 presents extensiveness of pruning for each data precision. It is visible that the ratio quickly drops, it is a result of Gaussian distribution of weights concentrated around 0. When quantization steps are greater than pruning threshold step ratio decreases more gradually.
Pruning impact on the accuracy of the network is presented in figure 4. Its effect on accuracy is highly separate from quantization. A significant drop in accuracy appears below 5 bit precision, and over 0.03-0.035 pruning threshold. Between these values and no reduction point, a rectangular plateau of only slight variations in precision can be observed.
Vi-a FPGA results
Described quantization and pruning mechanisms were implemented in FPGA to examine their impact on logic utilization. Experiments were performed on a smaller convolutional layer, with an input of size 64 and 35 feature maps, the filter of size 2 and output of size 63 and 16 feature maps. We checked what effects gives pruning and quantization separately and in conjunction. Experiments were performed on Xilinx Virtex-7 FPGA VC707 Evaluation Kit with Virtex 7 series FPGA. Test points were selected so the accuracy remained within 1% of the original. Results are summarized in table VI. When both techniques are used flip flips utilization drops times and look up tables utilization drops times, while having a little effect on the accuracy.
Vii Conclusions and future work
This paper considers several compression and pruning techniques for convolutional neural networks in sentiment analysis. The results of the experiments show that the size of the model may be reduced more than 84% with the little performance degradation of approximately 1%. Word embeddings can be compressed even more reaching 93%. Future work will concentrate on boosting the performance of given convolutional network by changing the architecture using deeper model and modifying filter scheme. The next issue will focus on using memetic algorithm and reinforcement techniques to find a configuration of the network which gives the best compression.
-  Raphael Shu and Hideki Nakayama. Compressing Word Embeddings via Deep Compositional Code Learning, 2017; arXiv:1711.01068.
-  Juan Andrés Laura, Gabriel Masi and Luis Argerich. From Imitation to Prediction, Data Compression vs Recurrent Neural Networks for Natural Language Processing, 2017; arXiv:1705.00697.
-  Artem M. Grachev, Dmitry I. Ignatov and Andrey V. Savchenko. Neural Networks Compression for Language Modeling, 2017; arXiv:1708.05963.
-  Yu Cheng, Duo Wang, Pan Zhou and Tao Zhang. A Survey of Model Compression and Acceleration for Deep Neural Networks, 2017; arXiv:1710.09282.
-  Yuhui Xu, Yongzhuang Wang, Aojun Zhou, Weiyao Lin and Hongkai Xiong. Deep Neural Network Compression with Single and Multiple Level Quantization, 2018; arXiv:1803.03289.
-  Yoon Kim. Convolutional Neural Networks for Sentence Classification, 2014; arXiv:1408.5882.
-  Pennington, Jeffrey and Socher, Richard and Manning, Christopher. Glove: Global vectors for word representation. Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 1532-1543, 2014
-  Timothy Dozat. Incorporating Nesterov Momentum into Adam. 2015
-  Han, Song and Pool, Jeff and Tran, John and Dally, William. Learning both weights and connections for efficient neural network. Advances in neural information processing systems, 1135-1143, 2015
-  Philipp Gysel. Ristretto: Hardware-Oriented Approximation of Convolutional Neural Networks, 2016; arXiv:1605.06402.
-  Wróbel Krzysztof , Wielgosz Maciej, Pietroń Marcin, Duda Jerzy, Smywiński-Pohl Aleksander Improving text classification with vectors of reduced precision ICAART 2018 : 10th International Conference on Agents and Artificial Intelligence : proceedings : Funchal, Madeira, Portugal : 16-18 January, 2018 arXiv:1605.06402.
-  Piyush Kaul Motaz Al-Hami, Marcin Pietron,Raul Casas,Samer Hijazi Towards a Stable Quantized Convolutional Neural Networks: An Embedded Perspective ICAART 2018 : 10th International Conference on Agents and Artificial Intelligence : proceedings : Funchal, Madeira, Portugal : 16-18 January, 2018 arXiv:1605.06402.