Structured Sparsification of
Gated Recurrent Neural Networks
Recently, a lot of techniques were developed to sparsify the weights of neural networks and to remove networks’ structure units, e. g. neurons. We adjust the existing sparsification approaches to the gated recurrent architectures. Specifically, in addition to the sparsification of weights and neurons, we propose sparsifying the preactivations of gates. This makes some gates constant and simplifies LSTM structure. We test our approach on the text classification and language modeling tasks. We observe that the resulting structure of gate sparsity depends on the task and connect the learned structure to the specifics of the particular tasks. Our method also improves neuron-wise compression of the model in most of the tasks.
Recurrent neural networks (RNNs) yield high-quality results in many applications but often are memory- and time-consuming due to a large number of parameters. A popular approach for RNN compression is sparsification (setting a lot of weights to zero), it may compress RNN orders of times with only a slight quality drop or even with quality improvement due to the regularization effect pruning.
Sparsification of the RNN is usually performed either at the level of individual weights (unstructured sparsification) intel; pruning; emnlp or at the level of neurons groupsparseLSTM (structured sparsification — removing weights by groups corresponding to neurons). The latter additionally accelerates the testing stage. However, most of the modern recurrent architectures (e. g. LSTM lstm or GRU gru) have a gated structure. We propose to add an intermediate level of sparsification between individual weights emnlp and neurons groupsparseLSTM — gates (see fig. 1, left). Precisely, we remove weights by groups corresponding to gates, which makes
some gates constant, independent of the inputs, and equal to the activation function of the bias. As a result, the LSTM/GRU structure is simplified. With this intermediate level introduced, we obtain a three-level sparsification hierarchy: sparsification of individual weights helps to sparsify gates (make them constant), and sparsification of gates helps to sparsify neurons (remove them from the model).
The described idea can be implemented for any gated architecture in any sparsification framework. We implement the idea for LSTM in two frameworks: pruning groupsparseLSTM and Bayesian sparsification emnlp and observe that resulting gate structures (which gates are constant and which are not) vary for different NLP tasks. We analyze these gate structures and connect them to the specifics of the particular tasks. The proposed method also improves neuron-wise compression of the RNN in most cases.
2 Proposed method
2.1 Main idea
In this section, we describe the three-level sparsification approach for LSTM. LSTM cell is composed of input, forget and output gates (, , ) and information flow (which we also call gate for brevity). All four gates are computed in a similar way, for example, for the input gate:
To make a gate constant, we need to zero out a corresponding row of the LSTM weight matrix (see dotted horizontal lines in fig. 2). We do not sparsify biases because they do not take up much memory compared to the weight matrices. For example, if we set the -th row of matrices and to zero, there are no ingoing connections to the corresponding gate, so the -th input gate becomes constant, independent of and and equal to . As a result, we do not need to compute the -th input gate on a forward pass and can use a precomputed value. We can construct the mask (whether the gate is constant or not) and use it to insert constant values into gate vectors . This lessens the amount of computations on the forward pass.
To remove a neuron, we need to zero out a corresponding column of the LSTM weight matrix and of the next layer matrix (see solid vertical lines in fig. 2). This ensures that there are no outgoing connections from the neuron, and the neuron does not affect the network output.
To sum up, our three-level hierarchy of gated RNN sparsification works as follows. Ideally, our goal is to remove a hidden neuron, this leads to the most effective compression and acceleration. If we don’t remove the hidden neuron, some of its four gates may become constant; this also saves computation and memory. If some gate is still non-constant, some of its weights may become zero; this reduces the size of the model.
2.2 Implementation of the idea
Pruning. We apply Lasso to individual weights and group Lasso group_lasso to five groups of the LSTM weights (four gate groups and one neuron group, see fig. 2). We use the same pruning algorithm as in Intrinsic Sparse Structure (ISS) groupsparseLSTM, a structured pruning approach developed specifically for LSTM. In contrast to our approach, they do not sparsify gates, and remove a neuron if all its ingoing and outgoing connections are set to zero.
Bayesian sparsification. We rely on Sparse Variational Dropout dmolch; emnlp to sparsify individual weights. Following chris, for each neuron, we introduce a group weight which is multiplied by the output of this neuron in the computational graph (setting to zero this group weight entails removing the neuron). To sparsify gates, for each gate we introduce a separate group weight which is multiplied by the preactivation of the gate before adding a bias (setting to zero this group weight makes the gate constant).
|Accuracy||Bayes W+N emnlp+chris||83.98||17874x|
|Accuracy||Bayes W+N emnlp+chris||88.55||645x|
|Bits-per||Bayes W+N emnlp+chris||10.2x||1560|
|Original||120.28 – 114.41||1x||800 – 800|
|Word PTB||Bayes W+N emnlp+chris||110.25 – 104.81||11.65x||68 – 110||272 – 392|
|(small)||Bayes W+G+N||109.98 – 104.45||11.44x||52 – 108||197 – 349|
|Perplexity||Prun. W+N groupsparseLSTM||110.34 – 106.25||1.44x||72 – 123||288 – 492|
|Prun. W+G+N||110.04 – 105.64||1.49x||64 – 115||193 – 442|
|Word PTB||Original||82.57 – 78.57||1x|
|(large)||Prun. W+N groupsparseLSTM||81.25 – 77.62||2.97x||324 – 394||1296 – 1576|
|Perplexity||Prun. W+G+N||81.24 – 77.82||3.22x||252 – 394||881 – 1418|
In the pruning framework, we perform experiments on word-level language modeling (LM) on a PTB dataset ptb following ISS (groupsparseLSTM). We use a standard model of zaremba14 of two sizes (small and large) with an embedding layer, two LSTM layers, and a fully-connected output layer (Emb + 2 LSTM + FC). Here regularization is applied only to LSTM layers following groupsparseLSTM, and its strength is selected using grid search so that qualities of ISS and our model are approximately equal.
In the Bayesian framework, we perform an evaluation on the text classification (datasets IMDb IMDB and AGNews agnews) and language modeling (dataset PTB, character and word level tasks) following emnlp. The architecture for the character-level LM is LSTM + FC, for the text classification is Emb + LSTM + FC on the last hidden state, for the word level LM is the same as in pruning. Here we regularize and sparsify all layers following emnlp.
3.1 Quantitative results
We compare our three-level sparsification approach (W+G+N) with the original dense model and a two-level sparsification (weights and neurons, W+N) in tab. 1. We do not compare two frameworks between each other; our goal is to show that the proposed idea improves results in both frameworks.
In most experiments, our method improves gate-wise and neuron-wise compression of the model without a quality drop. The only exception is the character-level LM, which we discuss later. The numbers for compression are not comparable between two frameworks because in pruning only LSTM layers are sparsified while in the Bayesian framework all layers in the network are sparsified.
3.2 Qualitative results
Below we analyze the resulting gate structure for different tasks, models and sparsification approaches.
Gate structure depends on the task. Figure 1, right shows the typical examples of the gate structures of the remaining hidden neurons obtained using the Bayesian approach. We observe that the gate structure varies for different tasks. For the word-level LM task, output gates are very important because models need both store all the information about the input in the memory and output only the current prediction at each timestep. On the contrary, for text classification tasks, models need to output the answer only once at the end of the sequence, hence they rarely use output gates. The
character-level LM task is more challenging than the word level one: the model uses the whole gate mechanism to solve it. We think this is the main reason why gate sparsification does not help here.
As can be seen in fig. 1, right, in the second LSTM layer of the small word-level language model, a lot of neurons have only one non-constant gate — output gate. We investigate the described effect and find that the neurons with only non-constant output gate learn short-term dependencies while neurons with all non-constant gates usually learn long-term dependencies. To show that, we compute the gradients of each hidden neuron of the second LSTM layer w. r. t. the input of this layer at different lag and average the norm of this gradient over the validation set (see fig. 3). The neurons with only non-constant output gate are “short”: the gradient is large only for the latest timesteps and small for old timesteps. On the contrary, neurons with all non-constant gates are mostly “long”: the gradient is non-zero even for old timesteps. In other words, changing input 20–100 steps ago does not affect “short” neurons too much, which is not true for the “long’ neurons. The presence of such “short” neurons is expectable for the language model: neurons without memory quickly adapt to the latest changes in the input sequence and produce relevant output.
In fact, for the neurons with only non-constant output gate, the memory cell is either monotonically increasing or monotonically decreasing depending on the sign of constant information flow so always equals either to or +1111Except for the first few epochs because is initialized with 0 value. and or . This means these neurons are simplified to vanilla recurrent units.
For classification tasks, memorizing information about the whole input sequence until the last timestep is important, therefore information flow is non-constant and saves information from the input to the memory. In other words, long dependencies are highly important for the classification. Gradient plots (fig. 3) confirm this claim: the values of the neurons are strongly influenced by both old and latest inputs. Gradients are bigger for the short lag only for one neuron because this neuron focuses not only on the previous hidden states but also on reading the current inputs.
Gate structure intrinsically exists in LSTM. As discussed above, the most visible gate structures are obtained for IMDB classification (a lot of constant output gates and non-constant information flow) and for the second LSTM layer of the small word-level LM task (a lot of neurons with only non-constant output gates). In our experiments, for these tasks, the same gate structures are detected even with unstructured sparsification, but with lower overall compression and less number of constant gates, see Appendix D. This shows that the gate structure intrinsically exists in LSTM and depends on the task. The proposed method utilizes this structure to achieve better compression.
We obtain a similar effect when we compare gate structures for the small word-level LM obtained using two different sparsification techniques: Bayes W+G+N (fig. 1, right) and Pruning W+G+N (fig. 4, left). The same gates become constant in these models. For the large language model (fig. 4, right), the structure is slightly different than for the small model. It is expected because there is a significant quality gap between these two models, so their intrinsic structure may be different.
This research is in part based on the work supported by Samsung Research, Samsung Electronics.
Appendix A Technical details on the implementation of the idea in pruning
Consider a dataset of sequences and a model defined by a recurrent neural network with weights and biases .
To implement our idea about three levels of sparsification, for each neuron , we define five (intersecting) sets of weights . The first four sets of weights correspond to four gates (dotted horizontal lines in fig. 2), and the last set corresponds to the neuron (solid vertical lines in fig. 2). We apply group Lasso regularization group_lasso to these groups. We also apply Lasso regularization to the individual weights.
Following groupsparseLSTM, we set to zero all the individual weights with absolute value less than the threshold. If for some all the weights in are set to zero, we remove the corresponding neuron as it does not affect the network’s output. If for some gate (for example, ) all the weights in are set to zero, we mark this gate as constant.
In contrast to our approach, in groupsparseLSTM, group Lasso is applied to larger groups :
They eliminate a neuron if all the weights in are zero. This approach does not lead to the sparse gate structure.
Appendix B Technical details on the implementation of the idea in Bayesian framework
Sparse variational dropout. Our approach relies on Sparse variational dropout dmolch (SparseVD). This model treats the weights of the neural network as random variables and comprises a log-uniform prior over the weights: and a fully factorized normal approximate posterior over the weights: . Biases are treated as deterministic parameters. To find the parameters of the approximate posterior distribution and biases, the evidence lower bound (ELBO) is optimized:
Because of the log-uniform prior, for the majority of weights, the signal-to-noise ratio and these weights do not affect the network’s output. In emnlp, SparseVD is adapted to the RNNs.
Our model. To sparsify the individual weights, we apply SparseVD dmolch to all the weights of the LSTM, taking into account the recurrent specifics underlined in emnlp. To compress the layer and remove the hidden neurons, we follow chris and introduce group weights for the hidden neurons of the LSTM.
The key component of our model is introducing groups weights on the preactivations of the gates and information flow. The resulting LSTM layer looks as follows:
The model is equivalent to multiplying the rows and columns of the weight matrices by the group weights:
If some component of , , , or is set to zero, we mark the corresponding gate as constant. If some component of is set to zero, we remove the corresponding neuron from the model.
Training our model. We work with the group weights in the same way as with the weights : we approximate the posterior with the fully factorized normal distribution given the fully factorized log-uniform prior distribution. To estimate the expectation in (1), we sample weights from the approximate posterior distribution in the same way as in emnlp.
With the integral estimated with one Monte-Carlo sample, the first term in (1) becomes the usual loss function (for example, cross-entropy in language modeling). The second term is a regularizer depending on the parameters and (for the exact formula, see dmolch).
After learning, we zero out all the weights and the group weights with the signal-to-noise ratio less than 0.05. At the testing stage, we use the mean values of all the weights and the group weights.
Appendix C Experimental setup
Datasets. To evaluate our approach on the text classification task, we use two standard datasets: IMDb dataset IMDB for binary classification and AGNews dataset agnews for four-class classification. We set aside 15% and 5% of the training data for validation purposes respectively. For both datasets, we use a vocabulary of 20,000 most frequent words. To evaluate our approach on the language modeling task, we use the Penn Treebank corpus ptb with the train/valid/test partition from mikolov11. The dataset has a vocabulary of 50 characters or 10,000 words.
All the small models including baseline are trained without dropout as in standard TensorFlow implementation. We train them from scratch for 20 epochs with SGD with a decaying learning rate schedule: an initial learning rate is equal to , the learning rate starts to decay after the -th epoch, the learning rate decay is equal to . For the two-level sparsification (W+N), we use Lasso regularization with and group Lasso regularization with . For the three-level sparsification (W+G+N), we use Lasso regularization with and group Lasso regularization with . We use the threshold to prune the weights in both models during training.
All the large models including baseline are trained in the same setting as in groupsparseLSTM except for the group Lasso regularization because we change the weight groups. We use the code provided by the authors. Particularly, we use binary dropout zaremba14 with the same dropout rates. We train the models from scratch for 55 epochs with SGD with a decaying learning rate schedule: an initial learning rate is equal to , the learning rate decreases two times during training (after epochs 18 and 36), the learning rate decay is equal to and for two- and three-level sparsification correspondingly. For the two-level sparsification (W+N), we use Lasso regularization with and group Lasso regularization with . For the three-level sparsification (W+G+N), we use Lasso regularization with and group Lasso regularization with . We use the same threshold as in the small models.
In all the Bayesian models, we sparsify the weight matrices of all layers. Since in text classification tasks, usually only a small number of input words are important, we use additional multiplicative weights to sparsify the input vocabulary following emnlp. For the networks with the embedding layer, in configurations W+N and W+G+N, we also sparsify the embedding components (by introducing group weights multiplied by .)
We train our networks using Adam adam. Baseline networks overfit for all our tasks, therefore, we present results for them with early stopping. Models for the text classification and the character-level LM are trained in the same setting as in emnlp (we used the code provided by the authors). For the text classification tasks, we use a learning rate equal to and train Bayesian models for 800 / 150 epochs on IMDb / AGNews. The embedding layer for IMDb / AGNews is initialized with word2vec NIPS2013_5021 / GloVe pennington2014glove. For the language modeling tasks, we train Bayesian models for 250 / 50 epochs on character-level / word-level tasks using a learning rate of .
For all the weights that we sparsify, we initialize with -3. We eliminate weights with the signal-to-noise ratio less than . To compute the number of the remaining neurons or non-constant gates, we use the corresponding rows/columns of and the corresponding weights if applicable.
Appendix D Experiments with unstructured Bayesian sparsification
In this section, we present experimental results for the unstructured Bayesian sparsification (configuration Bayes W). This configuration corresponds to a model of emnlp. Table 2 shows quantitative results, and figure 5 shows the resulting gate structures for the IMDB classification task and the second LSTM layer of the word-level language modeling task. Since Bayes W model does not comprise any group weights, the overall compression of the RNN is lower than for Bayes W+G+N (tab. 1), so there are more non-constant gates. However, the patterns in gate structures are the same as in Bayes W+G+N gate structures (fig. 1): for the IMDB classification, the model has a lot of constant output gates and non-constant information flow, for language modeling, the model has neurons with only non-constant output gates.
|IMDb||Bayes W emnlp||83.62||18567||17|
|AGNews||Bayes W emnlp||89.14||561x|
|Char PTB||Bayes W emnlp||7.9x||1718|
|Word PTB||Bayes W emnlp||114.80 – 109.85||10.52x||55 – 124||218 – 415|