Syntactically Informed Text Compression with Recurrent Neural Networks

# Syntactically Informed Text Compression with Recurrent Neural Networks

## 1Introduction

Accurate models are the key to better data compression. Compression algorithms operate in two steps – modeling and coding. Coding is the reversible process of reassigning symbols in a sequence of data such that its length is reduced. Modern coding methods are currently close to the theoretically optimal limit of bits per symbol. Modeling, on the other hand, is a provably unsolvable problem [19]. Consider if this were not the case. Such a model would be able to accurately estimate the next symbol in a sequence of random (or compressed) data, and would be able to recursively compress its own output to zero bytes. Rather than search for this impossible universal model, efforts have focused on generating domain-specific models that exploit intrinsic structure within data.
A previous approach to generating natural language models by Matt Mahoney [20] used a two layer, neural network as a substitute for prediction by partial matching – a popular modeling method. The simplicity of this model allows it to process characters per second, compressing Alice in Wonderland to 2.283 bpc, and comparing favorably to gzip at 3.250 bpc. While effective, this approach lacks the ability to model long-term relationships in character sequences and also does not take advantage of syntactic or semantic information. Mahoney notes that the ability to do so was one of the reasons he chose to use neural networks in the first place.
We address these limitations through the use of a recurrent neural network architecture and by utilizing Google’s SyntaxNet [2] to provide part of speech annotations. Our model processes sequences of characters and part of speech tags using separate recurrent layers. The output of these layers is then merged and processed with a final recurrent layer and two fully connected layers. We found that such an architecture was able to reliably predict the next character in a sequence of forty characters without explicitly memorizing the training data when provided documents of sufficient length. While our aim was to construct probability models tailored to specific input data, our results indicate that acceptable performance can be obtained from a generalized model. A generalized natural language model would be highly desirable as it would allow for neural network based compression without the need to train a model for each input document. This publication aims to serve as the foundational work for such a model.

## 2Background

### A note for MAA MathFest

This paper relies heavily on concepts from computer science. To ensure that it is accessible to our audience at MathFest, we’ve written it to be as self-containing as possible.

### 2.2Data Compression

As mentioned, data compression consists of two steps – modeling and coding. Arithmetic coding [33] is a near-optimal coding method that operates by representing a sequence of probabilities as a fractional number in the interval .
To illustrate arithmetic coding, consider the following example:

• Let , a message to be compressed, be the sequence of symbols [C,O,D,E,!].

• Let be an alphabet containing the symbols {A,B,C,D,E,O,!}.

The following table contains an arbitrary fixed probability model for the alphabet :

We encode by reducing the range of our subinterval from its initial range of for each symbol as shown in Table 2.

The final subinterval after following the steps listed above is . Any number within the subinterval can be decoded by running the steps in reverse to produce the original message, provided the same method is used to obtain the probability distribution. Much arithmetic as well as the decoding process have been left out for brevity. Readers wishing to fully understand the process are encouraged to seek out an example elsewhere. Because the lower bound of the subinterval is inclusive, we can simply use 0.687895 to represent our encoded message.
Unfortunately our example produces an output sequence that is longer than its input. This is not a fault of arithmetic coding, but a symptom of inaccurate probability estimates. As our model was arbitrary, it’s expected that we would see poor output. For a more successful example of arithmetic coding, consider that the message “AAAAA!” can be coded as 0.0024. This represents a reduction of two symbols – not counting the ever-present leading zero and the decimal.
Modeling is the process of generating a probability distribution estimate for the input sequence. Models can be static (as above) or dynamically generated. Dynamic models allow for continuous updates to probability values in response to the symbols observed in the sequence. A naïve approach to dynamic modeling would be to initially consider all symbols equally probable, updating probabilities accordingly as symbols are processed.
Neural networks are dynamic models that update their probability distribution estimates based on dependencies learned from contexts. In the case of Mahoney’s model, only spatially local contexts can be learned. By using a recurrent network architecture, spatial as well as temporal contexts can be used for dependency modeling. Utilizing more effective neural network architectures allows for the construction of more accurate language models.

### 2.3Recurrent Neural Networks

Recurrent neural networks are a class of neural networks well suited for modeling temporal systems such as sequences of audio or text. RNNs excel in these domains due to memory provided by recurrence in their hidden layers that allows them to learn dependencies over arbitrary time intervals. We can represent the hidden state of a RNN with a simple set of recurrence equations:

where is the output of the hidden state (layer) at time and is the hidden state output from the previous time interval. The vectors and are the input, input-hidden, hidden-hidden, and hidden-output weights, respectively. Each layer is assigned an index variable (with notation borrowed from this guide [7]) – for output nodes, for hidden and for input nodes. The functions and (used later) are differentiable, nonlinear activation functions such as the sigmoid or hyperbolic tangent function. is a bias.

The output state can be computed as:

All together, we see that a single forward pass through the network can be calculated with the following recurrence:

### 2.4Backpropagation Through Time

To allow for learning over arbitrary intervals, error values must be backpropageted through time. We use the cross entropy error function in our model defined as:

for the th sample in the training set of length and the cross entropy function, :

Together, our error function is:

for , the desired output of output nodes. Weight updates are proportional to the negative cost gradient with respect to the weight that is being updated, scaled by the learning rate, :

We can then compute the output error, and hidden error , which can be backpropagated through time to obtain the error of the hidden layer at the previous time interval.
Indices and are for nodes sending and receiving the activation, respectively.

### 2.5Gated Recurrent Units

When backpropagating over many time intervals, error gradients tend to either vanish or explode. That is, the derivatives of the output at time with respect to unit activations at rapidly approach either zero or infinity as increases [4]. A popular solution to this problem is to use a Gated Recurrent Unit (GRU) [8] – a recurrent unit that adaptively resets its internal state. Networks of gated recurrent units allow for modeling dependencies at multiple time scales of arbitrary length, retaining both long and short-term memory. A single GRU consists of a hidden state along with reset and update gates.
When the reset gate, is closed (), the value of the GRU’s previous hidden state is ignored, effectively resetting the unit. The value of the reset gate is computed as:

for the sigmoid activation function , the unit’s input and previous hidden state, and , respectively. The weight matrices and follow from our previous equations.
The update gate is similar:

The new hidden state1, is:

Finally, the unit’s activation function, can be calculated as a linear interpolation between the previous and current states:

Cho et al. note that short-term dependencies are captured by units with frequently active reset gates, while long-term dependencies are best captured by units containing an active update gate.
The output of a single forward pass in a single layer GRU network can be represented using notation from the previous simple recurrent model:

## 3Model Architecture

A close reader will notice that the topics covered in the background section address a succession of problems. We illustrated the need for an effective probabilistic model when compressing text data, then discussed the current state-of-the-art neural network architecture for generating such a model.
This section will address the issue of improving upon a vanilla GRU network architecture that operates solely on character sequences. The improvements discussed occur at a higher level of abstraction than the gate level architectures previously described, as we are seeking to build a practical model rather than propose a new recurrent unit architecture.
Table 3 outlines notation for the layers used in our architecture. Note that our model has two separate input layers. A graphical overview of our architecture can be found at the end of this section in Figure 1.

The character input layer, , is a one-hot representation of forty character sequences. This layer is paralleled by a second input layer containing part of speech information obtained from SyntaxNet. The part of speech tag (POS) input layer, is a one-hot2 representation of part of speech tag sequences, each of which correspond to the character at the same respective index in the other input layer.
GRU layers and are also parallel. We will use the notation when discussing separate but identical operations to both layers. Our implementation utilizes the hard (linearly approximated) sigmoid function in place of the standard logistic sigmoid as the GRU’s inner activation function in order to reduce computational requirements. The outer activation, is the hyperbolic tangent function applied element-wise for each node in the layer. A forward pass through and is calculated as:

To prevent overfitting, dropout layers [28] and are applied to and , respectively. The output of the dropout layers is a replica of the input, with the exception that output from a fractional number of nodes, randomly selected with probability is pinned to zero. After applying dropout, the state of the model is as follows:

A merge layer, is applied to the output of the two dropout layers. This layer is a simple vector concatenation, represented here by the operator.

The merged output feeds into a final GRU layer, followed by two fully connected layers, and to produce the network output .

Fully connected (dense) layers are non-recurrent neural layers in which each node is connected to every node in both the preceding and following layer. Appending two fully connected layers to a recurrent neural networks was found to improve accuracy of speech models by transforming the sequential output of the recurrent layers to a more discriminatory space [26]. Adding two dense layers to our model had similar results, suggesting that the effect translates to sequence data from arbitrary domains.
The first dense layer, uses the rectifier activation function . This function is analogous to a half-wave reduction in digital signal processing, and has the advantage of being less computationally demanding than the sigmoid function.

The second dense layer employs a softmax activation that transforms the output of from an arbitrary range to the interval [0,1] such that the sum of the 256 output nodes3 is 1. This is desirable as it allows our network output to satisfy the requirements of a proper probability mass function4.

The network output, is simply the output of the final dense layer, .

To keep calculation simple, we’ve been operating on individual neural units. As we’ve reached the output layer, it’s important to remember that we’re working with vectors:

We now see why the softmax activation function is critical to the model – the network’s output always sums to one and is a valid representation of probability estimates for each character:

Illustrating a full forward pass through this network would provide little value to the reader and require a significant amount of space. By following the layer descriptions in this section, we’ve essentially already completed the forward pass.
Backpropagation for an architecture of this complexity is not an easy task. Fortunately, automatic differentiation frees us from the burden of calculating the error gradient. Our implementation utilized Keras [1], a wrapper for Theano [6]. Readers seeking information on the gradient calculations should consult the Theano documentation.

## 4Training and Evaluation

Training data was obtained from Project Gutenberg [15]. Models were trained on single books – preserving the single stream, single model training method used by Mahoney. The input text was passed through SyntaxNet to obtain part of speech tags information for each word. Additionally, input was split into 40 character chunks with a sliding window. Part of speech tags were replicated such that each character in a word was given the appropriate tag for the word. The 41st character in each window was used as the target output. It’s also worth noting that this system is based loosely upon the lstm_text_generation example from the Keras library. Readers seeking to build upon our work should consult this example.
RMSprop [31] was used to optimize gradient descent. RMSprop keeps a moving average of the gradient squared for each weight as shown:

Models were trained on four books of various length for a minimum of 700 epochs per document. Variation in the number of training iterations was due to the increased computation required to model longer documents. For comparison with the referenced LSTM mode, we also trained a model on The complete works of Friedrich Nietzsche. The length of all documents used in training is illustrated in Table 4.

To quantify the effect of part of speech information, models for Pride and Prejudice were trained with and without part of speech tags5.
Amazon g2.2xlarge EC2 instances were used to perform training and evaluation. For longer documents, each epoch took approximately 230 seconds, equating to roughly 48 hours of computation per document. We’ve provided a preconfigured AMI for those wishing to verify or expand6 upon our results without going through the trouble of resolving software dependencies. The AMI is publicly available as ami-2c3a7a4c.

## 5Results

All models converged to a high level of accuracy within the training window. Figure 2 illustrates convergence and raises some noteworthy discussion points7. Unsurprisingly, the shortest document in our training set converged in the fewest number of epochs and attained the highest level of accuracy. This near-perfect accuracy is indicative of severe overfitting, and implies that our model is capable of essentially memorizing documents less than characters in length. Overfitting would be undesirable if training a general language model, but poses less of a concern in our usage case.
The model trained on A Tale of Two Cities exhibits gradient instability after epoch 650, significantly reducing its accuracy from that point onwards. Unstable gradients can occur when converged models are allowed to continue training, as appears in this case. This example highlights the sometimes chaotic [3] behavior of recurrent neural networks. Gated recurrent units often produce relatively stable models; however, their dynamics remain poorly understood. An in-depth analysis of GRU network dynamics would likely shed light on the observed long-term instability.
The addition of part of speech information to the Pride and Prejudice model resulted in an average accuracy increase of , as shown in Figure 3.
Further exploration of this metric was not performed due to computational and time constraints on the project. As consolidation, we considered the generalization performance of our document-specific models and found them to be reasonably accurate when applied to the other training documents. The generalization performance of the Pride and Prejudice model is shown in Table 5.

## 6Discussion

The results of this effort make a strong case for a pre-trained, generalized language model that could be used in text compression. Document-specific compression benchmarks were not performed, as such metrics are slightly outside the scope of this publication. Proper compression benchmarks for a generalized model will be the focus of future work.
We anticipate that model performance could be further increased by utilizing word dependency trees provided by SyntaxNet. The combination of semantic and syntactic information would likely allow for the representation of more complex word relationships than syntactic information alone. While accuracy does not seem to be of concern for single stream, single model usage contexts such as ours, generalized models stand to benefit from the understanding of complex contextual relationships derived from semantic information.
The computational overhead associated with training a model for even as few as 100 epochs limits the practicality of our current implementation. Use of a general, pre-trained model would eliminate this problem – the time required to compute a single forward pass for the prediction of the next character is negligible.
Training a generalized model will require significantly more computational resources. Generalized models require a large number of diverse training documents. The one-hot encoding used in our architecture is not memory efficient by design. Even if batch training is used to alleviate memory requirements, training time would far exceed the 48 hours required to train a document-specific model.
This work also raises the interesting concept of utilizing the output of one neural network as the input to another.
The composition of neural networks can be performed in a fashion similar to the composition of functions. This should be almost intuitive, as the forward pass through a neural network is in fact a function. Neural network composition may prove to be a critical area of machine learning research. Using separately trained, domain-specific neural networks is likely a better approach to complex tasks such as language modeling than training a single, monolithic network.

## Appendix

### Footnotes

1. Note the role of the reset gate in the calculation of the new hidden state.
2. One-hot encoding is a way of representing information in which an array contains a single high bit (1) with the remaining bits low (0).
3. There are 256 ASCII characters.
4. To accomplish this, we simply set the part of speech input vector to zero.
5. A full copy of our codebase is available at https://github.com/davidcox143/rnn-text-compress
6. Figures 2 and 3 are located in the Appendix.

### References

1. Keras: Deep learning library for theano and tensorflow.
2. Globally normalized transition-based neural networks.
Andor, D., Alberti, C., Weiss, D., Severyn, A., Presta, A., Ganchev, K., Petrov, S., and Collins, M. arXiv preprint arXiv:1603.06042 (2016).
3. On the dynamics of small continuous-time recurrent neural networks.
Beer, R. D. Adaptive Behavior 3, 4 (1995), 469–509.
4. The problem of learning long-term dependencies in recurrent networks.
Bengio, Y., Frasconi, P., and Simard, P. In Neural Networks, 1993., IEEE International Conference on (1993), IEEE, pp. 1183–1188.
5. Learning long-term dependencies with gradient descent is difficult.
Bengio, Y., Simard, P., and Frasconi, P. IEEE transactions on neural networks 5, 2 (1994), 157–166.
6. Theano: A cpu and gpu math compiler in python.
Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., and Bengio, Y. In Proc. 9th Python in Science Conf (2010), pp. 1–7.
7. A guide to recurrent neural networks and backpropagation.
Boden, M. The Dallas project, SICS technical report (2002).
8. Learning phrase representations using rnn encoder-decoder for statistical machine translation.
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. arXiv preprint arXiv:1406.1078 (2014).
9. Empirical evaluation of gated recurrent neural networks on sequence modeling.
Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. arXiv preprint arXiv:1412.3555 (2014).
10. Gated feedback recurrent neural networks.
Chung, J., Gülçehre, C., Cho, K., and Bengio, Y. CoRR, abs/1502.02367 (2015).
11. Data compression using adaptive coding and partial string matching.
Cleary, J., and Witten, I. IEEE transactions on Communications 32, 4 (1984), 396–402.
12. Long-term recurrent convolutional networks for visual recognition and description.
Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., and Darrell, T. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 2625–2634.
13. Understanding the difficulty of training deep feedforward neural networks.
Glorot, X., and Bengio, Y. In Aistats (2010), vol. 9, pp. 249–256.
14. A primer on neural network models for natural language processing.
Goldberg, Y. arXiv preprint arXiv:1510.00726 (2015).
15. Project gutenberg.
Hart, M. Project Gutenberg, 1971.
16. Long short-term memory.
Hochreiter, S., and Schmidhuber, J. Neural computation 9, 8 (1997), 1735–1780.
17. Visualizing and understanding recurrent networks.
Karpathy, A., Johnson, J., and Fei-Fei, L. arXiv preprint arXiv:1506.02078 (2015).
18. Three approaches to the quantitative definition ofinformation’.
Kolmogorov, A. N. Problems of information transmission 1, 1 (1965), 1–7.
19. Data Compression Explained.
Mahoney, M. Dell Inc., 2010.
20. Fast text compression with neural networks.
Mahoney, M. V.
21. The paq1 data compression program.
Mahoney, M. V. Draft, Jan 20 (2002).
22. Adaptive weighing of context models for lossless data compression.
Mahoney, M. V.
23. How to construct deep recurrent neural networks.
Pascanu, R., Gulcehre, C., Cho, K., and Bengio, Y. arXiv preprint arXiv:1312.6026 (2013).
24. On the difficulty of training recurrent neural networks.
Pascanu, R., Mikolov, T., and Bengio, Y. ICML (3) 28 (2013), 1310–1318.
25. Source coding algorithms for fast data compression.
Pasco, R. C. PhD thesis, Stanford University, 1976.
26. Convolutional, long short-term memory, fully connected deep neural networks.
Sainath, T. N., Vinyals, O., Senior, A., and Sak, H. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2015), IEEE, pp. 4580–4584.
27. Long short-term memory recurrent neural network architectures for large scale acoustic modeling.
Sak, H., Senior, A. W., and Beaufays, F. In INTERSPEECH (2014), pp. 338–342.
28. Dropout: a simple way to prevent neural networks from overfitting.
Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Journal of Machine Learning Research 15, 1 (2014), 1929–1958.
29. Lstm neural networks for language modeling.
Sundermeyer, M., Schlüter, R., and Ney, H. In Interspeech (2012), pp. 194–197.
30. Going deeper with convolutions.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 1–9.
31. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude.
Tieleman, T., and Hinton, G. COURSERA: Neural Networks for Machine Learning 4, 2 (2012).
32. Backpropagation through time: what it does and how to do it.
Werbos, P. J. Proceedings of the IEEE 78, 10 (1990), 1550–1560.
33. Arithmetic coding for data compression.
Witten, I. H., Neal, R. M., and Cleary, J. G. Communications of the ACM 30, 6 (1987), 520–540.
34. An empirical exploration of recurrent network architectures.
Zaremba, W.
12176