Compressive Transformers for Long-Range Sequence Modelling
We present the Compressive Transformer, an attentive sequence model which compresses past memories for long-range sequence learning. We find the Compressive Transformer obtains state-of-the-art language modelling results in the WikiText-103 and Enwik8 benchmarks, achieving ppl and bpc respectively. We also find it can model high-frequency speech effectively and can be used as a memory mechanism for RL, demonstrated on an object matching task. To promote the domain of long-range sequence learning, we propose a new open-vocabulary language modelling benchmark derived from books, PG-19.
Humans have a remarkable ability to remember information over long time horizons. When reading a book, we build up a compressed representation of the past narrative, such as the characters and events that have built up the story so far. We can do this even if they are separated by thousands of words from the current text, or long stretches of time between readings. During daily life, we make use of memories at varying time-scales: from locating the car keys, placed in the morning, to recalling the name of an old friend from decades ago. These feats of memorisation are not achieved by storing every sensory glimpse throughout one’s lifetime, but via lossy compression. We aggressively select, filter, or integrate input stimuli based on factors of surprise, perceived danger, or repetition — amongst other signals (Richards and Frankland, 2017).
Memory systems in artificial neural networks began with very compact representations of the past. Recurrent neural networks (RNNs, Rumelhart et al. (1986)) learn to represent the history of observations in a compressed state vector. The state is compressed because it uses far less space than the history of observations — the model only preserving information that is pertinent to the optimization of the loss. The LSTM (Hochreiter and Schmidhuber, 1997) is perhaps the most ubiquitous RNN variant; it uses learned gates on its state vector to determine what information is stored or forgotten from memory.
However since the LSTM, there has been great benefit discovered in not bottlenecking all historical information in the state, but instead in keeping past activations around in an external memory and attending to them. The Transformer (Vaswani et al., 2017) is a sequence model which stores the hidden activation of every time-step, and integrates this information using an attention operator (Bahdanau et al., 2014). The Transformer will thus represent the past with a tensor (depth memory size dimension) of past observations that is, in practice, an order of magnitude larger than an LSTM’s hidden state. With this granular memory, the Transformer has brought about a step-change in state-of-the-art performance, within machine translation (Vaswani et al., 2017), language modelling (Dai et al., 2019; Shoeybi et al., 2019), video captioning (Zhou et al., 2018), and a multitude of language understanding benchmarks (Devlin et al., 2018; Yang et al., 2019) amongst others.
One drawback in storing everything is the computational cost of attending to every time-step and the storage cost of preserving this large memory. Several works have focused on reducing the computational cost of attention with sparse access mechanisms (Rae et al., 2016; Child et al., 2019; Sukhbaatar et al., 2019; Lample et al., 2019). However sparse attention does not solve the storage problem, and often requires custom sparse kernels for efficient implementation. Instead we look back to the notion of compactly representing the past. We show this can be built with simple dense linear-algebra components, such as convolutions, and can reduce both the space and compute cost of our models.
We propose the Compressive Transformer, a simple extension to the Transformer which maps past hidden activations (memories) to a smaller set of compressed representations (compressed memories). The Compressive Transformer uses the same attention mechanism over its set of memories and compressed memories, learning to query both its short-term granular memory and longer-term coarse memory. We observe this improves the modelling of text, achieving state-of-the-art results in character-based language modelling — bpc on Enwik8 from the Hutter Prize (Hutter, 2012) — and word-level language modelling — perplexity on WikiText-103 (Merity et al., 2016). Specifically, we see the Compressive Transformer improves the modelling of rare words.
We show the Compressive Transformer works not only for language, but can also model the waveform of high-frequency speech with a trend of lower likelihood than the TransformerXL and Wavenet (Oord et al., 2016) when trained over 400,000 steps. We also show the Compressive Transformer can be used as a memory component within an RL agent, IMPALA (Espeholt et al., 2018), and can successfully compress and make use of past observations.
Furthermore we present a new book-level language-modelling benchmark PG-19, extracted from texts in Project Gutenberg111https://www.gutenberg.org/, to further promote the direction of long-context sequence modelling. This is over double the size of existing LM benchmarks and contains text with much longer contexts.
2 Related Work
There have been a variety of recent attempts to extend the range of attention, particularly in the Transformer, or to replace the attention operation with something less expensive. Wu et al. (2019) show that a convolution-like operator that runs in linear time can actually exceed the performance of the quadratic-time self-attention layer in the Transformer at sentence-to-sentence translation and sentence-level language modelling. However such a mechanism inhibits the flow of information across a large number of time-steps for a given layer, and has not shown to be beneficial for long-range sequence modelling.
Dai et al. (2019) propose the TransformerXL, which keeps past activations around in memory. They also propose a novel relative positional embedding scheme which they see outperforms the Transformer’s original absolute positional system. Our model incorporates both of these ideas, the use of a memory to preserve prior activations and their relative positional embedding scheme.
The Sparse Transformer (Child et al., 2019) uses fixed sparse attention masks to attend to roughly locations in memory. This approach still requires keeping all memories around during training, however with careful re-materialization of activations and custom kernels, the authors are able to train the model with a reasonable budget of memory and compute. When run on Enwik8, the much larger attention window of improves model performance, but overall it does not significantly outperform a simpler TransformerXL with a much smaller attention window.
The use of dynamic attention spans is explored in Sukhbaatar et al. (2019). Different attention heads can learn to have shorter or longer spans of attention — and they observe this achieves state-of-the-art in character-based language modelling. This idea could easily be combined with our contribution — a compressive memory. However an efficient implementation is not possible on current dense-linear-algebra accelerators, such as Google’s TPUs, due to the need for dynamic and sparse computation. Our approach builds on simple dense linear algebra components, such as convolutions.
We present the Compressive Transformer, a long-range sequence model which compacts past activations into a compressed memory. The Compressive Transformer is a variant of the Transformer (Vaswani et al., 2017), a deep residual network which only uses attention to propagate information over time (namely multi-head attention). We build on the ideas of the TransformerXL (Dai et al., 2019) which maintains a memory of past activations at each layer to preserve a longer history of context. The TransformerXL discards past activations when they become sufficiently old (controlled by the size of the memory). The key principle of the Compressive Transformer is to compress these old memories, instead of discarding them, and store them in an additional compressed memory.
We define and to be the number of respective memory and compressive memory slots in the model per layer. The overall input sequence represents input observations (e.g. tokens from a book). These are split into fixed-size windows of size for the model to process in parallel. The model observes at time , which we refer to as the sequence (e.g. in Figure 1). As the model moves to the next sequence, its hidden activations are pushed into a fixed-sized FIFO memory (like the TransformerXL). The oldest activations in memory are evicted, but unlike the TransformerXL we do not discard them. Instead we apply a compression operation, , mapping the oldest memories to compressed memories which we then store in a secondary FIFO compressed memory. denotes the hidden size of activations and refers to the compression rate, a higher value indicates more coarse-grained compressed memories. The full architecture is described in Algorithm 1.
3.2 Compression Functions and Losses
For choices of compression functions we consider (1) max/mean pooling, where the kernel and stride is set to the compression rate ; (2) 1D convolution also with kernel & stride set to (3) dilated convolutions; (4) most-used where the memories are sorted by their average attention (usage) and the most-used are preserved. The pooling is used as a fast and simple baseline. The most-used compression scheme is inspired from the garbage collection mechanism in the Differentiable Neural Computer (Graves et al., 2016) where low-usage memories are erased. The convolutional compression functions contain parameters which require training.
One can train the compression network using gradients from the loss; however for very old memories this requires backpropagating-through-time (BPTT) over long unrolls. As such we also consider some local auxiliary compression losses. We consider an auto-encoding loss where we reconstruct the original memories from the compressed memories , where is learned. This is a lossless compression objective — it attempts to retain all information in memory. We also consider an attention-reconstruction loss described in Algorithm 2 which reconstructs the content-based attention over memory, with content-based attention over the compressed memories. This is a lossy objective, as information that is no longer attended to can be discarded, and we found this worked best. We stop compression loss gradients from passing into the main network as this prevents learning. Instead the Transformer optimizes the task objective and the compression network optimizes the compression objective conditioned on task-relevant representations; there is no need to mix the losses with a tuning constant.
3.3 Temporal Range
The TransformerXL with a memory of size has a maximum temporal range of with an attention cost of (see Dai et al. (2019) for a detailed discussion). The Compressive Transformer now has a maximum temporal range of with an attention cost of . For example, setting and we obtain a maximum temporal range that is two times greater than the TransformerXL with an identical attention cost. Thus if we can learn in the compressed setting, the temporal range of the model can be significantly increased.
4 PG-19 Benchmark
As models begin to incorporate longer-range memories, it is important to train and benchmark them on data containing larger contexts. Natural language in the form of text provides us with a vast repository of data containing long-range dependencies, that is easily accessible. We propose a new language modelling benchmark, PG-19, using text from books extracted from Project Gutenberg 222The authors intend to release the PG-19 dataset along with the split into train, validation and test subsets.. We select Project Gutenberg books which were published over years old, i.e. before (hence the name PG-19) to avoid complications with international copyright, and remove short texts. The dataset contains books, or of text — which makes it over double the size of BookCorpus and Billion Word Benchmark.
4.1 Related Datasets
The two most benchmarked word-level language modelling datasets either stress the modelling of stand-alone sentences (Billion Word Benchmark from Chelba et al. (2013)) or the modelling of a small selection of short news articles (Penn Treebank processed by Mikolov et al. (2010)). Merity et al. (2016) proposed the WikiText-103 dataset, which contains text from a high quality subset of English-language wikipedia articles. These articles are on average words long. This dataset has been a popular recent LM benchmark due to the potential to exploit longer-range dependencies (Grave et al., 2016; Rae et al., 2018; Bai et al., 2018a). However recent Transformer models, such as the TransformerXL (Dai et al., 2019) appear to be able to exploit temporal dependencies on the order of several thousand words. This motivates a larger dataset with longer contexts.
Books are a natural choice of long-form text, and provide us with stylistically rich and varied natural language. Texts extracted from books have been used for prior NLP benchmarks; such as the Children’s Book Test (Hill et al., 2015) and LAMBADA (Paperno et al., 2016). These benchmarks use text from Project Gutenberg, an online repository of books with expired US copyright, and BookCorpus (Zhu et al., 2015), a prior dataset of unpublished (at time of authorship) books. CBT and LAMBADA contain extracts from books, with a specific task of predicting held-out words. In the case of LAMBADA the held-out word is specifically designed to be predictable for humans with access to the full textual context — but difficult to guess with only a local context.
CBT and LAMBADA are useful for probing the linguistic intelligence of models, but are not ideal for training long-range language models from scratch as they truncate text extracts to at most a couple of paragraphs, and discard a lot of the books’ text. There has been prior work on training models on book data using BookCorpus directly (e.g. BERT from Devlin et al. (2018)) however BookCorpus is no longer distributed due to licensing issues, and the source of data is dynamically changing — which makes exact benchmarking difficult over time.
The NarrativeQA Book Comprehension Task (Kočiskỳ et al., 2018) uses Project Gutenberg texts paired with Wikipedia articles, which can be used as summaries. Due to the requirement of needing a corresponding summary, NarrativeQA contains a smaller selection of books: 1,527 versus the 28,752 books in PG-19. However it is reasonable that PG-19 may be useful for pre-training book summarisation models.
|Avg. length (words)||Train Size||Vocab||Type|
|1B Word||27||4.15GB||793K||News (sentences)|
|Penn Treebank||355||5.1MB||10K||News (articles)|
A brief comparison of PG-19 to other LM datasets can be found in Table 1. We intentionally do not limit the vocabulary by unk-ing rare words, and release the dataset as an open-vocabulary benchmark. To compare models we propose to continue measuring the word-level perplexity. This can still be computed for any chosen character-based, byte-based or subword-based scheme. To do this, one calculates the total cross-entropy loss over the given validation or test subset using a chosen tokenization scheme, and then one normalizes this value by the number of words: where is the total number of words in the given subset, taken from Table 3. The word-level perplexity is thus . For sake of model comparisons, it is important to use the exact number of words computed in Table 3 as the normalisation constant.
Alongside quantitative analyses, we build an LDA topic model (Blei et al., 2003) for a qualitative inspection of the text. We present key words for several topics in the Supplementary Table 10. These topics include art, education, naval exploration, geographical description, war, ancient civilisations, and more poetic topics concerning the human condition — love, society, religion, virtue etc. This contrasts to the more objective domains of Wikipedia and news corpora.
We optimised all models with Adam (Kingma and Ba, 2014). We used a learning rate schedule with a linear warmup from to and a cosine decay back down to . For character-based LM we used warmup steps with decay steps, and for word-based LM we used warmup steps with decay steps. We found that decreasing the optimisation update frequency helped (see Section 5.5.1), namely we only applied parameter updates every steps after iterations. However we found the models would optimise well for a range of warmup/warm-down values. We clipped the gradients to have a norm of at most , which was crucial to successful optimisation.
|36L Compressive Transf.||43.4||33.6|
We benchmark the Compressive Transformer against the TransformerXL on the newly proposed PG-19 books dataset. Because it is open-vocabulary, we train a subword vocabulary of size with SubwordTextEncoder from the tfds package in TensorFlow and use the dataset statistics to compute word-level perplexity, as described in Section 4.2. We train a 36 layer Compressive Transformer with a window size of , both memory and compressed memory size of , and compression rate . We compare this to a 36 layer TransformerXL trained with window size and attention window . The model was trained on TPUv3 cores with a total batch size of and converged after processing around billion subword tokens. We display the results in Table 3 where we see the Compressive Transformer obtains a test perplexity of versus the TransformerXL’s . Despite the dataset size, it is clearly a challenging domain. This can suit as a first baseline on the proposed long-range language modelling benchmark. We show samples from this model in Supplementary Section E. The model is able to generate long-form narrative of varying styles: from character dialogue, first person diary entries, to descriptive third-person text.
|7L LSTM (Graves, 2013)||1.67|
|LN HyperNetworks Ha et al. (2016)||1.34|
|LN HM-LSTM Chung et al. (2016)||1.32|
|ByteNet (Kalchbrenner et al., 2016)||1.31|
|RHN Zilly et al. (2017)||1.27|
|mLSTM Krause et al. (2016)||1.24|
|64L Transf. Al-Rfou et al. (2019)||1.06|
|24L TXL (Dai et al., 2019)||0.99|
|Sparse Transf. (Child et al., 2019)||0.991|
|Adaptive Transf. (Sukhbaatar et al., 2019)||0.98|
|24L TXL (ours)||0.98|
|24L Compressive Transformer||0.97|
|Compression fn||Compression loss||BPC|
We compare the TransformerXL and the Compressive Transformer on the standard character-level language modelling benchmark Enwiki8 taken from the Hutter Prize (Hutter, 2012), which contains 100M bytes of unprocessed Wikipedia text. We select the first 90MB for training, 5MB for validation, and the latter 5MB for testing — as per convention. We train -layer models with a sequence window size of . During training, we set the TransformerXL’s memory size to , and for the Compressive Transformer we use memory of size and compressed memory of size with compression rate . During evaluation, we increased the TransformerXL memory size to and the compressed memory in our model to (after sweeping over the validation set), obtaining the numbers reported in Table 5. We show the effect of scaling the compressed memory size and evaluation performance in Supplementary Section B. The proposed model achieves the new state-of-the-art on this dataset with bits-per-character.
We compare compression functions and the use of auxiliary losses in Table 5. We sweep over compression rates of , , and and report results with the best performing value for each row. BPTT signifies that no auxiliary compression loss was used to train the network other than the overall training loss. To feed gradients into the compression function we unrolled the model over double the sequence length and halved the batch size to fit the larger unroll into memory.
We train an eighteen-layered Compressive Transformer on the closed-vocabulary word-level language modelling benchmark WikiText-103, which contains articles from Wikipedia. We train the model with a compressed memory size, memory size, and a sequence window size all equal to . We trained the model over Tensor Processing Units (TPU) v3 with a batch size of per core — making for a total batch size of . The model converged in a little over hours. We found the single-layer convolution worked best, with a compression rate of . This model obtained perplexity on the test set. By tuning the memory size over the validation set — setting the memory size to , and compressed memory size to — we obtain perplexity. This is perplexity points over prior state of the art, and means the model places a higher probability on the correct word over the prior SotA TransformerXL.
It is worth noting that in Table 6 we do not list methods that use additional training data, or that make use of test-time labels to continue training the model on the test set (known as dynamic evaluation (Graves, 2013)). If we incorporate a very naive dynamic evaluation approach of loading a model checkpoint and continuing training over one epoch of the test set, then we obtain a test perplexity of 16.1. This is slightly better than the published 16.4 from Krause et al. (2019) — which uses a more sophisticated dynamic evaluation approach on top of the TransformerXL. However in most settings, one does not have access to test-time labels — and thus we do not focus on this setting. Furthermore there has been great progress in showing that more data equates to much better language modelling; Shoeybi et al. (2019) find a large transformer 8B-parameter transformer trained on 170GB of text obtains 10.7 word-level perplexity on WikiText-103. However it is not clear to what extent the WikiText-103 test set may be leaked inside these larger training corpora. For clarity of model comparisons, we compare to published results trained on the WikiText-103 training set. Certainly the direction of larger scale and more data appear to bring immediate gains to the quality of existing language models. Both data scale and quality alongside intelligent model design are complementary lines of research towards better sequence modelling.
We break perplexity down by word frequency in Table 7 and see the Compressive Transformer makes only a small modelling improvement for frequent words ( over the TransformerXL baseline) but obtains a much larger improvement of for infrequent words. Furthermore, we see improvement in modelling rare words over the prior state-of-the-art LSTM language model published in 2018 — which demonstrates the rate of progress in this area.
|LSTM (Graves et al., 2014)||-||48.7|
|Temporal CNN (Bai et al., 2018b)||-||45.2|
|GCNN-14 (Dauphin et al., 2016)||-||37.2|
|Quasi-RNN Bradbury et al. (2016)||32||33|
|RMC (Santoro et al., 2018)||30.8||31.9|
|LSTM+Hebb. (Rae et al., 2018)||29.0||29.2|
|Transformer (Baevski and Auli, 2019)||-||18.7|
|18L TransformerXL, M=384 (Dai et al., 2019)||-||18.3|
|18L TransformerXL, M=1024 (ours)||-||18.1|
|18L Compressive Transformer, M=1024||16.0||17.1|
|Relative gain over TXL||2.6%||9.5%||21%||19.9%||5.8%|
5.4 Compressibility of layers
We can use compression to better understand the model’s mode of operation. We inspect how compressible Transformer’s activations are as they progress through higher layers in the network. One may expect representations to become more difficult to compress at higher layers, if more semantic information is represented there. We monitor the compression loss at each layer of our best-performing Compressive Transformer models trained on Enwik8 and WikiText-103 and display these in Supplementary Section A Figure 6. We note that the compression loss is about one order of magnitude higher for word-level language modelling (WikiText-103) over character-level langauge modelling (Enwik8). Furthermore the first layer of the Transformer is highly compressible. However there is not a clear trend of compression cost increasing with layer depth.
We inspect where the network is attending to on average, to determine whether it is using its compressed memory. We average the attention weight over a sample of sequences from a trained model on Enwik8. We aggregate the attention into eighteen buckets, six for each of the compressed memory, memory, and sequence respectively. We set the size of the sequence, memory and compressed memory all to be . We plot this average attention weight per bucket in Figure 3 with a 1 standard error. We see most of the attention is placed on the current sequence; with a greater weight placed on earlier elements of the sequence due to the causal self-attention mechanism which masks future attention weights. We also observe there is an increase in attention from the oldest activations stored in the regular memory, to the activations stored in the compressed memory. This goes against the trend of older memories being accessed less frequently — and gives evidence that the network is learning to preserve salient information.
5.5.1 Optimisation Schedule
We make an observation about an interesting but undesirable meta-learning phenomenon during long-context training. When the learning rate is tuned to be much smaller (or set to zero) during training, performance degrades drastically both for the TransformerXL and the Compressive Transformer. This is displayed in Figure 3.
Usually we consider distributional shift from the training data to the test data, but we can also observe a shift in the model when transferring from a training to evaluation mode (even when the model is evaluated on the training data). In this case, this is due to the online updating of parameters whilst processing long contiguous articles. We would like the model to generalise well to scenarios where it is not continuously optimised. Updating the parameters only at article boundaries (and then resetting the state) could be one solution for long-range memory models, but this would slow down learning significantly.
Instead, we propose reducing the frequency of optimisation updates during training. We find this allows for the best of both worlds — fast initial learning with frequent updates, and better generalisation near the end of training with less frequent updates (e.g. every 4 steps). Reducing the optimisation frequency increases the effective batch size, which has also been shown to be preferable to learning rate decay in image modelling (Smith et al., 2018). We observed a final performance improvement in our TransformerXL baseline on Enwik8, from — which approximately replicates the published result — to — which matches the most recent SotA architecture. We note, the additional space and compute cost of accumulating gradients is negligible across iterations, so there was no performance regression in using this scheme.
We train the Compressive Transformer on the waveform of speech to assess its performance on different modalities. Speech is interesting because it is sampled at an incredibly high frequency, but we know it contains a lot of information on the level of phonemes and entire phrases.
To encourage long-term reasoning, we refrain from conditioning the model on speaker identity or text features, but focus on unconditional speech modelling. We train the model on 24.6 hours of 24kHz North American speech data. We chunk the sequences into windows of size , roughly ms of audio, and compare a -layer Compressive Transformer to a -layer TransformerXL and a -layer WaveNet model (Oord et al., 2016) — a state-of-the-art audio generative model used to serve production speech synthesis applications at Google (Oord et al., 2018). All networks have approximately 40M parameters, as WaveNet is more parameter-efficient per layer. We train each network with V100 GPUs, and a batch size of per core (total batch size of ) using synchronous training.
WaveNet processes an entire chunk in parallel, however the TransformerXL and Compressive Transformer are trained with a window size of and a total memory size of (for the Compressive Transformer we use memory + compressed). We thus unroll the model over the sequence. Despite this sequential unroll, the attention-based models train at only half the speed of WaveNet. We see the test-set negative-log-likelihood in Figure 5, and observe that a Compressive Transformer with a compression rate of is able to outperform the TransformerXL and maintain a slim advantage over WaveNet. However we only trained models for at most one week (with 32GPUs) and it would be advantageous to continue training until full convergence — before definitive conclusions are made.
5.7 Reinforcement Learning
Compression is a good fit for video input sequences because subsequent frames have high mutual information. Here we do not test out the Compressive Transformer on video, but progress straight to a reinforcement learning agent task that receives a video stream of visual observations — but must ultimately learn to use its memory to reason over a policy.
We test the Compressive Transformer as a drop-in replacement to an LSTM in the IMPALA setup (Espeholt et al., 2018). Otherwise, we use the same training framework and agent architecture as described in the original work with a fixed learning rate of and entropy cost coefficient of . We test the Compressive Transformer on a challenging memory task within the DMLab-30 (Beattie et al., 2016) domain, rooms_select_nonmatching_object. This requires the agent to explore a room in a visually rich 3D environment and remember the object present. The agent can then advance to a second room where it must select the object not present in the original room. This necessitates that the agent both remember events far in the past, and also learn to efficiently reason about them.
We fix both the memory and compressed memory sizes to . In Figure 5, we present results for a range of compression rates, averaged over seeds. We see that the best performing agents endowed with the Compressive Transformer are able to solve the task to human-level. We note that the model with compression rate is unable to learn the task to the same proficiency. The speed of learning and stability seem to increase proportionally with higher rates of compression (up to a limit) – i.e. the effective memory window of the agent – and we find compression rate to once again be the best performing. We see this as a promising sign that the architecture is able to efficiently learn, and suitably use, compressed representations of its visual input and hope to test this more widely in future work.
In this paper we explore the notion of compression as a means of extending the temporal receptive field of Transformer-based sequence models. We see a benefit to this approach in the domain of text, with the Compressive Transformer outperforming existing architectures at long-range language modelling. To continue innovation in this area, we also propose a new book-level LM benchmark, PG-19. This may be used to compare long-range language models, or to pre-train on other long-range reasoning language tasks, such as NarrativeQA (Kočiskỳ et al., 2018).
We see the idea of compressive memories is applicable not only to the modality of text, but also audio, in the form of modelling the waveform of speech, and vision, within a reinforcement-learning agent trained on a maze-like memory task. In both cases, we compare to very strong baselines (Wavenet (Oord et al., 2016) and IMPALA (Espeholt et al., 2018)).
The main limitation of this work is additional complexity, if the task one wishes to solve does not contain long-range reasoning then the Compressive Transformer is unlikely to provide additional benefit. However as a means of scaling memory and attention, we do think compression is a simpler approach to dynamic or sparse attention — which often requires custom kernels to make efficient. One can build effective compression modules from simple neural network components, such as convolutions. The compression components are immediately efficient to run on GPUs and TPUs.
Memory systems for neural networks began as compressed state representations within RNNs. The recent wave of progress using attention-based models with deep and granular memories shows us that it is beneficial to refrain from immediately compressing the past. However we hypothesise that more powerful models will contain a mixture of granular recent memories and coarser compressed memories. Future directions could include the investigation of adaptive compression rates by layer, the use of long-range shallow memory layers together with deep short-range memory, and even the use of RNNs as compressors. Compressive memories should not be forgotten about just yet.
- Character-level language modeling with deeper self-attention. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 3159–3166. Cited by: Table 5.
- Adaptive input representations for neural language modeling. arXiv preprint arXiv:1809.10853. Cited by: Table 6.
- Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §1.
- Trellis networks for sequence modeling. arXiv preprint arXiv:1810.06682. Cited by: §4.1.
- Convolutional sequence modeling revisited. External Links: Cited by: Table 6.
- DeepMind lab. CoRR abs/1612.03801. External Links: Cited by: §5.7.
- Latent dirichlet allocation. J. Mach. Learn. Res. 3, pp. 993–1022. External Links: Cited by: Appendix D, §4.2.
- Quasi-recurrent neural networks. arXiv preprint arXiv:1611.01576. Cited by: Table 6.
- One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005. Cited by: §4.1.
- Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509. Cited by: §1, §2, Table 5.
- Hierarchical multiscale recurrent neural networks. arXiv preprint arXiv:1609.01704. Cited by: Table 5.
- Transformer-xl: attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860. Cited by: §1, §2, §3.3, §3, §4.1, Table 5, Table 6.
- Language modeling with gated convolutional networks. arXiv preprint arXiv:1612.08083. Cited by: Table 6.
- Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1, §4.1.
- IMPALA: scalable distributed deep-rl with importance weighted actor-learner architectures. In International Conference on Machine Learning, pp. 1406–1415. Cited by: §1, §5.7, §6.
- Improving neural language models with a continuous cache. arXiv preprint arXiv:1612.04426. Cited by: §4.1.
- Neural turing machines. arXiv preprint arXiv:1410.5401. Cited by: Table 6.
- Hybrid computing using a neural network with dynamic external memory. Nature 538 (7626), pp. 471. Cited by: §3.2.
- Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850. Cited by: §5.3, Table 5.
- Hypernetworks. arXiv preprint arXiv:1609.09106. Cited by: Table 5.
- The goldilocks principle: reading children’s books with explicit memory representations. arXiv preprint arXiv:1511.02301. Cited by: §4.1.
- Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §1.
- The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751. Cited by: Appendix E.
- The human knowledge compression contest. URL http://prize. hutter1. net 6. Cited by: §1, §5.2.
- Neural machine translation in linear time. arXiv preprint arXiv:1610.10099. Cited by: Table 5.
- Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.
- The narrativeqa reading comprehension challenge. Transactions of the Association for Computational Linguistics 6, pp. 317–328. Cited by: §4.1, §6.
- Dynamic evaluation of transformer language models. CoRR abs/1904.08378. External Links: Cited by: §5.3.
- Multiplicative lstm for sequence modelling. arXiv preprint arXiv:1609.07959. Cited by: Table 5.
- Large memory layers with product keys. arXiv preprint arXiv:1907.05242. Cited by: §1.
- Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843. Cited by: §1, §4.1.
- Recurrent neural network based language model. In Eleventh Annual Conference of the International Speech Communication Association, Cited by: §4.1.
- Parallel wavenet: fast high-fidelity speech synthesis. In International Conference on Machine Learning, pp. 3915–3923. Cited by: §5.6.
- Wavenet: a generative model for raw audio. arXiv preprint arXiv:1609.03499. Cited by: §1, §5.6, §6.
- The lambada dataset: word prediction requiring a broad discourse context. Cited by: §4.1.
- Scaling memory-augmented neural networks with sparse reads and writes. In Advances in Neural Information Processing Systems, pp. 3621–3629. Cited by: §1.
- Fast parametric learning with activation memorization. arXiv preprint arXiv:1803.10049. Cited by: §4.1, Table 6, Table 7.
- The persistence and transience of memory. Neuron 94 (6), pp. 1071–1084. Cited by: §1.
- Learning representations by back-propagating errors. Nature 323 (6088), pp. 533. Cited by: §1.
- Relational recurrent neural networks. In Advances in Neural Information Processing Systems, pp. 7299–7310. Cited by: Table 6.
- Megatron-lm: training multi-billion parameter language models using model parallelism. External Links: Cited by: §1, §5.3.
- Don’t decay the learning rate, increase the batch size. External Links: Cited by: §5.5.1.
- Adaptive attention span in transformers. arXiv preprint arXiv:1905.07799. Cited by: Appendix B, §1, §2, Table 5.
- Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1, §3.
- Pay less attention with lightweight and dynamic convolutions. arXiv preprint arXiv:1901.10430. Cited by: §2.
- XLNet: generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237. Cited by: §1.
- End-to-end dense video captioning with masked transformer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8739–8748. Cited by: §1.
- Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pp. 19–27. Cited by: §4.1.
- Recurrent highway networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 4189–4198. Cited by: Table 5.
Appendix A Compression Across Layers
We inspect the compression loss broken down by the layer index, to investigate whether there is a trend in network depth with how compressible the representations are. The compression loss here refers to the attention-reconstruction attention loss. We plot this for a 24 layer trained model on Enwik8, and an 18 layer model trained on WikiText-103. The compression loss for character-based language modelling is about one order of magnitude lower than that of word-level language modelling. The first layer’s representations are highly compressible, however from then on there is no fixed trend. Some non-contiguous layers have a very similar compression loss (e.g. 4 & 6, 5 & 7) which suggests information is being routed from these layer pairs via the skip connection.
Appendix B Comparison of Compressed Memory Sizes
We compare the best test perplexity obtained for the Compressive Transformer trained on WikiText-103 and Enwik8 across a range of compressed memory sizes. For both models, the best model used a 1D convolution compression network with a compression rate of . The Enwik8 model was trained with an embedding size of , attention heads, layers, an mlp hidden size of , a sequence window size of , and a memory size of . We see the best compressed memory size is in this sweep, facilitating a total attention window of . The WikiText-103 model was trained with an embedding size of , adaptive inputs using the same parameters as [Sukhbaatar et al., 2019], attention heads, layers, an mlp hidden size of , a sequence window of size and a memory of size . The best compressed memory size is resulting in a total attention window of c. .
|Compressed Memory Size||512||1024||2048||3072||4096|
|Compressed Memory Size||256||512||1024||1536||2048|
Appendix C PG-19 Preprocessing
The raw texts from the Gutenberg project were minimally pre-processed by removing boilerplate license text. We then also replaced discriminatory words with a unique token using the Ofcom list of discriminatory words 333https://www.ofcom.org.uk/__data/assets/pdf_file/0023/91625/OfcomQRG-AOC.pdf.
Appendix D PG-19 Topics
We present top-words for some of the topics on the PG-19 corpus. These were generated with LDA topic model [Blei et al., 2003].
Appendix E PG-19 Samples
We show a few different samples from the Compressive Transformer trained on PG-19. We use Nucleus Sampling with [Holtzman et al., 2019]. We choose extracts of books from the test set as prefixes. We see the model is able to continue in the style of the text, creating artificial dialogue or descriptive text, and remembering the names of characters over hundreds of words.
As the Compressive Transformer is trained without state resetting, it is actually slightly out of sample when provided with the (relatively) short contexts. This is because its memory and compressed memory may be still empty (whereas they are always full during training). However we see a trend of the samples usually improving towards the end.