Understanding Neural Abstractive Summarization Models via Uncertainty

Understanding Neural Abstractive Summarization Models via Uncertainty


An advantage of seq2seq abstractive summarization models is that they generate text in a free-form manner, but this flexibility makes it difficult to interpret model behavior. In this work, we analyze summarization decoders in both blackbox and whitebox ways by studying on the entropy, or uncertainty, of the model’s token-level predictions. For two strong pre-trained models, PEGASUS Zhang et al. (2020) and BART Lewis et al. (2020) on two summarization datasets, we find a strong correlation between low prediction entropy and where the model copies tokens rather than generating novel text. The decoder’s uncertainty also connects to factors like sentence position and syntactic distance between adjacent pairs of tokens, giving a sense of what factors make a context particularly selective for the model’s next output token. Finally, we study the relationship of decoder uncertainty and attention behavior to understand how attention gives rise to these observed effects in the model. We show that uncertainty is a useful perspective for analyzing summarization and text generation models more broadly.1


1 Introduction

Recent progress in abstractive summarization has been fueled by the advent of large-scale Transformers pre-trained on autoregressive language modeling objectives Hoang et al. (2019); Khandelwal et al. (2019); Lewis et al. (2020); Zhang et al. (2020). Despite their strong performance on automatic metrics like ROUGE Lin (2004), abstractive models are not as straightforward and interpretable as their extractive counterparts. Free-form generation in these models also leads to serious downstream errors, such as factual inconsistencies with the input document Cao et al. (2018); Kryściński et al. (2020); Wang et al. (2020); Durmus et al. (2020); Goyal and Durrett (2020). Although the interpretability of NLU models has been extensively studied Ribeiro et al. (2016); Ghaeini et al. (2018); Jain and Wallace (2019); Desai and Durrett (2020), summarization models specifically have not received similar attention, with analysis efforts often focused on datasets and evaluation Kryscinski et al. (2019).

In this work, we focus on interpreting and understanding abstractive summarization models through the lens of decoder uncertainty, or the entropy of decisions during generation. While uncertainty in generation has been studied from the perspective of data Ott et al. (2018), sampling (Fan et al., 2018; Holtzman et al., 2019), and training Correia et al. (2019); Kang and Hashimoto (2020), it is underutilized as a technique for analysis and inspection of generation systems. We study two prominent summarization models, PEGASUS Zhang et al. (2020) and BART Lewis et al. (2020), fine-tuned on two English summarization datasets, CNN/Daily Mail Hermann et al. (2015) and XSum Narayan et al. (2018), to understand model behavior in each setting.

First, by comparing -grams between the input document and generated summaries, we establish two coarse types for decoded tokens, copy and generate See et al. (2017). We find that the entropy of the generation decision correlates with whether the model is copying or generating, as well as where in the sentence the token is. This paints a picture of certain contexts being more restrictive from the standpoint of generation, particularly early in sentences where a model has not “decided” what to copy yet, and illustrates the interaction of content selection and lexical choice. Second, we extend this analysis by looking at how uncertainty relates to the syntax of the generated sentence: whether uncertainty connects to syntactic notions of surprisal Roark et al. (2009) and how the entropy varies across certain syntactic productions. Finally, we derive a way to quantify decoder attention by aggregating distinct self-attention heads, revealing the correlation between the attention entropy and prediction entropy, and investigating the correspondence between the prediction entropy and the fraction of the past and future decoded tokens.

Taking this analysis together, we find that the abstractiveness of reference summaries fundamentally changes model behavior: the extractive nature of CNN/DM makes most of its decisions low entropy and copy-oriented while the model maintains higher uncertainty on XSum, yielding more abstractive summaries. More broadly, we show that uncertainty is a simple but effective tool to characterize decoder behavior in text generation.

2 Model and Experimental Setup

Our experiments use PEGASUS Zhang et al. (2020) and BART Lewis et al. (2020), two state-of-the-art seq2seq pre-trained models. We use the large version of these two models, which have 16 and 12 Transformer layers, respectively. Both models have pre-training objectives tailored somewhat to this problem domain: seq2seq modeling for denoising (BART) or infilling of masked-out sentences (PEGASUS). We directly use the pre-trained models from Wolf et al. (2019).2

As reported in the original papers and measured by ROUGE-1/2/L Lin (2004), PEGASUS achieves 44.17/21.47/41.11 on CNN/DM Hermann et al. (2015) and 47.21/24.56/39.25 on XSum Narayan et al. (2018), and BART achieves 44.16/21.28/40.90 and 45.14/22.27/37.25.


Entropy is a standard measure of uncertainty in a probabilistic distribution. Given a discrete random variable with all possible outcomes , the entropy of is defined as .

For pre-trained Transformers, the domain of the predictions (the vocabulary) is large and also differs between models. The vocabulary sizes for PEGASUS and BART are 96,103 and 50,265,3 and the prediction distribution is usually long-tailed. To combat this, nucleus sampling (Holtzman et al., 2019) is used to sample from only the top most probable outcomes (the nucleus) to avoid generating very unlikely tokens. To more fairly compare models with different vocabulary sizes, and to better reflect the actual sampling distribution, we therefore compute all entropy values in this work over the nucleus distribution. That is, we sort the prediction distribution in descending order and get a minimal set of tokens where . Then we re-normalize the distribution as follows:


where the cumulative probability . We use for all experiments. The entropy is computed based on the new distribution .

3 Model Uncertainty during Generation

In this section, we analyze and compare the prediction uncertainty from different models and different datasets by inspecting entropy values during generation, allowing us to localize uncertainty to certain positions in a decoded sentence. A principle factor that past work has investigated is the amount of copying in abstractive summarization models See et al. (2017); Paulus et al. (2018). We first aim to understand how decisions to copy document content or generate new text are reflected in the model’s uncertainty.

One complicating factor is that while BART and PEGASUS both exhibit a mix of copying and novel generation, they do not have an explicit copy operation like in past models and so these behaviors are more difficult to define. We first separate generation decisions by bigrams that appear in the input document (existing bigrams) or whether they are free-form generations (novel bigrams).4

Figure 1: Next token entropies computed on 10K generation steps from PEGASUS, PEGASUS, BART and BART respectively, broken into two cases: an Existing Bigram means the bigram just generated occurs in the input document, while a Novel Bigram is an organic model generation. These cases are associated with low entropy and high entropy actions, respectively. The x-axis shows the entropy (truncated at 5), and the y-axis shows the count of bigram falling in each bin. The dashed lines indicate the median of each distribution.

Figure 1 shows a histogram of model entropies broken down by these two categories. Most notably, there is a strong correlation between copy-like behavior and the entropy of the model’s prediction distribution. On CNN/DM, we see that low entropy decisions are largely those generating existing bigrams, and conversely, existing bigrams are usually generated with low entropy. New bigrams are generated with a broad range of high entropy values, and are much more frequent on XSum. These results align with our manual analysis of these summaries: PEGASUS and BART summaries largely consist of spans from the input document with minor compression while PEGASUS and BART summaries involve stitching together disparate concepts and paraphrasing key details. This reflects a corresponding divergence in the gold summaries, where CNN/DM summaries are far more extractive than those in XSum.

Critically, though the entropy distributions are dissimilar across the two datasets, we see regularities among the approximate copy and generate operations: on CNN/DM and XSum, the median entropy values of using existing bigrams are 0.95 and 1.20, respectively, and for generating new bigrams, 2.27 and 1.75.

With this connection between entropy and copying behavior, we make the following additional observations based on Figures 1 and 2:

Figure 2: Prediction entropy values by relative sentence positions. For example, 0.0 indicates the first 10% of tokens in a sentence, and 0.9 is the last 10% of tokens. PEGASUS and BART make highly uncertain decisions to start, but then entropy decreases, suggesting that these models may be copying based on a sentence prefix. Entropies on XSum are more constant across the sentence.

Entropy varies across token positions, especially on CNN/DM.

In Figure 2, we depict a different view of entropy, looking at the decoding process as it progresses through each sentence. Across both CNN/DM and XSum, models are most uncertain at the beginning of the sentence and least uncertain at the end of the sentence. However, the rate at which entropy drops off is quite different: on CNN/DM, the entropy after decoding 20% of tokens falls below 2, while the entropies on XSum only begin to considerably drop after decoding 80% of tokens. Our manual analysis suggests the following characterization: to generate each sentence on CNN/DM, the model makes some high-entropy decisions to identify a sentence and begin to copy its prefix, followed by a series of low entropy decisions to copy that sentence’s content. On XSum, which is highly abstractive and features single sentence summaries, content planning and generation are less clearly decoupled.

PEGASUS copies and generates more tokens with entropy .

BART and PEGASUS report similar ROUGE results on CNN/DM, but these models do not place the same distributions over summaries. PEGASUS has more low-entropy copying decisions, and its start-of-sentence entropies are also significantly lower (Figure 2). This suggests that it is more confident than BART in selecting content to discuss next. There are also more low-entropy generation decisions, particularly on XSum.

4 Entropies of Syntactic Productions

Having observed connections between sentence position and entropy, we now flesh out this analysis from the lens of syntax, focusing in particular on uncertainty at constituent boundaries. From our PEGASUS generations on CNN/DM and XSum, we obtain constituency parses for each summary sentence using the Berkeley Neural Parser Kitaev and Klein (2018) and explore connections between syntax and uncertainty in more depth.

Figure 3: Correlating syntactic distance between neighboring tokens with the entropy change in those tokens’ generation decisions for PEGASUS summaries. The median entropy change is depicted as a dashed black line. At points of high syntactic distance, the model’s behavior is less restricted by the context, correlating with higher entropy.

Low and high entropy decisions can be localized to constituent span boundaries.

Parsing has long been used to explain psycholinguistic notions of surprisal (Hale, 2001; Roark et al., 2009, inter alia), which are in turn related to uncertainty under a language model. In our case, uncertainty about generating a text is a different notion than uncertainty when a reader is processing it. Hence, rather than looking at an incremental parser’s behavior, we instead look at a simpler notion of syntactic distance Shen et al. (2018), or the number of left and right parentheses between and in a linearized constituency tree. Our hypothesis is that when these words exhibit high syntactic distance, this word boundary is a “choice point” where the model may be less restricted in what it can choose to generate next.

Figure 3 shows the correlation between syntactic distance and the percent change in entropy between the adjacent tokens. On both CNN/DM and XSum, we see two patterns emerge: generating a token within the same immediate parent constituent (i.e., zero syntactic distance) is typically a certain decision, while generating a token belonging to a new constituent is an increasingly uncertain decision. From these results, we can draw a parallel to the copy vs. generate behavior established in Section 3; for example, generating York after New might be straightforward, perhaps due to a direct copy from the document, but generating a prepositional phrase might be more challenging due to the large search space of possible constructions or the higher chance that the model might delete this constituent.

Production Rule Example
NP NP : NP [Arsenal vs Reading] [:] [the game that changed the game]
NP NP , SBAR , [driver] [,] [who has not been identified] [,]
NP CD NN NNS [16] [felony] [counts]
NP NNP CD [April] [3]
Table 1: Examples of specific NP productions with high entropy (top) and low entropy (bottom). The notation [Y] implies the constituent is generated with entropy .

Low entropy spans are often short, specific units of information.

We also investigate the average entropy of spans within a rule production to uncover what types of spans are likely to elicit certainty or uncertainty during generation. In Table 1, we see qualitatively that productions with low average entropy productions are short extracts of document content, such as 16 felony counts. These are largely factual, often containing cardinal values, and more likely to be copied. Within these constituents, the model is very certain about what to generate next, supporting the connection with low syntactic distance.

5 Understanding Decoder Self-Attention

While we have analyzed the model’s predictions, we have not yet determined how the different behaviors we see emerge from the context. Our goal is to explore what the encoder attention places its emphasis during generation and how it correlates with the prediction entropy.5

Figure 4: Correlation between attention entropy and prediction entropy of PEG(ASUS) and BART on C(NN/DM) and X(Sum). We compute the mean value of the attention entropy within each bucket of prediction entropy. The uncertainty of attention strongly correlates with the entropy of the model’s prediction.

Blocking Low-information Tokens.

Analyzing the inner workings of attention in Transformers is challenging Clark et al. (2019); Kovaleva et al. (2019), particularly because many heads are useless, redundant, or noisy, and they frequently attend to low-information tokens such as end-of-sentence markers or periods. Inspired by tf-idf Joachims (1997), we propose a method to compute a set of tokens most meaningfully attended to by the model. If a token in the encoding document is attended to across many time steps (like a word appearing in many documents in tf-idf), we want to disregard it in our analysis.

Let denote the number of decoder timesteps and be the length of the source document. We compute an aggregate attention matrix by summing the attentions across all heads and all layers. We then compute a count of how often each token is attended to above a threshold : and discard the attention values on tokens with the highest score. In practice we discard 5% of tokens from the source document.

Attention Entropy.

One natural question we can ask is whether there is a connection between entropy of the attention distribution and entropy of the decoder’s prediction. This relationship is shown in Figure 4, where each point represents the mean attention entropy within the corresponding prediction entropy bucket. The attention entropy is especially low where the prediction entropy ranges from 0 to 0.5. For cases with prediction entropy greater than 1.5, the attention entropy saturates and no longer grows with the prediction entropy except the BART. While attention entropy is probably not “causing” the low decoder entropy per se, nevertheless decoder entropy provides a lens into the inner workings of the Transformer model.

Figure 5: Vocabulary projected attention attending to the last input , current input , current output , and next output . When the prediction entropy is low, the attention mostly focus a few tokens including the current input and current output .

Projecting Attention to Vocabulary.

We hypothesize that low decoder entropies may arise if the model is heavily attending to certain relevant tokens, particularly the (about to be predicted) token of time step and the input token of this time step , equivalent to . For the predicted token , we compute the vocabulary projected attention value where we accumulate the attention of all of the occurrences of the specified token in the document. The higher the value, the more attention put to the encoding token(s) which are predicted for this time step during decoding. We can define the value for last time step input , current time step input , and the not-yet-decoded token for next time step.

We show the relationship between the vocabulary projected attention and the prediction entropy in Figure 5. Visualizations for both models and both datasets show that when the prediction entropy is low, the attention focuses heavily on a few tokens including the current input token and the current token to predict. This suggests a potential mechanism where the model indexes into the source document by attending to , then strongly identifies and “reads off” as the next token to generate.

6 Conclusion

This work analyzes pre-trained summarization models via uncertainty, or the entropy of decoding decisions. We pursue several lines of inquiry: uncertainty can help us understand copying document spans vs. generating novel text, the behavior of models in different syntactic environments, and coarse properties of the model’s attention distribution. All of these give insight into what conditions most heavily restrict the model’s generation: generating an observed bigram (copying), low syntactic distance, and attention which can easily identify decoder context in the source document. We believe this approach can power future analyses of pre-trained text generation systems.


This work was partially supported by NSF Grant IIS-1814522, NSF Grant SHF-1762299, a gift from Salesforce Inc, and an equipment grant from NVIDIA. The authors acknowledge the Texas Advanced Computing Center (TACC) at The University of Texas at Austin for providing HPC resources used to conduct this research. Results presented in this paper were obtained using the Chameleon testbed supported by the National Science Foundation. Thanks as well to the anonymous reviewers for their helpful comments.


  1. Code is available at https://github.com/jiacheng-xu/text-sum-uncertainty
  2. Specifically, google/pegasus-cnndailymail, google/pegasus-xsm, facebook/bart-large-cnn, and facebook/bart-large-xsum for PEGASUS and BART on these two datasets.
  3. Note that entropy generally increases as the variable’s domain grows: a uniform distribution over 10,000 outcomes has entropy 9.21, while a uniform distribution over 100,000 outcomes has entropy 11.51.
  4. Bigrams are defined based on tokens rather than wordpieces, and so may consist of more than two generation steps.
  5. In PEGASUS and BART models, the encoder and decoder attention during decoding are two separate distributions where the encoder attention looks at the encoding context and the decoder attention attends to the previously decoded tokens. In this paper we chiefly examine the encoder attention to understand how the model references the input document.


  1. Faithful to the Original: Fact Aware Neural Abstractive Summarization. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Cited by: §1.
  2. What Does BERT Look at? An Analysis of BERT’s Attention. In Proceedings of the ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Cited by: §5.
  3. Adaptively sparse transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2174–2184. Cited by: §1.
  4. Calibration of Pre-trained Transformers. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §1.
  5. FEQA: A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization. In Proceedings of the Annual Conference of the Association for Computational Linguistics (ACL), Cited by: §1.
  6. Hierarchical Neural Story Generation. In Proceedings of the Annual Conference of the Association for Computational Linguistics (ACL), Cited by: §1.
  7. Interpreting Recurrent and Attention-Based Neural Models: a Case Study on Natural Language Inference. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §1.
  8. Evaluating Factuality in Generation with Dependency-level Entailment. In Findings of the Conference on Empirical Methods in Natural Language Processing (Findings of EMNLP), Cited by: §1.
  9. A probabilistic Earley parser as a psycholinguistic model. In Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics, Cited by: §4.
  10. Teaching Machines to Read and Comprehend. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), Cited by: §1, §2.
  11. Efficient Adaptation of Pretrained Transformers for Abstractive Summarization. arXiv preprint arXiv:1906.00138. Cited by: §1.
  12. The Curious Case of Neural Text Degeneration. In Proceedings of the Conference on International Conference on Learning Representations (ICLR), Cited by: §1, §2.
  13. Attention is not Explanation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Cited by: §1.
  14. A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. In ICML, Cited by: §5.
  15. Improved natural language generation via loss truncation. arXiv preprint arXiv:2004.14589. Cited by: §1.
  16. Sample Efficient Text Summarization Using a Single Pre-Trained Transformer. arXiv preprint arXiv:1905.08836. Cited by: §1.
  17. Constituency Parsing with a Self-Attentive Encoder. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), Cited by: §4.
  18. Revealing the Dark Secrets of BERT. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Cited by: §5.
  19. Evaluating the Factual Consistency of Abstractive Text Summarization. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §1.
  20. Neural Text Summarization: A Critical Evaluation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Cited by: §1.
  21. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 7871–7880. External Links: Link, Document Cited by: Understanding Neural Abstractive Summarization Models via Uncertainty, §1, §1, §2.
  22. ROUGE: A Package for Automatic Evaluation of Summaries. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), Cited by: §1, §2.
  23. Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §1, §2.
  24. Analyzing uncertainty in neural machine translation. In International Conference on Machine Learning, pp. 3956–3965. Cited by: §1.
  25. A Deep Reinforced Model for Abstractive Summarization. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §3.
  26. ”Why Should I Trust You?”: Explaining the Predictions of Any Classifier. In Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (SIGKDD), Cited by: §1.
  27. Deriving lexical and syntactic expectation-based measures for psycholinguistic modeling via incremental top-down parsing. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Singapore, pp. 324–333. External Links: Link Cited by: §1, §4.
  28. Get To The Point: Summarization with Pointer-Generator Networks. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), Cited by: §1, §3.
  29. Neural Language Modeling by Jointly Learning Syntax and Lexicon. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §4.
  30. Asking and Answering Questions to Evaluate the Factual Consistency of Summaries. In Proceedings of the Annual Conference of the Association for Computational Linguistics (ACL), Cited by: §1.
  31. HuggingFace’s Transformers: State-of-the-art Natural Language Processing. arXiv preprint arXiv:1910.03771. Cited by: §2.
  32. PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization. Proceedings of Machine Learning Research. Cited by: Understanding Neural Abstractive Summarization Models via Uncertainty, §1, §1, §2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description