Compressive Summarization with Plausibility and Salience Modeling
Compressive summarization systems typically rely on a crafted set of syntactic rules to determine what spans of possible summary sentences can be deleted, then learn a model of what to actually delete by optimizing for content selection (ROUGE). In this work, we propose to relax the rigid syntactic constraints on candidate spans and instead leave compression decisions to two data-driven criteria: plausibility and salience. Deleting a span is plausible if removing it maintains the grammaticality and factuality of a sentence, and spans are salient if they contain important information from the summary. Each of these is judged by a pre-trained Transformer model, and only deletions that are both plausible and not salient can be applied. When integrated into a simple extraction-compression pipeline, our method achieves strong in-domain results on benchmark summarization datasets, and human evaluation shows that the plausibility model generally selects for grammatical and factual deletions. Furthermore, the flexibility of our approach allows it to generalize cross-domain: our system fine-tuned on only 500 samples from a new domain can match or exceed an in-domain extractive model trained on much more data.
Compressive summarization systems offer an appealing tradeoff between the robustness of extractive models and the flexibility of abstractive models. Compression has historically been useful in heuristic-driven systems Knight and Marcu (2000, 2002); Wang et al. (2013) or in systems with only certain components being learned Martins and Smith (2009); Woodsend and Lapata (2012); Qian and Liu (2013). End-to-end learning-based compressive methods are not straightforward to train: exact derivations of which compressions should be applied are not available, and deriving oracles based on ROUGE Berg-Kirkpatrick et al. (2011); Durrett et al. (2016); Xu and Durrett (2019); Mendes et al. (2019) optimizes only for content selection, not grammaticality or factuality of the summary. As a result, past approaches require significant engineering, such as creating a highly specific list of syntactic compression rules to identify permissible deletions Berg-Kirkpatrick et al. (2011); Li et al. (2014); Wang et al. (2013); Xu and Durrett (2019). Such manually specified, hand-curated rules are fundamentally inflexible and hard to generalize to new domains.
In this work, we build a summarization system that compresses text in a more data-driven way. First, we create a small set of high-recall constituency-based compression rules that cover the space of legal deletions. Critically, these rules are merely used to propose candidate spans, and the ultimate deletion decisions are controlled by two data-driven models capturing different facets of the compression process. Specifically, we model plausibility and salience of span deletions. Plausibility is a domain-independent requirement that deletions maintain grammaticality and factuality, and salience is a domain-dependent notion that deletions should maximize content selection (from the standpoint of ROUGE). In order to learn plausibility, we leverage a pre-existing sentence compression dataset Filippova and Altun (2013); our model learned from this data transfers well to the summarization settings we consider. Using these two models, we build a pipelined compressive system as follows: (1) an off-the-shelf extractive model highlights important sentences; (2) for each sentence, high-recall compression rules yield span candidates; (3) two pre-trained Transformer models Clark et al. (2020) judge the plausibility and salience of spans, respectively, and only spans which are both plausible and not salient are deleted.
We evaluate our approach on several summarization benchmarks. On CNN Hermann et al. (2015), WikiHow Koupaee and Wang (2018), XSum Narayan et al. (2018), and Reddit Kim et al. (2019), our compressive system consistently outperforms strong extractive methods by roughly 2 ROUGE-1, and on CNN/Daily Mail Hermann et al. (2015), we achieve state-of-the-art ROUGE-1 by using our compression on top of MatchSum Zhong et al. (2020) extraction. We also perform additional analysis of each compression component: human evaluation shows plausibility generally yields grammatical and factual deletions, while salience is required to weigh the content relevance of plausible spans according to patterns learned during training.
Furthermore, we conduct out-of-domain experiments to examine the cross-domain generalizability of our approach. Because plausibility is a more domain-independent notion, we can hold our plausibility model constant and adapt the extraction and salience models to a new setting with a small number of examples. Our experiments consist of three transfer tasks, which mimic real-world domain shifts (e.g., newswire social media). By fine-tuning salience with only 500 in-domain samples, we demonstrate our compressive system can match or exceed the ROUGE of an in-domain extractive model trained on tens of thousands of document-summary pairs.
2 Plausible and Salient Compression
Our principal goal is to create a compressive summarization system that makes linguistically informed deletions in a way that generalizes cross-domain, without relying on heavily-engineered rules. In this section, we discuss our framework in detail and elaborate on the notions of plausibility and salience, two learnable objectives that underlie our span-based compression.
Plausible compressions are those that, when applied, result in grammatical and factual sentences; that is, sentences that are syntactically permissible, linguistically acceptable to native speakers Chomsky (1956); SchÃ¼tze (1996), and factually correct from the perspective of the original sentence. Satisfying these three criteria is challenging: acceptability is inherently subjective and measuring factuality in text generation is a major open problem KryÅciÅski et al. (2020); Wang et al. (2020); Durmus et al. (2020); Goyal and Durrett (2020). Figure 1 gives examples of plausible deletions: note that of dozens of California wineries would be grammatical to delete but significantly impacts factuality.
We can learn this notion of plausibility in a data-driven way with appropriately labeled corpora. In particular, \newcitefilippova-2013-overcoming construct a corpus from news headlines which can suit our purposes: these headlines preserve the important facts of the corresponding article sentence while omitting minor details, and they are written in an acceptable way. We can therefore leverage this type of supervision to learn a model that specifically identifies plausible deletions.
As we have described it, plausibility is a domain-independent notion that asks if a compression maintains grammaticality and factuality. However, depending on the summarization task, a compressive system may not want to apply all plausible compressions. In Figure 1, for instance, deleting all plausible spans results in a loss of key information. In addition to plausibility, we use a domain-dependent notion of salience, or whether a span should be included in summaries of the form we want to produce.
Labeled oracles for this notion of content relevance (Gillick and Favre, 2009; Berg-Kirkpatrick et al., 2011, inter alia) can be derived from gold-standard summaries using ROUGE Lin (2004). We compare the ROUGE score of an extract with and without a particular span as a proxy for its importance, then learn a model to classify which spans improve ROUGE if deleted. By deleting spans which are both plausible and salient in Figure 1, we obtain a compressed sentence that captures core summary content with 28% fewer tokens, while still being fully grammatical and factual.
2.3 Syntactic Compression Rules
The base set of spans which we judge for plausibility and salience comes from a recall-oriented set of compression rules over a constituency grammar; that is, they largely cover the space of valid deletions, but include invalid ones as well.
Our rules allow for deletion of the following: (1) parentheticals (PRN) and fragments (FRAG); (2) adjectives (JJ) and adjectival phrases (ADJP); (3) adverbs (RB) and adverbial phrases (ADVP); (4) prepositional phrases (PP); (5) appositive noun phrases (NP–[,–NP–,]); (6) relative clauses (SBAR); and (7) conjoined noun phrases (e.g., NP–[CC–NP]), verb phrases (e.g., VP–[CC–VP]), and sentences (e.g., S–[CC–S]). Brackets specify the constituent span(s) to be deleted, e.g., CC–NP in NP–[CC–NP].
Much more refined rules would be needed to ensure grammaticality: for example, in She was [at the tennis courts], deletion of the PP leads to an unacceptable sentence. However, this base set of spans is nevertheless a good set of building blocks, and reliance on syntax gives a useful inductive bias for generalization to other domains Swayamdipta et al. (2018).
3 Summarization System
We now describe our compressive summarization system that leverages our notions of plausibility and salience. For an input document, an off-the-shelf extractive model first chooses relevant sentences, then for each extracted sentence, our two compression models decide which sub-sentential spans to delete. Although the plausibility and salience models have different objectives, they both output a posterior over constituent spans, and thus use the same base model architecture.
We structure our model’s decisions in terms of separate sentence extraction and compression decisions. Let denote random variables for sentence extraction where indicates that the th sentence is selected to appear in the summary. Let , denote random variables for the plausibility model, where indicates that the th span of the th sentence is plausible. An analogous set of is included for the salience model. These variables are modeled independently and fully specify a compressive summary; we describe this process more explicitly in Section 4.4.
Our system takes as input a document with sentences , where each sentence has words . We constrain to be the maximum number of sentences that collectively have less than 512 wordpieces when tokenized. Each sentence has an associated constituency parse Kitaev and Klein (2018) comprised of constituents where is the constituent’s part-of-speech tag and are the indices of the text span. Let denote the set of spans proposed for deletion by our compression rules (see Section 2.3).
Our extraction model is a re-implementation of the BERTSum model Liu and Lapata (2019), which predicts a set of sentences to select as an extractive summary. The model encodes the document sentences using BERT Devlin et al. (2019), also preprending [CLS] and adding [SEP] as a delimiter between sentences.
During fine-tuning, the [CLS] tokens are treated as sentence-level representations. We collect the [CLS] vectors over all sentences , dot each with a weight vector , and use a sigmoid to obtain selection probabilities:
Depicted in Figure 2, the compression model (instantiated twice; once for plausibility and once for salience) is a sentence-level model that judges which constituent spans should be deleted. We encode a single sentence at a time, adding [CLS] and [SEP] as in the extraction model. We obtain token-level representations using a pre-trained Transformer encoder:
We create a span representation for each constituent . For the th constituent, using its span indices , we select its corresponding token representations . We then use span attention Lee et al. (2017) to reduce this span to a fixed-length vector . Finally, we compute deletion probabilities using a weight vector as follows: , where is either a plausibility or salience random variable.
As alluded to in Section 2.3, there are certain cases where the syntactic compression rules license deleting a chain of constituents rather than individual ones. A common example of this is in conjoined noun phrases (NP–[CC–NP]) where if the second noun phrase NP is deleted, its preceding coordinating conjunction CC can also be deleted without affecting the grammaticality of the sentence. To avoid changing the compression model substantially, we relegate secondary deletions to a postprocessing step, where if a primary constituent like NP is deleted at test-time, its secondary constituents are also automatically deleted.
4 Training and Inference
The extraction and compression models in our summarization system are trained separately, but both used in a pipeline during inference. Because the summarization datasets we use do not come with labels for extraction and compression, we chiefly rely on structured oracles that provide supervision for our models. In this section, we describe our oracle design decisions, learning objectives, and inference procedures.
4.1 Extraction Supervision
4.2 Compression Supervision
Because plausibility and salience are two different views of compression, as introduced in Section 2.3, we have different methods for deriving their supervision. However, their oracles share the same high-level structure, which procedurally operate as follows: an oracle takes in as input an uncompressed sentence , compressed sentence or paragraph , and a similarity function . Using the list of available compression rules for , if without a constituent results in , we assign a positive “delete” label, otherwise we assign it a negative “keep” label. Intuitively, this oracle measures whether the deletion of a constituent causes to become closer to . We set to ROUGE Lin (2004), primarily for computational efficiency, although more complex similarity functions such as BERTScore Zhang et al. (2020b) could be used without modifying our core approach. Below, we elaborate on the nature of and for plausibility and salience, respectively.
We leverage labeled, parallel sentence compression data from news headlines to learn plausibility. Filippova and Altun (2013) create a dataset of 200,000 news headlines and the lead sentence of its corresponding article, where each headline is a compressed extract of the lead sentence . Critically, the headline is a subtree of the dependency relations induced by the lead sentence, ensuring that and will have very similar syntactic structure. Filippova and Altun (2013) further conduct a human evaluation of the headline and lead sentence pairs and conclude that, with 95% confidence, annotators find the pairs “indistinguishable” in terms of readability and informativeness. This dataset therefore suits our purposes for plausibility as we have defined it.
Though the sentence compression data described above offers a reasonable prior on span-level deletions, the salience of a particular deletion is a domain-dependent notion that should be learned from in-domain data. One way to approximate this is to consider whether the deletion of a span in a sentence of an extractive summary increases ROUGE with the reference summary Xu and Durrett (2019), allowing us to estimate what types of spans are likely or unlikely to appear in a summary. We can therefore derive salience labels directly from labeled summarization data.
In aggregate, our system requires training three models: an extraction model (), a plausibility model (), and a salience model ().
The extraction model optimizes log likelihood over each selection decision in document , defined as where is the gold label for selecting the th sentence in the th document.
The plausibility model optimizes log likelihood over the oracle decision for each constituent in sentence , defined as . The salience model operates analogously over the variables.
While our sentence selection and compression stages are modeled independently, structurally we need to combine these decisions to yield a coherent summary, recognizing that these models have not been optimized directly for ROUGE.
Our pipeline consists of three steps: (1) For an input document , we select the top- sentences with the highest posterior selection probabilities: . (2) Next, for each selected sentence , we obtain plausible compressions and salient compressions , where and are hyperparameters discovered with held-out samples. (3) Finally, we only delete constituent spans licensed by both the plausibility and salience models, denoted as , for each sentence. The remaining tokens among all selected sentences form the compressive summary.
We do not perform joint inference over the plausibility and salience models because plausibility is a necessary precondition in span-based deletion, as defined in Section 2.1. If, for example, a compression has a low plausibility score but high salience score, it will get deleted during joint inference, but this may negatively affect the well-formedness of the summary. As we demonstrate in Section 6.3, the plausibility model enforces strong guardrails that prevent the salience model from deleting arbitrary spans that result in higher ROUGE but at the expense of syntactic or semantic errors.
|cmp||MatchSum + CUPS||44.69||20.71||40.86|
5 Experimental Setup
We benchmark our system first with an automatic evaluation based on ROUGE-1/2/L F Lin (2004).
We seek to answer three questions: (1) How does our compressive system stack up against our own extractive baseline and past extractive approaches? (2) Do our plausibility and salience modules successfully model their respective phenomena? (3) How can these pieces be used to improve cross-domain summarization?
Systems for Comparison.
We refer to our full compressive system as CUPS
Because our approach is fundamentally extractive (albeit with compression), we chiefly compare against state-of-the-art extractive models: BERTSum Liu et al. (2019), the canonical architecture for sentence-level extraction with pre-trained encoders, and MatchSum Zhong et al. (2020), a summary-level semantic matching model that uses BERTSum to prune irrelevant sentences. These models outperform recent compressive systems Xu and Durrett (2019); Mendes et al. (2019); updating the architectures of these models and extending their oracle extraction procedures to the range of datasets we consider is not straightforward.
To contextualize our results, we also compare against a state-of-the-art abstractive model, PEGASUS Zhang et al. (2020a), a seq2seq Transformer pre-trained with “gap-sentences.” This comparison is not entirely apples-to-apples, as this pre-training objective uses very large text corpora (up to 3.8TB) in a summarization-specific fashion. We expect our approach to stack with further advances in pre-training.
Extractive, abstractive, and compressive approaches are typed as ext, abs, and cmp, respectively, throughout the experiments.
6 In-Domain Experiments
|opening statements in the murder trial of movie theater massacre suspect james holmes are scheduled for april 27, more than a month ahead of schedule, a colorado court spokesman said. holmes, 27, is charged as the sole gunman who stormed a crowded movie theater at a midnight showing of "the dark knight rises" in aurora, colorado, and opened fire, killing 12 people and wounding 58 more in july 2012. holmes, a one-time neuroscience doctoral student, faces 166 counts, including murder and attempted murder charges.|
|the accident happened in santa ynez california, near where crosby lives. crosby was driving at approximately 50 mph when he struck the jogger, according to california highway patrol spokesman don clotworthy. the jogger suffered multiple fractures, and was airlifted to a hospital in santa barbara, clotworthy said.|
|update: jonathan hyla said in an phone interview monday that his interview with cate blanchett was mischaracterized when an edited version went viral around the web last week. “she wasn’t upset,” he told cnn. blanchett ended the interview laughing, hyla said, and “she was in on the joke.”|
6.1 Benchmark Results
Compression consistently improves ROUGE, even when coupled with a strong extractive model.
Across the board, we see improvements in ROUGE when using CUPS. Our results particularly contrast with recent trends in compressive summarization where span-based compression (in joint and pipelined forms) decreases ROUGE over sentence extractive baselines Zhang et al. (2018); Mendes et al. (2019). Gains are especially pronounced on datasets with more abstractive summaries, where applying compression roughly adds +2 ROUGE-1; however, we note there is a large gap between extractive and abstractive approaches on tasks like XSum due to the amount of paraphrasing in reference summaries Narayan et al. (2018). Nonetheless, our system outperforms strong extractive models on these datasets, and also yields competitive results on CNN/DM. In addition, Table 3 includes representative summaries produced by our compressive system. The summaries are highly compressive: spans not contributing to the main event or story are deleted, while maintaining grammaticality and factuality.
Our compression module can also improve over other off-the-shelf extractive models.
The pipelined nature of our approach allows us to replace the current BERTSum Liu and Lapata (2019) extractor with any arbitrary, black-box model that retrieves important sentences. We apply our compression module on system outputs from MatchSum Zhong et al. (2020), the current state-of-the-art extractive model, and also see gains in this setting with no additional modification to the system.
6.2 Plausibility Study
Given that our system achieves high ROUGE, we now investigate whether its compressed sentences are grammatical and factual. The plausibility model is responsible for modeling these phenomena, as defined in Section 2.1, thus we analyze its compression decisions in detail. Specifically, we run the plausibility model on 50 summaries from each of CNN and Reddit, and have annotators judge whether the predicted plausible compressions are grammatical and factual with respect to the original sentence.
Because the plausibility model uses candidate spans from the high-recall compression rules (defined in Section 2.3), we compare our plausibility model against the baseline consisting of simply the spans identified by these rules. The results are shown in Table 4. On both CNN and Reddit, the plausibility model’s deletions are highly grammatical, and we also see evidence that the plausibility model makes more semantically-informed deletions to maintain factuality, especially on CNN.
|+ Plausibility Model||96.0||89.7||93.1||66.7|
Factuality performance is lower on Reddit, but incorporating the plausibility model on top of the compression rules results in a 6% gain in precision. There is still, however, a large gap between factuality in this setting and factuality on CNN, which we suspect is because Reddit summaries are different in style and structure than CNN summaries: they largely consist of short event narratives Kim et al. (2019), and so annotators may disagree on the degree to which deleting spans such as subordinate clauses impact the meaning of the events described.
|NYT CNN||CNN Reddit||XSum WikiHow||Average|
|+ Fine-Tune (500)||31.90||13.04||28.42||23.76||5.66||18.95||29.44||8.25||27.41||28.37||8.98||24.93|
|+ Fine-Tune (500)||33.98||13.25||30.39||25.01||5.96||20.10||30.52||8.44||28.48||29.84||9.22||26.32|
6.3 Compression Analysis
The experiments above demonstrate the plausibility model generally selects spans that, if deleted, preserve grammaticality and factuality. In this section, we dive deeper into how the plausibility and salience models work together in the final trained summary model, presenting evidence of typical compression patterns. We analyze (1) our default system CUPS, which deletes spans ; and (2) a variant CUPS-NoPl (without plausibility but with salience), which only deletes spans , to specifically understand what compressions the salience model makes without the plausibility model’s guardrails. Using 100 randomly sampled documents from CNN, we conduct a series of experiments detailed below.
On average, per sentence, 16% of candidate spans deleted by the salience model alone are not plausible.
For each sentence, our system exposes a list of spans for deletion, denoted by and for CUPS and CUPS-NoPl, respectively. Because is identical across both variants, we can compute the plausibility model’s rejection rate (16%), defined as . Put another way, how many compressions does the plausibility model reject if partnered with the salience model? On average, per sentence, the plausibility model rejects 16% of spans approved by the salience model alone, so it does non-trivial filtering of the compressions. We observe a drop in the token-level compression ratio, from 26% in CUPS to 24% in CUPS-NoPl, which is partially a result of this. From a ROUGE-1/2 standpoint, the slight reduction in compression yields a peculiar effect: on this subset of summaries, CUPS achieves 36.23/14.61 while CUPS-NoPl achieves 36.1/14.79, demonstrating the plausibility model trades off some salient deletions (-R1) for overall grammaticality (+R2) Paulus et al. (2018).
Using salience to discriminate between plausible spans increases ROUGE.
With CUPS, we perform a line search on , which controls the confidence threshold for deleting non-salient spans as described in Section 4.4.
7 Out-of-Domain Experiments
Additionally, we examine the cross-domain generalizability of our compressive summarization system. We set up three source target transfer tasks guided by real-world settings: (1) NYT CNN (one newswire outlet to another), (2) CNN Reddit (newswire to social media, a low-resource domain), and (3) XSum WikiHow (single to multiple sentence summaries with heavy paraphrasing).
For each transfer task, we experiment with two types of settings: (1) zero-shot transfer, where our system with parameters is directly evaluated on the target test set; and (2) fine-tuned transfer, where are fine-tuned with 500 target samples, then the resulting system with parameters is evaluated on the target test set. As defined in Section 2.1, plausibility is a domain-independent notion, thus we do not fine-tune .
Table 5 shows the results. Our system maintains strong zero-shot out-of-domain performance despite distribution shifts: extraction outperforms the lead- baseline, and compression adds roughly +1 ROUGE-1. This increase is largely due to compression improving ROUGE precision: extraction is adept at retrieving content-heavy sentences with high recall, and compression helps focus on salient content within those sentences.
More importantly, we see that performance via fine-tuning on 500 samples matches or exceeds in-domain extraction ROUGE. On NYT CNN and CNN Reddit, our system outperforms in-domain extraction baselines (trained on tens of thousands of examples), and on XSum WikiHow, it comes within 0.3 in-domain average ROUGE. These results suggest that our system could be applied widely by crowdsourcing a relatively small number of summaries in a new domain.
8 Related Work
Our work follows in a line of systems that use auxiliary training data or objectives to learn sentence compression Martins and Smith (2009); Woodsend and Lapata (2012); Qian and Liu (2013). Unlike these past approaches, our compression system uses both a plausibility model optimized for grammaticality and a salience model optimized for ROUGE. Almeida and Martins (2013) leverage such modules and learn them jointly in a multi-task learning setup, but face an intractable inference problem in their model which needs sophisticated approximations. Our approach, by contrast, does not need such approximations or expensive inference machinery like ILP solvers Martins and Smith (2009); Berg-Kirkpatrick et al. (2011); Durrett et al. (2016). The highly decoupled nature of our pipelined compressive system is an advantage in terms of training simplicity: we use only simple MLE-based objectives for extraction and compression, as opposed to recent compressive methods that use joint training Xu and Durrett (2019); Mendes et al. (2019) or reinforcement learning Zhang et al. (2018). Moreover, we demonstrate our compression module can stack with state-of-the-art sentence extraction models, achieving additional gains in ROUGE.
One significant line of prior work in compressive summarization relies on heavily engineered rules for syntactic compression Berg-Kirkpatrick et al. (2011); Li et al. (2014); Wang et al. (2013); Xu and Durrett (2019). By relying on our data-driven objectives to ultimately perform compression, our approach can rely on a leaner, much more minimal set of constituency rules to extract candidate spans.
Gehrmann et al. (2018) also extract sub-sentential spans in a “bottom-up” fashion, but their method does not incorporate grammaticality and only works best with an abstractive model; thus, we do not compare to it in this work.
Recent work also demonstrates elementary discourse units (EDUs), spans of sub-sentential clauses, capture salient content more effectively than entire sentences Hirao et al. (2013); Li et al. (2016); Durrett et al. (2016); Xu et al. (2020). Our approach is significantly more flexible because it does not rely on an a priori chunking of a sentence, but instead can delete variably sized spans based on what is contextually permissible. Furthermore, these approaches require RST discourse parsers and in some cases coreference systems Xu et al. (2020), which are less accurate than the constituency parsers we use.
In this work, we present a compressive summarization system that decomposes span-level compression into two learnable objectives, plausibility and salience, on top of a minimal set of rules derived from a constituency tree. Experiments across both in-domain and out-of-domain settings demonstrate our approach outperforms strong extractive baselines while creating well-formed summaries.
This work was partially supported by NSF Grant IIS-1814522, NSF Grant SHF-1762299, a gift from Salesforce Inc., and an equipment grant from NVIDIA. The authors acknowledge the Texas Advanced Computing Center (TACC) at The University of Texas at Austin for providing HPC resources used to conduct this research. Results presented in this paper were obtained using the Chameleon testbed supported by the National Science Foundation. Thanks as well to the anonymous reviewers for their helpful comments.
Appendix A Summarization Datasets
Table 1 lists training, development, and test splits for each dataset used in our experiments.
|New York Times||3||137,772||17,222||17,220|
|Max Sequence Length||512||256|
|Hyperparameter: Plausibility ()|
|Hyperparameter: Salience ()|
Appendix B Training Details
Table 2 details the hyperparameters for training the extraction and compression models. These hyperparameters largely borrowed from previous work Devlin et al. (2019), and we do not perform any additional grid searches in the interest of simplicity. The pre-trained encoders are set to either bert-base-uncased or google/electra-base-discriminator from HuggingFace Transformers Wolf et al. (2019). Following previous work Liu et al. (2019); Zhong et al. (2020), we use the best performing model among the top three validation checkpoints.
Appendix C Inference Details
Our system uses two hyperparameters at test-time to control the level of compression performed by the plausibility and salience models. Table 3 shows the BERT- and ELECTRA-based system hyperparameters, respectively. We sweep the salience model threshold with a granularity of 0.05; across all datasets used in the in-domain experiments (CNN/DM, CNN, WikiHow, XSum, and Reddit), this process takes roughly 8 hours on a 32GB NVIDIA V100 GPU.
Appendix D Plausibility Study
|cmp||MatchSum + CUPS||32.83||9.24||30.53||26.42||5.09||19.76||26.60||6.60||21.43|
We conduct our human evaluation on Amazon Mechanical Turk, and set up the following requirements: annotators must (1) reside in the US; (2) have a HIT acceptance rate 95%; and (3) complete at least 50 HITs prior to this one. Each HIT comes with detailed instructions (including a set of representative examples) and 6 assignments. One of these assignments is a randomly chosen example from the instructions (the challenge question), and the other five are samples we use in our actual study. In each assignment, annotators are presented with the original sentence and a candidate span, and asked if deleting the span negatively impacts the grammaticality and factuality of the resulting, compressed sentence. Each annotator is paid 50 cents upon completing the HIT; this pay rate was calibrated to pay roughly $10/hour.
After all assignments are completed, we filter low-quality annotators according to two heuristics. An annotator is removed if he/she completes the assignment in under 60 seconds or answers the challenge question incorrectly. We see a substantial increase in agreement for both the grammaticality and factuality studies among the remaining annotators. The absolute agreement scores, as measured by Krippendorff’s Krippendorff (1980), are shown in Table 4. Consistent with prior grammaticality evaluations in summarization Xu and Durrett (2019); Xu et al. (2020), agreement scores are objectively low due to the difficulty of the tasks, thus we compare the annotations with expert judgements. An expert annotator (one of the authors of this paper uninvolved with the development of the plausibility model) performed the CNN annotation task; we find, by using the majority vote among the crowdsourced annotations, the regular and expert annotators concur 80% of the time on grammaticality and 60% of the time on factuality; this establishes a higher degree of confidence in the crowdsourced annotations when aggregated.
Appendix E System Results with BERT
Table 5 (CNN/DM, CNN, WikiHow, XSum, Reddit) shows results using BERT as the pre-trained encoder. While the absolute ROUGE results with BERT are lower than with ELECTRA, we still see a large improvement compared to the sentence extractive baseline.
Appendix F Extended MatchSum Results
On WikiHow, XSum, and Reddit, we additionally experiment with replacing the sentences extracted from CUPS with MatchSum Zhong et al. (2020) system outputs. From the results (see Table 6), we see that our system with MatchSum extraction achieves the most gains on Reddit, but its average performance on WikiHow and XSum is more comparable to the standard CUPS system.
Appendix G Plausibility Ablation
Table 7 shows results on CNN, WikiHow, XSum, and Reddit with removing the plausibility model in CUPS. Consistent with the analysis in Section 6.3, we see the plausibility model is primarily responsible for gains in ROUGE-2, but in its absence, the salience model can delete arbitrary spans, resulting in gains in ROUGE-1 and ROUGE-L. This ablation demonstrates the need to analyze summaries outside of ROUGE since notions of grammaticality and factuality cannot easily be ascertained by computing lexical overlap with a reference summary.
Appendix H Out-of-Domain Results
In Tables 8, 9, and 10, we show ROUGE results with standard deviations across 5 independent runs, for the fine-tuning experiments on NYT CNN, CNN Reddit, and XSum WikiHow, respectively. Despite fine-tuning with a random batch of 500 samples each time, we consistently see low variance across the runs, demonstrating our system does not have an affinity towards particular samples in an out-of-domain setting.
Furthermore, we present an ablation of salience for the aforementioned transfer tasks in Table 11. On NYT CNN, salience only helps increase ROUGE-L, but we see consistent increases in average ROUGE on CNN Reddit and XSum WikiHow. We can expect larger gains by fine-tuning salience on more samples, but even with 500 out-of-domain samples, our compression module benefits from the inclusion of the salience model.
|Type||Model||R1 (std)||R2 (std)||RL (std)|
|ext||CUPS||33.74 (0.08)||13.19 (0.11)||30.46 (0.11)|
|cmp||CUPS||33.98 (0.06)||13.25 (0.11)||30.39 (0.07)|
|Type||Model||R1 (std)||R2 (std)||RL (std)|
|ext||CUPS||24.30 (0.20)||5.78 (0.08)||19.87 (0.11)|
|cmp||CUPS||25.01 (0.15)||5.96 (0.08)||20.10 (0.09)|
|Type||Model||R1 (std)||R2 (std)||RL (std)|
|ext||CUPS||30.22 (0.05)||8.43 (0.03)||28.30 (0.03)|
|cmp||CUPS||30.52 (0.06)||8.44 (0.01)||28.48 (0.04)|
|NYT CNN||CNN Reddit||XSum WikiHow|
Appendix I Reproducibility
Table 12 shows system results on the development sets of CNN/DM, CNN, WikiHow, XSum, and Reddit to aid the reproducibility of our system; both CUPS and CUPS are included. Furthermore, in Table 13, we report several metrics to aid the training of the extraction and compression models. These specific metrics recorded by training models on a 32GB NVIDIA V100 GPU with the hyperparameters listed in Table 2.
|Time Elapsed (hrs/min)||6h 48m||3h 4m||5h 52m||5h 5m||6h 6m||1h 59m||—|
|Time Elapsed (hrs/min)||3h 32m||1h 27m||2h 38m||3h 26m||3h 38m||0h 56m||1h 59m|
- Code and datasets available at https://github.com/shreydesai/cups
- BERT can be replaced with other pre-trained encoders, such as ELECTRA Clark et al. (2020), which we use for most experiments.
- The encoders between the extraction and compression modules are fine-tuned separately; in other words, our modules do not share any parameters.
- See Appendices B and C for training and inference hyperparameters, respectively.
- We found that using beam search to derive the oracle yielded higher oracle ROUGE, but also a significantly harder learning problem, and the extractive model trained on this oracle actually performed worse at test time.
- Our pipeline overall requires 3x more parameters than a standard Transformer-based extractive model (e.g., BERTSum). However, our compression module (which accounts for 2/3 of these parameters) can be applied on top of any off-the-shelf extractive model, so stronger extractive models with more parameters can be combined with our approach as well.
- Following previous work, we use pyrouge with the default command-line arguments: -c 95 -m -n 2
- See Appendix A for dataset splits.
- Compressive Summarization with Plausibility and Salience
- See Appendix D for further information on the annotation task and agreement scores.
- Our assumption is that posterior probabilities are calibrated, which holds true for various pre-trained Transformers across a range of tasks Desai and Durrett (2020).
- Fast and Robust Compressive Summarization with Dual Decomposition and Multi-Task Learning. In Proceedings of the Annual Meeting of the Association for Computational Lingusitics (ACL), Cited by: §8.
- Jointly Learning to Extract and Compress. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), Cited by: §1, §2.2, §8, §8.
- Syntactic Structures. Cited by: §2.1.
- ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §1, §5, footnote 2.
- Calibration of Pre-trained Transformers. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: footnote 11.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Cited by: Appendix B, §3.2.
- FEQA: A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization. In Proceedings of the Annual Conference of the Association for Computational Linguistics (ACL), Cited by: §2.1.
- Learning-Based Single-Document Summarization with Compression and Anaphoricity Constraints. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), Cited by: §1, §8, §8.
- Overcoming the Lack of Parallel Data in Sentence Compression. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: Table 13, §1, §4.2.
- Bottom-Up Abstractive Summarization. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §8.
- A Scalable Global Model for Summarization. In Proceedings of the Workshop on Integer Linear Programming for Natural Language Processing (ILP for NLP), Cited by: §2.2.
- Evaluating Factuality in Generation with Dependency-level Entailment. In Findings of the Conference on Empirical Methods in Natural Language Processing (Findings of EMNLP), Cited by: §2.1.
- Teaching Machines to Read and Comprehend. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), Cited by: Table 1, §1, §5.
- Single-Document Summarization as a Tree Knapsack Problem. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §8.
- Abstractive Summarization of Reddit Posts with Multi-level Memory Networks. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Cited by: Table 1, §1, §5, §6.2.
- Constituency Parsing with a Self-Attentive Encoder. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), Cited by: §3.1.
- Statistics-Based Summarization—Step One: Sentence Compression. In Proceedings of the National Conference on Artificial Intelligence (AAAI) and Conference on Innovative Applications of Artificial Intelligence (IAAI), Cited by: §1.
- Summarization beyond Sentence Extraction: A Probabilistic Approach to Sentence Compression. Artificial Intelligence. Cited by: §1.
- WikiHow: A Large Scale Text Summarization Dataset. arXiv preprint arXiv:1810.09305. Cited by: Table 1, §1, §5.
- Content Analysis: An Introduction to Its Methodology. Sage. Cited by: Table 4, Appendix D.
- Evaluating the Factual Consistency of Abstractive Text Summarization. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §2.1.
- End-to-end Neural Coreference Resolution. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §3.3.
- Improving Multi-documents Summarization by Sentence Compression based on Expanded Constituent Parse Trees. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §1, §8.
- The Role of Discourse Units in Near-Extractive Summarization. In Proceedings of the Annual Meeting of the Special Interest Group on Discourse and Dialogue, Cited by: §8.
- ROUGE: A Package for Automatic Evaluation of Summaries. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), Cited by: §2.2, §4.1, §4.2, §5.
- Text Summarization with Pretrained Encoders. In Proceedings of the Conference on Empirical Methods of Natural Language Processing and International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Cited by: §3.2, §4.1, §6.1.
- Single Document Summarization as Tree Induction. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Cited by: Appendix B, §5, §5.
- Summarization with a Joint Model for Sentence Extraction and Compression. In Proceedings of the Workshop on Integer Linear Programming for Natural Language Processing (ILP for NLP), Cited by: §1, §8.
- Jointly Extracting and Compressing Documents with Summary State Representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Cited by: §1, §5, §6.1, §8.
- Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: Table 1, §1, §5, §6.1.
- A Deep Reinforced Model for Abstractive Summarization. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §6.3.
- Fast Joint Compression and Summarization via Graph Cuts. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §1, §8.
- The New York Times Annotated Corpus. Cited by: Table 1, §5.
- The Empirical Base of Linguistics: Grammaticality Judgments and Linguistic Methodology. Cited by: §2.1.
- Syntactic Scaffolds for Semantic Structures. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §2.3.
- Asking and Answering Questions to Evaluate the Factual Consistency of Summaries. In Proceedings of the Annual Conference of the Association for Computational Linguistics (ACL), Cited by: §2.1.
- A Sentence Compression Based Framework to Query-Focused Multi-Document Summarization. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), Cited by: §1, §8.
- HuggingFace’s Transformers: State-of-the-art Natural Language Processing. arXiv preprint arXiv:1910.03771. Cited by: Appendix B.
- Multiple Aspect Summarization Using Integer Linear Programming. Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing (EMNLP) and Computational Natural Language Learning (CoNLL). Cited by: §1, §8.
- Neural Extractive Text Summarization with Syntactic Compression. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Cited by: Appendix D, §1, §4.2, §5, §8, §8.
- Discourse-Aware Neural Extractive Text Summarization. In Proceedings of the Annual Conference of the Association for Computational Linguistics (ACL), Cited by: Appendix D, §8.
- PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization. Proceedings of the International Conference on Machine Learning (ICML). Cited by: §5.
- BERTScore: Evaluating Text Generation with BERT. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §4.2.
- Neural Latent Extractive Document Summarization. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §6.1, §8.
- Extractive Summarization as Text Matching. In Proceedings of the Annual Conference of the Association for Computational Linguistics (ACL), Cited by: Appendix B, Table 6, Appendix F, §1, Table 2, §5, §6.1.