Discriminative Adversarial Search for Abstractive Summarization

Discriminative Adversarial Search for Abstractive Summarization


We introduce a novel approach for sequence decoding, Discriminative Adversarial Search (DAS), which has the desirable properties of alleviating the effects of exposure bias without requiring external metrics. Inspired by Generative Adversarial Networks (GANs), wherein a discriminator is used to improve the generator, our method differs from GANs in that the generator parameters are not updated at training time and the discriminator is only used to drive sequence generation at inference time.

We investigate the effectiveness of the proposed approach on the task of Abstractive Summarization: the results obtained show that a naive application of DAS improves over the state-of-the-art methods, with further gains obtained via discriminator retraining. Moreover, we show how DAS can be effective for cross-domain adaptation. Finally, all results reported are obtained without additional rule-based filtering strategies, commonly used by the best performing systems available: this indicates that DAS can effectively be deployed without relying on post-hoc modifications of the generated outputs.


1 Introduction

In the context of Natural Language Generation (NLG), a majority of approaches propose sequence to sequence models trained via maximum likelihood estimation; a Teacher Forcing Williams and Zipser (1989) strategy is applied during training: ground-truth tokens are sequentially fed into the model to predict the next token. Conversely, at inference time, ground-truth tokens are not available: the model can only have access to its previous outputs. In the literature Bengio et al. (2015); Ranzato et al. (2015), such mismatch is referenced to as exposure bias: as mistakes accumulate, this can lead to a divergence from the distribution seen at training time, resulting in poor generation outputs.

Several works have focused on alleviating this issue, proposing to optimize a sequence level metric such as BLEU or ROUGE: Wiseman and Rush (2016) used beam search optimisation while Ranzato et al. (2015) framed text generation as a reinforcement learning problem, using the chosen metric as reward. Still, these automated metrics suffer from known limitations: Sulem et al. (2018) showed how BLEU metrics do not reflect meaning preservation, while Novikova et al. (2017) pointed out that, for NLG tasks, they do not map well to human judgements.

Similar findings have been reported for ROUGE, in the context of abstractive summarization Paulus et al. (2017): for the same input, several correct outputs are possible; nonetheless, the generated output is often compared to a single human reference, given the lack of annotated data. Complementary metrics have been proposed to evaluate NLG tasks, based on Question Answering Scialom et al. (2019) or learned from human evaluation data Böhm et al. (2019). Arguably, though, the correlation of such metrics to human judgments is still unsatisfactory.

To tackle exposure bias, Generative Adversarial Networks (GANs) Goodfellow et al. (2014) represent a natural alternative to the proposed approaches: rather than learning a specific metric, the model learns to generate text that a discriminator can not differentiate from human-produced content. However, the discrete nature of text makes the classifier signal non-differentiable. A solution would be to use reinforcement learning with the classifier prediction as a reward signal. However, due to reward sparsity and mode collapse Zhou et al. (2020), text GANs failed so far to be competitive with state-of-the-art models trained with teacher forcing on NLG tasks Caccia et al. (2018); Clark et al. (2019), and are mostly evaluated on synthetic datasets.

Inspired by Generative Adversarial Networks, we propose an alternative approach for sequence decoding: first, a discriminator is trained to distinguish human-produced texts from machine-generated ones. Then, this discriminator is integrated into a beam search: at each decoding step, the generator output probabilities are refined according to the likelihood that the candidate sequence is human-produced. This is equivalent to optimize the search for a custom and dynamic metric, learnt to fit the human examples.

Under the proposed paradigm, the discriminator causes the output sequences to diverge from those originally produced by the generator. These sequences, adversarial to the discriminator, can be used to further fine-tune the discriminator: following the procedure used for GANs, the discriminator can be iteratively trained on the new predictions it has contributed to improve. This effectively creates a positive feedback loop for training the discriminator: until convergence, the generated sequences improve and become harder to distinguish from human-produced text. Additionally, the proposed approach allows to dispense of custom rule-based strategies commonly used at decoding time such as length penalty and n-gram repetition avoidance.

In GANs, the discriminator is used to improve the generator and is dropped at inference time. Our proposed approach differs in that, instead, we do not modify the generator parameters at training time, and use the discriminator at inference time to drive the generation towards human-like textual content.

The main contributions of this work can be summarized as:

  1. we propose Discriminative Adversarial Search (DAS), a novel sequence decoding approach that allows to alleviate the effects of exposure bias and to optimize on the data distribution itself rather than for external metrics;

  2. we apply DAS to the abstractive summarization task, showing that even a naively discriminated beam – i.e. without the self-retraining procedure, improves over the state-of-the-art for various metrics;

  3. we report further significant improvements when applying discriminator retraining;

  4. finally, we show how the proposed approach can be effectively used for domain adaptation.

2 Related Work

2.1 Exposure Bias

Several research efforts have tackled the issue of exposure bias resulting from Teacher Forcing. Inspired by Venkatraman et al. (2015), Bengio et al. (2015) proposed a variation of Teacher Forcing wherein the ground truth tokens are incrementally replaced by the predicted words. Further, Professor Forcing Lamb et al. (2016) was devised as an adversarial approach in which the model learns to generate without distinction between training and inference time, when it has no more access to the ground truth tokens. Using automated metrics at coarser (sequence) rather than finer (token) granularity to optimize the model, Wiseman and Rush (2016) proposed a beam search variant to optimise the BLEU score in neural translation. Framing NLG as a Reinforcement Learning problem, Ranzato et al. (2015) used the reward as the metric to optimise. Paulus et al. (2017) applied a similar approach in abstractive summarization tasks, using the ROUGE metric as a reward; the authors observed that, despite the successful application of reinforcement, higher ROUGE does not yield better models: other metrics for NLG are needed. Finally, Zhang et al. (2019) proposed to select, among the different beams decoded, the one obtaining the highest BLEU score and then to fine-tune the model on that sequence.

2.2 Discriminators for Text Generation

Recent works have applied text classifiers as discriminators for different NLG tasks. Kryscinski et al. (2019) used them to detect factual consistency in the context of abstractive summarization; Zellers et al. (2019) applied discriminators to detect fake news, in a news generation scenario, reporting high accuracy (over 90%). Recently, Clark et al. (2019) proposed to train encoders as discriminators rather than language models, as an alternative to BERT Devlin et al. (2019); they obtained better performances while improving in terms of training time. Further, it has been pointed out how abstractive summarization systems tend to be too extractive Kryściński et al. (2018), mainly because of the copy mechanism Vinyals et al. (2015). To improve on the abstractiveness of the generated outputs, Gehrmann et al. (2018) proposed to train a classifier to detect which words from the input could be copied, and applied it as a filter during inference: to some extent, our work can be seen as the generalisation of this approach.

2.3 Text Decoding

Beam search is the de-facto algorithm used to decode generated sequences of text, in NLG taks. This decoding strategy allows to select the sequence with the highest probability, offering more flexibility than a greedy approach. Beam search has contributed to performance improvements of state-of-the-art models for many tasks, such as Neural Machine Translation, Summarization, and Question Generation Ott et al. (2018); Dong et al. (2019). However, external rules are usually added to further constrain the generation, like the filtering mechanism for copy described above Gehrmann et al. (2018) or the inclusion of a length penalty factor Wu et al. (2016). Hokamp and Liu (2017) reported improvements when adding lexical constraints to beam search. Observing that neural models are prone to repetitions, while human-produced summaries contain more than 99% unique 3-grams, Paulus et al. (2017) introduced a rule in the beam forbidding the repetition of 3-grams. Whether trained from scratch Paulus et al. (2017); Gehrmann et al. (2018) or based on pre-trained language models Dong et al. (2019), the current state-of-the-art results in abstractive summarization have been achieved using length penalty and 3-grams repetition avoidance.

3 Datasets

len_src len_tgt abstr. (%)
CNN/DM 810.69 61.04 10.23
TL;DR 248.95 30.71 36.88
Table 1: Statistics of CNN/DM and TL;DR summarization datasets. We report length in tokens for source (len_src) and summaries (len_tgt). Abstractiveness (abstr.) is the percentage of tokens in the target summary, which are not present in the source article.

While the proposed approach is applicable to any Natural Language Generation (NLG) task, we focus on Abstractive Summarization in this study. One of most popular datasets for summarization is the CNN/Daily Mail (CNN/DM) dataset Hermann et al. (2015); Nallapati et al. (2016). It is composed of news articles paired to multi-sentence summaries. The summaries were written by professional writers and consist of several bullet points corresponding to the important information present in the paired articles. For fair comparison, we used the exact same dataset version as previous works See et al. (2017); Gehrmann et al. (2018); Dong et al. (2019).1

Furthermore, to assess the possible benefits of the proposed approach in a domain adaptation setup, we conduct experiments on TL;DR, a large scale summarization dataset built from social media data Völske et al. (2017). We choose this dataset for two main reasons: first, its data is relatively out-of-domain if compared to the samples in CNN/DM; second, its characteristics are also quite different: compared to CNN/DM, the TL;DR summaries are twice shorter and three times more abstractive, as detailed in Table 1. The training set is composed of around 3M examples and publicly available,2 while the test set is kept hidden because of public ongoing leaderboard evaluation. Hence, we randomly sampled 100k examples for training, 5k for validation and 5k for test. For reproducibility purposes, we make the TL;DR split used in this work publicly available.

4 Discriminative Adversarial Search

The proposed model is composed of a generator (described in 4.1) coupled with a sequential discriminator (described in 4.2): at inference time, for every new token generated by , the score and the label assigned by is used to refine the probabilities, within a beam search, to select the top candidate sequences.

4.1 Generator

Abstractive summarization is usually cast as a sequence to sequence task:


where is the input text, is the summary composed of tokens and represents the parameters of the generator. Under this framework, an abstractive summarizer is thus trained using article () and summary () pairs (e.g., via log-likelihood maximization).

4.2 Discriminator

The objective of the discriminator is to label a sequence as being human-produced or machine-generated. We use the discriminator to obtain a label at each generation step, rather than only for the entire generated sequence. For simplicity, we cast the problem as sequence to sequence, with a slight modification from our generator: at each generation step, the discriminator, instead of predicting the next token among the entire vocabulary , generates a label among two classes. The maximum-likelihood training objective is the minimization of:


where are the discriminator learnable parameters, and the model has only access to the context (the source article), along with the (partial) sequence generated up to step , along with. We refer to it as sequential discriminator, since it learns to predict the label for any partial sequence (up to the tokens generated at step ) of .

The training samples for the discriminator are labeled as: i) human corresponding to the gold summaries on the train set, or ii) generated corresponding to the generator predictions, again, on the training set. All the summary’s tokens ranging from 1 to are labeled accordingly, allowing to train the discriminator on partial sequences. We cut all the summaries to tokens if longer, consistently with previous works Dong et al. (2019).

4.3 Discriminative Beam Reranking

At inference time, we aim is to maximise the probability of according to Equation 1. Thus, the best candidate sequence is the one that maximizes:


At each generation step , the generator assigns a probability to each of the tokens in its vocabulary . In the beam search procedure, more than one hypothesis are kept, allowing to explore a graph of possible sequences, to finally select the one maximising among them. In practice, to limit the exploration, only the most probable hypotheses are kept. This parameter, which we refer to as the beam size throughout this paper, ranges between 1 and 5 in the literature.

In our method, we propose a new score to refine the score during the beam search w.r.t. the log probability of the discriminator, such that:


where is the discriminator output probability for the label on ; is used as a weighting factor: for , the discriminator has no impact and ; otherwise, the probability of the predicted tokens are proportionally affected by the probabilities assigned by the discriminator to the sequence. While theoretically such probabilities can be computed for the entire vocabulary, in practice, to not increase the complexity, we limit the exploration using a parameter, similarly as it is usually done within classical beam search.

This procedure is illustrated in Algorithm 1.

0:  : beam size, : maximum length, : # re-ranked hypotheses, : discriminator weighted factor
1:  Start-Of-Sentence
2:  for  do
3:     Compute generator scores for all
4:     Push best hypotheses from on

5:     for  do
6:        Compute discriminator score

8:     end for
9:      best hypotheses from
10:     if finished hypotheses then
11:        return
12:     end if
13:  end for
Algorithm 1 DAS: a Beam Search algorithm with the proposed discriminator re-ranking mechanism highligted.

4.4 Retraining the Discriminator

Figure 1: DAS self-training procedure: the generated examples are improved by the discriminator, and then fed back to the discriminator in a self-training loop.

Under the proposed paradigm, as mentioned in Section 1, the discriminator can be fine-tuned using the outputs which were found improved via the re-ranking (see Eq. 4). Hence, inspired by the GAN, we iteratively retrain the discriminator given the new predictions until convergence. We detail the retraining procedure in Figure 1.

5 Experimental Protocol


In this work, we build upon the Unified Language Model for natural language understanding and generation (UniLM) proposed by Dong et al. (2019); it is the current state-of-the-art model for summarization.3 This model can be described as a Transformer Vaswani et al. (2017) whose weights are first initialised from BERT. However, BERT is an encoder trained with bi-directional self attention: it can be used in Natural Language Understanding (NLU) tasks but not directly for generation (NLG). Dong et al. (2019) proposed a method to unify BERT for NLU and NLG: resuming its training, this time with an unidirectional loss; after this step, the model can be directly fine-tuned on any NLG task.

For our ablation experiments, to save time and computation, we do not use UniLM (345 million parameters). Instead, we follow the approach proposed by the authors Dong et al. (2019), with the difference that 1) we start from BERT-base (110 million parameters) and 2) we do not extend the pre-training. We actually observed little degradation than when starting from UniLM. We refer to this smaller model as BERT-gen.

For our final results we instead use the exact same UniLM checkpoint made publicly available by Dong et al. (2019) for Abstractive Summarization.


As detailed in Section 4.2, the discriminator model is also based on a sequence to sequence architecture. Thus, we can use again BERT-gen, initialising it the same way than the generator. The training data from CNN/DM is used to train the model; for each sample, the discriminator has access to two training examples: the human reference (corresponding to the label), and a generated summary (labeled as ).

Hence, the full training data available for the discriminator amounts to total examples. However, as detailed in the following Section, the discriminator does not need a lot of data to achieve a high accuracy. Therefore, we only used 150k training examples, split into 90% for training, 5% for validation and 5% for test. Unless otherwise specified, this data is only used to train/evaluate the discriminator.

Implementation details

All models are implemented in PyText Aly et al. (2018). For all our experiments we used a single RTX 2080 Ti GPU.

To train the discriminator, we used the Adam optimiser with the recommended parameters for BERT: learning rate of , batch size of 4 and accumulated batch size of 32. We trained it for 5 epochs; each epoch took 100 minutes on 150k samples.

During discriminator retraining, the generator is needed and thus additional memory is required: all else equal, we decreased the batch size to 2. The self-training process takes one epoch to converge, in about 500 minutes: 200 minutes for training the discriminator and 300 minutes to generate the summaries with the search procedure described in Algorithm 1.

5.1 Metrics

The evaluation of NLG models remains an open research question. Most of the previous works report n-grams based metrics such as ROUGE Lin (2004) or BLEU Papineni et al. (2002). ROUGE-n is a recall oriented metric counting the percentage of n-grams in the gold summaries that are present in the evaluated summary. Conversely, BLEU is precision oriented.

However, as discussed in Section 1, these metrics do not correlate sufficiently w.r.t human judgments. For summarization, Louis and Nenkova (2013) showed how this issue gets even more relevant when few gold references are given. Unfortunately, the annotation of large scale dataset is not realistic: in practice, all the large scale summarization datasets rely on web-scraping to gather text-summary pairs.

For these reasons, See et al. (2017) suggested to systematically compare summarization systems with other metrics such as novelty and the number of repetitions. Following the authors’ recommendation, we report the following measures for all our experiments: i) Novelty (nov-n), as the percentage of novel n-grams w.r.t. the source text, indicating the abstractiveness of a system; ii) Repetition (rep-n), as the percentage of n-grams that occur more than once in the summary; and iii) Length (len), as the length in tokens of the summary.

It is important to note that the objective is not to maximize those metrics, but to minimize the difference w.r.t. human-quality summaries. Hence, we report this difference such that for any measure above, .

6 Preliminary study

Figure 2: Accuracy of two discriminators: one is given access to the source context while the other is not. The abscissa corresponds to the partial length of the sequence to discriminate, so that corresponds to discriminate knowing the first two tokens only, and so on.

High discriminator accuracy is of utmost importance for DAS to improve the decoding search. In Fig. 2 we plot the discriminator accuracy against the generation step , with corresponding to the prediction for the partial sequence of the summary, (see Eq. 2). As an ablation, we report the accuracy for a discriminator which is not given access to the source article . As one would expect, the scores improve with higher , from 65% for to 98% for : the longer the sequence of the evaluated summary, the easier it is to discriminate it. This observed high accuracy indicates the potential benefit of using the discriminator signal to improve the generated summaries.

When trained without access to the source article (orange plot in Fig. 2), the discriminator has access to little contextual and semantic information and its accuracy is lower than a discriminator who has access to . In Fig. 2, the shaded area between the two curves represents the discrimination performance improvement attributed to using the source article . It increases for and starts shrinking afterwards. After , corresponding to the average length of the human summaries (see Table 1), the performance of the discriminator without context quickly increases, indicating that the generated sequences contain relatively easy-to-spot mistakes. This might be due to the increased difficulty for the generator to produce longer and correct sequences, as errors may accumulate over time.

6.1 Impact of and

To assess the behavior of DAS, we conducted experiments with BERT-gen for both the generator and the discriminator using different values for and . All models are trained using the same training data from CNN/DM, and the figures reported in Tables 2 and 3 are the evaluation results averaged across 3 runs on three different subsets (of size 1k) randomly sampled from the validation split.

In this preliminary study, we compare i) BERT-gen, i.e. the model without a discriminator, ii) DAS-naive, where the discriminator is not self-retrained, and iii) DAS, where the discriminator is iteratively retrained. As previously mentioned, for the repetition, novelty and length measures, we report the difference w.r.t. human summaries: the lower the better, with 0 indicating no difference w.r.t. human.

DAS-naive DAS
1 (BERT-gen) 27.700.3 27.700.3


5 27.510.3 29.700.3
10 29.180.3 29.810.2
1 (BERT-gen) 11.710.1 11.710.1


5 11.220.1 10.050.2
10 10.830.3 9.820.1
1 (BERT-gen) -9.840.1 -9.840.1


5 -7.240.1 -3.050.1
10 -3.140.1 -1.420.1
1 (BERT-gen) -21.491.2 -21.491.2


5 -17.540.5 -11.260.4
10 -13.770.8 -10.450.4
Table 2: Scores for DAS-naive and DAS with varying

The parameter corresponds to the number of explored possibilities by the discriminator (see Sec. 4.3). With , no reranking is performed, and the model is equivalent to BERT-gen. In Table 2, for DAS-naive and DAS, we vary only , all other parameters fixed. We empirically observe that both increasing and retraining the discriminator help to better fit the human distribution (i.e. lower ): compared to BERT-gen, DAS models generate more novel words, are shorter and less repetitive, show improvements over the base architecture, and also obtain performance gains in terms of BLEU.

DAS-naive DAS
0.0 (BERT-gen) 27.700.3 27.700.3


0.5 27.510.3 29.700.3
1 28.380.3 29.250.2
5 24.260.4 27.470.4
0.0 (BERT-gen) 11.710.1 11.710.1


0.5 11.220.1 10.050.2
1 10.700.2 9.330.1
5 7.980.2 6.570.2
0.0 (BERT-gen) -9.840.1 -9.840.1


0.5 -7.240.1 -3.050.1
1 -4.110.1 -3.100.1
5 -7.110.1 -3.850.1
0.0 (BERT-gen) -21.491.2 -21.491.2


0.5 -17.540.5 -11.260.4
1 -12.850.4 -8.930.4
5 -2.190.3 -5.490.3
Table 3: Scores for DAS-naive and DAS with varying

The factor controls the impact of the discriminator predictions when selecting the next token to generate (see Eq. 4). With , the discriminator is deactivated and only the generator probabilities are used (corresponding to Eq. 1): the model is effectively equivalent to BERT-gen. Consistently with the results obtained for varying , we observe: DAS DAS-naive BERT-gen for . However, when , i) BLEU scores are found to decrease and ii) the length resembles less that of human summaries. This could indicate that a limit was reached: the higher the , the more the discriminator influences the selection of the next word. When this is too far from the generator top-p probabilities, it might then adversely affect the generator, which would struggle to represent a sequence too far from what was seen during training.

7 Results and discussion

len nov-1 nov-3 rep-1 rep-3 ROUGE-1 ROUGE-L BLEU-1
See et al. (2017) - - - - - 36.38 34.24 -
Gehrmann et al. (2018) - - - - - 41.22 38.34 -
Kryściński et al. (2018) - 10.10 32.84 - - 40.19 37.52 -
UniLM -40.37 8.35 7.98 -27.99 0.12 43.08 40.34 34.24
UniLM (no rules) -45.57 8.58 7.98 -31.41 -6.88 42.98 40.54 34.46
DAS-naive -29.75 6.05 2.80 -28.21 -4.60 42.90 40.05 35.69
DAS -16.81 6.69 2.59 -25.21 -2.40 44.05 40.58 35.94
Table 4: Results on CNN/DM test set for the previous works as well as our proposed models.
len nov-1 nov-3 rep-1 rep-3 ROUGE-1 ROUGE-L BLEU-1
UniLM -12.11 27.16 5.49 -6.87 0.19 18.66 15.49 16.91
UniLM (no rules) -13.11 30.16 5.69 -7.87 -3.77 18.76 14.49 17.14
DAS-naive -10.76 19.68 4.58 -10.81 -5.05 18.19 13.30 15.41
DAS -2.72 19.05 1.01 -3.42 -1.33 19.76 14.92 17.59
Table 5: Results on TL;DR test set for our proposed model in transfer learning scenarios.

In our preliminary study, the best performing DAS configuration was found with . We apply this configuration for our main experiments, for fair comparison using the state-of-the-art UniLM model checkpoint4. We report the results obtained on the CNN/DM test set in Table 4.

Confirming the outcomes of our preliminary study, DAS is found to perform favorably to previous works for all the metrics. Compared to UniLM, we can observe that both DAS-naive and DAS are closer to the target data distribution: they allow to significantly reduce the gap with human-produced summaries over all metrics. The length of the summaries are 16.81 tokens in average longer than the human, as opposed to 40.37 tokens of difference for UniLM and 45.57 without the length penalty. DAS is also more abstractive, averaging only 2.59 novel 3-grams less than the human summaries, as opposed to 7.98 for UniLM. Notably, the proposed approach also outperforms Kryściński et al. (2018) in terms of novelty, while their model was trained with novelty as a reward in a reinforcement learning setup.

UniLM applies a 3-grams repetition avoidance rule, which is why this model generates even less 3-grams repetitions than human summaries (3-grams repetition is in average 0.12 for human summaries). Without this post-hoc rule, DAS generation is less repetitive compared to UniLM. Incidentally, our approach is also found to outperform the previous works and achieves, to the best of our knowledge, a new state-of-the-art for ROUGE.

Domain Adaptation

Figure 3: Learning curve for discriminators trained on TL;DR on 1k, 10k and 100k examples. The abscissa corresponds to the partial length of the sequence to discriminate, so that corresponds to discriminate only knowing the context and the first two token of the summary, and so on.

Further, we explore a domain adaptation scenario, applying DAS on a second dataset, TL;DR. This dataset is built off social media data, as opposed to news articles as in CNN/DM, and differs from the latter in several respects, as described in Section 3. In this scenario, we keep the previously used generator (i.e. the UniLM checkpoint trained on CNN/DM), and only train the discriminator on a subset of TL;DR training samples. This setup has practical applications in scenarios where limited data is available: indeed, learning to generate is harder than to discriminate and requires a large amount of examples Gehrmann et al. (2018). A discriminator can be trained with relatively few samples: in Fig. 3 we show the learning curves for discriminators trained from scratch on TL;DR training subsets of varying size. The samples are balanced: a training set size of 10k means that 5k gold summaries are used, along with 5k generated ones. We observe that only 1k examples allow the discriminator to obtain an accuracy of 82.5% at step . This score, higher in comparison to the one obtained on CNN/DM (see Fig. 2) is due to the relatively lower quality of the out-of-domain generator outputs, which makes it easier for the discriminator to recognise the correct label.

Figure 4: Vocabulary frequency for the most frequent words generated by the different models, for CNN/DM (left) and TL;DR (right) corpora.

The results obtained on TL;DR are reported in Table 5. Given the strong accuracy of the discriminator, we observe larger improvements of DAS over UniLM, than on CNN/DM: the summaries generated are only 2.72 tokens shorter than the human ones as opposed to 12.11. They also contain more novelty and less repetitions. In terms of ROUGE and BLEU, DAS also compares favorably with the exception of ROUGE-L.

This might be due to the lower length of DAS summaries as compared to UniLM: ROUGE is a recall oriented metric and ROUGE-L is computed for the longest sub-sequence common w.r.t. the ground truth. Models participating to the public TL;DR leaderboard5 are omitted here, since they are trained on TL;DR data, and evaluated on a hidden test set. Nonetheless, assuming that the distribution of our sampled test set is similar to that of the official test set, we observe that our approach obtains comparable performance to the state-of-the-art, under a domain-adaptation setup and using only 1k training examples – exclusively for the discriminator – over an available training set of 3M examples.


In Fig. 4 we report the frequency distributions for the different models and the human summaries. We observe that DAS comes closer to the human distribution, followed by DAS-naive and significantly outperforming UniLM. This shows the benefit of DAS at inference time, to produce relatively more human-like summaries.

Figure 5: Distribution of the 3-grams repetitions over their position in the sequence on CNN/DM.

Turning to the problem of exposure bias, in Fig. 5 we report the distribution of the 3-grams repetitions and their relative position in the sequence for the different models, on CNN/DM. We observe that the gap between UniLM and Human increases more than that between DAS and human, indicating that DAS contributes to reduce the exposure bias effect. Rather than exclusively targeting exposure bias (as in Scheduled Sampling or Professor Forcing), or relying on automatic metrics as in reinforcement learning approaches, we propose optimizing toward a discriminator instead of discrete metrics: we find that besides reducing the exposure bias issue, this allows improvements over the various other aspects captured by a discriminator.

8 Conclusion

We introduced a novel methodology for sequence decoding, Discriminative Adversarial Search (DAS), allowing to directly optimize on the data distribution rather than on external metrics.

Applied to abstractive summarization, we observe that the distribution of the generated sequences are indeed closer to that of human-written summaries over several measures, while also obtaining improvements over the state-of-the-art. We report extensive ablation analyses, and show the benefits of the approach in a domain-adaptation setup. Our results highlight the effectiveness of discriminators for text generation.

In future work, we plan to apply DAS to other tasks such as machine translation and dialogue systems.


  1. Publicly available at https://github.com/microsoft/unilm#abstractive-summarization—-cnn—daily-mail
  2. https://zenodo.org/record/1168855
  3. Code and models available at https://github.com/microsoft/unilm#abstractive-summarization—-cnn—daily-mail
  4. As publicly released by the authors.
  5. https://tldr.webis.de/


  1. Pytext: a seamless path from nlp research to production. arXiv preprint arXiv:1812.08729. Cited by: §5.
  2. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pp. 1171–1179. Cited by: §1, §2.1.
  3. Better rewards yield better summaries: learning to summarise without references. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3101–3111. Cited by: §1.
  4. Language gans falling short. arXiv preprint arXiv:1811.02549. Cited by: §1.
  5. ELECTRA: pre-training text encoders as discriminators rather than generators. In International Conference on Learning Representations, Cited by: §1, §2.2.
  6. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: §2.2.
  7. Unified language model pre-training for natural language understanding and generation. In Advances in Neural Information Processing Systems, pp. 13042–13054. Cited by: §2.3, §3, §4.2, §5, §5, §5.
  8. Bottom-up abstractive summarization. arXiv preprint arXiv:1808.10792. Cited by: §2.2, §2.3, §3, §7, Table 4.
  9. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1.
  10. Teaching machines to read and comprehend. In Advances in neural information processing systems, pp. 1693–1701. Cited by: §3.
  11. Lexically constrained decoding for sequence generation using grid beam search. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1535–1546. Cited by: §2.3.
  12. Evaluating the factual consistency of abstractive text summarization. arXiv preprint arXiv:1910.12840. Cited by: §2.2.
  13. Improving abstraction in text summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1808–1817. Cited by: §2.2, Table 4, §7.
  14. Professor forcing: a new algorithm for training recurrent networks. In Advances In Neural Information Processing Systems, pp. 4601–4609. Cited by: §2.1.
  15. ROUGE: a package for automatic evaluation of summaries. In Text Summarization Branches Out, Barcelona, Spain, pp. 74–81. Cited by: §5.1.
  16. Automatically assessing machine summary content without a gold standard. Computational Linguistics 39 (2), pp. 267–300. Cited by: §5.1.
  17. Abstractive text summarization using sequence-to-sequence rnns and beyond. arXiv preprint arXiv:1602.06023. Cited by: §3.
  18. Why we need new evaluation metrics for NLG. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 2241–2252. External Links: Link, Document Cited by: §1.
  19. Analyzing uncertainty in neural machine translation. arXiv preprint arXiv:1803.00047. Cited by: §2.3.
  20. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: §5.1.
  21. A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304. Cited by: §1, §2.1, §2.3.
  22. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732. Cited by: §1, §1, §2.1.
  23. Answers unite! unsupervised metrics for reinforced summarization models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3237–3247. Cited by: §1.
  24. Get to the point: summarization with pointer-generator networks. arXiv preprint arXiv:1704.04368. Cited by: §3, §5.1, Table 4.
  25. BLEU is not suitable for the evaluation of text simplification. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 738–744. External Links: Link, Document Cited by: §1.
  26. Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §5.
  27. Improving multi-step prediction of learned time series models. In Twenty-Ninth AAAI Conference on Artificial Intelligence, Cited by: §2.1.
  28. Pointer networks. In Advances in neural information processing systems, pp. 2692–2700. Cited by: §2.2.
  29. TL;DR: mining Reddit to learn automatic summarization. In Proceedings of the Workshop on New Frontiers in Summarization, Copenhagen, Denmark, pp. 59–63. External Links: Link, Document Cited by: §3.
  30. A learning algorithm for continually running fully recurrent neural networks. Neural computation 1 (2), pp. 270–280. Cited by: §1.
  31. Sequence-to-sequence learning as beam-search optimization. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1296–1306. Cited by: §1, §2.1.
  32. Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144. Cited by: §2.3.
  33. Defending against neural fake news. In Advances in Neural Information Processing Systems, pp. 9051–9062. Cited by: §2.2.
  34. Bridging the gap between training and inference for neural machine translation. arXiv preprint arXiv:1906.02448. Cited by: §2.1.
  35. Self-adversarial learning with comparative discrimination for text generation. In International Conference on Learning Representations, External Links: Link Cited by: §1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description