Answers Unite!Unsupervised Metrics for Reinforced Summarization Models

Answers Unite!
Unsupervised Metrics for Reinforced Summarization Models

Thomas Scialom, Sylvain Lamprier, Benjamin Piwowarski, Jacopo Staiano
CNRS, France
Sorbonne Université, CNRS, LIP6, F-75005 Paris, France
reciTAL, Paris, France

Abstractive summarization approaches based on Reinforcement Learning (RL) have recently been proposed to overcome classical likelihood maximization. RL enables to consider complex, possibly non-differentiable, metrics that globally assess the quality and relevance of the generated outputs. ROUGE, the most used summarization metric, is known to suffer from bias towards lexical similarity as well as from suboptimal accounting for fluency and readability of the generated abstracts. We thus explore and propose alternative evaluation measures: the reported human-evaluation analysis shows that the proposed metrics, based on Question Answering, favorably compares to ROUGE – with the additional property of not requiring reference summaries. Training a RL-based model on these metrics leads to improvements (both in terms of human or automated metrics) over current approaches that use ROUGE as a reward.

1 Introduction

Summarization systems aim at generating relevant and informative summaries given a variable-length text as input. They can be roughly divided under two main categories, those adopting an extractive approach, i.e. identifying the most informative pieces from the input text and concatenating them to form the output summary; and those producing abstractive summaries, i.e. generating an output text whose tokens are not necessarily present in the input text.

While closer to human summarization, abstractive summarization is a much harder task and the need for faithful evaluation metrics is crucial to measure and drive the progress of such systems. The standard for evaluation of summarization systems is ROUGE lin2004rouge: this metric can be considered as an adaptation of BLEU papineni2002bleu, a scoring method for evaluation of machine translation systems; both based on n-gram co-occurrences, the latter favors precision while the former emphasizes recall.

Recent research works paulus2017deep; pasunuru2018multi; arumae2019guiding have proposed to use evaluation metrics – and ROUGE in particular – to learn the model parameters through Reinforcement Learning (RL) techniques. This makes the choice of a good evaluation metric even more important. Unfortunately, ROUGE is known to incur several problems: in particular, its poor accounting for fluency and readability of the generated abstracts, as well as its bias towards lexical similarity ng2015better. To emphasize the latter point, since ROUGE evaluates a summary against given human references, summarization models incur the risk of being unfairly penalized: a high quality summary might still have very few tokens/n-grams in common with the reference it is evaluated against.

In this work, we propose to overcome n-gram matching based metrics, such as ROUGE, by developing metrics which are better predictors of the quality of summaries. The contributions of this paper can be summarized as follows:

  • Extending recent works matan2019; chen2018semantic, we introduce new metrics, based on Question Answering, that do not require human annotations.

  • We report a quantitative comparison of various summarization metrics, based on correlations with human assessments.

  • We leverage the accuracy of the proposed metrics in several reinforcement learning schemes for summarization, including two unsupervised settings: in-domain (raw texts from the target documents) and out-of-domain (raw texts from another document collection).

  • Besides a quantitative evaluation of the generated summarizes, we qualitatively evaluate the performances of the different approaches through human assessment.

Our main results can be summarized as follows:

  1. We show that fitting human judgments from carefully chosen measures allows one to successfully train a reinforcement learning-based model, improving over the state-of-the-art (in terms of ROUGE and human assessments).

  2. we show that dropping the requirement for human-generated reference summaries, as enabled by the proposed metrics, allows to leverage texts in a self-supervised manner and brings clear benefits in terms of performance.

Section 2 introduces the metrics. Section 3 reviews related summarization systems and presents our proposed approaches. Section 4 presents our experimental results and discussions.

2 Evaluation Metrics

This section first describes our selection of existing summarization metrics and introduces our proposals. Then, we quantitatively compare them for abstractive summarization. For a comprehensive list of evaluation metrics, we refer the reader to liu2016not.

2.1 n-grams-based metrics


Automated summarization started with the development of extractive text summarization models. Many unsupervised models, that aim at computing a score between a sentence and document(s) were developed – the score attempting to reflect whether the sentence should be selected for building a summary NenkovaAutomaticSummarization2011. Such scores can thus be used as a proxy of the summary quality. We chose Text-Rank mihalcea2004textrank – an extractive non-parametric summarization system inspired by PageRank page1999pagerank – since it is well performing for extractive tasks and could be easily adapted for our needs. The algorithm builds a graph of sentences within a text based on their co-occurrences. Then, it assigns an importance score for each sentence based on a random walk on the resulting graph. The most important elements of the graph are considered as the ones that best describe the text. As a derivative usage, we propose to consider these importance scores to assess the quality of abstractive summaries in our study. This metric is referred to as Text-Rank in the following.


Arguably the most popular metric for summarization at the moment, it provides a set of measures to compare automatically generated texts against one or more references lin2004rouge. In particular, ROUGE-N is based on the count of overlapping n-grams, while ROUGE-L accounts for the longest common sub-sequence between the candidate and its corresponding reference(s).


As noted by see2017get, abstractive summarization models do not produce novel n-grams as often as the reference summaries. Thus, to favor the generation of unseen words and produce more abstractive summaries, kryscinski2018improving integrated novelty as a reward for reinforcement learning. It is defined as the fraction of unique n-grams in the summary that are novel, normalized by the length ratio of the generated and reference summaries.

2.2 Beyond n-grams

2.2.1 Language Modeling

We investigate the use of language models as an evaluation metric. shafieibavani-etal-2018-summarization proposed to exploit word embeddings to train a model able to rate the generated summaries. Following neural language models (LM), we propose to consider the perplexity of the generated summary according to the BERT LM devlin2018bert, which demonstrated state of the art results in many NLP tasks. For our experiments, we used the publicly available pre-trained English “base” model.

2.2.2 Question-Answering based Metrics

Question-Answering is related to summarization: the first work in this direction wu2002towards introduced the notion of Answer-Focused Summarization, where answers to relevant questions on the source text are used to build the corresponding summary. Based on the intuition that a good-quality summary should provide the answers to the most relevant questions on a given input text, several works have proposed to adapt Question Answering (QA) for summary quality evaluation.

In that vein, pasunuru2018multi proposed to measure if answers contain the most salient tokens. Closer to our work, matan2019 proposed APES, a novel metric for evaluating summarization, based on the hypothesis that the quality of a generated summary is linked to the number of questions (from a set of relevant ones) that can be answered by reading it. In their proposed setup, two components are thus needed: (a) a set of relevant questions for each source document; and (b) a QA system. For each summary to assess, questions are successively generated from a reference summary, by masking each of the named entities present in this reference, following the methodology described in hermann2015teaching. This results in as many triplets as named entities present in the reference summary, where input denotes the summary to assess, question refers to the sentence containing the masked entity and answer refers to this masked entity to retrieve. Thus, for each summary to assess, metrics can be derived from the ability of the QA system to retrieve the correct answers from each of the associated triplets.

F1 score

For each triplet, an F1 score is computed according to the responses retrieved by the considered QA system. This score, commonly used for QA evaluation rajpurkar2016squad, measures the average overlap between predictions and ground truth answers. For each summary to assess, the is the average of the F1 score computed over each triplet. In the following, we denote this metric as QA(sup).

QA confidence

Complementary to the F1 score, we propose to also consider the confidence of the QA system for its retrieved answer. This corresponds, for each triplet, to the probability of the true answer according to the QA model. Confidence scores are averaged for each summary to assess over its associated triplets. In the following, we denote this metric as QA(sup).

Besides considering the simple presence of the expected answers in the generated summary, QA-based metrics also account to some extent for readability. They indeed require that the considered QA system, trained on natural language, be able to find the answer in the input to assess, despite the variability of the generated texts.

Extension to the unsupervised setting

While being a useful complement to ROUGE, the two QA-based metrics described above still need human-generated summaries. In this paper, we investigate and propose extending the previously described QA-based approach in an unsupervised setting.

With this aim, we extended the above metrics at the document level (i.e., questions and answers are generated from the source article text rather than from the reference summary), dispensing of the need for human-generated reference summaries. Thus, in line with the APES approach described above, we propose two unsupervised QA-based metrics, to which we refer to as QA(unsup) and QA(unsup). Accounting for both quality and informativeness of a generated summary, those metrics have the appealing property of not requiring reference summaries.

2.3 Quantitative Analysis

We exploit human judgments obtained for 3 types of automatically generated summaries by paulus2017deep on 100 samples of the CNN/Daily Mail summarization dataset (see detail in section 4.1), in terms of readability (how well written the summary is) and relevance (how well does the summary capture the important parts of the article). The summaries are generated by the three different systems proposed in the original work. Those samples have been scored, via Amazon Mechanical Turk, for Readability and Relevance (scores from 1 to 10 for both metrics).

Readability Relevance
Readability 1.0 0.77 **
Relevance 0.77 ** 1.0
ROUGE-1 (sup) 0.14 * 0.18 **
ROUGE-2 (sup) 0.12 * 0.18 **
ROUGE-L (sup) 0.13 * 0.18 **
Text-Rank (unsup) 0.14 * 0.13 **
Novelty (unsup) -0.13 * -0.1 *
Bert LM (unsup) 0.21 ** 0.08 *
QA (sup) 0.14 * 0.19 **
QA (sup) 0.19 ** 0.23 **
QA (unsup) 0.08 0.2 **
QA (unsup) 0.33 ** 0.31 **
Table 1: Spearman’s for the different metrics w.r.t. Readability and Relevance (*: , **: ).

In Table 1, we report Spearman’s rank correlations on this data, where we compare summaries rankings obtained according to the assessed metrics. Scores render the ability of the various metrics to reproduce human preferences (in terms of readability and relevance). First, we observe that readability and relevance are naturally intertwined: intuitively, an unreadable summary will bear very little information, one of the facts that explains the high correlation between readability and relevance.

From this correlation analysis against human judgments, we observe that, as expected, the Language Model metric captures readability better than ROUGE, while falling short on relevance.

On the other hand, the results obtained using the proposed QA-based metrics indicate their potential benefits especially under the unsupervised setting, with QA and QA capturing readability and relevance better than all the others reported metrics, including ROUGE. We thus conclude that the proposed metrics, which favorably correlate with readability and relevance under human evaluation, are worth of a deeper experimental investigation: in the following sections we provide a thorough evaluation of their contributions as Reinforcement Learning rewards signals.

2.4 Learned Metric

Finally, we also leverage the qualitative data obtained by paulus2017deep – which compounds to 50 samples evaluated by annotators in terms of readability and relevance – to learn an aggregate metric for evaluation. We use a Ridge regression (with a regularization ) to learn to predict the geometric mean of readability and relevance from the metrics defined above. The geometric means was chosen since we want the generated summary to be both readable and relevant.

We randomly sampled 50% of the data to fit the linear model with various subsets of our base metrics. Then, we measured the correlation w.r.t. the expected geometric mean on the remaining 50% data. We performed this procedure 1000 times. Our experiments show that the best performing set of metrics consists of ROUGE-L in conjunction with QA and QA, both computed at an article-level, and hence unsupervised.

This learned metric is thus defined as (with unsup versions of QA-based scores):


with , and . We leverage this learned metric in our RL-based summarization model, as described below.

2.5 Implementation details

As QA system we use the BERT “base” pre-trained model devlin2018bert, finetuned on the SQuAD dataset rajpurkar2016squad using the recommended parameters for the task111 This differs from the approach adopted by matan2019 who trained their QA model on CNN-DM (the same data used for the summarization task).

3 Summarization Models

Abstractive summarization systems were originally designed as a post-processing of an extractive system – by compressing sentences NenkovaAutomaticSummarization2011. A lot of activity takes place nowadays in designing neural networks sequence to sequence architectures sutskever2014sequence, which allow to consider the problem as a whole rather than a two-step process. Such models reached state-of-the-art performance. To tackle the summarization, which deals with a long text and possibly out-of-vocabulary tokens, see2017get proposed to leverage an attention over the input bahdanau2014neural, as well as a copy mechanism vinyals2015pointer.

One problem of sequence-to-sequence models is that they tend to repeat text in the output. To deal with this problem, see2017get use a coverage mechanism, and paulus2017deep introduced Intra-Decoder Attention with the same goal of avoiding duplicate information within the output sequences.

More recently, the model proposed by see2017get was further extended gehrmann2018bottom, with the addition of an attention mask during inference: a pre-trained sequence tagger trained to select which input tokens should be copied and used to filter the copy mechanism. Such a filter, called Bottom-Up Copy Attention, was shown to help prevent copying from the source text sequences that are too long, hence resulting in more abstractive summaries. On the CNN/Daily Mail dataset, gehrmann2018bottom found this two-step process to yield significant improvements in terms of ROUGE – resulting in the current state-of-the-art system. We base our experiments on this model.

The differentiable loss function commonly used for training summarization models, negative log-likelihood, has several known limitations. Among those, exposure bias and failure to cope with the large number of potentially valid summaries.

To overcome this, approaches based on reinforcement learning have recently been proposed, allowing the models to learn via reward signals. ranzato2015sequence used the REINFORCE algorithm williams1992simple to train RNNs for several generation tasks, showing improvements over previous supervised approaches. narayan-etal-2018-ranking used such an approach in an extractive summarization setting, learning to select the most relevant sentences within the input text in order to construct its summary. paulus2017deep combined supervised and reinforcement learning, demonstrating improvements over competing approaches both in terms of ROUGE and on human evaluation. However, the main limit of these works is that they rely on standard summarization metrics which are known to be biased.

Finally, closer to our work, arumae2019guiding proposed to use question-answering rewards to learn an extractive summarization model in a reinforcement learning setup. Compared to what we propose, their system is extractive, and relies on hand-written summaries.

3.1 Mixed Training Objectives

In our experiments, we follow the reinforcement learning scheme described below. The main difference with previous works is our reward function, which was based on our study of metrics (section 2). We consider a mixed loss combining supervised and reinforcement learning schemes:


where we define the reinforcement loss and the maximum likelihood in the following paragraphs.

Maximum Likelihood

Under a supervised training setup, the teacher forcing algorithm williams1989learning can be applied, and corresponds to maximizing the likelihood (ML) or equivalently to minimizing the negative log likelihood (NLL) loss defined as:


where is the input text of tokens and is the corresponding reference summary of tokens.

Policy Learning

Several RL-based summarization kryscinski2018improving; li2018actor; pasunuru2018multi; paulus2017deep apply the self-critical policy gradient training algorithm rennie2017self. Following paulus2017deep we use REINFORCE algorithm, using as the baseline a greedy decoding algorithm according to the conditional distribution , giving rise to a sequence . The model is sampled using its Markov property, that is, one token at a time, giving rise to the sequence .

Following the standard RL actor-critic scheme, with the reward function for an output sequence Y, the loss to be minimized is then defined as:


As ROUGE is the most widely used evaluation metric, paulus2017deep used ROUGE-L as the reward for the function and tested the following three different setups:

  • ML: the model trained with ();

  • RL: the model trained with ();

  • ML+RL: the model trained with ().

The human evaluation conducted on the three models shows that RL performs worse than ML, and ML+RL performs best for both readability and relevance. The authors also conclude that “despite their common use for evaluation, ROUGE scores have their shortcomings and should not be the only metric to optimize on summarization model for long sequences”, which is translated in the very high optimal . We show that using a more sensible metric to optimize leads to a better model, and to a lower .

4 Experiments

In our experiments, we evaluate the effect of substituting the ROUGE reward in the reinforcement-learning model of paulus2017deep by our proposed metric (section 2). We, moreover, study the effect of using metrics that do not necessitate human-generated summaries.

4.1 Data Used

Task-specific corpora for building and evaluating summarization models associate a human-generated reference summary with each text provided. We resort to the CNN/Daily Mail (CNN-DM) dataset hermann2015teaching; nallapati2016abstractive for our experiments. It includes 287,113 article/summary pairs for training, 13,368 for validation, and 11,490 for testing. The summary corresponding to each article consists of several bullet points displayed on the respective news outlet webpage. In average, summaries contain 66 tokens () and 4.9 bullet points. Consistently with see2017get and gehrmann2018bottom, we use the non-anonymized version of the dataset, the same training/validation splits, and perform truncation of source documents and summaries to 400 and 100 tokens, respectively.

To assess the possible benefits of reinforcing over the proposed QG-based metric, which does not require human-generated reference summaries, we employ TL;DR222, a large-scale dataset for automatic summarization built on social media data, compounding to 4 Million training pairs volske2017tl. Both CNN-DM and TL;DR datasets are in English.

4.2 Models

For all our experiments, we build on top of the publicly available OpenNMT implementation333, consistently with gehrmann2018bottom to which we refer to as a baseline. The encoder is composed of a one-layer bi-LSTM with 512 hidden states, and 512 hidden states for the one-layer decoder. The embedding size is set at 128. The model is trained with Adagrad, with an initial learning rate of 0.15, and an initial accumulator value of 0.1. We continue training until convergence; when the validation perplexity does not decrease after an epoch, the learning rate is halved. We use gradient-clipping with a maximum norm of 2.

gehrmann2018bottom showed that increasing the number of hidden states leads to slight improvements in performance, at the cost of increased training time; thus, as reinforcement learning is computationally expensive, we build on top of the smallest model – nonetheless, we include the largest model by gehrmann2018bottom in our discussion of results.

All the experimented reinforcement approaches use the mixed training objectives defined in equation 2, with the ML part corresponding to the previously described baseline model pretrained on the CNN-DM dataset. Compared models differ on the considered reward signals. They also differ on their use of additional unsupervised data, either In-Domain or Out-of-Domain, as discussed below.

4.2.1 Reward Signals

The three reward signals used throughout our experiments, are detailed below:

  1. ROUGE: We use only ROUGE-L as reward signal within the baseline architecture, consistently with paulus2017deep;

  2. QA: Conversely, we compute the reward by applying the learned coefficients to the three components of the learned metric, as obtained in Section 2.4.

  3. QA: We apply the mixed training objective function, using as a reward the three metric components of the learned metric (ROUGE-L, QA, and QA) equally weighted: this corresponds to setting a value of 1 for , and in Eq. 1. This allows to see to which extent learning is sensitive to fitting human assessments.

For (2) and (3), we set (Eq. 2) to 0.5444We have run experiments with , and as paulus2017deep; we report here the best performance which was obtained with the former.. This shows that, compared to paulus2017deep, we do not need to use NLL to avoid the model from generating unreadable summaries.

4.2.2 In-Domain vs Out-of-Domain

Finally, we experiment with the proposed QA and QA metrics in an unsupervised fashion, as they can be computed at article level – i.e. without accessing the reference human-generated summaries. We investigate the potential benefits of using this approach both in-domain and out-of-domain: for the former, we resort to the test set of the CNN-Daily Mail (CNN-DM) dataset; for the latter, we leverage the TL;DR corpus.

As the CNN-Daily Mail is built from mainstream news articles, and the TL;DR data comes from social media sources, we consider the latter as out-of-domain in comparison. From the latter, which includes circa 4 million samples, we randomly draw sample subsets of size comparable with CNN-DM for training, validation and testing splits.

Due to computational costs, we restrict these experiments to the model trained under reinforcement using the QA metric. Under this setup, the model has access at training time both to:

  • supervised samples for which a reference summary is given (and thus all metrics, including ROUGE and NLL, can be computed as a training objective), coming from the training set of CNN-Daily Mail corpus ;

  • unsupervised samples, for which no reference is available, thus allowing to only compute QA(unsup) and QA(unsup). Three unsupervised settings are considered in the following:

    TL;DR, corresponding to the out-of-domain setting where we use articles from the TL;DR dataset;

    CNN-DM (VAL), corresponding to an in-domain setting where we use texts from the validation set from the CNN/Daily Mail dataset;

    and, CNN-DM (TEST) for an in-domain setting where we use the articles from the test set (thus containing texts used for evaluation purposes).

While all the data is from the CNN-DM train dataset in the supervised setups, for the unsupervised setups, we set the proportion of unsupervised data to 50% (either CNN-DM VAL, CNN-DM TEST for in-domain or TL;DR for out-of-domain data). Thus, for 50% of the data, the model has access only to the QA and QA reward signals, since the ROUGE-L reward can only be computed on supervised batches.

Therefore, for all the unsupervised setups, in order to keep consistency in the reward signal throughout the training, we multiply by a factor of 2 the weight associated with ROUGE-L when this reward is computable, and set it to 0 otherwise.

4.3 Results

R-1 R-2 R-L QA QA

39.53 17.28 36.38 - -
gehrmann2018bottom 41.22 18.68 38.34 - -
ML+RL paulus2017deep 39.87 15.82 36.90 - -
RL paulus2017deep 41.16 15.75 39.08 - -
pasunuru2018multi 40.43 18.00 37.10 - -
chen2018fast 40.88 17.80 38.54 - -
baseline 42.24 17.78 37.44 14.91 40.12
 + ROUGE 45.62 16.30 41.60 13.64 37.90
 + QA 43.36 18.06 38.33 16.06 41.01
 + QA 42.71 17.81 37.94 15.19 41.39
 + QA + TL;DR 42.75 17.57 37.88 15.75 41.54
 + QA + CNN-DM (VAL) 43.00 17.66 38.23 16.16 41.75
 + QA + CNN-DM (TEST) 42.74 17.25 37.96 16.17 42.14
Table 2: Comparison with previous works. On top, we report the results obtained by gehrmann2018bottom using their largest architecture, as well as those by see2017get. Next, we report results recently obtained by reinforcement learning approaches. Finally, we indicate the scores obtained by our baseline – the “small” model by gehrmann2018bottom – and the six reinforced models we build on top of it.

In Table 2, we report the results obtained from our experiments in comparison with previously proposed approaches. We observe that, unsurprisingly, reinforcing on ROUGE-L allows to obtain significant improvements over the state-of-the-art, in terms of ROUGE but at the cost of lower QA-based metrics. Conversely, reinforcing on the proposed metric improves consistently all its components (ROUGE-L, QA and QA).

However, increasing the reward does not necessarily correlate with better summaries. The human inspection as reported by paulus2017deep shows that the generated summaries reinforced on ROUGE-L are consistently on the low end in terms of readability and relevance.

A closer inspection of the generated summaries revealed that the sequences generated by this model seem to qualitatively degrade as the number of produced tokens grows: they often start with a reasonable sub-sequence, but quickly diverge towards meaningless outputs. This can be explained by the aforementioned drawbacks of ROUGE, which are likely amplified when used both as evaluation and reward: the system might be optimizing for ROUGE, at the price of losing the information captured with the NLL loss by its language model.

We hence conducted a human evaluation for the different setups, reported in Table 4, assessing their outputs for readability and relevance in line with paulus2017deep. We randomly sampled 50 articles from the CNN-DM test set; since the learned metric used in our experiments is derived from the subset manually evaluated in paulus2017deep we ensured that there was no overlap with it. For each of those 50 articles, three English speakers evaluated the summaries generated by the 7 different systems reported in Table 2.

Readability Relevance
human reference 7.27* 7.4**
baseline 7.07 5.82
 + ROUGE 2.14** 5.48**
 + QA 5.94** 6.34**
 + QA 6.96 6.21**
 + QA + TL;DR 6.60* 6.26**
 + QA + CNN-DM (VAL) 6.40* 6.75**
 + QA + CNN-DM (TEST) 6.89 6.80**

Table 3: Human assessment: two-tailed t-test results are reported for each model compared to the baseline (, ).
\adjustboxangle=90,lap=0pt-(1em)human reference \adjustboxangle=90,lap=0pt-(1em)baseline \adjustboxangle=90,lap=0pt-(1em)+ ROUGE \adjustboxangle=90,lap=0pt-(1em)+ QA \adjustboxangle=90,lap=0pt-(1em)+ QA \adjustboxangle=90,lap=0pt-(1em)+ QA + TL;DR \adjustboxangle=90,lap=0pt-(1em)+ QA + CNN-DM (VAL) \adjustboxangle=90,lap=0pt-(1em)+ QA + CNN-DM (TEST)
baseline * / ** -
 + ROUGE ** / ** ** / ** -
 + QA ** / ** ** / ** ** / ** -
 + QA ** / ** - / ** ** / ** ** / - -
 + QA + TL;DR ** / ** * / ** ** / ** ** / - * / - -
 + QA + CNN-DM (VAL) ** / ** * / ** ** / ** ** / * ** / ** - / * -
 + QA + CNN-DM (TEST) ** / ** - / ** ** / ** ** / * - / ** - / * * / - -

Table 4: Human assessment: two-tailed t-test results are reported for each model pair for Readability / Relevance (, ).

We observe that reinforcing using the proposed metric – which includes QA based metrics, leads to comparable performance in terms of ROUGE w.r.t. state-of-the-art approaches, while clear benefits emerge from the results of the human evaluation: a significant improvement in terms of relevance, particularly when leveraging in-domain data in an unsupervised setup. Not surprisingly, we observe an improvement for our model when reinforced through the learned metric compared to the one equally weighted. The slightly lower relevance scores observed for the QA w.r.t. QA are consistent with the lower ROUGE-L and QA reported in Table 2. This is explained by the lower coefficients for ROUGE-L and QA (see 2.4), and the relatively stronger correlation of those two metrics with relevance (see Table 1).

Consistently with the figures reported in Table 2, the human evaluation results – reported in Tables 4 and 4 – confirm the progressive improvements of our different proposed models when using unsupervised data closer to the test set documents:

  • adding unsupervised data from the out-of-domain TL;DR brings a slight improvement using QA;

  • when it comes to the same domain (i.e. CNN-DM validation) the improvements increase;

  • finally, when unsupervised samples come from the same set as those used for testing, we observe even better results.

These results show that using the proposed QA-based metrics, that do not depend on reference summaries, allows to leverage raw text data; and, that fine-tuning (without supervision) on the documents to be summarized is beneficial.

To elaborate further, we notice that applying the learned coefficients for 1 to the results obtained by models reinforced on QAlearned and QAequally, see Table 2, we obtain very similar scores (namely, 136.43 for QAequally and 136.4 for QAlearned). However, the qualitative analysis reported in Tables 4 and 4 shows that while they perform similarly in terms of relevance, a significantly lower score for readability is obtained using QAequally. This can be explained by the stronger weight of ROUGE_L for this setup, a fact which might lead to a degradation of the quality of the output consistently with the observations reported in paulus2017deep as well as in our ROUGE experiment.

Another observation from Tables 4 and 4 is that while QAlearned performs significantly better in term of readability than QAlearned + CNN-DM (VAL), the opposite holds for relevance. This could be explained by the setup difference during training: as detailed in section 4.2.2, for unsupervised setups (i.e. QAlearned + CNN-DM (VAL)) only the QA-based metrics are computed for the portion of data for which no reference is available. While testing (TEST) and validation (VAL) splits come the same dataset (CNN-DM), we observe that using the samples from TEST in an unsupervised fashion allows for maintaining comparably high relevance compared to QAlearned + CNN-DM (VAL), while also obtaining similar readability to QAlearned. This shows the possible benefits that can be obtained by exposing the model to the evaluation data in unsupervised setups. To further study our unsupervised metrics, we performed additional experiments on the TL;DR corpus. We observed more than one absolute point of improvement w.r.t CNN-DM TEST in terms of ROUGE-L, QA (unsup) and QA (unsup).

This indicates that the proposed unsupervised metrics allow the model to better transfer to new domains such as TL;DR. These results pave the way for leveraging large numbers of texts, in a self-supervised manner, to train automatic summarization models.

5 Conclusions

We have presented the analysis of novel QA-based metrics555A python package will be made available at, and have shown promising results when using them as a reward in a RL setup. Crucially, those metrics do not require a human reference, as they can be computed from the text to be summarized.

From our experiments this proves particularly beneficial, allowing to leverage both in-domain and out-of-domain unlabeled data.

The promising results obtained indicate a path towards partially self-supervised training of summarization models, and suggest that progress in automated question generation can bring benefits for automatic summarization.

Finally, to our knowledge, this paper is the first to compare two architectures with the same reinforcement setup on the same data: the one proposed by see2017get and extended by gehrmann2018bottom, versus the one by paulus2017deep. In terms of ROUGE, we observe better results than those reported by paulus2017deep – see Table 2 – indicating a possible edge for the architecture proposed by see2017get.


Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description