Improving Abstractive Text Summarization with History Aggregation

Improving Abstractive Text Summarization with History Aggregation


Recent neural sequence to sequence models have provided feasible solutions for abstractive summarization. However, such models are still hard to tackle long text dependency in the summarization task. A high-quality summarization system usually depends on strong encoder which can refine important information from long input texts so that the decoder can generate salient summaries from the encoder’s memory. In this paper, we propose an aggregation mechanism based on the Transformer model to address the challenge of long text representation. Our model can review history information to make encoder hold more memory capacity. Empirically, we apply our aggregation mechanism to the Transformer model and experiment on CNN/DailyMail dataset to achieve higher quality summaries compared to several strong baseline models on the ROUGE metrics.


1 Introduction

The task of text summarization is automatically compressing long text to a shorter version while keeping the salient information. It can be divided into two approaches: extractive and abstractive summarization. The extractive approach usually selects sentences or phrases from source text directly. On the contrary, the abstractive approach first understands the semantic information of source text and generates novel words not appeared in source text. Extractive summarization is easier, but abstractive summarization is more like the way humans process text. This paper focuses on the abstractive approach. Unlike other sequence generation tasks in NLP(Natural Language Processing) such as NMT(Neural Machine Translation), in which the lengths of input and output text are close, the summarization task exists severe imbalance on the lengths. It means that the summarization task must model long-distance text dependencies.

As RNNs have the ability to tackle time sequence text, variants of sequence to sequence model[37] based on them have emerged on a large scale and can generate promising results. To solve the long-distance text dependencies, Bahdnau et al.[1] first propose the attention mechanism which allows each decoder step to refer to all encoder hidden states. Rush et al.[34] first incorporate attention mechanism to summarization task. There are also other attention-based models to ease the problem of long input texts for summarization task, like Bahdnau attention[35], hierarchical attention[30], graph-based attention[39] and simple attention[25]. Celikyilmaz et al.[2] segment text and encode segmented text independently then broadcast their encoding to others. Though these systems are promising, they exhibit undesirable behaviors such as producing inaccurate factual details and repeating themselves as it is hard to decide where to attend and where to ignore for one-pass encoder.

Modeling an effective encoder for representing a long text is still a challenge, and we are committed to solving long text dependency problems by aggregation mechanism. The key idea of the aggregation mechanism is to collect history information then computes attention between the encoder final hidden states and history information to re-distribute the encoder final states. It suggests that the encoder can read long input texts a few times to understand the text clearly. We build our model by reconstructing the Transformer model[40] via incorporating our novel aggregation mechanism. Empirically, we first analyze the features of summarization and translation dataset. Then we experiment with different encoder and decoder layers, and the results reveal that the ability of the encoder layer is more important than the decoder layer, which implies that we should focus more on the encoder. Finally, we experiment on CNN/DailyMail dataset, and our model generates higher quality summaries compared to strong baselines of Pointer Generator and Transformer models on ROUGE metrics and human evaluations.

The main contributions of this paper are as follows:

  • We put forward a novel aggregation mechanism to redistribute context states of text with collected history information. Then we equip the Transformer model with the aggregation mechanism.

  • Our model outperforms 1.01 ROUGE-1, 0.30 ROUGE-2 and 1.27 ROUGE-L scores on CNN/DailyMail dataset and 5.31 ROUGE-1, 4.56 ROUGE-2 and 5.19 ROUGE-L scores on our build Chinese news dataset compared to Transformer baseline model.

2 Related Work

In this section, we first introduce extractive summarization then introduce abstractive summarization.

2.1 Extractive Summarization

Extractive summarization aims to select salient sentences from source documents directly. This method is always modeled as a sentence ranking problem via selecting sentences with high scores[17], sequence labeling(binary label) problem[6] or integer linear programmers[42]. The models above mostly leverage manual engineered features, but they are now replaced by the neural network to extract features automatically. Cheng et al.[4] get sentence representation using convolutional neural network(CNN) and document representation using recurrent neural network(RNN) and then select sentences/words using hierarchical extractor. Nallapati et al.[28] treat the summarization as a sequence labeling task. They get sentence and document representations using RNNs and after a classification layer, each sentence will get a label which indicates whether this sentence should be selected. Zhou et al.[49] present a model for extractive summarization by jointly learning score and select sentences. Zhang et al. [47] put forward a latent variable model to tackle the problem of sentence label bias.

2.2 Abstractive Summarization

Abstractive summarization aims to rewrite source texts with understandable semantic meaning. Most methods of this task are based on sequence to sequence models. Rush et al.[34] first incorporate the attention mechanism to abstractive summarization and achieve state of the art scores on DUC-2004 and Gigaword datasets. Chopra et al.[5] improve the model performance via RNN decoder. Nallapati et al.[30] adopt a hierarchical network to process long source text with hierarchical structure. Gu et al.[10] are the first to show that a copy mechanism can take advantage of both extractive and abstractive summarization by copying words from the source text (extractive summarization) and generating original words (abstractive summarization). See et al.[35] incorporate copy and coverage mechanisms to avoid generating inaccurate and repeated words. Celikyilmaz et al.[2] split text to paragraph and apply encoder to each paragraph, then broadcast paragraph encoding to others. Recently, Vaswani et al.[40] give a new view of sequence to sequence model. It employs the self-attention to replace RNN in sequence to sequence model and uses multi-head attention to capture different semantic information.

Lately, more and more researchers focus on combine abstractive and extractive summarization. Hsu et al.[14] build a unified model by using inconsistency loss. Gehrmann et al.[9] first train content-selector to select and mask salient information then train the abstractive model (Pointer Generator) to generate abstractive summarization.

3 Model

Figure 1: Aggregation Transformer model overview. Compared with the Transformer baseline model, we apply the aggregation layer between encoder and decoder. The aggregation layer can collect history information to redistribute the encoder’s final hidden states.

In this section, we first describe the attention mechanism and the Transformer baseline model, after that, we introduce the pointer and BPE mechanism. Our novel aggregation mechanism is described in the last part. The code for our model is available online.4

Notation We have pairs of texts , where is a long text and is the summary of corresponding . The lengths of and is and respectively. Each text is composed by a sequence of words , and we embed word into vector . So we represent document with embedding vector and we can get representation of the same as .

3.1 Attention Mechanism

The attention mechanism is widely used in text summarization models as it can produce word significance distribution in source text for disparate decode steps. Bahdanau et al.[1] first propose the attention mechanism where attention weight distribution can be calculated:


Where is the encoder hidden states in th word, is decoder hidden states at time step . Vector and scalar are learnable parameters. is probability distribution that represents the importance of different source words for decoder at time step .

Transformer redefines attention mechanism more concisely. In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix . The keys and values are also packed together into matrices and .


where is transpose function, , is the real field, are the lengths of query and key/value sequences, are the dimensions of key and value. For summarization model we assume . Self-attention can be defined from basic attention with . And multi-head attention concatenates multiple basic attentions with different parameters. We formulate multi-head attention as:


where , and vector are learnable parameters.

3.2 Transformer Baseline Model

Our baseline model corresponds to the Transformer model in NMT tasks. The model is different from previous sequence-to-sequence models as it applies attention to replace RNN. The Transformer model can divide into encoder and decoder, and we will discuss them respectively below.

Input The attention defined in the Transformer is the bag of words(BOW) model, so we have to add extra position information to the input. The position encodes with heuristic sine and cosine function:


where is the position of word in text, is the dimension index of embedding, and the dimension of model is . The input of network is equal to source text word embeddings added position embeddings .

Encoder The goal of encoder is extracting the features of input text and map it to a vector representation. The encoder stacks with encoder layer. Each layer consists of multi-head self-attention and position-wise feed-forward sublayers. We employ a residual connection around each of the two sublayers, followed by layer normalization. From the multi-head attention sublayer, we can extract different semantic information. Then we compute each encoder layer’s final hidden states using position-wise feed-forward. The th encoder layer is formulated as:


where is the multi-head self-attention output after residual connection and is layer normalization function, means the output of encoder layer . if , or , vector and scalar are learnable parameters, and is the position-wise feed-forward sublayer. This sublayer also can be described as two convolution operations with kernel size 1.

Decoder The decoder is used for generating salient and fluent text from the encoder hidden states. Decoder stacks with decoder layers. Each layer consists of masked multi-head self-attention, multi-head attention, and feed-forward sublayers. Similar to the encoder, we employ residual connections around each of the sublayers, followed by layer normalization. And we take th decoder layer as example. We use the masked multi-head attention to encode summary as vector :


where in the first layer and in other layers. is the output of the th decoder layer, is the word embeddings and position embeddings of generated words respectively. The is masked multi-head self-attention and the mask is similar with the Transformer decoder. Then we execute multi-head attention between encoder and decoder:


where is hidden states of decoder masked multi-head attention and is the last encoder layer output states. Finally, we use position-wise feed-forward and layer normalization sublayers to compute final states :


where vector and scalar are learnable parameters. And projecting the decoder final hidden states to vocab size then we can get vocabulary probability distribution .

3.3 Pointer and BPE Mechanism

In generation tasks, we should deal with the out of vocabulary(OOV) problem. If we do not tackle this problem, the generated text only contains a limited vocabulary words and replaces OOVs with . Things get worse in summarization task, the specific nouns(like name, place, etc.) with low frequency is the key information of summary, however, the vocabulary built with top words with the most frequent occurrence while those specific nouns may not occur in vocabulary.

The pointer and byte pair encoder (BPE) mechanism are both used to tackle the OOV problem. The original BPE mechanism is a simple data compression technique that replaces the most frequent bytes pair with unused byte. Sennrich et al.[36] first use this technique for word segmentation via merging characters instead of bytes. So the fixed vocabulary can load more subwords to alleviate the problem of OOV.

The pointer mechanism allows both copying words from the source text and generating words from a fixed vocabulary. For pointer mechanism at each decoder time step, the generation probability can be calculated:


where vector and scalar are learnable parameter. is the last decoder output states. We compute the final word distribution via pointer network:


where is representation of input, is one-hot indicator vector for , is probability distribution of source words and is final probability distribution.

3.4 Aggregation Mechanism

The overview of our model is in Figure 1. To enhance memory ability, we add the aggregation mechanism between encoder and decoder for collecting history information. The aggregation mechanism reconstructs the encoder’s final hidden states by reviewing history information. And we put forward two primitive aggregation approaches that can be proved effective in our task.

Figure 2: The overview of projection aggregation mechanism with 4 encoder layers.

The first approach is using full-connected networks to collect historical information(see Figure2). This approach first goes through normal encoder layers to get the outputs of each layer, and we select middle layers’ outputs then concatenate them as input of full connected networks to obtain history information . Finally, we compute multi-head attention between history state and the output of the last encoder layer. This process can be formulated as:


where vector and scalar are learnable parameters, is hyper-parameter to be explored. Then we add the multi-head attention layer between the last encoder layer output and history information . The output of attention is the final states of encoder:


where is history information and .

Figure 3: The overview of attention aggregation mechanism with 4 encoder layers.

The second approach is using attention mechanism to collect history information(see Figure 3). We select middle encoder layers’ outputs to iteratively compute multi-head attention between current encoder layer output and previous history information. And the th history information can be calculated as follows:


where is index of selected encoder layers, is previous history state and is encoder output . Iteratively calculating history information until the last selected encoder layer, we can get final history hidden states and make the states as the final states of the encoder.

Finally, we define the objective function. Given the golden summary and input text , we minimize the negative log-likelihood of the target word sequence. The training objective function can be described:


where is model parameter and is the number of source-summary text pairs in training set. The loss for one sample can be added by the loss of generated word in each time step :


where can be calculated in decoder time step, is total decoding steps.

4 Experiments

In this section, we first define the setup of our experiment and then analyze the results of our experiments.

4.1 Experimental Setup

Dataset We conduct our experiments on CNN/DailyMail dataset[13, 30], which has been widely used for long document summarization tasks. The corpus is constructed by collecting online news articles and human-generated summaries on CNN/Daily Mail website. We choose the non-anonymized version5[35], which is not replacing named entity with a unique identifier. The dataset contains pairs of articles and summaries. The details of this dataset are in section 4.2.

lead-3 40.24 17.52 36.34
words-1vt2k-temp-att [30] 36.64 15.66 33.42
ConvS2S [8] 39.75 17.29 36.54
Pointer Generator + Coverage [35] 39.53 17.28 36.38
Pointer Generator + Coverage + cbdec + RL[15] 40.66 17.87 37.06
Inconsistency Loss [14] 40.68 17.97 37.13
rnn-ext + abs + RL + rerank [3] 40.88 17.80 38.54
Transformer 40.05 17.72 36.77

Aggregation Transformer(attention)
41.06 18.02 38.04

Table 1: Comparison of different model results on CNN/DaliyMail test dataset using F1 scores of ROUGE-1, ROUGE-2, ROUGE-L with confidence interval. The first part is previous abstractive baseline models, the second part is the Transformer baseline model and our Transformer model with aggregation mechanism. The best scores are bolded.

Training Details We conduct our experiments with 1 NVIDIA Tesla V100. During training and testing time we truncate the source text to words and we build a shared vocabulary for encoder and decoder with small vocabulary size k, due to the using of the pointer or BPE mechanism. Word embeddings are learned during training time. We use Adam optimizer with initial learning rate and parameter in training phase. We adapt the learning rate according to the loss on the validation set (half learning rate if validation set loss is not going down in every two epochs). And we use regulation with all . The training process converges about 200,000 steps for each model.

In the generation phase, we use the beam search algorithm to produce multiple summary candidates in parallel to get better summaries and add repeated words to blacklist in the processing of search to avoid duplication. For fear of favoring shorter generated summaries, we utilize the length penalty. In detail, we set beam size , no-repeated n-gram size and length penalty parameter . We also constrain the maximum and minimum length of the generated summary to and respectively.

We evaluate our system using F-measures of ROUGE-1, ROUGE-2, ROUGE-L metrics which respectively represent the overlap of N-gram and the longest common sequence between the golden summary and the system summary. The scores are computed by python pyrouge6 package.

Experiment explorations We explore the influence of different experiment hyper-parameters setup for the model’s performance, which includes 11 different experiment settings.

Firstly, we explore the number of Transformer encoder/decoder layers (see Table 3).

Secondly, we dig out the different aggregation methods with 1 aggregation layer (see Table 4). The exploration includes our baseline model(m1) and Transformer model with add function(m2), projection aggregation method(m4) and attention aggregation method(m6).

Thirdly, we also explore the different performance of different number of aggregation layers (see Table 4). There are 3 groups of experiments with different number of aggregation layers: Transformer adding last 2 layers(m2) and last 3 layers(m3), Transformer with projection aggregation method using 1 layer(m4) and 2 layers(m5) and Transformer with attention aggregation method using 1 layer(m6) and 2 layer(m7). For all models except the exploration of encoder/decoder layers, we use 4 encoder and 4 decoder layers.

Human Evaluation The ROUGE scores are widely used in the automatic evaluation of summarization, but it has great limitations in semantic and syntax information. In this case, we use manual evaluation to ensure the performance of our models. We perform a small scale human evaluations where we randomly select about 100 generated summaries from each of the 3 models(Pointer Generator, Transformer, and aggregation Transformer) and randomly shuffle the order of 3 summaries to anonymize model identities, then let 20 anonymous volunteers with excellent English literacy skills score random 10 summaries for each 3 models range from 1 to 5(high score means high-quality summary). then we using the average score of each summary as their final score. the evaluation criteria are as follows: (1) salient: summaries have the important point of the source text, (2) fluency: summaries are consistent with human reading habits and have few grammatical errors, (3) non-repeated: summaries do not contain too much redundancy word.

4.2 Results

Dataset Train Valid Test
CNN/DailyMail(summarization) 287226 13368 11490
max-token-len(art/abs) 2882 / 2096 2134 / 1684 2377 / 678
avg-token-len(art/abs) 790 / 55 768 / 61 777 / 58
Our Dataset(summarization) 48600 4800 6600
max-token-len(art/abs) 1914 / 80 1687 / 80 1670 / 80
avg-token-len(art/abs) 768 / 65 763 / 65 769 / 65
iwslt14-de-en(translation) 160239 7283 6750
max-token-len(de/en) 244 / 228 169 / 154 245 / 217
avg-token-len(de/en) 24 /24 24 / 24 23 / 22
wmt17-en-de(translation) 3961179 40058 3003
max-token-len(en/de) 250 /250 224 / 233 101 / 93
avg-token-len(en/de) 28 /29 28 / 29 26 / 27
Table 2: The comparison of translation and summarization datasets. We remove sentence tags in the source text and split sentences with blank, then count maximal and average length token in each dataset.

Dataset Analysis To demonstrate the difference between summarization and translation tasks, we compare the dataset for two tasks (see Table 2). The summarization dataset CNN/DailyMail contains 287226 training pairs, 13368 validation pairs, and 11490 test pairs. The translation dataset iwslt14 and wmt17 have 160239/3961179 training pairs, 7283/40058 validation pairs, and 6750/3003 test pairs respectively. Then we find the characteristics of those two different tasks after comparison. The summarization source text can include more than 2000 words and the average length of the source text is 10 times longer than the target text, while the translation task contains at most 250 words and the average length of the source text is about the same as the target text. Because of that, we need a strong encoder with memory ability to decide where to attend and where to ignore.

Source Text(truncated 500): (……) national grid has revealed the uk ’s first new pylon for nearly 90 years . called the t-pylon -lrb- artist ’s illustration shown -rrb- it is a third shorter than the old lattice pylons . but it is able to carry just as much power - 400,000 volts . it is designed to be less obtrusive and will be used for clean energy purposes . national grid is building a training line of the less obtrusive t-pylons at their eakring training academy in nottinghamshire . britain ’s first pylon , erected in july 1928 near edinburgh , was designed by architectural luminary sir reginald blomfield , inspired by the greek root of the word ‘ pylon ’ -lrb- meaning gateway of an egyptian temple -rrb- . the campaign against them - they were unloved even then - was run by rudyard kipling , john maynard keynes and hilaire belloc . five years later , the biggest peacetime construction project seen in britain , the connection of 122 power stations by 4,000 miles of cable , was completed . it marked the birth of the national grid and was a major stoking of the nation ’s industrial engine and a vital asset during the second world war (……)
Ground Truth: national grid has revealed the uk ’s first new pylon for nearly 90 years . called the t-pylon it is a third shorter than the old lattice pylons . but it is able to carry just as much power - 400,000 volts . it is designed to be less obtrusive and will be used for clean energy .
Transformer Baseline: the t-pylon -lrb- artist ’s shown -rrb- it is a third shorter than the old lattice pylons . but it is able to carry just as much power - 400,000 volts . it is designed to be less obtrusive and will be used for clean energy purposes .
Our model: national grid has revealed the uk ’s first new pylon for nearly 90 years . called the t-pylon it is a third shorter than the old lattice pylons . but it is able to carry just as much power - 400,000 volts . it is designed to be less obtrusive and will be used for clean energy purposes .

Figure 4: The comparison of ground truth summary and generated summaries of 2 abstractive summarization models on CNN/DailyMail dataset. The red represents missed information, the blue means unnecessary information and the green signify appropriate information.

Quantitative Analysis The experimental results are given in Table 1. Overall, our model improves all other baselines(reported in their articles) for ROUGE-1, 2 F1 scores, while our model gets a lower ROUGE-L F1 score than the RL (Reinforcement Learning) model[3]. From celikyilmaz et al.[2], the ROUGE-L F1 score is not correlated with summary quality, and our model generates the most novel words compared with other baselines in novelty experiment 5. The novel words are harmful to ROUGE-2, L F1 scores. This result also account for our models being more abstractive.

Figure 4 shows the ground truth summary, the generated summaries from the Transformer baseline model and our aggregation Transformer using the attention aggregation method. The source text is the main fragment of the truncated text. Compared with the aggregation Transformer, the summary generated by the Transformer baseline model have two problems. Firstly, the summary of the baseline model is lack of salient information marked with red in the source text. Secondly, it contains unnecessary information marked with blue in the source text.

we hold the opinion that the Transformer baseline model has weak memory ability compared to our model. Therefore, it can not remind the information far from its current states which will lead to missing some salient information and it may remember irrelevant information which will lead to unnecessary words generated in summaries. Our model uses the aggregation mechanism that can review the primitive information to enhance the model memory capacity. Therefore, the aggregation mechanism makes our model generate salient and non-repetitive words in summaries.

4/4 40.46 41.53 40.05 18.11 18.42 17.72 36.42 37.15 36.77
4/3 40.88 40.47 39.75 18.40 17.93 17.63 37.07 36.70 36.50
4/2 41.70 39.23 39.54 18.78 17.26 17.47 37.96 35.87 36.51
2/4 39.88 41.26 39.57 17.67 18.07 17.30 35.97 37.00 35.98
3/4 40.46 40.01 39.80 18.05 18.10 17.54 36.63 37.07 36.43
Table 3: We compare different layers of encoder(E) and decoder(D) and report results on CNN/DailyMail test dataset using precision/recall/F1 scores of ROUGE.

Encoder/Decoder Layers Analysis The first exploration experiment consists of Transformer models using different encoder and decoder layers. And we only experiment if the number of encoder/decoder layers is no more than 4. We also tried 6 encoder and decoder layers, however, there is no notable difference with 4 encoder and decoder layers and increasing a lot of parameters and taking more time to converge. Therefore we make the Transformer baseline model have 4 encoder and decoder layers.

If we decrease the layers of encoder or decoder respectively, the results are shown in Table 3. It can be concluded from the comparison of each model results that we can get lower precision but higher recall score when the encoder layers are decreasing and we have opposite results on the decoder layers decreasing experiments. Meanwhile, we can get a higher ROUGE-1 F1 score and lower ROUGE-2, L F1 scores in the model decreasing each 1 decoder layer compared to that decreasing each 1 encoder layer. Therefore, we can conclude that the encoder captures the features of the source text while the decoder makes summaries consistently.

(m1)Transformer 40.05 17.72 36.77
(m2)Transformer(add 1 layer) 39.79 17.52 36.32
(m3)Transformer(add 2 layer) 39.69 17.34 36.15
(m4)Agg-Transformer(proj 1 layer) 40.58 17.77 36.60
(m5)Agg-Transformer(proj 2 layer) 40.67 17.84 36.70
(m6)Agg-Transformer(attn 1 layer) 41.06 18.02 38.04
(m7)Agg-Transformer(attn 2 layer) 40.03 17.59 36.60
Table 4: The aggregation mechanism experiments. our experiments use 3 aggregation methods with 2 different aggregation layers.

Aggregation mechanism Analysis The second exploration experiment consists of our baseline model(m1, m2) and aggregation Transformer model using different aggregation mechanism(m4, m6) in Table 4. If we use baseline model adding the last layer(s) simply(m2), the result scores will decrease beyond our expectation. However, simply adding the last layer(s) can re-distribute the encoder final states with history states, it will average the importance weights of those layers and that maybe get things worse. Compared with the baseline model, the result scores of our aggregation models(m4, m6) are boosting. We compute attention between history(query) and encoder final states(key/value) to re-distribute the final states so that the encoder obtains the ability to fusing history information with different importance.

The third exploration contains 3 groups experiments: add group(m2, m3), projection group(m4, m5) and attention group(m6, m7). The aggregation Transformer models here use different aggregation layers. We also experiment with the model in the above 3 groups with 3 aggregation layers, but they all get extraordinary low ROUGE scores (all 3 models have ROUGE-1 39.3, ROUGE-2 14.5, ROUGE-L 34.3 roughly). They all incorporate the output of the first encoder layer which may not have semantic information which may be harmful to the re-distributing of the encoder final states. So we do not compare with those models explicitly.

For add aggregation group, we increase the added layers while the ROUGE scores will get down. If we add more layers, the final state distributions will tend to be the uniform distribution which makes decoder confused about the key ideas of source text. For that reason, we may get worse scores when we add more layers.

For the projection aggregation group, we increase the aggregation layers and the ROUGE scores will rise. If we aggregate more layers, the history states will contain more information which will lead to performance improvement. However, we will lose a lot of information when the aggregation layers increasing. And we achieve the best result with 2 aggregation layers.

For the attention aggregation group, we get the best score with 1 aggregation layer but the ROUGE scores will decline if we increase the aggregation layers. We just need one layer attention to focus on history states, because too much attention layers may have an excessive dependency on history states. If the encoder final distribution focus more on shallow layers which introduced a lot of useless information, it is harmful to the encoder to capture salient features.

Figure 5: The statistics of novel n-grams and sentences. Our model can generate far more novel n-grams and sentences than Pointer Generator and Transformer baseline.

Abstractive analysis Figure 5 shows that our model copy whole sentences from source texts, and the copy rate is almost close to reference summaries. However, there is still a huge gap in n-grams generation, and this is the main area for improvement.

In particular, the Pointer Generator model tends to examples with few novel words in summaries because of its lower rate of novel words generation. The Transformer baseline model can generate novel summaries and our model get great improvement (with 0.5, 4.6, 7.8, 10.1% novelty improvement for n-gram()) compared to the Transformer baseline model. Because our model reviews history states and re-distribute encoder final states, we get more accurate semantic representation. It also proves that our aggregation mechanism can improve the memory capability of encoder.

Model Salient Fluency Non-Repeated
Pointer Generator 3.37 3.12 3.17
Pointer Generator + Coverage 3.42 3.23 3.61
Transformer 3.56 3.30 3.67
Transformer + Aggregation 3.87 3.37 3.78
Table 5: Human evaluation of three models. We compare the average score of salient, fluency and non-repeated. The best scores are bolded.

Human Evaluation We conduct our human evaluation with setup in section 4.1, and the results show in Table 5. We only compared three models on salient, fluency and non-repeated criteria, and our model gets the highest score in all criteria. But in fluency criterion, none of the models scores well, which means it is hard to understand semantic information for all models now. The Pointer Generator is our baseline abstractive summarization approach and has the lowest scores. The Pointer Generator uses the coverage mechanism to avoid generating overlap words, which can make summaries more fluent and less repetitive. The transformer is a new abstractive summarization based on attention mechanism, and it can get better performance than the Pointer Generator model. We equip the Transformer model with the aggregation mechanism, and it can get great improvement on all 3 criteria.

4.3 Our Chinese Experiments

We build our Chinese summarization dataset via crawling news website7 and process the raw web page contents to character-based texts. The details of our dataset show in Table 2 where our dataset has a similar average length of source texts and summaries compared CNN/DM dataset. It is a temporary dataset, which only contains 60,000 pairs of text totally for now, and we are still adding data to our dataset.

Lead-3 54.09 42.46 34.56
Pointer Generator 55.49 43.59 48.03
Pointer Generator + Coverage 55.64 43.80 48.08
Transformer 52.69 39.86 43.66
Transformer + Aggregation 58.00 44.42 48.85
Table 6: Experiments on our Chinese dataset. We only experiment on three baseline models and evaluate results with ROUGE F metrics. The best scores are bolded.

We also experiment on our Chinese dataset and evaluate the result with ROUGE8 metrics. Our model gets the highest score, while the Pointer Generator model gets rather high ROUGE scores (see Table 6). Because the dataset does not contain many novel words where it is suitable for the Pointer Generator model. Our dataset contains (6.17, 14.51, 17.99, 20.10)% novel (1,2,3,4)-gram and 59.90% novel sentences; by comparison, the novel n-gram and sentences frequency of CNN/DM in Figure 5 is (14.47, 54.75, 73.32, 82, 98.16)% respectively. And the Pointer Generator model generates summaries containing less novel words and sentences, which leads to high scores in our Chinese dataset. Finally, we compare our model with the Transformer baseline model, and our results improve 5.31 in ROUGE-1, 4.56 in ROUGE-2 and 5.19 in ROUGE-L scores.

5 Conclusions

In this paper, we propose a new aggregation mechanism for the Transformer model, which can enhance encoder memory ability. The addition of the aggregation mechanism obtains the best performance compared to the Transformer baseline model and Pointer Generator on CNN/DailyMail dataset in terms of ROUGE scores. We explore different aggregation methods: add, projection and attention methods, in which attention method performs best. We also explore the performance of different aggregation layers to improve the best score. We build a Chinese dataset for the summarization task and give the statistics of it in Table 2. our proposed method also achieves the best performance on our Chinese dataset.

In the future, we will explore memory network to collect history information and try to directly send history information to the decoding processing to improve the performance in the summarization task. And the aggregation mechanism can be transferred to other generation tasks as well.


This work was supported by the National Natural Science Foundation of China (Grant No.61602474).


  1. footnotetext: School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
  2. footnotetext: Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
  3. footnotetext: Corresponding author: Chuang Zhang,


  1. D. Bahdanau, K. Cho and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §1, §3.1.
  2. A. Celikyilmaz, A. Bosselut, X. He and Y. Choi (2018) Deep communicating agents for abstractive summarization. arXiv preprint arXiv:1803.10357. Cited by: §1, §2.2, §4.2.
  3. Y. Chen and M. Bansal (2018) Fast abstractive summarization with reinforce-selected sentence rewriting. arXiv: Computation and Language. Cited by: §4.2, Table 1.
  4. J. Cheng and M. Lapata (2016) Neural summarization by extracting sentences and words. arXiv preprint arXiv:1603.07252. Cited by: §2.1.
  5. S. Chopra, M. Auli and A. M. Rush (2016) Abstractive sentence summarization with attentive recurrent neural networks. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 93–98. Cited by: §2.2.
  6. J. M. Conroy and D. P. Oleary (2001) Text summarization via hidden markov models. pp. 406–407. Cited by: §2.1.
  7. X. Duan, M. Yin, M. Zhang, B. Chen and W. Luo (2019) Zero-shot cross-lingual abstractive sentence summarization through teaching generation and attention. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3162–3172.
  8. J. Gehring, M. Auli, D. Grangier, D. Yarats and Y. N. Dauphin (2017) Convolutional sequence to sequence learning. arXiv: Computation and Language. Cited by: Table 1.
  9. S. Gehrmann, Y. Deng and A. M. Rush (2018) Bottom-up abstractive summarization. arXiv preprint arXiv:1808.10792. Cited by: §2.2.
  10. J. Gu, Z. Lu, H. Li and V. O. K. Li (2016) Incorporating copying mechanism in sequence-to-sequence learning. 1, pp. 1631–1640. Cited by: §2.2.
  11. C. Gulcehre, S. Ahn, R. Nallapati, B. Zhou and Y. Bengio (2016) Pointing the unknown words. arXiv: Computation and Language.
  12. H. Guo, R. Pasunuru and M. Bansal (2018) Soft layer-specific multi-task summarization with entailment and question generation. arXiv: Computation and Language.
  13. K. M. Hermann, T. Kociský, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman and P. Blunsom (2015) Teaching machines to read and comprehend. arXiv: Computation and Language. Cited by: §4.1.
  14. W. Hsu, C. Lin, M. Lee, K. Min, J. Tang and M. Sun (2018) A unified model for extractive and abstractive summarization using inconsistency loss. 1, pp. 132–141. Cited by: §2.2, Table 1.
  15. Y. Jiang and M. Bansal (2018) Closed-book training to improve summarization encoder memory. pp. 4067–4077. Cited by: Table 1.
  16. P. Kouris, G. Alexandridis and A. Stafylopatis (2019) Abstractive text summarization based on deep learning and semantic content generalization. pp. 5082–5092.
  17. J. M. Kupiec, J. O. Pedersen and F. R. Chen (1995) A trainable document summarizer. pp. 68–73. Cited by: §2.1.
  18. L. Lebanoff, K. Song, F. Dernoncourt, D. S. Kim, S. Kim, W. Chang and F. Liu (2019) Scoring sentence singletons and pairs for abstractive summarization. arXiv: Computation and Language.
  19. C. Lin (2004) Rouge: a package for automatic evaluation of summaries.
  20. H. Lin and V. Ng (2019) Abstractive summarization: a survey of the state of the art. 33, pp. 9815–9822.
  21. J. Lin, X. Sun, S. Ma and Q. Su (2018) Global encoding for abstractive summarization. arXiv: Computation and Language.
  22. L. Liu, Y. Lu, M. Yang, Q. Qu, J. Zhu and H. Li (2017) Generative adversarial network for abstractive text summarization. arXiv: Computation and Language.
  23. Y. Liu and M. Lapata (2019) Hierarchical transformers for multi-document summarization. pp. 5070–5081.
  24. Y. Liu and M. Lapata (2019) Text summarization with pretrained encoders. arXiv: Computation and Language.
  25. K. Lopyrev (2015) Generating news headlines with recurrent neural networks. arXiv: Computation and Language. Cited by: §1.
  26. T. Makino, T. Iwakura, H. Takamura and M. Okumura (2019) Global optimization under length constraint for neural text summarization. pp. 1039–1048.
  27. E. Moroshko, G. Feigenblat, H. Roitman and D. Konopnicki (2019) An editorial network for enhanced document summarization.. arXiv: Computation and Language.
  28. R. Nallapati, F. Zhai and B. Zhou (2016) SummaRuNNer: a recurrent neural network based sequence model for extractive summarization of documents. arXiv: Computation and Language. Cited by: §2.1.
  29. R. Nallapati, B. Zhou and M. Ma (2017) Classify or select: neural architectures for extractive document summarization. arXiv: Computation and Language.
  30. R. Nallapati, B. Zhou, C. N. D. Santos, C. Gulcehre and B. Xiang (2016) Abstractive text summarization using sequence-to-sequence rnns and beyond. pp. 280–290. Cited by: §1, §2.2, §4.1, Table 1.
  31. S. Narayan, S. B. Cohen and M. Lapata (2018) RANKING sentences for extractive summarization with reinforcement learning. 1, pp. 1747–1759.
  32. M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier and M. Auli (2019) Fairseq: a fast, extensible toolkit for sequence modeling. pp. 48–53.
  33. T. Oya, Y. Mehdad, G. Carenini and R. T. Ng (2014) A template-based abstractive meeting summarization: leveraging summary and source text relationships. pp. 45–53.
  34. A. M. Rush, S. Chopra and J. Weston (2015) A neural attention model for abstractive sentence summarization. arXiv: Computation and Language. Cited by: §1, §2.2.
  35. A. See, P. J. Liu and C. D. Manning (2017) Get to the point: summarization with pointer-generator networks. 1, pp. 1073–1083. Cited by: §1, §2.2, §4.1, Table 1.
  36. R. Sennrich, B. Haddow and A. Birch (2015) Neural machine translation of rare words with subword units. arXiv: Computation and Language. Cited by: §3.3.
  37. I. Sutskever, O. Vinyals and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: §1.
  38. S. Takase, J. Suzuki, N. Okazaki, T. Hirao and M. Nagata (2016) Neural headline generation on abstract meaning representation.. pp. 1054–1059.
  39. J. Tan, X. Wan and J. Xiao (2017) Abstractive document summarization with a graph-based attentional neural model. 1, pp. 1171–1181. Cited by: §1.
  40. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser and I. Polosukhin (2017) Attention is all you need. arXiv: Computation and Language. Cited by: §1, §2.2.
  41. O. Vinyals, M. Fortunato and N. Jaitly (2015) Pointer networks. pp. 2692–2700.
  42. K. Woodsend and M. Lapata (2010) Automatic generation of story highlights. pp. 565–574. Cited by: §2.1.
  43. K. Yao, L. Zhang, T. Luo and Y. Wu (2018) Deep reinforcement learning for extractive document summarization. Neurocomputing 284, pp. 52–62.
  44. Y. You, W. Jia, T. Liu and W. Yang (2019) Improving abstractive document summarization with salient information modeling. pp. 2132–2141.
  45. W. Zeng, W. Luo, S. Fidler and R. Urtasun (2017) Efficient summarization with read-again and copy mechanism. arXiv: Computation and Language.
  46. H. Zhang, Y. Gong, Y. Yan, N. Duan, J. Xu, J. Wang, M. Gong and M. Zhou (2019) Pretraining-based natural language generation for text summarization.. arXiv: Computation and Language.
  47. X. Zhang, M. Lapata, F. Wei and M. Zhou (2018) Neural latent extractive document summarization. pp. 779–784. Cited by: §2.1.
  48. X. Zhang, F. Wei and M. Zhou (2019) HIBERT: document level pre-training of hierarchical bidirectional transformers for document summarization. pp. 5059–5069.
  49. Q. Zhou, N. Yang, F. Wei, S. Huang, M. Zhou and T. Zhao (2018) Neural document summarization by jointly learning to score and select sentences. 1, pp. 654–663. Cited by: §2.1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description