Improving Abstractive Text Summarization with History Aggregation
Recent neural sequence to sequence models have provided feasible solutions for abstractive summarization. However, such models are still hard to tackle long text dependency in the summarization task. A high-quality summarization system usually depends on strong encoder which can refine important information from long input texts so that the decoder can generate salient summaries from the encoderâs memory. In this paper, we propose an aggregation mechanism based on the Transformer model to address the challenge of long text representation. Our model can review history information to make encoder hold more memory capacity. Empirically, we apply our aggregation mechanism to the Transformer model and experiment on CNN/DailyMail dataset to achieve higher quality summaries compared to several strong baseline models on the ROUGE metrics.
The task of text summarization is automatically compressing long text to a shorter version while keeping the salient information. It can be divided into two approaches: extractive and abstractive summarization. The extractive approach usually selects sentences or phrases from source text directly. On the contrary, the abstractive approach first understands the semantic information of source text and generates novel words not appeared in source text. Extractive summarization is easier, but abstractive summarization is more like the way humans process text. This paper focuses on the abstractive approach. Unlike other sequence generation tasks in NLP(Natural Language Processing) such as NMT(Neural Machine Translation), in which the lengths of input and output text are close, the summarization task exists severe imbalance on the lengths. It means that the summarization task must model long-distance text dependencies.
As RNNs have the ability to tackle time sequence text, variants of sequence to sequence model based on them have emerged on a large scale and can generate promising results. To solve the long-distance text dependencies, Bahdnau et al. first propose the attention mechanism which allows each decoder step to refer to all encoder hidden states. Rush et al. first incorporate attention mechanism to summarization task. There are also other attention-based models to ease the problem of long input texts for summarization task, like Bahdnau attention, hierarchical attention, graph-based attention and simple attention. Celikyilmaz et al. segment text and encode segmented text independently then broadcast their encoding to others. Though these systems are promising, they exhibit undesirable behaviors such as producing inaccurate factual details and repeating themselves as it is hard to decide where to attend and where to ignore for one-pass encoder.
Modeling an effective encoder for representing a long text is still a challenge, and we are committed to solving long text dependency problems by aggregation mechanism. The key idea of the aggregation mechanism is to collect history information then computes attention between the encoder final hidden states and history information to re-distribute the encoder final states. It suggests that the encoder can read long input texts a few times to understand the text clearly. We build our model by reconstructing the Transformer model via incorporating our novel aggregation mechanism. Empirically, we first analyze the features of summarization and translation dataset. Then we experiment with different encoder and decoder layers, and the results reveal that the ability of the encoder layer is more important than the decoder layer, which implies that we should focus more on the encoder. Finally, we experiment on CNN/DailyMail dataset, and our model generates higher quality summaries compared to strong baselines of Pointer Generator and Transformer models on ROUGE metrics and human evaluations.
The main contributions of this paper are as follows:
We put forward a novel aggregation mechanism to redistribute context states of text with collected history information. Then we equip the Transformer model with the aggregation mechanism.
Our model outperforms 1.01 ROUGE-1, 0.30 ROUGE-2 and 1.27 ROUGE-L scores on CNN/DailyMail dataset and 5.31 ROUGE-1, 4.56 ROUGE-2 and 5.19 ROUGE-L scores on our build Chinese news dataset compared to Transformer baseline model.
2 Related Work
In this section, we first introduce extractive summarization then introduce abstractive summarization.
2.1 Extractive Summarization
Extractive summarization aims to select salient sentences from source documents directly. This method is always modeled as a sentence ranking problem via selecting sentences with high scores, sequence labeling(binary label) problem or integer linear programmers. The models above mostly leverage manual engineered features, but they are now replaced by the neural network to extract features automatically. Cheng et al. get sentence representation using convolutional neural network(CNN) and document representation using recurrent neural network(RNN) and then select sentences/words using hierarchical extractor. Nallapati et al. treat the summarization as a sequence labeling task. They get sentence and document representations using RNNs and after a classification layer, each sentence will get a label which indicates whether this sentence should be selected. Zhou et al. present a model for extractive summarization by jointly learning score and select sentences. Zhang et al.  put forward a latent variable model to tackle the problem of sentence label bias.
2.2 Abstractive Summarization
Abstractive summarization aims to rewrite source texts with understandable semantic meaning. Most methods of this task are based on sequence to sequence models. Rush et al. first incorporate the attention mechanism to abstractive summarization and achieve state of the art scores on DUC-2004 and Gigaword datasets. Chopra et al. improve the model performance via RNN decoder. Nallapati et al. adopt a hierarchical network to process long source text with hierarchical structure. Gu et al. are the first to show that a copy mechanism can take advantage of both extractive and abstractive summarization by copying words from the source text (extractive summarization) and generating original words (abstractive summarization). See et al. incorporate copy and coverage mechanisms to avoid generating inaccurate and repeated words. Celikyilmaz et al. split text to paragraph and apply encoder to each paragraph, then broadcast paragraph encoding to others. Recently, Vaswani et al. give a new view of sequence to sequence model. It employs the self-attention to replace RNN in sequence to sequence model and uses multi-head attention to capture different semantic information.
Lately, more and more researchers focus on combine abstractive and extractive summarization. Hsu et al. build a unified model by using inconsistency loss. Gehrmann et al. first train content-selector to select and mask salient information then train the abstractive model (Pointer Generator) to generate abstractive summarization.
In this section, we first describe the attention mechanism and the Transformer baseline model, after that, we introduce the pointer and BPE mechanism. Our novel aggregation mechanism is described in the last part. The code for our model is available online.
Notation We have pairs of texts , where is a long text and is the summary of corresponding . The lengths of and is and respectively. Each text is composed by a sequence of words , and we embed word into vector . So we represent document with embedding vector and we can get representation of the same as .
3.1 Attention Mechanism
The attention mechanism is widely used in text summarization models as it can produce word significance distribution in source text for disparate decode steps. Bahdanau et al. first propose the attention mechanism where attention weight distribution can be calculated:
Where is the encoder hidden states in th word, is decoder hidden states at time step . Vector and scalar are learnable parameters. is probability distribution that represents the importance of different source words for decoder at time step .
Transformer redefines attention mechanism more concisely. In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix . The keys and values are also packed together into matrices and .
where is transpose function, , is the real field, are the lengths of query and key/value sequences, are the dimensions of key and value. For summarization model we assume . Self-attention can be defined from basic attention with . And multi-head attention concatenates multiple basic attentions with different parameters. We formulate multi-head attention as:
where , and vector are learnable parameters.
3.2 Transformer Baseline Model
Our baseline model corresponds to the Transformer model in NMT tasks. The model is different from previous sequence-to-sequence models as it applies attention to replace RNN. The Transformer model can divide into encoder and decoder, and we will discuss them respectively below.
Input The attention defined in the Transformer is the bag of words(BOW) model, so we have to add extra position information to the input. The position encodes with heuristic sine and cosine function:
where is the position of word in text, is the dimension index of embedding, and the dimension of model is . The input of network is equal to source text word embeddings added position embeddings .
Encoder The goal of encoder is extracting the features of input text and map it to a vector representation. The encoder stacks with encoder layer. Each layer consists of multi-head self-attention and position-wise feed-forward sublayers. We employ a residual connection around each of the two sublayers, followed by layer normalization. From the multi-head attention sublayer, we can extract different semantic information. Then we compute each encoder layer’s final hidden states using position-wise feed-forward. The th encoder layer is formulated as:
where is the multi-head self-attention output after residual connection and is layer normalization function, means the output of encoder layer . if , or , vector and scalar are learnable parameters, and is the position-wise feed-forward sublayer. This sublayer also can be described as two convolution operations with kernel size 1.
Decoder The decoder is used for generating salient and fluent text from the encoder hidden states. Decoder stacks with decoder layers. Each layer consists of masked multi-head self-attention, multi-head attention, and feed-forward sublayers. Similar to the encoder, we employ residual connections around each of the sublayers, followed by layer normalization. And we take th decoder layer as example. We use the masked multi-head attention to encode summary as vector :
where in the first layer and in other layers. is the output of the th decoder layer, is the word embeddings and position embeddings of generated words respectively. The is masked multi-head self-attention and the mask is similar with the Transformer decoder. Then we execute multi-head attention between encoder and decoder:
where is hidden states of decoder masked multi-head attention and is the last encoder layer output states. Finally, we use position-wise feed-forward and layer normalization sublayers to compute final states :
where vector and scalar are learnable parameters. And projecting the decoder final hidden states to vocab size then we can get vocabulary probability distribution .
3.3 Pointer and BPE Mechanism
In generation tasks, we should deal with the out of vocabulary(OOV) problem. If we do not tackle this problem, the generated text only contains a limited vocabulary words and replaces OOVs with . Things get worse in summarization task, the specific nouns(like name, place, etc.) with low frequency is the key information of summary, however, the vocabulary built with top words with the most frequent occurrence while those specific nouns may not occur in vocabulary.
The pointer and byte pair encoder (BPE) mechanism are both used to tackle the OOV problem. The original BPE mechanism is a simple data compression technique that replaces the most frequent bytes pair with unused byte. Sennrich et al. first use this technique for word segmentation via merging characters instead of bytes. So the fixed vocabulary can load more subwords to alleviate the problem of OOV.
The pointer mechanism allows both copying words from the source text and generating words from a fixed vocabulary. For pointer mechanism at each decoder time step, the generation probability can be calculated:
where vector and scalar are learnable parameter. is the last decoder output states. We compute the final word distribution via pointer network:
where is representation of input, is one-hot indicator vector for , is probability distribution of source words and is final probability distribution.
3.4 Aggregation Mechanism
The overview of our model is in Figure 1. To enhance memory ability, we add the aggregation mechanism between encoder and decoder for collecting history information. The aggregation mechanism reconstructs the encoder’s final hidden states by reviewing history information. And we put forward two primitive aggregation approaches that can be proved effective in our task.
The first approach is using full-connected networks to collect historical information(see Figure2). This approach first goes through normal encoder layers to get the outputs of each layer, and we select middle layers’ outputs then concatenate them as input of full connected networks to obtain history information . Finally, we compute multi-head attention between history state and the output of the last encoder layer. This process can be formulated as:
where vector and scalar are learnable parameters, is hyper-parameter to be explored. Then we add the multi-head attention layer between the last encoder layer output and history information . The output of attention is the final states of encoder:
where is history information and .
The second approach is using attention mechanism to collect history information(see Figure 3). We select middle encoder layers’ outputs to iteratively compute multi-head attention between current encoder layer output and previous history information. And the th history information can be calculated as follows:
where is index of selected encoder layers, is previous history state and is encoder output . Iteratively calculating history information until the last selected encoder layer, we can get final history hidden states and make the states as the final states of the encoder.
Finally, we define the objective function. Given the golden summary and input text , we minimize the negative log-likelihood of the target word sequence. The training objective function can be described:
where is model parameter and is the number of source-summary text pairs in training set. The loss for one sample can be added by the loss of generated word in each time step :
where can be calculated in decoder time step, is total decoding steps.
In this section, we first define the setup of our experiment and then analyze the results of our experiments.
4.1 Experimental Setup
We conduct our experiments on CNN/DailyMail dataset[13, 30], which has been widely used for long document summarization tasks. The corpus is constructed by collecting online news articles and human-generated summaries on CNN/Daily Mail website. We choose the non-anonymized version
|Pointer Generator + Coverage ||39.53||17.28||36.38|
|Pointer Generator + Coverage + cbdec + RL||40.66||17.87||37.06|
|Inconsistency Loss ||40.68||17.97||37.13|
|rnn-ext + abs + RL + rerank ||40.88||17.80||38.54|
Training Details We conduct our experiments with 1 NVIDIA Tesla V100. During training and testing time we truncate the source text to words and we build a shared vocabulary for encoder and decoder with small vocabulary size k, due to the using of the pointer or BPE mechanism. Word embeddings are learned during training time. We use Adam optimizer with initial learning rate and parameter in training phase. We adapt the learning rate according to the loss on the validation set (half learning rate if validation set loss is not going down in every two epochs). And we use regulation with all . The training process converges about 200,000 steps for each model.
In the generation phase, we use the beam search algorithm to produce multiple summary candidates in parallel to get better summaries and add repeated words to blacklist in the processing of search to avoid duplication. For fear of favoring shorter generated summaries, we utilize the length penalty. In detail, we set beam size , no-repeated n-gram size and length penalty parameter . We also constrain the maximum and minimum length of the generated summary to and respectively.
We evaluate our system using F-measures of ROUGE-1, ROUGE-2, ROUGE-L metrics which respectively represent the overlap of N-gram and the longest common sequence between the golden summary and the system summary. The scores are computed by python pyrouge
Experiment explorations We explore the influence of different experiment hyper-parameters setup for the model’s performance, which includes 11 different experiment settings.
Firstly, we explore the number of Transformer encoder/decoder layers (see Table 3).
Secondly, we dig out the different aggregation methods with 1 aggregation layer (see Table 4). The exploration includes our baseline model(m1) and Transformer model with add function(m2), projection aggregation method(m4) and attention aggregation method(m6).
Thirdly, we also explore the different performance of different number of aggregation layers (see Table 4). There are 3 groups of experiments with different number of aggregation layers: Transformer adding last 2 layers(m2) and last 3 layers(m3), Transformer with projection aggregation method using 1 layer(m4) and 2 layers(m5) and Transformer with attention aggregation method using 1 layer(m6) and 2 layer(m7). For all models except the exploration of encoder/decoder layers, we use 4 encoder and 4 decoder layers.
Human Evaluation The ROUGE scores are widely used in the automatic evaluation of summarization, but it has great limitations in semantic and syntax information. In this case, we use manual evaluation to ensure the performance of our models. We perform a small scale human evaluations where we randomly select about 100 generated summaries from each of the 3 models(Pointer Generator, Transformer, and aggregation Transformer) and randomly shuffle the order of 3 summaries to anonymize model identities, then let 20 anonymous volunteers with excellent English literacy skills score random 10 summaries for each 3 models range from 1 to 5(high score means high-quality summary). then we using the average score of each summary as their final score. the evaluation criteria are as follows: (1) salient: summaries have the important point of the source text, (2) fluency: summaries are consistent with human reading habits and have few grammatical errors, (3) non-repeated: summaries do not contain too much redundancy word.
|max-token-len(art/abs)||2882 / 2096||2134 / 1684||2377 / 678|
|avg-token-len(art/abs)||790 / 55||768 / 61||777 / 58|
|max-token-len(art/abs)||1914 / 80||1687 / 80||1670 / 80|
|avg-token-len(art/abs)||768 / 65||763 / 65||769 / 65|
|max-token-len(de/en)||244 / 228||169 / 154||245 / 217|
|avg-token-len(de/en)||24 /24||24 / 24||23 / 22|
|max-token-len(en/de)||250 /250||224 / 233||101 / 93|
|avg-token-len(en/de)||28 /29||28 / 29||26 / 27|
Dataset Analysis To demonstrate the difference between summarization and translation tasks, we compare the dataset for two tasks (see Table 2). The summarization dataset CNN/DailyMail contains 287226 training pairs, 13368 validation pairs, and 11490 test pairs. The translation dataset iwslt14 and wmt17 have 160239/3961179 training pairs, 7283/40058 validation pairs, and 6750/3003 test pairs respectively. Then we find the characteristics of those two different tasks after comparison. The summarization source text can include more than 2000 words and the average length of the source text is 10 times longer than the target text, while the translation task contains at most 250 words and the average length of the source text is about the same as the target text. Because of that, we need a strong encoder with memory ability to decide where to attend and where to ignore.
|Source Text(truncated 500): (……) national grid has revealed the uk ’s first new pylon for nearly 90 years . called the t-pylon -lrb- artist ’s illustration shown -rrb- it is a third shorter than the old lattice pylons . but it is able to carry just as much power - 400,000 volts . it is designed to be less obtrusive and will be used for clean energy purposes . national grid is building a training line of the less obtrusive t-pylons at their eakring training academy in nottinghamshire . britain ’s first pylon , erected in july 1928 near edinburgh , was designed by architectural luminary sir reginald blomfield , inspired by the greek root of the word ‘ pylon ’ -lrb- meaning gateway of an egyptian temple -rrb- . the campaign against them - they were unloved even then - was run by rudyard kipling , john maynard keynes and hilaire belloc . five years later , the biggest peacetime construction project seen in britain , the connection of 122 power stations by 4,000 miles of cable , was completed . it marked the birth of the national grid and was a major stoking of the nation ’s industrial engine and a vital asset during the second world war (……)|
|Ground Truth: national grid has revealed the uk ’s first new pylon for nearly 90 years . called the t-pylon it is a third shorter than the old lattice pylons . but it is able to carry just as much power - 400,000 volts . it is designed to be less obtrusive and will be used for clean energy .|
|Transformer Baseline: the t-pylon -lrb- artist ’s shown -rrb- it is a third shorter than the old lattice pylons . but it is able to carry just as much power - 400,000 volts . it is designed to be less obtrusive and will be used for clean energy purposes .|
|Our model: national grid has revealed the uk ’s first new pylon for nearly 90 years . called the t-pylon it is a third shorter than the old lattice pylons . but it is able to carry just as much power - 400,000 volts . it is designed to be less obtrusive and will be used for clean energy purposes .|
Quantitative Analysis The experimental results are given in Table 1. Overall, our model improves all other baselines(reported in their articles) for ROUGE-1, 2 F1 scores, while our model gets a lower ROUGE-L F1 score than the RL (Reinforcement Learning) model. From celikyilmaz et al., the ROUGE-L F1 score is not correlated with summary quality, and our model generates the most novel words compared with other baselines in novelty experiment 5. The novel words are harmful to ROUGE-2, L F1 scores. This result also account for our models being more abstractive.
Figure 4 shows the ground truth summary, the generated summaries from the Transformer baseline model and our aggregation Transformer using the attention aggregation method. The source text is the main fragment of the truncated text. Compared with the aggregation Transformer, the summary generated by the Transformer baseline model have two problems. Firstly, the summary of the baseline model is lack of salient information marked with red in the source text. Secondly, it contains unnecessary information marked with blue in the source text.
we hold the opinion that the Transformer baseline model has weak memory ability compared to our model. Therefore, it can not remind the information far from its current states which will lead to missing some salient information and it may remember irrelevant information which will lead to unnecessary words generated in summaries. Our model uses the aggregation mechanism that can review the primitive information to enhance the model memory capacity. Therefore, the aggregation mechanism makes our model generate salient and non-repetitive words in summaries.
Encoder/Decoder Layers Analysis The first exploration experiment consists of Transformer models using different encoder and decoder layers. And we only experiment if the number of encoder/decoder layers is no more than 4. We also tried 6 encoder and decoder layers, however, there is no notable difference with 4 encoder and decoder layers and increasing a lot of parameters and taking more time to converge. Therefore we make the Transformer baseline model have 4 encoder and decoder layers.
If we decrease the layers of encoder or decoder respectively, the results are shown in Table 3. It can be concluded from the comparison of each model results that we can get lower precision but higher recall score when the encoder layers are decreasing and we have opposite results on the decoder layers decreasing experiments. Meanwhile, we can get a higher ROUGE-1 F1 score and lower ROUGE-2, L F1 scores in the model decreasing each 1 decoder layer compared to that decreasing each 1 encoder layer. Therefore, we can conclude that the encoder captures the features of the source text while the decoder makes summaries consistently.
|(m2)Transformer(add 1 layer)||39.79||17.52||36.32|
|(m3)Transformer(add 2 layer)||39.69||17.34||36.15|
|(m4)Agg-Transformer(proj 1 layer)||40.58||17.77||36.60|
|(m5)Agg-Transformer(proj 2 layer)||40.67||17.84||36.70|
|(m6)Agg-Transformer(attn 1 layer)||41.06||18.02||38.04|
|(m7)Agg-Transformer(attn 2 layer)||40.03||17.59||36.60|
Aggregation mechanism Analysis The second exploration experiment consists of our baseline model(m1, m2) and aggregation Transformer model using different aggregation mechanism(m4, m6) in Table 4. If we use baseline model adding the last layer(s) simply(m2), the result scores will decrease beyond our expectation. However, simply adding the last layer(s) can re-distribute the encoder final states with history states, it will average the importance weights of those layers and that maybe get things worse. Compared with the baseline model, the result scores of our aggregation models(m4, m6) are boosting. We compute attention between history(query) and encoder final states(key/value) to re-distribute the final states so that the encoder obtains the ability to fusing history information with different importance.
The third exploration contains 3 groups experiments: add group(m2, m3), projection group(m4, m5) and attention group(m6, m7). The aggregation Transformer models here use different aggregation layers. We also experiment with the model in the above 3 groups with 3 aggregation layers, but they all get extraordinary low ROUGE scores (all 3 models have ROUGE-1 39.3, ROUGE-2 14.5, ROUGE-L 34.3 roughly). They all incorporate the output of the first encoder layer which may not have semantic information which may be harmful to the re-distributing of the encoder final states. So we do not compare with those models explicitly.
For add aggregation group, we increase the added layers while the ROUGE scores will get down. If we add more layers, the final state distributions will tend to be the uniform distribution which makes decoder confused about the key ideas of source text. For that reason, we may get worse scores when we add more layers.
For the projection aggregation group, we increase the aggregation layers and the ROUGE scores will rise. If we aggregate more layers, the history states will contain more information which will lead to performance improvement. However, we will lose a lot of information when the aggregation layers increasing. And we achieve the best result with 2 aggregation layers.
For the attention aggregation group, we get the best score with 1 aggregation layer but the ROUGE scores will decline if we increase the aggregation layers. We just need one layer attention to focus on history states, because too much attention layers may have an excessive dependency on history states. If the encoder final distribution focus more on shallow layers which introduced a lot of useless information, it is harmful to the encoder to capture salient features.
Abstractive analysis Figure 5 shows that our model copy whole sentences from source texts, and the copy rate is almost close to reference summaries. However, there is still a huge gap in n-grams generation, and this is the main area for improvement.
In particular, the Pointer Generator model tends to examples with few novel words in summaries because of its lower rate of novel words generation. The Transformer baseline model can generate novel summaries and our model get great improvement (with 0.5, 4.6, 7.8, 10.1% novelty improvement for n-gram()) compared to the Transformer baseline model. Because our model reviews history states and re-distribute encoder final states, we get more accurate semantic representation. It also proves that our aggregation mechanism can improve the memory capability of encoder.
|Pointer Generator + Coverage||3.42||3.23||3.61|
|Transformer + Aggregation||3.87||3.37||3.78|
Human Evaluation We conduct our human evaluation with setup in section 4.1, and the results show in Table 5. We only compared three models on salient, fluency and non-repeated criteria, and our model gets the highest score in all criteria. But in fluency criterion, none of the models scores well, which means it is hard to understand semantic information for all models now. The Pointer Generator is our baseline abstractive summarization approach and has the lowest scores. The Pointer Generator uses the coverage mechanism to avoid generating overlap words, which can make summaries more fluent and less repetitive. The transformer is a new abstractive summarization based on attention mechanism, and it can get better performance than the Pointer Generator model. We equip the Transformer model with the aggregation mechanism, and it can get great improvement on all 3 criteria.
4.3 Our Chinese Experiments
We build our Chinese summarization dataset via crawling news website
|Pointer Generator + Coverage||55.64||43.80||48.08|
|Transformer + Aggregation||58.00||44.42||48.85|
We also experiment on our Chinese dataset and evaluate the result with ROUGE
In this paper, we propose a new aggregation mechanism for the Transformer model, which can enhance encoder memory ability. The addition of the aggregation mechanism obtains the best performance compared to the Transformer baseline model and Pointer Generator on CNN/DailyMail dataset in terms of ROUGE scores. We explore different aggregation methods: add, projection and attention methods, in which attention method performs best. We also explore the performance of different aggregation layers to improve the best score. We build a Chinese dataset for the summarization task and give the statistics of it in Table 2. our proposed method also achieves the best performance on our Chinese dataset.
In the future, we will explore memory network to collect history information and try to directly send history information to the decoding processing to improve the performance in the summarization task. And the aggregation mechanism can be transferred to other generation tasks as well.
This work was supported by the National Natural Science Foundation of China (Grant No.61602474).
- footnotetext: School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
- footnotetext: Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
- footnotetext: Corresponding author: Chuang Zhang, firstname.lastname@example.org
- (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §1, §3.1.
- (2018) Deep communicating agents for abstractive summarization. arXiv preprint arXiv:1803.10357. Cited by: §1, §2.2, §4.2.
- (2018) Fast abstractive summarization with reinforce-selected sentence rewriting. arXiv: Computation and Language. Cited by: §4.2, Table 1.
- (2016) Neural summarization by extracting sentences and words. arXiv preprint arXiv:1603.07252. Cited by: §2.1.
- (2016) Abstractive sentence summarization with attentive recurrent neural networks. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 93–98. Cited by: §2.2.
- (2001) Text summarization via hidden markov models. pp. 406–407. Cited by: §2.1.
- (2019) Zero-shot cross-lingual abstractive sentence summarization through teaching generation and attention. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3162–3172.
- (2017) Convolutional sequence to sequence learning. arXiv: Computation and Language. Cited by: Table 1.
- (2018) Bottom-up abstractive summarization. arXiv preprint arXiv:1808.10792. Cited by: §2.2.
- (2016) Incorporating copying mechanism in sequence-to-sequence learning. 1, pp. 1631–1640. Cited by: §2.2.
- (2016) Pointing the unknown words. arXiv: Computation and Language.
- (2018) Soft layer-specific multi-task summarization with entailment and question generation. arXiv: Computation and Language.
- (2015) Teaching machines to read and comprehend. arXiv: Computation and Language. Cited by: §4.1.
- (2018) A unified model for extractive and abstractive summarization using inconsistency loss. 1, pp. 132–141. Cited by: §2.2, Table 1.
- (2018) Closed-book training to improve summarization encoder memory. pp. 4067–4077. Cited by: Table 1.
- (2019) Abstractive text summarization based on deep learning and semantic content generalization. pp. 5082–5092.
- (1995) A trainable document summarizer. pp. 68–73. Cited by: §2.1.
- (2019) Scoring sentence singletons and pairs for abstractive summarization. arXiv: Computation and Language.
- (2004) Rouge: a package for automatic evaluation of summaries.
- (2019) Abstractive summarization: a survey of the state of the art. 33, pp. 9815–9822.
- (2018) Global encoding for abstractive summarization. arXiv: Computation and Language.
- (2017) Generative adversarial network for abstractive text summarization. arXiv: Computation and Language.
- (2019) Hierarchical transformers for multi-document summarization. pp. 5070–5081.
- (2019) Text summarization with pretrained encoders. arXiv: Computation and Language.
- (2015) Generating news headlines with recurrent neural networks. arXiv: Computation and Language. Cited by: §1.
- (2019) Global optimization under length constraint for neural text summarization. pp. 1039–1048.
- (2019) An editorial network for enhanced document summarization.. arXiv: Computation and Language.
- (2016) SummaRuNNer: a recurrent neural network based sequence model for extractive summarization of documents. arXiv: Computation and Language. Cited by: §2.1.
- (2017) Classify or select: neural architectures for extractive document summarization. arXiv: Computation and Language.
- (2016) Abstractive text summarization using sequence-to-sequence rnns and beyond. pp. 280–290. Cited by: §1, §2.2, §4.1, Table 1.
- (2018) RANKING sentences for extractive summarization with reinforcement learning. 1, pp. 1747–1759.
- (2019) Fairseq: a fast, extensible toolkit for sequence modeling. pp. 48–53.
- (2014) A template-based abstractive meeting summarization: leveraging summary and source text relationships. pp. 45–53.
- (2015) A neural attention model for abstractive sentence summarization. arXiv: Computation and Language. Cited by: §1, §2.2.
- (2017) Get to the point: summarization with pointer-generator networks. 1, pp. 1073–1083. Cited by: §1, §2.2, §4.1, Table 1.
- (2015) Neural machine translation of rare words with subword units. arXiv: Computation and Language. Cited by: §3.3.
- (2014) Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: §1.
- (2016) Neural headline generation on abstract meaning representation.. pp. 1054–1059.
- (2017) Abstractive document summarization with a graph-based attentional neural model. 1, pp. 1171–1181. Cited by: §1.
- (2017) Attention is all you need. arXiv: Computation and Language. Cited by: §1, §2.2.
- (2015) Pointer networks. pp. 2692–2700.
- (2010) Automatic generation of story highlights. pp. 565–574. Cited by: §2.1.
- (2018) Deep reinforcement learning for extractive document summarization. Neurocomputing 284, pp. 52–62.
- (2019) Improving abstractive document summarization with salient information modeling. pp. 2132–2141.
- (2017) Efficient summarization with read-again and copy mechanism. arXiv: Computation and Language.
- (2019) Pretraining-based natural language generation for text summarization.. arXiv: Computation and Language.
- (2018) Neural latent extractive document summarization. pp. 779–784. Cited by: §2.1.
- (2019) HIBERT: document level pre-training of hierarchical bidirectional transformers for document summarization. pp. 5059–5069.
- (2018) Neural document summarization by jointly learning to score and select sentences. 1, pp. 654–663. Cited by: §2.1.