Headline Generation: Learning from Decomposed Document Titles

Headline Generation: Learning from Decomposed Document Titles

Oleg Vasilyev, Tom Grek and John Bohannon
San Francisco, California

We propose a novel method for generating titles for unstructured text documents. We reframe the problem as a sequential question-answering task. A deep neural network is trained on document-title pairs that have the property of decomposability, in which the vocabulary of the document title is a subset of the vocabulary of the document body. To train the model we use a corpus of millions of publicly available document-title pairs: news articles and headlines. We present the results of a randomized double-blind trial in which subjects were unaware of which titles were human or machine-generated. When trained on approximately 1.5 million news articles, the model generates headlines that humans judge to be as good or better than the original human-written headlines in the majority of cases.


Headline Generation: Learning from Decomposed Document Titles

 A Preprint
Oleg Vasilyev, Tom Grek and John Bohannon Primer San Francisco, California {oleg,tom,bohannon}@primer.ai

July 3, 2019

1 Introduction

The title of a document can be considered as the shortest possible summary of the document. Automatically generating this concise summary requires extremely low error tolerance. The slightest grammatical deficiency or factual error can render a title functionally useless [1, 2, 3, 4].

The general task of text summarization is well reviewed [5, 2, 6, 7] and is divided broadly into two methods: extractive and abstractive.

Extractive summarization extracts long spans of text from a document, usually whole sentences, and assembles them [8, 9]. Although it lacks expressivity, the method has the advantage that sentences are at least guaranteed to be coherent because they were originally written by humans.

Abstractive summarization does away with this limitation, drawing upon a dictionary [5]. In principle this removes limits to expressivity, but it comes at a steep cost. Existing abstractive summarization systems suffer from grammatical errors, factual hallucinations, and incoherence [10, 4, 3, 7, 1].

A recent attempt at a compromise between the two methods uses abstractive summarization with an extractive fall-back [10]. By extracting unknown words and phrases from a text with a pointer-generator network, it is possible to avoid some of the errors of abstractive methods.

In this paper we present an approach that completely abandons the use of a dictionary for abstractive text generation. It is generally recognized that this dictionary is a significant source of error [11]. We instead compose a headline by extracting all necessary words from the text of the document. Even a document of modest length - typical news articles are between 400 and 800 words long - contains sufficient vocabulary to express a functional headline. Our motivation is to force the text generation model to concentrate on the text itself without the distraction of a large external vocabulary.

The method we present thus blurs the distinction between extractive and abstractive summarization. It can be considered either as abstractive with a purpose-limited dictionary provided by the text itself, or as extractive with high flexibility in its choice of text spans and assembly. For training and evaluation of our approach we use Primer’s database of daily news documents.

2 Headline decomposition

For creating a training set we select only the documents with ‘decomposable’ titles. Decomposition is defined by the following greedy longest-match-first algorithm that performs iterative search through the document text:

  1. Find in the text the longest substring of the title; the substring must start at the beginning of the title. If there are more than one substrings of equal length, take the first one encountered from the beginning of the text.

  2. Repeat the following until the end of the title: Find the next longest title substring; the substring must start after the end of the previously found substring. If not found, terminate and discard the title.

More formally:

The title is represented as concatenation of strings where each string is as short as possible, providing that there is no any pair such that the last character of and the first character of are both letters or are both digits. The text is represented in the same way: .

The algorithm gathers training samples as tuples , where the is the input, and the is the output. For all samples of the document, . Gathering the samples from each document is done as following:

= empty list
= empty string
While is not an empty string:
      Find largest with and and
      If more than one exists for the found , select the smallest .
      If is not found:
          = empty list
      Add to

Consider the following example of a document with the title “Christopher John: I am very happy to be here”. The following substrings are sequentially found in the text of the document:

‘Christopher John’, ‘: ’, ‘I am very happy to ’, ‘be here’

The text did not have the string ‘Christopher John:’ nor the string ‘I am very happy to be here’. But the text did contain smaller pieces of the title. If any of the smallest pieces do not exist in the text, for example the word ‘happy’ or the ‘:’ punctuation mark, then the document does not have a decomposable title and would not be included in our training data set.

It turns out that for typical news articles in English, about 10% of the titles are fully decomposable. Therefore the training data set is about 10 times smaller than a representative sample of news documents.

For title generation we arrange the decomposable title as a sequence of question-answers. These question-answer pairs are then used as input for a traditional question-answer model. Returning to our example, Table 1 illustrates how the sequence of question-answers looks.

Question Answer (found in text)
Christofer John
Christofer John :
Christofer John: I am very happy to
Christofer John: I am very happy to be here
Christofer John: I am very happy to be here _
Table 1: Every generation starts with empty ‘question’ and ends with termination ‘answer’

The termination answer, the symbol ‘_’, is always available because we add it to the end of every document. Thus, each document produces two or more training samples. Two is the minimal possible number: It happens when the whole title is found at the first question-answer search, and the termination answer is given by the second question-answer. Notice that the very first ‘question’ is always an empty string.

For our question-answering model we use the BERT transformer base uncased model [12] adapted to question-answering by the Huggingface team [13]. At generation time we ask ‘questions’ and accumulate text in the form of ‘answers’ until the termination symbol is an answer.

How different are the generated titles from the human-written originals? Does the system tend more toward abstractive or extractive behavior? We know in advance that about 90% of generated titles will necessarily differ from the original titles, simply because roughly 90% of news article titles are not decomposable. But moreover, even for documents with fully decomposable titles, our model generates titles different from the human-written original. It is extremely rare that a generated title is an exact match to the original. On average, the generated titles consists of 4 to 5 ‘answers’ from different locations in the text.

Does the system simply memorize specific titles or decomposition patterns? We know that in our data less than 0.5% of daily news titles match titles from the previous week. And those that do overlap are generic titles such as “TV/Radio”, “Tuesday”, and “Tuscarawas County”. Such titles are typically not decomposable. Generally, we find it difficult even to overfit our model on the training set. In order to obtain the majority of generated titles repeating the training titles, we had to decrease the training set to thousands of decomposable titles and to increase the number of epochs above ten. On this basis, we are confident that the system is not simply memorizing.

The model (evaluated as explained in the next section) was trained on a question-answer dataset created from 10 weeks of Primer data, starting October 1, 2018. On a typical day, this data stream consists of between 100K and 200K English-language news documents, of which approximately 10% enter our training data. The training data set represents 7 million question-answer examples (including the termination answers). The question-answer examples were obtained from 1.5M documents with decomposable titles. The training was done by fine-tuning the BERT transformer uncased base model for 3 epochs.

Examples of real and generated titles are presented in Table 2.

Real title Generated title Generation spans Text: only the top is shown here
Put Some Healthful Into Holiday Eating How to eat a healthful diet for a festive season ‘How’, ‘to’, ‘eat a healthful diet’, ‘for a’, ‘festive season’ By Robert Preidt, HealthDay Reporter (HealthDay) SUNDAY, Dec. 9, 2018 (HealthDay News) - You can eat a healthful dietduring the holidays with just a few tweaks totraditional recipes, the American Heart Association says. "We want to help people overcome their nutrition struggles and pave the wayfor ahealthfulfestive season," registered dietitian Annessa Chumbley said in an association news release. …
Dermatologists want you to avoid this skin-care ingredient Why our skin hates fragrance ’why’, ’our’, ’skin hates’, ’fragrance’ Our skin hatesthis ingredient found in most products we use. A dermatologist says, "Those with sensitive skin usually have a form of inflammatory disease that compromises their skin barrier." If you have sensitive skin, then you probably also have an entire checklist of skin-care ingredients to avoid. And most likely,fragranceis in this list-not because you dont́ want your products to smell good, but because your dermatologist probably told you that itś not the best thing for your skin type. …
Table 2: Examples of generated headlines. The spans are marked yellow. Some spans are not in the shown part of the text because they were found lower in the text.

3 Human evaluation

Human evaluation is especially important for our headline generation, because we know in advance that our generated titles in most cases should not be close to real titles as measured by automated text overlap metrics such as ROUGE [14].

For evaluation, we presented to human evaluators the following task. You see the body text of an article and two titles. The task is to independently score the titles, each on their own merit. One of the titles is real while the other is machine-generated (in randomly shuffled order). The titles are scored on a 5-point scale: Very Bad, Bad, OK, Good, Very Good.

The evaluation was done using Prodigy [15]. In the absence of codified standards for evaluating models using human graders, we opted to present each text and title pair individually, ensuring that our graders could not examine more than one generated (or real) title simultaneously. We limited the task to 100 documents at a time which takes on average about 1 hour to complete.

The 100 documents for human evaluation were randomly picked as one document per source from popular news sources, on the day after the model training period. The sources used were chosen by ranking those most frequently cited on English Wikipedia, excluding non-journalistic sources and exclusively sports and business-focused sources.

The top five of our selected sources were:

  1. nytimes.com

  2. washingtonpost.com

  3. bbc.co.uk

  4. forbes.com

  5. cbc.ca

The evaluation presented here was undertaken by 10 evaluators: 2 authors of this paper and 8 hired external evaluators. Each evaluator scored a total of 200 titles: one real and one generated for each of the 100 documents. Fig 1 shows the instructions text for the task.

You are shown two or more possible headlines for an article. Score each headline for its quality independently. (Some headlines might be better than the others, or worse, or they could all be the same quality.) Sometimes headlines are good, and sometimes they are bad. You must be the judge!

What makes a good headline? It should be…

  • Informative. It should tell you what the article is about, including key details.

  • Easy to read. It should not be too long or full of extra details.

  • Well-written. It should not have grammatical errors or awkward wording.

What makes a bad headline?

  • Irrelevant details included.

  • Factually incorrect. (This is the worst of all!)

Figure 1: Instructions for evaluating headlines

In order to produce an estimate from the distribution of the gathered scores, we performed Bootstrap with 1 million samples, where each sampling was obtained by two mutually independent random selections with replacement: selection of the evaluators and selection of the documents. The results of comparison of scores given by the same evaluator to the real vs. generated titles of the same document are shown in Fig 2, with 95% confidence intervals.

Figure 2: Normalized distribution of the comparison between generated and real titles, with 95% confidence intervals. The scores are given by the same person to the titles for the same texts.

The distribution of the scores with 95% confidence interval is shown in Fig 3

Figure 3: Normalized distribution of the scores given by human evaluators to generated and real titles, with 95% confidence intervals.

The pie-chart Fig 4 is equivalent to the Fig 3 stripped of the confidence intervals.

Figure 4: Median values from Bootstrap distribution

A couple more characterizations from the Bootstrap that are of practical interest:

  1. Real title is or better, while the generated title is or worse: median = 17%

  2. Generated title is or better, while the real title is or worse: median = 9%

The Table 3 shows several examples of differently scored real and generated titles.

Real title Generated title Real title median score Generated title median score Text: only the top is shown here
Postal pension No acknowledgement of pension amendments 1.0 3.0 Dear Claudienne I worked with the Post and Telecommunications Department in St Catherine from 1968 to 1977. When I applied for my pension the Post Office headquarters sent my file to the Ministry of Finance (MOF) on October 5, 2016. However, to date (October 13, 2017) I have received no letter of acknowledgement or even a telephone call with an update from the MOF. …
It’s been a long 2 years’ for Kelly, Kudlow says John Kelly replacement to be announced in next few days 2.0 3.5 White House economic adviser Larry Kudlow, right, said the replacement for chief of staff John Kelly, center, would be announced Monday or early in the week. | Mark Wilson/ Share on Facebook Share on Twitter It is unclear if White House chief of staff John Kelly decided to resign from his post or was forced to leave …
Bad Movie Diaries: A Christmas Prince: The Royal Wedding (2018) Jim Vorel and Kenneth Lowe discuss A Christmas Prince and its sequel, The Royal Wedding 2.5 4.0 Jim Vorel and Kenneth Lowe are connoisseurs of terrible movies. In this occasional series , they watch and then discuss the fallout of a particularly painful film. Be wary of spoilers. Ken: A very happy holiday to you again, Jim. As I sip my eggnog here in the winter wonderland that is 50-degree, tornado-ravaged Central Illinois …
Moxie Girl cleaning company experiences growth amidst vacation rental market boom Moxie Girl is turning around homes quickly 4.0 2.0 Unless itś messy, dirty or somewhat disheveled, vacation-home renters probably dont́ notice when anythingś amiss at their home away from home and instantly feel, well, at home. Amanda Thomas, the CEO and founder of Moxie Girl, makes sure that happens. The influx of investment properties used for vacation …
Schiff Says Trump May ’Face The Very Real Prospect Of Jail Time’ Schiff: Trump will have jail time 4.0 1.5 The congressman expected to become the new chairman of the House Intelligence Committee is predicting dark days ahead for President Donald Trump, including potential jail time. Democratic Rep. Adam Schiff of California was on "Face the Nation" …
Motorist shot on West Capitol Drive, street closed in both directions Motorist shot in traffic accident on West Capitol Drive, shut both directions after motorist shot 4.0 1.0 Share This Story! Let friends in your social network know what you are reading about A motorist was shot on W. Capitol Dr. near N. 7th St. Sunday morning. Post to Facebook Sent! A link has been sent to your friend’s email address. Posted! A link has been posted to your Facebook feed. A traffic accident on West Capitol Drive and North 7th Street involved a shooting of a motorist …
Table 3: Examples of real and generated titles, with median scores. The median is taken as across all the scores; the score values are: 0 = Very Bad; 1 = Bad; 2 = OK; 3 = Good; 4 = Very Good

The shown scores are medians, taken from the scores of all evaluators. For example, here are the scores given by the evaluators to the first row example of the table 3:

  1. Scores of the real title: [0, 1, 1, 1, 1, 1, 2, 2, 2, 2]

  2. Scores of the generated title: [2, 2, 2, 3, 3, 3, 3, 3, 3, 4]

Assessment of headline quality is highly subjective, so the inter-rater reliability is low. The Krippendorff’s alpha [16, 17] for scores considered as intervals is 0.27. If the scores for generated and real headlines are considered separately, the alpha is higher for the generated headlines 0.31, and lower for the real headlines 0.17. This may reflect the fact that the generated headlines closer follow the text (by the nature of our generation algorithm) and this makes the evaluator job easier and judgment more certain.

The real vs generated comparison rating (with values -1,0,1 as worse, same or better) has also low inter-rater reliability: the alpha is 0.23. The low inter-rater reliability reflects low agreement between the human evaluators but does not negate the above results representation obtained with Bootstrap, since the samplings do include selection of evaluators.

4 Do we miss having a dictionary?

When decomposing a title, we included the obtained samples of question-answers only if it was possible to decompose the whole title. If at least one word was not found in the text, nothing from that title would enter our training set. Let us consider now a possibility of making more decomposable titles by adding a dictionary. The advantage would be not necessarily an increase of the training set, since there are enough training samples from news documents. But a dictionary could increase the variety of text generation expressivity. We hypothesized that access to a dictionary would damage the model’s performance by causing text generation errors [11].

How large would an external dictionary have to be to be useful for the task of headline generation? We assume that the dictionary could be used in a manner similar to a pointer-generator network [10], so that a word from the dictionary could be picked instead of a span from the text if this provides a better generation path. For creating the question-answer training dataset this means that a title may be decomposed not only into spans from the text, but also with dictionary words. This defines a vocabulary of words that appear in the title but are not in the document text. When decomposing a title, we would first try to find our next ‘answer’ in the text, and if not found we check if a word from the dictionary can be used instead as the next ‘answer’.

We restrict ourselves to word tokens consisting of alphabetic characters. Taking 6 months of news articles (Jun - Nov 2018), we have 0.8 million distinct ‘in-title-not-in-text’ words out of 2.2 million distinct ‘in-title’ words, and 16.2 million distinct ‘in-text’ words. The ‘in-text’ dictionary of course has a familiar distribution of stop words, with ‘the’, ‘of’, ‘and’ at the top. The ‘in-title’ dictionary is similar, except that the names of month come close to the top. But the ‘in-title-not-in-text’ dictionary is different. Names of the months and some special time-related abbreviations rise to the top, for example ‘June’, ‘EDT’, ‘CEST’ etc. We argue that it is better to pick such words from the text rather than to ‘hallucinate’ them from a dictionary. Hence our filtering for the dictionary:

  1. Create a dictionary (with counts) of cased words found in title but not in text.

  2. If a word has a count of at least 100 occurrences as all lowercase (in title) then combine the uppercase and lowercase counts and keep the word as lowercase.

  3. Remove all words that are not lowercase.

  4. Select whatever dictionary size is needed as top N words by counts of occurrences.

The top 10 words are shown in Table 4.

word count
seeks 678931
head 325633
increases 313320
us 312013
line 296889
issues 295236
decreases 284587
says 265168
publishes 220836
announces 211991
Table 4: Top of the selected ‘in-title-not-in-text’ words.

If we keep all the words above the count 500, the dictionary size is 9016. Let us see how much the use of the dictionary changes the nature of our training set. The bigger the dictionary, the more titles become decomposable not purely by spans from the text but also by including the dictionary words. Table 5 shows how increase of the dictionary size affects our training set. The counts are normalized by the number of decomposable documents in the absence of the dictionary: The first row shows that for each decomposable document there are 5.0 training samples produced by a text span and 0.0 training samples produced by a dictionary word. The latter is obvious because the first row is for the dictionary size equal to 0 words.

Dictionary Decomposable Samples Samples by
size documents by span dictionary
0 1.0 5.0 0.0
100 1.6 9.2 0.7
200 1.8 10.6 1.0
300 2.1 12.4 1.4
400 2.3 13.6 1.7
500 2.4 14.7 2.0
1000 3.1 19.1 3.3
9016 5.8 39.9 11.4
Table 5: Growth of the number of decomposable documents and training samples with increase of the dictionary size. Each training sample (‘answer’, i.e. a piece of a title) can be obtained either by a text span or by a dictionary word.

With adding only the top 100 words of the dictionary, the number of decomposable documents jumps by 60%, mostly due to ‘saving’ a decomposition of a title by some single word from the dictionary. Most of a title is still composed of spans, with less than 8% of samples obtained from the dictionary. Notice that the number of samples has grown more than the number of decomposable titles. This means that the added titles have a more complex decomposition strategy and this may yield a steep cost during model training.

With the addition of new words, the training set changes far less dramatically. In principle, an increase of the dictionary size should improve flexibility of title generation, but also increase the probability of hallucinations and incoherence. From the growth of the training patterns we observe, it is reasonable to assume that the added dictionary should be kept small, limited to the top 1000 words or even to the first 100 - 200.

We have not fully explored this direction, but our attempt to add a dictionary of 500 words caused such deterioration of quality of generated titles (with the same amount of training data) that it did not merit an evaluation. The model we used for that occasion used a simple combination of the question-answer output and the dictionary output (the latter as a softmax classifier) on top of BERT.

Our intuition is that we do not need a dictionary for this task. The vocabulary needed for expression of a functional document title can be found in the text itself.

5 Conclusion

As we found from human evaluation, the original headlines are scored higher in quality than generated ones on average. This is true regardless of the fact that it is difficult to guess whether a headline is real or generated.

Our approach can be considered supervised or unsupervised, since the decomposable titles are selected without any human help. Curiously, we observe that our generation produces headlines which look not like real titles and usually not like the decomposable titles used in the training set.

We also observe that our model, being trained on news articles, generates functional titles when applied to other document types such as emails, movie plot summaries, and even legal documents. For example, when given the full LaTex source of this paper, the model is not confused by the LaTeX commands (which it has never seen) and produces the title of this paper verbatim. The model of course considers the title string as a very informative statement positioned below far less informative specifications of user packages, and above the author, abstract, and introduction sections. If everything above the introduction is deleted, the model generates the title "Extractive summarization from text during headline generation", which is a reasonable description of this very document.

We are also exploring (with encouraging results) the use of this headline generation model for generating a bullet-point summary of text, generating headlines for appropriately identifying segments of the text, - this is a work in progress. Finally, we hope that simply using larger models and larger and better filtered data may deliver still better results.

6 Acknowledgements

We thank Ethan Chan for many helpful comments. And most of all we thank the journalists of the world who created the data and continue to inform the public.


  • [1] Konstantin Lopyrev. Generating News Headlines with Recurrent Neural Networks. arXiv preprint arXiv:512.01712v1, 2015.
  • [2] Ayana, Shiqi Shen, Yu Zhao, Zhiyuan Liu, Maosong Sun. Neural Headline Generation with Sentence-wise Optimization. arXiv preprint arXiv:1604.01904v2, 2016.
  • [3] Shun Kiyono, Sho Takase, Jun Suzuki, Naoaki Okazaki, Kentaro Inui and Masaaki Nagata. Source-side Prediction for Neural Headline Generation. arXiv preprint arXiv:1712.08302, 2017.
  • [4] Peng Xu and Pascale Fung. A novel repetition normalized adversarial reward for headline generation. arXiv preprint arXiv:1902.07110, 2019.
  • [5] Chandra Khatri, Gyanit Singh and Nish Parikh. Abstractive and Extractive Text Summarization using Document Context Vector and Recurrent Neural Networks. arXiv preprint arXiv:1807.08000v2, 2018.
  • [6] Mehdi Allahyari, Seyedamin Pouriyeh, Mehdi Assefi, Saeid Safaei, Elizabeth D. Trippe, Juan B. Gutierrez and Krys Kochut. Text Summarization Techniques: A Brief Survey. arXiv preprint arXiv:1707.02268v3, 2017.
  • [7] Soheil Esmaeilzadeh, Gao Xian Peh, Angela Xu. Neural Abstractive Text Summarization and Fake News Detection. arXiv preprint arXiv:1904.00788v1, 2019.
  • [8] Günes̨ Erkan and Dragomir R. Radev. LexRank: Graph-based Lexical Centrality as Salience in Text Summarization. In Journal of Artificial Intelligence Research 22:457–479, 2004.
  • [9] Rada Mihalcea and Paul Tarau. Textrank: Bringing order into texts. In Association for Computational Linguistics, 2004.
  • [10] Abigail See, Peter J. Liu and Christopher D. Manning. Get To The Point: Summarization with Pointer-Generator Networks. arXiv preprint arXiv:1704.04368v2, 2017.
  • [11] Sachin Kumar and Yulia Tsvetkov. Von Mises-Fisher Loss for Training Sequence to Sequence Models with Continuous Outputs. arXiv preprint arXiv:1812.04616v3, 2019.
  • [12] Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805v1, 2018.
  • [13] https://github.com/huggingface/pytorch-pretrained-BERT
  • [14] Chin-Yew Lin. Rouge: a Package for Automatic Evaluation of Summaries. In Workshop on Text Summarization Branches Out, Post-Conference Workshop of ACL, 2004.
  • [15] https://prodi.gy
  • [16] Klaus Krippendorff. Content analysis: An introduction to its methodology, 4rd edition. Sage Publications. 2018.
  • [17] https://github.com/pln-fing-udelar/fast-krippendorff
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description