WikiHow: A Large Scale Text Summarization Dataset
Sequence-to-sequence models have recently gained the state of the art performance in summarization. However, not too many large-scale high-quality datasets are available and almost all the available ones are mainly news articles with specific writing style. Moreover, abstractive human-style systems involving description of the content at a deeper level require data with higher levels of abstraction. In this paper, we present WikiHow, a dataset of more than 230,000 article and summary pairs extracted and constructed from an online knowledge base written by different human authors. The articles span a wide range of topics and therefore represent high diversity styles. We evaluate the performance of the existing methods on WikiHow to present its challenges and set some baselines to further improve it.
Summarization as the process of generating a shorter version of a piece of text while preserving important context information is one of the most challenging NLP tasks. Sequence-to-sequence neural networks have recently obtained significant performance improvements on summarization Rush et al. (2015); Chopra et al. (2016). However, the existence of large-scale datasets is the key to success of these models. Moreover, the length of the articles and the diversity in their styles can create more complications.
Almost all existing summarization datasets such as DUC Harman and Over (2004), Gigaword Napoles et al. (2012), New York Times Sandhaus (2008) and CNN/Daily Mail Nallapati et al. (2016) consist of news articles. The news articles have their own specific styles and therefore the systems trained on only news may not be generalized well. On the other hand, the existing datasets may not be large enough (DUC) to train a sequence-to-sequence model, the summaries may be limited to only headlines (Gigaword), they may be more useful as an extractive summarization dataset (New York Times) and their abstraction level might be limited (CNN/Daily mail).
To overcome the issues of the existing datasets, we present a new large-scale dataset called WikiHow using the online WikiHow
We introduce a large-scale, diverse dataset with various writing styles, convenient for long-sequence text summarization.
We introduce level of abstractedness and compression ratio metrics to show how abstractive the new dataset is.
We evaluate the performance of the existing systems on WikiHow to create benchmarks and understand the challenges better.
2 Existing Datasets
There are several datasets used to evaluate the summarization systems. We briefly describe the properties of these datasets as follows.
DUC: The Document Understanding Conference dataset Harman and Over (2004) contains 500 news articles and their summaries capped at 75 bytes. The summaries are written by human authors and there exist more than one summary per article which is its major advantage over other existing datasets. The DUC dataset cannot be used for training models with large number of parameters and therefore is used along with other datasets Rush et al. (2015); Nallapati et al. (2017).
Gigaword: Another collection of news articles used for summarization is Gigaword Napoles et al. (2012). The original articles in the dataset do not have summaries paired with them. However, some prior work Rush et al. (2015); Chopra et al. (2016) used a subset of this dataset and constructed pairs of summaries by using the first line of the article and its headline, making the dataset suitable for short text summarization tasks.
New York Times: The New York Times (NYT) dataset Sandhaus (2008) is a large collection of articles published between 1996 and 2007. While this dataset has been mainly used for extractive systems Hong and Nenkova (2014); Durrett et al. (2016), Paulus et al. (2017) are the first to evaluate their abstractive system using NYT.
CNN/Daily Mail: This dataset mainly used in recent summarization papers Nallapati et al. (2016); See et al. (2017); Nallapati et al. (2017) consists of online CNN and Daily Mail news articles and was originally developed for question/answering systems. The highlights associated with each article are concatenated to form the summary. Two versions of this dataset depending on the preprocessing exist. Nallapati et al. (2017) has used the entity anonymization to create the anonymized version of the dataset while See et al. (2017) replaced the anonymized entities with their actual values and create the non-anonymized version.
NEWSROOM: This corpus Grusky et al. (2018) is the most recent large-scale dataset introduced for text summarization. It consists of diverse summaries combining abstractive and extractive strategies yet it is another news dataset and the average length of summaries are limited to .
3 WikiHow Dataset
|Average Article Length||579.8|
|Average Summary Length||62.1|
The existing summarization datasets, consist of news articles. These articles are written by journalists and follow the journalistic style. The journalists usually follow the Inverted Pyramid style Po¨ ttker (2003) (depicted in Figure 1) to prioritize and structure a text by starting with mentioning the most important, interesting or attention-grabbing elements of a story in the opening paragraphs and later adding details and any background information. This writing style might be the cause why lead-3 baselines (where the first three sentences are selected to form the summary) usually score higher compared to the existing summarization systems. We introduce a new dataset called WikiHow, obtained from WikiHow data dump. This dataset contains articles written by ordinary people, not journalists, describing the steps of doing a task throughout the text. Therefore, the Inverted Pyramid does not apply to it as all parts of the text can be of similar importance.
3.1 WikiHow Knowledge Base
The WikiHow knowledge base contains online articles describing a procedural task about various topics (from arts and entertainment to computers and electronics) with multiple methods or steps and new articles are added to it regularly. Each article consists of a title starting with “How to” and a short description of the article. There are two types of articles: the first type of articles describe single-method tasks in different steps, while the second type of articles represent multiple steps of different methods for a task. Each step description starts with a bold line summarizing that step and is followed by a more detailed explanation. A truncated example of a WikiHow article and how the data pairs are constructed is shown in Figure 2.
3.2 Data Extraction and Dataset Construction
We made use of the python Scrapy
4 WikiHow Properties
The large scale of the WikiHow dataset by having more than pairs, and its average article and summary lengths makes it a better choice compared to DUC and Gigaword corpus. We also define two metrics to represent the abstraction level of WikiHow by comparing it with CNN/Daily mail known as one of the most abstractive and common datasets in recent summarization papers Nallapati et al. (2016, 2017); See et al. (2017); Paulus et al. (2017).
|Seq-to-seq with attention||31.33||11.81||28.83||12.03||22.04||6.27||20.87||10.06|
|Pointer-generator + coverage||39.53||17.28||36.38||17.32||28.53||9.23||26.54||10.56|
4.1 Level of Abstractedness
Abstractedness of the dataset is measured by calculating the unique n-grams in the reference summary which are not in the article. The comparison is shown in Figure 3. Except for common unigrams, bi-grams and trigrams between the articles, and the summaries, no other common n-grams exist in the WikiHow pairs. The higher level of abstractedness creates new challenges for the summarization systems as they have to be more creative in generating more novel summaries.
4.2 Compression Ratio
We define compression ratio to characterize the summarization. We first calculate the average length of sentences for both the articles and the summaries. The compression ratio is then defined as the ratio between the average length of sentences and the average length of summaries. The higher the compression ratio, the more difficult the summarization task, as it needs to capture higher levels of abstraction and semantics. Table 3 shows the results for WikiHow and CNN/Daily Mail. The higher compression ratio of WikiHow shows the need for higher levels of abstraction.
|Article Sentence Length||100.68||118.73|
|Summary Sentence Length||42.27||82.63|
We evaluate the performance of the WikiHow dataset using existing extractive and abstractive baselines. The systems used and the results generated for WikiHow and CNN/Daily mail are described in the following sections.
5.1 Evaluated Systems
TextRank Extractive system: An extractive summarization system Mihalcea and Tarau (2004); Barrios et al. (2016) using a graph-based ranking method to select sentences from the article and form the summary.
Sequence-to-sequence model with attention: A baseline system applied by Chopra et al. (2016); Nallapati et al. (2016) to abstractive summarization task to generate summaries using the predefined vocabulary. This baseline is not able to handle Out of Vocabulary words (OOVs).
Pointer-generator abstractive system: A pointer-generator mechanism See et al. (2017) allowing the model to freely switch between copying a word from the input sequence or generating a word form the predefined vocabulary.
Pointer-generator with coverage abstractive system: The pointer-generator baseline with added coverage loss See et al. (2017) to reduce the repetition in the final generated summary.
Lead-3 baseline: A baseline selecting the first three sentences of the article to form the summary. This baseline cannot be directly used for the WikiHow dataset as the first sentences of each article only describe a small portion of the whole article. We created the Lead-3 baseline by extracting the first sentence of each paragraph and concatenated them to create the summary.
To study the performance of the evaluated systems, we used the Pyrouge package
We present WikiHow, a new large-scale summarization dataset consisting of diverse articles form WikiHow knowledge base. The WikiHow features discussed in the paper can create new challenges to the summarization systems. We hope that the new dataset can attract researchers attention as a choice to evaluate their systems.
- Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72.
- Federico Barrios, Federico López, Luis Argerich, and Rosa Wachenchauzer. 2016. Variations of the similarity function of textrank for automated summarization. arXiv preprint arXiv:1602.03606.
- Sumit Chopra, Michael Auli, Alexander M Rush, and SEAS Harvard. 2016. Abstractive sentence summarization with attentive recurrent neural networks. In HLT-NAACL, pages 93–98.
- Greg Durrett, Taylor Berg-Kirkpatrick, and Dan Klein. 2016. Learning-based single-document summarization with compression and anaphoricity constraints. arXiv preprint arXiv:1603.08887.
- Max Grusky, Mor Naaman, and Yoav Artzi. 2018. Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), volume 1, pages 708–719.
- Donna Harman and Paul Over. 2004. The effects of human variation in duc summarization evaluation. Text Summarization Branches Out.
- Kai Hong and Ani Nenkova. 2014. Improving the estimation of word importance for news multi-document summarization. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 712–721.
- Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out: Proceedings of the ACL-04 workshop, volume 8. Barcelona, Spain.
- Rada Mihalcea and Paul Tarau. 2004. Textrank: Bringing order into text. In Proceedings of the 2004 conference on empirical methods in natural language processing.
- Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017. Summarunner: A recurrent neural network based sequence model for extractive summarization of documents. AAAI.
- Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Ça glar Gulçehre, and Bing Xiang. 2016. Abstractive text summarization using sequence-to-sequence rnns and beyond. CoNLL 2016, page 280.
- Courtney Napoles, Matthew Gormley, and Benjamin Van Durme. 2012. Annotated gigaword. In Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction, pages 95–100. Association for Computational Linguistics.
- Romain Paulus, Caiming Xiong, and Richard Socher. 2017. A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304.
- Horst Po¨ ttker. 2003. News and its communicative quality: The inverted pyramidâwhen and why did it appear? Journalism Studies, 4(4):501–511.
- Alexander M Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685.
- Evan Sandhaus. 2008. The new york times annotated corpus. Linguistic Data Consortium, Philadelphia, 6(12):e26752.
- Abigail See, Peter J Liu, and Christopher D Manning. 2017. Get to the point: Summarization with pointer-generator networks. ACL.