Multi-Document Summarization using Distributed Bag-of-Words Model
As the number of documents on the web is growing exponentially, multi-document summarization is becoming more and more important since it can provide the main ideas in a document set in short time. In this paper, we present an unsupervised centroid-based document-level reconstruction framework using distributed bag of words model. Specifically, our approach selects summary sentences in order to minimize the reconstruction error between the summary and the documents. We apply sentence selection and beam search, to further improve the performance of our model. Experimental results show that performance of our model is competitive against the state-of-the-art unsupervised algorithms on standard benchmark datasets.
Kaustubh Mani IIT Kharagpur
Ishan Verma TCS Innovation Labs Lipika Dey TCS Innovation Labs
Multi-document summarization is a process of representing a set of documents with a short piece of text by capturing the relevant information and filtering out the redundant information. Two prominent approaches to multi-document summarization are extractive and abstractive summarization. Extractive summarization systems aim to extract salient snippets, sentences or passages from documents, while abstractive summarization systems aim to concisely paraphrase the content of the documents.
Most unsupervised extractive summarization techniques are based on sentence ranking, where sentences are selected on the basis of a ranking model. Several ranking models like PageRank, topic modeling etc. has been proposed. Redundancy within the selected sentences is one of the main reason behind the failure of these models. To overcome this problem, data reconstruction based techniques (He et al., 2012; Yao et al., 2015; Liu et al., 2015) has been proposed. Since these methods try to reconstruct all the sentences in a document set, they fail to consider the global information content of a document.
In this paper, we propose a centroid-based document level reconstruction framework using distributed bag-of-words (PV-DBOW) (Le and Mikolov, 2014) model. Summary sentences are selected in order to minimize the reconstruction error between the summary and documents. We further employ sentence selection and beam search to reduce redundancy and increase diversity among the selected sentences.
The major contributions of our work include:
We propose the use of Distributed Bag of Words (PV-DBOW) model for multi-document summarization. To the best of our knowledge, this model has not been used for unsupervised document summarization before.
Since document and sentence representation technique is crucial to the implementation of our model. We compare the performance of distributed bag of words (PV-DBOW) model with several other document representation techniques.
We conduct experiments on DUC 2006 and DUC 2007 benchmark datasets to show the improvement of our model over previous unsupervised data reconstruction based summarization systems.
2 Proposed Framework
Several summarization methods use the bag of words (BOW) model for sentence ranking and sentence selection (Erkan and Radev, 2004; Radev et al., 2004). Bag of words model fails to encode the semantic relationship between words when comparing sentences. Continuous Bag of Words (CBOW) (also known as word2vec) based methods (Rossiello et al., 2017) perform better by taking the context of words into consideration, but fail to encode the global content of a document. Paragraph vectors (Le and Mikolov, 2014) have been recently proposed as a method for learning fixed-length distributed representations from variable-length pieces of text. The method has been proven to be effective for representing documents and sentences in several natural language processing tasks like sentiment classification (Le and Mikolov, 2014), topic detection (Hashimoto et al., 2016) and document similarity (Dai et al., 2015). In this work, we use Distributed bag of words (PV-DBOW) model to represent documents and sentences. First, we train the PV-DBOW model to compute document vectors for all the documents in a document set then, we represent the main content of a document set by its centroid vector, which is calculated by averaging the document vectors. We then select summary sentences in order to minimize the reconstruction error between the documents and the summary. Sentence selection is performed to reduce the redundancy in the summary and beam search is used to further minimize the reconstruction error by exploring a large search space of candidate summaries.
2.1 Distributed Memory Model
Distributed Memory model is similar to word2vec model proposed by (Mikolov et al., 2013), the only difference is the addition of document vector on the input side (Figure 1).
During training, document vector and all the word vectors are initialized randomly. The document vector and the word vectors are concatenated or averaged and sent into a softmax classifier to predict the next word in a context window of fixed size . The softmax weight is also randomly initialized.
More precisely, the model tries to maximize the log-likelihood of word given a document vector and words , , … in a fixed context window size .
At inference stage, the softmax weight and the word vectors are fixed and the document vector is updated using stochastic gradient descent.
2.2 Distributed Bag of Words Model
Distributed Bag of Words model is a simpler version of paragraph vectors, which takes the document vector D as input and forces the model to predict words in a text window of n words randomly sampled from the document.
During training, document vector D and softmax weights U are randomly initialized and updated using stochastic gradient descent via backpropagation. At inference stage, for a new document or sentence, document vector D is randomly initialized and updated by gradient descent while keeping the softmax weights U fixed. Unlike distributed memory model (PV-DM), which tries to predict the next word given a context, PV-DBOW predicts the context directly from the document vector. This enables the model to encode higher n-gram representations, thus making it more suitable for our task of document reconstruction. In comparison to the distributed memory (PV-DM) version of paragraph vectors, PV-DBOW has fewer number of parameters and thus needs less data to train.
2.3 Document Reconstruction
We treat summarization task as a multi-document reconstruction problem. We assume that a good summary is one which can reconstruct the main content of a document set. We assume that the centroid of all the documents is representative of all the meaningful content in the document set. Our assumption is inspired by (Radev et al., 2004) where the idea was first introduced.
Further, to validate our assumption we randomly select four document sets and their respective model summaries from DUC 2006 dataset, compute their vector representation using PV-DBOW model and project the documents and summaries onto two-dimensional space (see Figure 3). We find that the centroid of document vectors and the centroid of summary vectors are very close to each other. Hence in our framework, the main content of a document set is represented by the centroid of the document vectors.
Given, a multi-document set D = [, , … , ], centroid vector C is represented by:
where n is the total number of documents in the multi-document set, and DBOW represents the Distributed Bag of Words model (PV-DBOW). Our basic model builds the summary by iteratively selecting the sentences with the minimum reconstruction error, given by equation (3).
where denotes a candidate summary.
2.4 Sentence Selection
Given a document set, we create a candidate set of sentences S=[, , .., ], which contains all the sentences in the document set. Sentence vectors for all the sentences in the candidate set are computed using the trained PV-DBOW model. The sentences in the candidate set are sorted according to their reconstruction error given by (3). Reconstruction error is minimized by iteratively selecting sentences from the candidate set into the summary set until the summary length exceeds a max limit given by K. At each iteration, we calculate the cosine similarity between candidate sentence vector and the sentence vectors of the sentences which are already present in the summary set. The sentence having cosine similarity greater than a threshold are not selected in the summary set.
2.5 Beam Search
Beam search is a heuristic state space search algorithm, which is basically a modification of breadth first search. The algorithm loops over the entire candidate set S and selects sentences until the summary length exceeds the max length limit given by K. At each iteration, sentences in candidate set are added to the summaries present in the summary set, the vectors for each summary is computed using trained PV-DBOW model and reconstruction error is calculated. The summaries present in the summary set are sorted according to their reconstruction error and only top k summaries are retained in the summary set for the next iteration. k is often referred as beam width. After the algorithm terminates summary set containing k summaries is returned. Out of these, we consider the summary with the minimum reconstruction error as the output of beam search algorithm.
We conducted experiments with two standard summarization benchmark datasets DUC 2006 and DUC 2007 provided by NIST 111http://www/nist.gov/index.html for evaluation. DUC 2006 and DUC 2007 contain 50 and 45 document sets respectively. Each document set consists of 25 news articles and 4 human-written summaries as ground truth. The summary length is limited to 250 words (whitespace delimited).
Neural Network based models are difficult to train on small datasets. For this purpose we train our model on a combined corpus of Thomson Reuters Text Research Collection (TRC2) in Reuters Corpora (Lewis et al., 2004) and CNN/Dailymail dataset first, and then fine tune on DUC 2006 and DUC 2007 datasets. The datasets used for training is consistent for all the neural network based models mentioned in the paper. We use gensim222https:radimrehurek.com/gensim/index.html library in python to train our PV-DBOW model.The hyper-parameters of the model are selected through parameter tuning on DUC 2005 dataset using grid search. In the CBOW model, the document vector is represented by the weighted average of all the word vectors in the document.
3.2 Evaluation Metric
We run the ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metrics (Lin, 2005) which has been widely adopted by DUC for automatic summarization evaluation. ROUGE measures summary quality by counting overlapping units such as n-grams word sequences and word pairs between the generated summary(produced by algorithms) and the model summary (human labeled). We choose ROUGE-N and ROUGE-SU4 in our experiments. Formally, ROUGE-N is an n-gram recall and ROUGE-SU4 is an unigram plus skip-bigram match with maximum skip distance of 4 between between a system generated summary and a set of model summaries.
4 Compared Methods
As our framework is unsupervised, we compare our model with state-of-the-art unsupervised summarization systems. Document reconstruction based methods like SpOpt (Yao et al., 2015), DocRebuild (Ma et al., ) and DSDR (He et al., 2012) are the direct baselines for comparison. SpOpt uses a sparse representation model which selects sentences and does sentence compression simultaneously. DocRebuild333DocRebuild also uses other summarization systems like SpOpt, DSDR to improve the performance of their model. For fair comparision, we have implemented only the basic version of their model. uses distributed memory (PV-DM) model to represent documents and selects sentences using a document level reconstruction framework. DSDR selects sentences from the candidate set by linearly reconstructing all the sentences in the document set, and minimizes the reconstruction error using sparse coding. We also show two weaker baselines Random and Lead (Wasson, 1998). Random does a random selection of sentences for each document set. Lead sorts the documents in a document set chronologically and selects the leading sentences from each documents one by one. We use PV-DBOW to denote our basic model, PV-DBOW + SS to denote our model with sentence selection, PV-DBOW+BS to denote our model with beam search.
We also compare PV-DBOW model with other document representation techniques like CBOW and PV-DM using the same document reconstruction framework (Table 3).
5 Results and Discussion
The results for all the experiments performed on DUC 2006 and DUC 2007 datasets are shown in Table 1 and Table 2 respectively. As shown in the table, Random and Lead give the poorest performance. DSDR improves the performance by introducing a data reconstruction based system. DocRebuild performs better by using a document level reconstruction framework. SpOpt improves the performance even further by doing sentence compression and putting the diversity constraint. Our basic model outperforms all the baselines and PV-DBOW with beam search achieves the best performance. It can be seen that the improvement in Rouge-2 and Rouge-SU4 scores is more significant in comparision to Rouge-1 scores. Higher Rouge-2 and Rouge-SU4 scores suggest that our model is more capable at handling n-grams than words.
To show the effectiveness of our model, we randomly pick 5 document sets from DUC 2006 dataset and compute the vectors for our model generated summaries, and reference summaries. For each document set we plot the documents along with the system generated summary and the centroid of the 4 reference summaries. In Figure 4, each color corresponds to a document set, system generated summaries are denoted by (x), and centroids of reference summaries are denoted by (+). It can be seen from the figure that our system generates summaries are very close to the centroid of the reference summaries for each document set.
|PV-DBOW + SS||41.400||9.299||14.895|
|PV-DBOW + BS||41.421||9.418||14.976|
|PV-DBOW + SS||42.617||11.124||16.462|
|PV-DBOW + BS||42.723||11.231||16.508|
Experimental results (Table 1) also show that PV-DBOW is a better model for representing documents and sentences in comparision to PV-DM (Ma et al., ) and CBOW at the task of document reconstruction based multi-document summarization.
6 Related Work
Our model is closely related to data reconstruction based summarization which was first proposed by (He et al., 2012). Since then, several other data reconstruction (Yao et al., 2015; Ma et al., ) based approaches has been proposed. (Liu et al., 2015) proposed a two-level sparse representation model to reconstruct the sentences in the document set subject to a diversity constraint. (Wang et al., 2008) proposed a model based on Nonnegative matrix factorization (NMF) to group the sentences into clusters. Recently, several neural network based models have been proposed for both extractive (Cao et al., 2016; Nallapati et al., 2017) and abstractive summarization (Rush et al., 2015; Nallapati et al., 2016)
7 Conclusion and Future Work
In this paper, we present a document level reconstruction framework based on distributed bag of words model (PV-DBOW). The main content of the document set is represented by a centroid vector which is computed using PV-DBOW model, and summary sentences are selected in order to minimize the reconstruction error. We do sentence selection and beam search to further improve the performance of our model. Our model outperforms the state-of-the-art unsupervised systems and shows significant improvements over Rouge-2 and Rouge-SU4 scores. Since paragraph vectors can be used to model variable-length texts, our model can be extended to a phrase level extraction based summarization system. We leave this as our future work.
- Cao et al. (2016) Ziqiang Cao, Wenjie Li, Sujian Li, Furu Wei, and Yanran Li. 2016. Attsum: Joint learning of focusing and summarization with neural attention. arXiv preprint arXiv:1604.00125 .
- Dai et al. (2015) Andrew M Dai, Christopher Olah, and Quoc V Le. 2015. Document embedding with paragraph vectors. arXiv preprint arXiv:1507.07998 .
- Erkan and Radev (2004) Günes Erkan and Dragomir R Radev. 2004. Lexrank: Graph-based lexical centrality as salience in text summarization. Journal of Artificial Intelligence Research 22:457–479.
- Hashimoto et al. (2016) Kazuma Hashimoto, Georgios Kontonatsios, Makoto Miwa, and Sophia Ananiadou. 2016. Topic detection using paragraph vectors to support active learning in systematic reviews. Journal of biomedical informatics 62:59–65.
- He et al. (2012) Zhanying He, Chun Chen, Jiajun Bu, Can Wang, Lijun Zhang, Deng Cai, and Xiaofei He. 2012. Document summarization based on data reconstruction. In AAAI.
- Le and Mikolov (2014) Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning (ICML-14). pages 1188–1196.
- Lewis et al. (2004) David D Lewis, Yiming Yang, Tony G Rose, and Fan Li. 2004. Rcv1: A new benchmark collection for text categorization research. Journal of machine learning research 5(Apr):361–397.
- Lin (2005) C Lin. 2005. Recall-oriented understudy for gisting evaluation (rouge). Retrieved August 20:2005.
- Liu et al. (2015) He Liu, Hongliang Yu, and Zhi-Hong Deng. 2015. Multi-document summarization based on two-level sparse representation model. In AAAI. pages 196–202.
- (10) Shulei Ma, Zhi-Hong Deng, and Yunlun Yang. ???? An unsupervised multi-document summarization framework based on neural document model.
- Mikolov et al. (2013) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 .
- Nallapati et al. (2017) Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017. Summarunner: A recurrent neural network based sequence model for extractive summarization of documents. hiP (yi= 1— hi, si, d) 1:1.
- Nallapati et al. (2016) Ramesh Nallapati, Bowen Zhou, Caglar Gulcehre, Bing Xiang, et al. 2016. Abstractive text summarization using sequence-to-sequence rnns and beyond. arXiv preprint arXiv:1602.06023 .
- Radev et al. (2004) Dragomir R Radev, Hongyan Jing, Małgorzata Styś, and Daniel Tam. 2004. Centroid-based summarization of multiple documents. Information Processing & Management 40(6):919–938.
- Rossiello et al. (2017) Gaetano Rossiello, Pierpaolo Basile, and Giovanni Semeraro. 2017. Centroid-based text summarization through compositionality of word embeddings. MultiLing 2017 page 12.
- Rush et al. (2015) Alexander M Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685 .
- Wang et al. (2008) Dingding Wang, Tao Li, Shenghuo Zhu, and Chris Ding. 2008. Multi-document summarization via sentence-level semantic analysis and symmetric matrix factorization. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. ACM, pages 307–314.
- Wasson (1998) Mark Wasson. 1998. Using leading text for news summaries: Evaluation results and implications for commercial summarization applications. In Proceedings of the 17th international conference on Computational linguistics-Volume 2. Association for Computational Linguistics, pages 1364–1368.
- Yao et al. (2015) Jin-ge Yao, Xiaojun Wan, and Jianguo Xiao. 2015. Compressive document summarization via sparse optimization. In IJCAI. pages 1376–1382.