Multi-Document Summarization using Distributed Bag-of-Words Model

Multi-Document Summarization using Distributed Bag-of-Words Model

Kaustubh Mani
IIT Kharagpur
\ANDIshan Verma
TCS Innovation Labs
&Lipika Dey
TCS Innovation Labs

As the number of documents on the web is growing exponentially, multi-document summarization is becoming more and more important since it can provide the main ideas in a document set in short time. In this paper, we present an unsupervised centroid-based document-level reconstruction framework using distributed bag of words model. Specifically, our approach selects summary sentences in order to minimize the reconstruction error between the summary and the documents. We apply sentence selection and beam search, to further improve the performance of our model. Experimental results show that performance of our model is competitive against the state-of-the-art unsupervised algorithms on standard benchmark datasets.

Multi-Document Summarization using Distributed Bag-of-Words Model

Kaustubh Mani IIT Kharagpur

Ishan Verma TCS Innovation Labs                        Lipika Dey TCS Innovation Labs

1 Introduction

Multi-document summarization is a process of representing a set of documents with a short piece of text by capturing the relevant information and filtering out the redundant information. Two prominent approaches to multi-document summarization are extractive and abstractive summarization. Extractive summarization systems aim to extract salient snippets, sentences or passages from documents, while abstractive summarization systems aim to concisely paraphrase the content of the documents.

Most unsupervised extractive summarization techniques are based on sentence ranking, where sentences are selected on the basis of a ranking model. Several ranking models like PageRank, topic modeling etc. has been proposed. Redundancy within the selected sentences is one of the main reason behind the failure of these models. To overcome this problem, data reconstruction based techniques (He et al., 2012; Yao et al., 2015; Liu et al., 2015) has been proposed. Since these methods try to reconstruct all the sentences in a document set, they fail to consider the global information content of a document.

In this paper, we propose a centroid-based document level reconstruction framework using distributed bag-of-words (PV-DBOW) (Le and Mikolov, 2014) model. Summary sentences are selected in order to minimize the reconstruction error between the summary and documents. We further employ sentence selection and beam search to reduce redundancy and increase diversity among the selected sentences.

The major contributions of our work include:

  • We propose the use of Distributed Bag of Words (PV-DBOW) model for multi-document summarization. To the best of our knowledge, this model has not been used for unsupervised document summarization before.

  • Since document and sentence representation technique is crucial to the implementation of our model. We compare the performance of distributed bag of words (PV-DBOW) model with several other document representation techniques.

  • We conduct experiments on DUC 2006 and DUC 2007 benchmark datasets to show the improvement of our model over previous unsupervised data reconstruction based summarization systems.

2 Proposed Framework

Several summarization methods use the bag of words (BOW) model for sentence ranking and sentence selection (Erkan and Radev, 2004; Radev et al., 2004). Bag of words model fails to encode the semantic relationship between words when comparing sentences. Continuous Bag of Words (CBOW) (also known as word2vec) based methods (Rossiello et al., 2017) perform better by taking the context of words into consideration, but fail to encode the global content of a document. Paragraph vectors (Le and Mikolov, 2014) have been recently proposed as a method for learning fixed-length distributed representations from variable-length pieces of text. The method has been proven to be effective for representing documents and sentences in several natural language processing tasks like sentiment classification (Le and Mikolov, 2014), topic detection (Hashimoto et al., 2016) and document similarity (Dai et al., 2015). In this work, we use Distributed bag of words (PV-DBOW) model to represent documents and sentences. First, we train the PV-DBOW model to compute document vectors for all the documents in a document set then, we represent the main content of a document set by its centroid vector, which is calculated by averaging the document vectors. We then select summary sentences in order to minimize the reconstruction error between the documents and the summary. Sentence selection is performed to reduce the redundancy in the summary and beam search is used to further minimize the reconstruction error by exploring a large search space of candidate summaries.

Figure 1: The distributed memory model predicts next word in a given context of fixed size.
Figure 2: The distributed bag of words model uses document vector to directly predict words in a randomly selected context.

2.1 Distributed Memory Model

Distributed Memory model is similar to word2vec model proposed by (Mikolov et al., 2013), the only difference is the addition of document vector on the input side (Figure 1).

During training, document vector and all the word vectors are initialized randomly. The document vector and the word vectors are concatenated or averaged and sent into a softmax classifier to predict the next word in a context window of fixed size . The softmax weight is also randomly initialized.

More precisely, the model tries to maximize the log-likelihood of word given a document vector and words , , … in a fixed context window size .


At inference stage, the softmax weight and the word vectors are fixed and the document vector is updated using stochastic gradient descent.

2.2 Distributed Bag of Words Model

Distributed Bag of Words model is a simpler version of paragraph vectors, which takes the document vector D as input and forces the model to predict words in a text window of n words randomly sampled from the document.

During training, document vector D and softmax weights U are randomly initialized and updated using stochastic gradient descent via backpropagation. At inference stage, for a new document or sentence, document vector D is randomly initialized and updated by gradient descent while keeping the softmax weights U fixed. Unlike distributed memory model (PV-DM), which tries to predict the next word given a context, PV-DBOW predicts the context directly from the document vector. This enables the model to encode higher n-gram representations, thus making it more suitable for our task of document reconstruction. In comparison to the distributed memory (PV-DM) version of paragraph vectors, PV-DBOW has fewer number of parameters and thus needs less data to train.

2.3 Document Reconstruction

We treat summarization task as a multi-document reconstruction problem. We assume that a good summary is one which can reconstruct the main content of a document set. We assume that the centroid of all the documents is representative of all the meaningful content in the document set. Our assumption is inspired by (Radev et al., 2004) where the idea was first introduced.

Further, to validate our assumption we randomly select four document sets and their respective model summaries from DUC 2006 dataset, compute their vector representation using PV-DBOW model and project the documents and summaries onto two-dimensional space (see Figure 3). We find that the centroid of document vectors and the centroid of summary vectors are very close to each other. Hence in our framework, the main content of a document set is represented by the centroid of the document vectors.

Given, a multi-document set D = [, , … , ], centroid vector C is represented by:


where n is the total number of documents in the multi-document set, and DBOW represents the Distributed Bag of Words model (PV-DBOW). Our basic model builds the summary by iteratively selecting the sentences with the minimum reconstruction error, given by equation (3).


where denotes a candidate summary.

2.4 Sentence Selection

Given a document set, we create a candidate set of sentences S=[, , .., ], which contains all the sentences in the document set. Sentence vectors for all the sentences in the candidate set are computed using the trained PV-DBOW model. The sentences in the candidate set are sorted according to their reconstruction error given by (3). Reconstruction error is minimized by iteratively selecting sentences from the candidate set into the summary set until the summary length exceeds a max limit given by K. At each iteration, we calculate the cosine similarity between candidate sentence vector and the sentence vectors of the sentences which are already present in the summary set. The sentence having cosine similarity greater than a threshold are not selected in the summary set.

Figure 3: Visualization of document sets and their summaries. Documents are represented by circles (o) and summaries by stars (*).
Input: S, ReconError, , K, DBOW
Output: Summary
    S SORT(S, ReconError)
    for sentence in S do
         if len(Summary) K then
              return Summary
         select True
         for in Summary do
              if sim(, ) then
                  select False
         if select then
              Summary Summary
Algorithm 1 Sentence Selection

2.5 Beam Search

Beam search is a heuristic state space search algorithm, which is basically a modification of breadth first search. The algorithm loops over the entire candidate set S and selects sentences until the summary length exceeds the max length limit given by K. At each iteration, sentences in candidate set are added to the summaries present in the summary set, the vectors for each summary is computed using trained PV-DBOW model and reconstruction error is calculated. The summaries present in the summary set are sorted according to their reconstruction error and only top k summaries are retained in the summary set for the next iteration. k is often referred as beam width. After the algorithm terminates summary set containing k summaries is returned. Out of these, we consider the summary with the minimum reconstruction error as the output of beam search algorithm.

3 Experiments

We conducted experiments with two standard summarization benchmark datasets DUC 2006 and DUC 2007 provided by NIST 111http://www/ for evaluation. DUC 2006 and DUC 2007 contain 50 and 45 document sets respectively. Each document set consists of 25 news articles and 4 human-written summaries as ground truth. The summary length is limited to 250 words (whitespace delimited).

3.1 Implementation

Neural Network based models are difficult to train on small datasets. For this purpose we train our model on a combined corpus of Thomson Reuters Text Research Collection (TRC2) in Reuters Corpora (Lewis et al., 2004) and CNN/Dailymail dataset first, and then fine tune on DUC 2006 and DUC 2007 datasets. The datasets used for training is consistent for all the neural network based models mentioned in the paper. We use library in python to train our PV-DBOW model.The hyper-parameters of the model are selected through parameter tuning on DUC 2005 dataset using grid search. In the CBOW model, the document vector is represented by the weighted average of all the word vectors in the document.

Model Rouge-1 Rouge-2 Rouge-SU4
CBOW 38.649 7.942 13.584
PV-DM 39.826 8.514 13.875
PV-DBOW 42.679 10.916 16.320
Table 1: % Average F-measure on DUC 2007

3.2 Evaluation Metric

We run the ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metrics (Lin, 2005) which has been widely adopted by DUC for automatic summarization evaluation. ROUGE measures summary quality by counting overlapping units such as n-grams word sequences and word pairs between the generated summary(produced by algorithms) and the model summary (human labeled). We choose ROUGE-N and ROUGE-SU4 in our experiments. Formally, ROUGE-N is an n-gram recall and ROUGE-SU4 is an unigram plus skip-bigram match with maximum skip distance of 4 between between a system generated summary and a set of model summaries.

4 Compared Methods

As our framework is unsupervised, we compare our model with state-of-the-art unsupervised summarization systems. Document reconstruction based methods like SpOpt (Yao et al., 2015), DocRebuild (Ma et al., ) and DSDR (He et al., 2012) are the direct baselines for comparison. SpOpt uses a sparse representation model which selects sentences and does sentence compression simultaneously. DocRebuild333DocRebuild also uses other summarization systems like SpOpt, DSDR to improve the performance of their model. For fair comparision, we have implemented only the basic version of their model. uses distributed memory (PV-DM) model to represent documents and selects sentences using a document level reconstruction framework. DSDR selects sentences from the candidate set by linearly reconstructing all the sentences in the document set, and minimizes the reconstruction error using sparse coding. We also show two weaker baselines Random and Lead (Wasson, 1998). Random does a random selection of sentences for each document set. Lead sorts the documents in a document set chronologically and selects the leading sentences from each documents one by one. We use PV-DBOW to denote our basic model, PV-DBOW + SS to denote our model with sentence selection, PV-DBOW+BS to denote our model with beam search.

We also compare PV-DBOW model with other document representation techniques like CBOW and PV-DM using the same document reconstruction framework (Table 3).

Figure 4: Visualization of model generated summaries (x) and the Centroid of reference summaries (+) for five document sets in DUC 2006.

5 Results and Discussion

The results for all the experiments performed on DUC 2006 and DUC 2007 datasets are shown in Table 1 and Table 2 respectively. As shown in the table, Random and Lead give the poorest performance. DSDR improves the performance by introducing a data reconstruction based system. DocRebuild performs better by using a document level reconstruction framework. SpOpt improves the performance even further by doing sentence compression and putting the diversity constraint. Our basic model outperforms all the baselines and PV-DBOW with beam search achieves the best performance. It can be seen that the improvement in Rouge-2 and Rouge-SU4 scores is more significant in comparision to Rouge-1 scores. Higher Rouge-2 and Rouge-SU4 scores suggest that our model is more capable at handling n-grams than words.

To show the effectiveness of our model, we randomly pick 5 document sets from DUC 2006 dataset and compute the vectors for our model generated summaries, and reference summaries. For each document set we plot the documents along with the system generated summary and the centroid of the 4 reference summaries. In Figure 4, each color corresponds to a document set, system generated summaries are denoted by (x), and centroids of reference summaries are denoted by (+). It can be seen from the figure that our system generates summaries are very close to the centroid of the reference summaries for each document set.

Model Rouge-1 Rouge-2 Rouge-SU4
Random 33.879 5.184 10.092
Lead 34.892 6.539 11.148
DSDR 35.484 6.142 11.834
DocRebuild 37.257 6.835 12.399
SpOpt 40.418 8.388 14.232
PV-DBOW 41.282 9.269 15.040
PV-DBOW + SS 41.400 9.299 14.895
PV-DBOW + BS 41.421 9.418 14.976
Table 2: % Average F-measure on DUC 2006
Model Rouge-1 Rouge-2 Rouge-SU4
Random 34.279 5.822 10.092
Lead 36.367 8.361 12.973
DSDR 37.351 7.892 12.936
DocRebuild 39.826 8.514 13.875
SpOpt 41.674 9.905 15.665
PV-DBOW 42.679 10.916 16.320
PV-DBOW + SS 42.617 11.124 16.462
PV-DBOW + BS 42.723 11.231 16.508
Table 3: % Average F-measure on DUC 2007

Experimental results (Table 1) also show that PV-DBOW is a better model for representing documents and sentences in comparision to PV-DM (Ma et al., ) and CBOW at the task of document reconstruction based multi-document summarization.

6 Related Work

Our model is closely related to data reconstruction based summarization which was first proposed by (He et al., 2012). Since then, several other data reconstruction (Yao et al., 2015; Ma et al., ) based approaches has been proposed. (Liu et al., 2015) proposed a two-level sparse representation model to reconstruct the sentences in the document set subject to a diversity constraint. (Wang et al., 2008) proposed a model based on Nonnegative matrix factorization (NMF) to group the sentences into clusters. Recently, several neural network based models have been proposed for both extractive (Cao et al., 2016; Nallapati et al., 2017) and abstractive summarization (Rush et al., 2015; Nallapati et al., 2016)

7 Conclusion and Future Work

In this paper, we present a document level reconstruction framework based on distributed bag of words model (PV-DBOW). The main content of the document set is represented by a centroid vector which is computed using PV-DBOW model, and summary sentences are selected in order to minimize the reconstruction error. We do sentence selection and beam search to further improve the performance of our model. Our model outperforms the state-of-the-art unsupervised systems and shows significant improvements over Rouge-2 and Rouge-SU4 scores. Since paragraph vectors can be used to model variable-length texts, our model can be extended to a phrase level extraction based summarization system. We leave this as our future work.


  • Cao et al. (2016) Ziqiang Cao, Wenjie Li, Sujian Li, Furu Wei, and Yanran Li. 2016. Attsum: Joint learning of focusing and summarization with neural attention. arXiv preprint arXiv:1604.00125 .
  • Dai et al. (2015) Andrew M Dai, Christopher Olah, and Quoc V Le. 2015. Document embedding with paragraph vectors. arXiv preprint arXiv:1507.07998 .
  • Erkan and Radev (2004) Günes Erkan and Dragomir R Radev. 2004. Lexrank: Graph-based lexical centrality as salience in text summarization. Journal of Artificial Intelligence Research 22:457–479.
  • Hashimoto et al. (2016) Kazuma Hashimoto, Georgios Kontonatsios, Makoto Miwa, and Sophia Ananiadou. 2016. Topic detection using paragraph vectors to support active learning in systematic reviews. Journal of biomedical informatics 62:59–65.
  • He et al. (2012) Zhanying He, Chun Chen, Jiajun Bu, Can Wang, Lijun Zhang, Deng Cai, and Xiaofei He. 2012. Document summarization based on data reconstruction. In AAAI.
  • Le and Mikolov (2014) Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning (ICML-14). pages 1188–1196.
  • Lewis et al. (2004) David D Lewis, Yiming Yang, Tony G Rose, and Fan Li. 2004. Rcv1: A new benchmark collection for text categorization research. Journal of machine learning research 5(Apr):361–397.
  • Lin (2005) C Lin. 2005. Recall-oriented understudy for gisting evaluation (rouge). Retrieved August 20:2005.
  • Liu et al. (2015) He Liu, Hongliang Yu, and Zhi-Hong Deng. 2015. Multi-document summarization based on two-level sparse representation model. In AAAI. pages 196–202.
  • (10) Shulei Ma, Zhi-Hong Deng, and Yunlun Yang. ???? An unsupervised multi-document summarization framework based on neural document model.
  • Mikolov et al. (2013) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 .
  • Nallapati et al. (2017) Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017. Summarunner: A recurrent neural network based sequence model for extractive summarization of documents. hiP (yi= 1— hi, si, d) 1:1.
  • Nallapati et al. (2016) Ramesh Nallapati, Bowen Zhou, Caglar Gulcehre, Bing Xiang, et al. 2016. Abstractive text summarization using sequence-to-sequence rnns and beyond. arXiv preprint arXiv:1602.06023 .
  • Radev et al. (2004) Dragomir R Radev, Hongyan Jing, Małgorzata Styś, and Daniel Tam. 2004. Centroid-based summarization of multiple documents. Information Processing & Management 40(6):919–938.
  • Rossiello et al. (2017) Gaetano Rossiello, Pierpaolo Basile, and Giovanni Semeraro. 2017. Centroid-based text summarization through compositionality of word embeddings. MultiLing 2017 page 12.
  • Rush et al. (2015) Alexander M Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685 .
  • Wang et al. (2008) Dingding Wang, Tao Li, Shenghuo Zhu, and Chris Ding. 2008. Multi-document summarization via sentence-level semantic analysis and symmetric matrix factorization. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. ACM, pages 307–314.
  • Wasson (1998) Mark Wasson. 1998. Using leading text for news summaries: Evaluation results and implications for commercial summarization applications. In Proceedings of the 17th international conference on Computational linguistics-Volume 2. Association for Computational Linguistics, pages 1364–1368.
  • Yao et al. (2015) Jin-ge Yao, Xiaojun Wan, and Jianguo Xiao. 2015. Compressive document summarization via sparse optimization. In IJCAI. pages 1376–1382.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description