Sentence Compression in Spanish driven by Discourse Segmentation and Language Models

Sentence Compression in Spanish driven by Discourse Segmentation and Language Models

Alejandro Molina    Juan-Manuel Torres-Moreno    Iria da Cunha    Eric SanJuan    Gerardo Sierra

Previous works demonstrated that Automatic Text Summarization (ATS) by sentences extraction may be improved using sentence compression. In this work we present a sentence compressions approach guided by level-sentence discourse segmentation and probabilistic language models (LM). The results presented here show that the proposed solution is able to generate coherent summaries with grammatical compressed sentences. The approach is simple enough to be transposed into other languages.

Sentence Compression in Spanish driven by Discourse Segmentation and Language Models

Alejandro Molina, Juan-Manuel Torres-Moreno, Iria da Cunha, Eric SanJuan and Gerardo Sierra

  Laboratoire Informatique d’Avignon,

BP 91228 84911, Avignon, Cedex 09, France


 École Polytechnique de Montréal,

CP. 6128 succursale Centre-ville, Montréal, Québec, Canada

 Universitat Pompeu Fabra, Barcelona, Spain.

 Instituto de Ingenierá, UNAM Mexico, DF.

1 Introduction

Automatic Text Summarization (ATS) is indispensable to cope with ever increasing volumes of valuable information. An abstract is by far the most concrete and most recognized kind of text condensation [ANS:79]. Sentences extraction allows to generate summaries by extraction sentences [luhn:58, edmundson:69, torres:11].

Sentence compression can be used to improve extract summarization [knight:00, molina:linguamatica:10]. Previous works suggest that sentence segmentation could be helpful in sentence compression generation [sporleder:05].

In this work we present an new automatic sentence compression generation approach. First sentences are segmented using a discourse segmenter and then, compression candidates are generated. Finally, the best candidate i.e., the most grammatical one, is selected based on its probability as a sequence in a Language Model.

We organized the rest of the paper as follows. Firstly, in section §2 we recall the mains concepts of sentence compression. Then, we present in §3 our compression candidates generation approach. Compression candidates evaluation is introduced in §4. Experimental results are showed in §5. Finally, section §6 presents conclusions and future work.

2 Sentence compression

Sentence compression can be considered as a summarization at the sentence level. Sentence compression task is defined as follows:

“Consider an input sentence as a sequence of words . An algorithm may drop any subset of these words. The words that remain (order unchanged) form a compression” [knight:02].

There are interesting algorithms to determine the removal of words in a sentence but humans tend also to delete long phrases in an abstract [pitler:10].

Recent studies have found good results by concentrating on clauses, instead of isolated words. In [steinberger:06] an algorithm first divides sentences into clauses prior to any elimination and then, compression candidates are scored based on Latent Semantic Analysis proposed in [deerwester:90]. However, no component to mitigate grammaticality issues is included in this algorithm [steinberger:06]. Although the results of this last work are in general good, in some cases the main subject of the sentence is removed. The authors attempted to solve this issue by including features in a machine learning approach [steinberger:07].

As an alternative to clauses, some studies explore discourse structures to tackle the sentence compression task. Discourse chunking [sporleder:05] is an alternative to discourse parsing, thereby, showing a direct application to sentence compression. The authors of this last work plausibly argued that, while discourse parsing at document-level stills poses a significant challenge, sentence-level discourse chunking could represent an alternative in languages with limited full discourse parsing tools. In addition, some sentence-level discourse models have shown accuracies comparable to human performance [soricut:03].

3 Compression Candidates Generation

In this work, we use a sentence-level discourse segmentation approach. Formally, “Discourse segmentation is the process of decomposing discourse into Elementary Discourse Units (EDUs), which may be simple sentences or clauses in a complex sentence, and from which discourse trees are constructed” [tofiloski:09]. Discourse segmentation is only the first stage for discourse parsing (the others are detection of rhetorical relations and building of discourse trees). However, we can consider segmentation at the sentence level in order to identify segments to be eliminated in the sentence compression task. This decomposition of a sentence into EDUs using only local information is called shallow discourse segmentation. In [molina:micai:11], the authors use a discourse segmenter in order to segment sentences in spanish. The discourse segmenter is described in [dacunha:10] and is based in the Rhetorical Structure framework [mann:87].

We propose that compression candidates be generated by deleting some discourse segments from the original sentence. Let be a sentence the sequence of its discourse segments: . A candidate, , is a subsequence of that preserves the original order of the segments. The original sentence always form a candidate, i.e., , this is convenient because sometimes there is no shorter grammatical version of the sentence, especially in short sentences that conform one single EDU. Since we do not consider the empty subsequence as a candidate, there are candidates. Furthermore, since we rarely have more than 5 discourse segments in a sentence, usually we create between 1 and 31 candidates, this, dramatically reduces the solution space given that . The compression candidates are constructed using a binary counter. In Example 1 we show all the candidates associated to a sentence extracted from our corpus.

Example 1.
:[Además ella participó ese mismo año en el concierto en tributo a Freddie Mercury,][hablando acerca de la prevención necesaria][para combatir el SIDA.]111English translation: [Also she participated that year in the concert in tribute to Freddie Mercury, ][talking about prevention needed][to fight AIDS.]
:[hablando acerca de la prevención necesaria][para combatir el SIDA.]
:[Además ella participó ese mismo año en el concierto en tributo a Freddie Mercury,][para combatir el SIDA.]
:[para combatir el SIDA.]
:[Además ella participó ese mismo año en el concierto en tributo a Freddie Mercury,][hablando acerca de la prevención necesaria]
:[hablando acerca de la prevención necesaria]
:[Además ella participó ese mismo año en el concierto en tributo a Freddie Mercury,]

4 Compression Candidates Scoring with Language Model

A Language Model (LM) estimates the probability distribution of natural language. Statistical language modeling [chen:99, manning:99] is a technique widely used to assign a probability to a sequence of words. We assume that good compression candidates must have a high probability as sequences in a LM. In general, for a sentence , the probability of is:


Where . The probabilities in a LM are estimated counting sequences from a corpus. Even though we will never be able to get enough data to compute the statistics for all possible sentences, we can base our estimations using big corpora and interpolation methods. In our experiments we use a big corpus with 1T words222LDC Catalog No.: LDC2009T25 ISBN: 1-58563-525-1 to get the sequences counts and a LM interpolation based on Jelinek-Mercer smoothing [chen:99]:


In equation (4) the maximum likelihood estimate of a sequence is interpolated with the smoothed lower-order distribution.

For a given candidate,


we assign a LM based score () based on its probability as in equation (4). For our experiments we use the Language Modeling Toolkit SRILM333Avaliable at [stolcke:02].


5 Experimental Results

For the experiments, two annotators were required to compress each sentence following the instructions in [molina:micai:11]. The corpus contains four sub-corpora: Wikipedia sections, brief news, scientific abstracts and short stories. Each sub-corpus has 20 texts composed of no more than 50 phrases each one (1939 tokens). We have randomly selected eight documents for evaluation, two of each sub-corpus.

We have generated abstract summaries selecting the best compression candidate of each sentence considering two different approaches:

  1. All system: Selecting the best scored candidate for each sentence.

  2. First system: Selecting the best candidate from those that include the first segment.

For comparison we have created Random system: a baseline system which applies a random compression. Random system eliminates some words of a given sentence at the same rate of human annotators.

After compressions, three judges (different from annotators) read the eight summaries. The judges do not know the source of the final summary. They mark each sentence in the final summary if they found it grammatically incorrect in the context. In addition, they evaluate the global coherence of summaries after compressions. Coherence of summaries is scored with a categorical variable: a value of -1 is assigned for incoherent summaries, 0 if some coherences are found and +1 for coherent productions. The Compression Rate (CR) is defined as the proportion of content eliminated from the original document. It says how much of the content was eliminated.

In Table 1, we compare our systems and the two human annotators. The results confirm that scoring compressions before producing a summary improves the text quality in summaries with compressed sentences. It is very surprising that, for Human, the number of compressions judged grammatically incorrect is greater than that of our systems. May be Human misunderstood the compression instructions. However, considering as limits the results of Human and Random baseline system we consider that the proportion of bad formed sentences is very low. We confirmed our initial intuition that preserving the first segment tends to save the main subject in most of the cases. The introduction of this simple heuristic in the First system improves the grammar quality of productions.

Table 2 shows the result of comparing our systems using the two human summaries as references. We wanted to evaluate the content quality of summaries with the ROUGE package [lin:2003]. ROUGE is used to evaluate summaries because some results show that it correlates well with human judgements [lin:2004rpa]. Results in Table 2 are opposite to what was expected. We assumed that Random system would have worst results with respect to First system. Looking at the judgments of coherence and grammar, made by humans, in Table 1, we expected the same positions of the systems. However, in Table 2 we see that the best value of ROUGE, using human-made summaries with compressed sentences as references, is for the Random system. Other than that, again, First system overcomes All system.

As an alternative, we compare the divergence of texts with respect to the original uncompressed text using the FRESA package444 [torres:10c, saggion:10, torres:10poli]. The FRESA score F asses the summaries qualities. Lower values of F means significant difference whit respect to the original text (i.e. more radical compressions). The results of divergence tests of summaries is showed in Table 3. Results in Table 3 are interesting. Considering F values, we see that First system is found more related to Human than Human by FRESA and Human is closer to Random system performance. These values are congruent with the coherence and grammar judgments showed in Table 1. The F value of All system suggests that it is the most aggressive approach.

For all tables we use the following notation about the sources: All system=all compression candidates, First system=candidates including the first segment, Random system=random compression (baseline), Human=human compressions.

Avgerage Agrammatical
Algorithm CR Coherence compressions
(%) (%)
All 30.50 +0.37 8.12
First 18.80 +0.62 6.98
Random 22.97 -0.50 76.60
Human 22.34 +1.00 0.00
Human 15.98 +0.75 20.68
Table 1: Gramaticallity tests.
All 0.6999 0.6041 0.5860
First 0.8069 0.6897 0.6775
Random 0.8880 0.7085 0.7257
Table 2: Content tests using ROUGE package. The summaries references were the two humans compressions.
Algorithm F F F F

0.9197 0.9124 0.9078 0.9133
First 0.9461 0.9472 0.94512 0.9461
Random 0.9593 0.9536 0.9535 0.9555
Human 0.9460 0.9427 0.9372 0.9420
Human 0.9594 0.9562 0.9547 0.9567
Table 3: Divergence tests by using FRESA without references. F is an statistical mean of 1-grammes (F), 2-grammes (F) and SU4-grammes (F).

6 Conclusions and future work

In this work we have introduced the concept of Sentence Compression driven by Discourse Segmentation and Language Models. We have found that using Probabilistic Language Models can be helpful for evaluation of compressions candidates. The results in Spanish presented in this paper are very encouraging. We believe that this approach is independent enough of the language to be transposed into other languages such as English or French. In future work we aim to improve the score (4) adding content restrictions.

Evaluation of compressed sentences and summaries with compressions is still a challenge in languages other than English that do not have reference corpora. We think that more studies are necessary in order to evaluate if ROUGE or FRESA are good methods for compressed text evaluations.


This work was partially supported by Consejo Nacional de Ciencia y Tecnología (CONACYT) México, grant number 211963.


Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description